20 May, 2016

11 commits

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • A recent cleanup removed some exported functions that were not used
    anywhere, which in turn exposed the fact that some other functions in
    the same file are only used in some configurations.

    We now get a warning about them when CONFIG_HOTPLUG_CPU is disabled:

    kernel/padata.c:670:12: error: '__padata_remove_cpu' defined but not used [-Werror=unused-function]
    static int __padata_remove_cpu(struct padata_instance *pinst, int cpu)
    ^~~~~~~~~~~~~~~~~~~
    kernel/padata.c:650:12: error: '__padata_add_cpu' defined but not used [-Werror=unused-function]
    static int __padata_add_cpu(struct padata_instance *pinst, int cpu)

    This rearranges the code so the __padata_remove_cpu/__padata_add_cpu
    functions are within the #ifdef that protects the code that calls them.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4ba6d78c671e ("kernel/padata.c: removed unused code")
    Signed-off-by: Arnd Bergmann
    Cc: Richard Cochran
    Cc: Steffen Klassert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • By accident I stumbled across code that has never been used. This
    driver has EXPORT_SYMBOL functions, and the only user of the code is
    pcrypt.c, but this only uses a subset of the exported symbols.

    According to 'git log -G', the functions, padata_set_cpumasks,
    padata_add_cpu, and padata_remove_cpu have never been used since they
    were first introduced. This patch removes the unused code.

    On one 64 bit build, with CRYPTO_PCRYPT built in, the text is more than
    4k smaller.

    kbuild_hp> size $KBUILD_OUTPUT/vmlinux
    text data bss dec hex filename
    10566658 4678360 1122304 16367322 f9beda vmlinux
    10561984 4678360 1122304 16362648 f9ac98 vmlinux

    On another config, 32 bit, the saving is about 0.5k bytes.

    kbuild_hp-x86> size $KBUILD_OUTPUT/vmlinux
    6012005 2409513 2785280 11206798 ab008e vmlinux
    6011491 2409513 2785280 11206284 aafe8c vmlinux

    Signed-off-by: Richard Cochran
    Cc: Steffen Klassert
    Cc: Herbert Xu
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Cochran
     
  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    change (debugobjects: make fixup functions return bool instead of int)

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • All references to timespec_add_safe() now use timespec64_add_safe().

    The plan is to replace struct timespec references with struct timespec64
    throughout the kernel as timespec is not y2038 safe.

    Drop timespec_add_safe() and use timespec64_add_safe() for all
    architectures.

    Link: http://lkml.kernel.org/r/1461947989-21926-4-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     
  • timespec64_add_safe() has been defined in time64.h for 64 bit systems.
    But, 32 bit systems only have an extern function prototype defined.
    Provide a definition for the above function.

    The function will be necessary as part of y2038 changes. struct
    timespec is not y2038 safe. All references to timespec will be replaced
    by struct timespec64. The function is meant to be a replacement for
    timespec_add_safe().

    The implementation is similar to timespec_add_safe().

    Link: http://lkml.kernel.org/r/1461947989-21926-2-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

19 May, 2016

3 commits

  • Pull tracing updates from Steven Rostedt:
    "This includes two new updates for the ftrace infrastructure.

    - With the changing of the code for filtering events by pid, from a
    list of pids to a bitmask, we can now easily implement following
    forks. With a new tracing option "event-fork" which, when set,
    will have tasks with pids in set_event_pid, when they fork, to have
    their child pids added to set_event_pid and the child will be
    traced as well.

    Note, if "event-fork" is set and a task with its pid in
    set_event_pid exits, its pid will be removed from set_event_pid

    - The addition of Tom Zanussi's hist triggers. This includes a very
    thorough documentatino on how to use the hist triggers with events.
    This introduces a quick and easy way to get histogram data from
    events and their fields.

    Some other cleanups and updates were added as well. Like Masami
    Hiramatsu added test cases for the event trigger and hist triggers.
    Also I added a speed up of filtering by using a temp buffer when
    filters are set"

    * tag 'trace-v4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
    tracing: Use temp buffer when filtering events
    tracing: Remove TRACE_EVENT_FL_USE_CALL_FILTER logic
    tracing: Remove unused function trace_current_buffer_lock_reserve()
    tracing: Remove one use of trace_current_buffer_lock_reserve()
    tracing: Have trace_buffer_unlock_commit() call the _regs version with NULL
    tracing: Remove unused function trace_current_buffer_discard_commit()
    tracing: Move trace_buffer_unlock_commit{_regs}() to local header
    tracing: Fold filter_check_discard() into its only user
    tracing: Make filter_check_discard() local
    tracing: Move event_trigger_unlock_commit{_regs}() to local header
    tracing: Don't use the address of the buffer array name in copy_from_user
    tracing: Handle tracing_map_alloc_elts() error path correctly
    tracing: Add check for NULL event field when creating hist field
    tracing: checking for NULL instead of IS_ERR()
    tracing: Do not inherit event-fork option for instances
    tracing: Fix unsigned comparison to zero in hist trigger code
    kselftests/ftrace: Add a test for log2 modifier of hist trigger
    tracing: Add hist trigger 'log2' modifier
    kselftests/ftrace: Add hist trigger testcases
    kselftests/ftrace : Add event trigger testcases
    ...

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "Four small audit patches for 4.7.

    Two are simple cleanups around the audit thread management code, one
    adds a tty field to AUDIT_LOGIN events, and the final patch makes
    tty_name() usable regardless of CONFIG_TTY.

    Nothing controversial, and it all passes our regression test"

    * 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
    tty: provide tty_name() even without CONFIG_TTY
    audit: add tty field to LOGIN event
    audit: we don't need to __set_current_state(TASK_RUNNING)
    audit: cleanup prune_tree_thread

    Linus Torvalds
     
  • Pull misc vfs cleanups from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coredump: only charge written data against RLIMIT_CORE
    coredump: get rid of coredump_params->written
    ecryptfs_lookup(): try either only encrypted or plaintext name
    ecryptfs: avoid multiple aliases for directories
    bpf: reject invalid names right in ->lookup()
    __d_alloc(): treat NULL name as QSTR("/", 1)
    mtd: switch ubi_open_volume_path() to vfs_stat()
    mtd: switch open_mtd_by_chdev() to use of vfs_stat()

    Linus Torvalds
     

18 May, 2016

8 commits

  • Pull GPIO updates from Linus Walleij:
    "This is the bulk of GPIO changes for kernel cycle v4.7:

    Core infrastructural changes:

    - Support for natively single-ended GPIO driver stages.

    This means that if the hardware has registers to configure open
    drain or open source configuration, we use that rather than (as we
    did before) try to emulate it by switching the line to an input to
    get high impedance.

    This is also documented throughly in Documentation/gpio/driver.txt
    for those of you who did not understand one word of what I just
    wrote.

    - Start to do away with the unnecessarily complex and unitelligible
    ARCH_REQUIRE_GPIOLIB and ARCH_WANT_OPTIONAL_GPIOLIB, another
    evolutional artifact from the time when the GPIO subsystem was
    unmaintained.

    Archs can now just select GPIOLIB and be done with it, cleanups to
    arches will trickle in for the next kernel. Some minor archs ACKed
    the changes immediately so these are included in this pull request.

    - Advancing the use of the data pointer inside the GPIO device for
    storing driver data by switching the PowerPC, Super-H Unicore and
    a few other subarches or subsystem drivers in ALSA SoC, Input,
    serial, SSB, staging etc to use it.

    - The initialization now reads the input/output state of the GPIO
    lines, so that each GPIO descriptor knows - if this callback is
    implemented - whether the line is input or output. This also
    reflects nicely in userspace "lsgpio".

    - It is now possible to name GPIO producer names, line names, from
    the device tree. (Platform data has been supported for a while).
    I bet we will get a similar mechanism for ACPI one of those days.
    This makes is possible to get sensible producer names for e.g.
    GPIO rails in "lsgpio" in userspace.

    New drivers:

    - New driver for the Loongson1.

    - The XLP driver now supports Broadcom Vulcan ARM64.

    - The IT87 driver now supports IT8620 and IT8628.

    - The PCA953X driver now supports Galileo Gen2.

    Driver improvements:

    - MCP23S08 was switched to use the gpiolib irqchip helpers and now
    also suppors level-triggered interrupts.

    - 74x164 and RCAR now supports the .set_multiple() callback

    - AMDPT was converted to use generic GPIO.

    - TC3589x, TPS65218, SX150X, F7188X, MENZ127, VX855, WM831X, WM8994
    support the new single ended callback for open drain and in some
    cases open source.

    - Implement the .get_direction() callback for a few more drivers like
    PL061, Xgene.

    Cleanups:

    - Paul Gortmaker combed through the drivers and de-modularized those
    who are not really modules.

    - Move the GPIO poweroff DT bindings to the power subdir where they
    belong.

    - Rename gpio-generic.c to gpio-mmio.c, which is much more to the
    point. That's what it is handling, nothing more, nothing less"

    * tag 'gpio-v4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (126 commits)
    MIPS: do away with ARCH_[WANT_OPTIONAL|REQUIRE]_GPIOLIB
    gpio: zevio: make it explicitly non-modular
    gpio: timberdale: make it explicitly non-modular
    gpio: stmpe: make it explicitly non-modular
    gpio: sodaville: make it explicitly non-modular
    pinctrl: sh-pfc: Let gpio_chip.to_irq() return zero on error
    gpio: dwapb: Add ACPI device ID for DWAPB GPIO controller on X-Gene platforms
    gpio: dt-bindings: add wd,mbl-gpio bindings
    gpio: of: make it possible to name GPIO lines
    gpio: make gpiod_to_irq() return negative for NO_IRQ
    gpio: xgene: implement .get_direction()
    gpio: xgene: Enable ACPI support for X-Gene GFC GPIO driver
    gpio: tegra: Implement gpio_get_direction callback
    gpio: set up initial state from .get_direction()
    gpio: rename gpio-generic.c into gpio-mmio.c
    gpio: generic: fix GPIO_GENERIC_PLATFORM is set to module case
    gpio: dwapb: add gpio-signaled acpi event support
    gpio: dwapb: convert device node to fwnode
    gpio: dwapb: remove name from dwapb_port_property
    gpio/qoriq: select IRQ_DOMAIN
    ...

    Linus Torvalds
     
  • Pull livepatching updates from Jiri Kosina:

    - remove of our own implementation of architecture-specific relocation
    code and leveraging existing code in the module loader to perform
    arch-dependent work, from Jessica Yu.

    The relevant patches have been acked by Rusty (for module.c) and
    Heiko (for s390).

    - live patching support for ppc64le, which is a joint work of Michael
    Ellerman and Torsten Duwe. This is coming from topic branch that is
    share between livepatching.git and ppc tree.

    - addition of livepatching documentation from Petr Mladek

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: make object/func-walking helpers more robust
    livepatch: Add some basic livepatch documentation
    powerpc/livepatch: Add live patching support on ppc64le
    powerpc/livepatch: Add livepatch stack to struct thread_info
    powerpc/livepatch: Add livepatch header
    livepatch: Allow architectures to specify an alternate ftrace location
    ftrace: Make ftrace_location_range() global
    livepatch: robustify klp_register_patch() API error checking
    Documentation: livepatch: outline Elf format and requirements for patch modules
    livepatch: reuse module loader code to write relocations
    module: s390: keep mod_arch_specific for livepatch modules
    module: preserve Elf information for livepatch modules
    Elf: add livepatch-specific Elf constants

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     
  • Pull core block layer updates from Jens Axboe:
    "This is the core block IO changes for this merge window. Nothing
    earth shattering in here, it's mostly just fixes. In detail:

    - Fix for a long standing issue where wrong ordering in blk-mq caused
    order_to_size() to spew a warning. From Bart.

    - Async discard support from Christoph. Basically just splitting our
    sync interface into a submit + wait part.

    - Add a cleaner interface for flagging whether a device has a write
    back cache or not. We've previously overloaded blk_queue_flush()
    with this, but let's make it more explicit. Drivers cleaned up and
    updated in the drivers pull request. From me.

    - Fix for a double check for whether IO accounting is enabled or not.
    From Michael Callahan.

    - Fix for the async discard from Mike Snitzer, reinstating the early
    EOPNOTSUPP return if the device doesn't support discards.

    - Also from Mike, export bio_inc_remaining() so dm can drop it's
    private copy of it.

    - From Ming Lin, add support for passing in an offset for request
    payloads.

    - Tag function export from Sagi, which will be used in NVMe in the
    drivers pull.

    - Two blktrace related fixes from Shaohua.

    - Propagate NOMERGE flag when making a request from a bio, also from
    Shaohua.

    - An optimization to not parse cgroup paths in blk-throttle, if we
    don't need to. From Shaohua"

    * 'for-4.7/core' of git://git.kernel.dk/linux-block:
    blk-mq: fix undefined behaviour in order_to_size()
    blk-throttle: don't parse cgroup path if trace isn't enabled
    blktrace: add missed mask name
    blktrace: delete garbage for message trace
    block: make bio_inc_remaining() interface accessible again
    block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard
    block: Minor blk_account_io_start usage cleanup
    block: add __blkdev_issue_discard
    block: remove struct bio_batch
    block: copy NOMERGE flag from bio to request
    block: add ability to flag write back caching on a device
    blk-mq: Export tagset iter function
    block: add offset in blk_add_request_payload()
    writeback: Fix performance regression in wb_over_bg_thresh()

    Linus Torvalds
     
  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     
  • Pull irq updates from Thomas Gleixner:
    "This update delivers:

    - Yet another interrupt chip diver (LPC32xx)
    - Core functions to handle partitioned per-cpu interrupts
    - Enhancements to the IPI core
    - Proper handling of irq type configuration
    - A large set of ARM GIC enhancements
    - The usual pile of small fixes, cleanups and enhancements"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
    irqchip/bcm2836: Use a more generic memory barrier call
    irqchip/bcm2836: Fix compiler warning on 64-bit build
    irqchip/bcm2836: Drop smp_set_ops on arm64 builds
    irqchip/gic: Add helper functions for GIC setup and teardown
    irqchip/gic: Store GIC configuration parameters
    irqchip/gic: Pass GIC pointer to save/restore functions
    irqchip/gic: Return an error if GIC initialisation fails
    irqchip/gic: Remove static irq_chip definition for eoimode1
    irqchip/gic: Don't initialise chip if mapping IO space fails
    irqchip/gic: WARN if setting the interrupt type for a PPI fails
    irqchip/gic: Don't unnecessarily write the IRQ configuration
    irqchip: Mask the non-type/sense bits when translating an IRQ
    genirq: Ensure IRQ descriptor is valid when setting-up the IRQ
    irqchip/gic-v3: Configure all interrupts as non-secure Group-1
    irqchip/gic-v2m: Add workaround for Broadcom NS2 GICv2m erratum
    irqchip/irq-alpine-msi: Don't use
    irqchip/mbigen: Checking for IS_ERR() instead of NULL
    irqchip/gic-v3: Remove inexistant register definition
    irqchip/gicv3-its: Don't allow devices whose ID is outside range
    irqchip: Add LPC32xx interrupt controller driver
    ...

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "A rather small set of patches from the timer departement:

    - Some more y2038 work
    - Yet another new clocksource driver
    - The usual set of small fixes, cleanups and enhancements"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/tegra: Remove unused suspend/resume code
    clockevents/driversi/mps2: add MPS2 Timer driver
    dt-bindings: document the MPS2 timer bindings
    clocksource/drivers/mtk_timer: Add __init attribute
    clockevents/drivers/dw_apb_timer: Implement ->set_state_oneshot_stopped()
    time: Introduce do_sys_settimeofday64()
    security: Introduce security_settime64()
    clocksource: Add missing include of of.h.

    Linus Torvalds
     
  • …t/rostedt/linux-trace

    Pull tracing ring-buffer fixes from Steven Rostedt:
    "Hao Qin reported an integer overflow possibility with signed and
    unsigned numbers in the ring-buffer code.

    https://bugzilla.kernel.org/show_bug.cgi?id=118001

    At first I did not think this was too much of an issue, because the
    overflow would be caught later when either too much data was allocated
    or it would trigger RB_WARN_ON() which shuts down the ring buffer.

    But looking closer into it, I found that the right settings could
    bypass the checks and crash the kernel. Luckily, this is only
    accessible by root.

    The first fix is to convert all the variables into long, such that we
    don't get into issues between 32 bit variables being assigned 64 bit
    ones. This fixes the RB_WARN_ON() triggering.

    The next fix is to get rid of a duplicate DIV_ROUND_UP() that when
    called twice with the right value, can cause a kernel crash.

    The first DIV_ROUND_UP() is to normalize the input and it is checked
    against the minimum allowable value. But then DIV_ROUND_UP() is
    called again, which can overflow due to the (a + b - 1)/b, logic. The
    first called upped the value, the second can overflow (with the +b
    part).

    The second call to DIV_ROUND_UP() came in via a second change a while
    ago and the code is cleaned up to remove it"

    * tag 'trace-fixes-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Prevent overflow of size in ring_buffer_resize()
    ring-buffer: Use long for nr_pages to avoid overflow failures

    Linus Torvalds
     

17 May, 2016

14 commits

  • …ing-ppc64' into for-linus

    Jiri Kosina
     
  • Backmerge to resolve a conflict in ovl_lookup_real();
    "ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
    but it was too late in the cycle to rebase.

    Al Viro
     
  • Pull power management updates from Rafael Wysocki:
    "The majority of changes go into the cpufreq subsystem this time.

    To me, quite obviously, the biggest ticket item is the new "schedutil"
    governor. Interestingly enough, it's the first new cpufreq governor
    since the beginning of the git era (except for some out-of-the-tree
    ones).

    There are two main differences between it and the existing governors.
    First, it uses the information provided by the scheduler directly for
    making its decisions, so it doesn't have to track anything by itself.
    Second, it can invoke drivers (supporting that feature) to adjust CPU
    performance right away without having to spawn work items to be
    executed in process context or similar. Currently, the acpi-cpufreq
    driver is the only one supporting that mode of operation, but then it
    is used on a large number of systems.

    The "schedutil" governor as included here is very simple and mostly
    regarded as a foundation for future work on the integration of the
    scheduler with CPU power management (in fact, there is work in
    progress on top of it already). Nevertheless it works and the
    preliminary results obtained with it are encouraging.

    There also is some consolidation of CPU frequency management for ARM
    platforms that can add their machine IDs the the new stub dt-platdev
    driver now and that will take care of creating the requisite platform
    device for cpufreq-dt, so it is not necessary to do that in platform
    code any more. Several ARM platforms are switched over to using this
    generic mechanism.

    In addition to that, the intel_pstate driver is now going to respect
    CPU frequency limits set by the platform firmware (or a BMC) and
    provided via the ACPI _PPC object.

    The devfreq subsystem is getting a new "passive" governor for SoCs
    subsystems that will depend on somebody else to manage their voltage
    rails and its support for Samsung Exynos SoCs is consolidated.

    The rest is support for new hardware (Intel Broxton support in
    intel_idle for one example), bug fixes, optimizations and cleanups in
    a number of places.

    Specifics:

    - New cpufreq "schedutil" governor (making decisions based on CPU
    utilization information provided by the scheduler and capable of
    switching CPU frequencies right away if the underlying driver
    supports that) and support for fast frequency switching in the
    acpi-cpufreq driver (Rafael Wysocki)

    - Consolidation of CPU frequency management on ARM platforms allowing
    them to get rid of some platform-specific boilerplate code if they
    are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao,
    Marc Gonzalez)

    - Support for ACPI _PPC and CPU frequency limits in the intel_pstate
    driver (Srinivas Pandruvada)

    - Fixes and cleanups in the cpufreq core and generic governor code
    (Rafael Wysocki, Sai Gurrappadi)

    - intel_pstate driver optimizations and cleanups (Rafael Wysocki,
    Philippe Longepe, Chen Yu, Joe Perches)

    - cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri
    Bhat)

    - cpufreq qoriq driver fixes and cleanups (Jia Hongtao)

    - ACPI cpufreq driver cleanups (Viresh Kumar)

    - Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang,
    Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla)

    - Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann)

    - Fixes and cleanups in the OPP (Operating Performance Points)
    framework, mostly related to OPP sharing, and reorganization of
    OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla)

    - New "passive" governor for devfreq (for SoC subsystems that will
    rely on someone else for the management of their power resources)
    and consolidation of devfreq support for Exynos platforms, coding
    style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham)

    - PM core fixes and cleanups, mostly to make it work better with the
    generic power domains (genpd) framework, and updates for that
    framework (Ulf Hansson, Thierry Reding, Colin Ian King)

    - Intel Broxton support for the intel_idle driver (Len Brown)

    - cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach)

    - ARM cpuidle cleanups (Jisheng Zhang)

    - Intel Kabylake support for the RAPL power capping driver (Jacob
    Pan)

    - AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko
    Stuebner)

    - Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King,
    Mattia Dongili, Thomas Renninger)"

    * tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (112 commits)
    intel_pstate: Clean up get_target_pstate_use_performance()
    intel_pstate: Use sample.core_avg_perf in get_avg_pstate()
    intel_pstate: Clarify average performance computation
    intel_pstate: Avoid unnecessary synchronize_sched() during initialization
    cpufreq: schedutil: Make default depend on CONFIG_SMP
    cpufreq: powernv: del_timer_sync when global and local pstate are equal
    cpufreq: powernv: Move smp_call_function_any() out of irq safe block
    intel_pstate: Clean up intel_pstate_get()
    cpufreq: schedutil: Make it depend on CONFIG_SMP
    cpufreq: governor: Fix handling of special cases in dbs_update()
    PM / OPP: Move CONFIG_OF dependent code in a separate file
    cpufreq: intel_pstate: Ignore _PPC processing under HWP
    cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table
    PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_, }remove_table
    cpufreq: tango: Use generic platdev driver
    PM / OPP: pass cpumask by reference
    cpufreq: Fix GOV_LIMITS handling for the userspace governor
    cpupower: fix potential memory leak
    PM / devfreq: style/typo fixes
    PM / devfreq: exynos: Add the detailed correlation for Exynos5422 bus
    ..

    Linus Torvalds
     
  • Pull arm64 updates from Will Deacon:

    - virt_to_page/page_address optimisations

    - support for NUMA systems described using device-tree

    - support for hibernate/suspend-to-disk

    - proper support for maxcpus= command line parameter

    - detection and graceful handling of AArch64-only CPUs

    - miscellaneous cleanups and non-critical fixes

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
    arm64: do not enforce strict 16 byte alignment to stack pointer
    arm64: kernel: Fix incorrect brk randomization
    arm64: cpuinfo: Missing NULL terminator in compat_hwcap_str
    arm64: secondary_start_kernel: Remove unnecessary barrier
    arm64: Ensure pmd_present() returns false after pmd_mknotpresent()
    arm64: Replace hard-coded values in the pmd/pud_bad() macros
    arm64: Implement pmdp_set_access_flags() for hardware AF/DBM
    arm64: Fix typo in the pmdp_huge_get_and_clear() definition
    arm64: mm: remove unnecessary EXPORT_SYMBOL_GPL
    arm64: always use STRICT_MM_TYPECHECKS
    arm64: kvm: Fix kvm teardown for systems using the extended idmap
    arm64: kaslr: increase randomization granularity
    arm64: kconfig: drop CONFIG_RTC_LIB dependency
    arm64: make ARCH_SUPPORTS_DEBUG_PAGEALLOC depend on !HIBERNATION
    arm64: hibernate: Refuse to hibernate if the boot cpu is offline
    arm64: kernel: Add support for hibernate/suspend-to-disk
    PM / Hibernate: Call flush_icache_range() on pages restored in-place
    arm64: Add new asm macro copy_page
    arm64: Promote KERNEL_START/KERNEL_END definitions to a header file
    arm64: kernel: Include _AC definition in page.h
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - massive CPU hotplug rework (Thomas Gleixner)

    - improve migration fairness (Peter Zijlstra)

    - CPU load calculation updates/cleanups (Yuyang Du)

    - cpufreq updates (Steve Muckle)

    - nohz optimizations (Frederic Weisbecker)

    - switch_mm() micro-optimization on x86 (Andy Lutomirski)

    - ... lots of other enhancements, fixes and cleanups.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
    ARM: Hide finish_arch_post_lock_switch() from modules
    sched/core: Provide a tsk_nr_cpus_allowed() helper
    sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
    sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
    sched/fair: Correct unit of load_above_capacity
    sched/fair: Clean up scale confusion
    sched/nohz: Fix affine unpinned timers mess
    sched/fair: Fix fairness issue on migration
    sched/core: Kill sched_class::task_waking to clean up the migration logic
    sched/fair: Prepare to fix fairness problems on migration
    sched/fair: Move record_wakee()
    sched/core: Fix comment typo in wake_q_add()
    sched/core: Remove unused variable
    sched: Make hrtick_notifier an explicit call
    sched/fair: Make ilb_notifier an explicit call
    sched/hotplug: Make activate() the last hotplug step
    sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()
    sched/migration: Move CPU_ONLINE into scheduler state
    sched/migration: Move calc_load_migrate() into CPU_DYING
    sched/migration: Move prepare transition to SCHED_STARTING state
    ...

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "Bigger kernel side changes:

    - Add backwards writing capability to the perf ring-buffer code,
    which is preparation for future advanced features like robust
    'overwrite support' and snapshot mode. (Wang Nan)

    - Add pause and resume ioctls for the perf ringbuffer (Wang Nan)

    - x86 Intel cstate code cleanups and reorgnization (Thomas Gleixner)

    - x86 Intel uncore and CPU PMU driver updates (Kan Liang, Peter
    Zijlstra)

    - x86 AUX (Intel PT) related enhancements and updates (Alexander
    Shishkin)

    - x86 MSR PMU driver enhancements and updates (Huang Rui)

    - ... and lots of other changes spread out over 40+ commits.

    Biggest tooling side changes:

    - 'perf trace' features and enhancements. (Arnaldo Carvalho de Melo)

    - BPF tooling updates (Wang Nan)

    - 'perf sched' updates (Jiri Olsa)

    - 'perf probe' updates (Masami Hiramatsu)

    - ... plus 200+ other enhancements, fixes and cleanups to tools/

    The merge commits, the shortlog and the changelogs contain a lot more
    details"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (249 commits)
    perf/core: Disable the event on a truncated AUX record
    perf/x86/intel/pt: Generate PMI in the STOP region as well
    perf buildid-cache: Use lsdir() for looking up buildid caches
    perf symbols: Use lsdir() for the search in kcore cache directory
    perf tools: Use SBUILD_ID_SIZE where applicable
    perf tools: Fix lsdir to set errno correctly
    perf trace: Move seccomp args beautifiers to tools/perf/trace/beauty/
    perf trace: Move flock op beautifier to tools/perf/trace/beauty/
    perf build: Add build-test for debug-frame on arm/arm64
    perf build: Add build-test for libunwind cross-platforms support
    perf script: Fix export of callchains with recursion in db-export
    perf script: Fix callchain addresses in db-export
    perf script: Fix symbol insertion behavior in db-export
    perf symbols: Add dso__insert_symbol function
    perf scripting python: Use Py_FatalError instead of die()
    perf tools: Remove xrealloc and ALLOC_GROW
    perf help: Do not use ALLOC_GROW in add_cmd_list
    perf pmu: Make pmu_formats_string to check return value of strbuf
    perf header: Make topology checkers to check return value of strbuf
    perf tools: Make alias handler to check return value of strbuf
    ...

    Linus Torvalds
     
  • Pull support for killable rwsems from Ingo Molnar:
    "This, by Michal Hocko, implements down_write_killable().

    The main usecase will be to update mm_sem usage sites to use this new
    API, to allow the mm-reaper introduced in commit aac453635549 ("mm,
    oom: introduce oom reaper") to tear down oom victim address spaces
    asynchronously with minimum latencies and without deadlock worries"

    [ The vfs will want it too as the inode lock is changed from a mutex to
    a rwsem due to the parallel lookup and readdir updates ]

    * 'locking-rwsem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rwsem: Fix comment on register clobbering
    locking/rwsem: Fix down_write_killable()
    locking/rwsem, x86: Add frame annotation for call_rwsem_down_write_failed_killable()
    locking/rwsem: Provide down_write_killable()
    locking/rwsem, x86: Provide __down_write_killable()
    locking/rwsem, s390: Provide __down_write_killable()
    locking/rwsem, ia64: Provide __down_write_killable()
    locking/rwsem, alpha: Provide __down_write_killable()
    locking/rwsem: Introduce basis for down_write_killable()
    locking/rwsem, sparc: Drop superfluous arch specific implementation
    locking/rwsem, sh: Drop superfluous arch specific implementation
    locking/rwsem, xtensa: Drop superfluous arch specific implementation
    locking/rwsem: Drop explicit memory barriers
    locking/rwsem: Get rid of __down_write_nested()

    Linus Torvalds
     
  • Pull locking changes from Ingo Molnar:
    "The main changes in this cycle were:

    - pvqspinlock statistics fixes (Davidlohr Bueso)

    - flip atomic_fetch_or() arguments (Peter Zijlstra)

    - locktorture simplification (Paul E. McKenney)

    - documentation updates (SeongJae Park, David Howells, Davidlohr
    Bueso, Paul E McKenney, Peter Zijlstra, Will Deacon)

    - various fixes"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/atomics: Flip atomic_fetch_or() arguments
    locking/pvqspinlock: Robustify init_qspinlock_stat()
    locking/pvqspinlock: Avoid double resetting of stats
    lcoking/locktorture: Simplify the torture_runnable computation
    locking/Documentation: Clarify that ACQUIRE applies to loads, RELEASE applies to stores
    locking/Documentation: State purpose of memory-barriers.txt
    locking/Documentation: Add disclaimer
    locking/Documentation/lockdep: Fix spelling mistakes
    locking/lockdep: Deinline register_lock_class(), save 2328 bytes
    locking/locktorture: Fix NULL pointer dereference for cleanup paths
    locking/locktorture: Fix deboosting NULL pointer dereference
    locking/Documentation: Mention smp_cond_acquire()
    locking/Documentation: Insert white spaces consistently
    locking/Documentation: Fix formatting inconsistencies
    locking/Documentation: Add missed subsection in TOC
    locking/Documentation: Fix missed s/lock/acquire renames
    locking/Documentation: Clarify relationship of barrier() to control dependencies

    Linus Torvalds
     
  • Pull core signal updates from Ingo Molnar:
    "These updates from Stas Sergeev and Andy Lutomirski, improve the
    sigaltstack interface by extending its ABI with the SS_AUTODISARM
    feature, which makes it possible to use swapcontext() in a sighandler
    that works on sigaltstack. Without this flag, the subsequent signal
    will corrupt the state of the switched-away sighandler.

    The inspiration is more robust dosemu signal handling"

    * 'core-signals-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    signals/sigaltstack: Change SS_AUTODISARM to (1U << 31)
    signals/sigaltstack: Report current flag bits in sigaltstack()
    selftests/sigaltstack: Fix the sigaltstack test on old kernels
    signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack()
    selftests/sigaltstack: Add new testcase for sigaltstack(SS_ONSTACK|SS_AUTODISARM)
    signals/sigaltstack: Implement SS_AUTODISARM flag
    signals/sigaltstack: Prepare to add new SS_xxx flags
    signals/sigaltstack, x86/signals: Unify the x86 sigaltstack check with other architectures

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Documentation updates, including fixes to the design-level
    requirements documentation and a fixed version of the design-level
    data-structure documentation. These fixes include removing
    cartoons and getting rid of the html/htmlx duplication.

    - Further improvements to the new-age expedited grace periods.

    - Miscellaneous fixes.

    - Torture-test changes, including a new rcuperf module for measuring
    RCU grace-period performance and scalability, which is useful for
    the expedited-grace-period changes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    rcutorture: Add boot-time adjustment of leaf fanout
    rcutorture: Add irqs-disabled test for call_rcu()
    rcutorture: Dump trace buffer upon shutdown
    rcutorture: Don't rebuild identical kernel
    rcutorture: Add OS-jitter capability
    documentation: Add documentation for RCU's major data structures
    rcutorture: Convert test duration to seconds early
    torture: Kill qemu, not parent process
    torture: Clarify refusal to run more than one torture test
    rcutorture: Consider FROZEN hotplug notifier transitions
    rcutorture: Remove redundant initialization to zero
    rcuperf: Do not wake up shutdown wait queue if "shutdown" is false.
    rcutorture: Add largish-system rcuperf scenario
    rcutorture: Avoid RCU CPU stall warning and RT throttling
    rcutorture: Add rcuperf holdoff boot parameter to reduce interference
    rcutorture: Make scripts analyze rcuperf trace data, if present
    rcutorture: Make rcuperf collect expedited event-trace data
    rcutorture: Print measure of batching efficiency
    rcutorture: Set rcuperf writer kthreads to real-time priority
    rcutorture: Bind rcuperf reader/writer kthreads to CPUs
    ...

    Linus Torvalds
     
  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since the blinding is strictly only called from inside eBPF JITs,
    we need to change signatures for bpf_int_jit_compile() and
    bpf_prog_select_runtime() first in order to prepare that the
    eBPF program we're dealing with can change underneath. Hence,
    for call sites, we need to return the latest prog. No functional
    change in this patch.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Move the functionality to patch instructions out of the verifier
    code and into the core as the new bpf_patch_insn_single() helper
    will be needed later on for blinding as well. No changes in
    functionality.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Besides others, remove redundant comments where the code is self
    documenting enough, and properly indent various bpf_verifier_ops
    and bpf_prog_type_list declarations. Moreover, remove two exports
    that actually have no module user.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 May, 2016

3 commits

  • * pm-cpufreq: (63 commits)
    intel_pstate: Clean up get_target_pstate_use_performance()
    intel_pstate: Use sample.core_avg_perf in get_avg_pstate()
    intel_pstate: Clarify average performance computation
    intel_pstate: Avoid unnecessary synchronize_sched() during initialization
    cpufreq: schedutil: Make default depend on CONFIG_SMP
    cpufreq: powernv: del_timer_sync when global and local pstate are equal
    cpufreq: powernv: Move smp_call_function_any() out of irq safe block
    intel_pstate: Clean up intel_pstate_get()
    cpufreq: schedutil: Make it depend on CONFIG_SMP
    cpufreq: governor: Fix handling of special cases in dbs_update()
    cpufreq: intel_pstate: Ignore _PPC processing under HWP
    cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table
    cpufreq: tango: Use generic platdev driver
    cpufreq: Fix GOV_LIMITS handling for the userspace governor
    cpufreq: mvebu: Move cpufreq code into drivers/cpufreq/
    cpufreq: dt: Kill platform-data
    mvebu: Use dev_pm_opp_set_sharing_cpus() to mark OPP tables as shared
    cpufreq: dt: Identify cpu-sharing for platforms without operating-points-v2
    cpufreq: governor: Change confusing struct field and variable names
    cpufreq: intel_pstate: Enable PPC enforcement for servers
    ...

    Rafael J. Wysocki
     
  • The new signal_pending exit path in __rwsem_down_write_failed_common()
    was fingered as breaking his kernel by Tetsuo Handa.

    Upon inspection it was found that there are two things wrong with it;

    - it forgets to remove WAITING_BIAS if it leaves the list empty, or
    - it forgets to wake further waiters that were blocked on the now
    removed waiter.

    Especially the first issue causes new lock attempts to block and stall
    indefinitely, as the code assumes that pending waiters mean there is
    an owner that will wake when it releases the lock.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Tested-by: Michal Hocko
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vince Weaver
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
    because we no longer have a per-netns conntrack hash.

    The ip_gre.c conflict as well as the iwlwifi ones were cases of
    overlapping changes.

    Conflicts:
    drivers/net/wireless/intel/iwlwifi/mvm/tx.c
    net/ipv4/ip_gre.c
    net/netfilter/nf_conntrack_core.c

    Signed-off-by: David S. Miller

    David S. Miller
     

14 May, 2016

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "During v4.6-rc1 cgroup namespace support was merged. There is an
    issue where it's impossible to tell whether a given cgroup mount point
    is bind mounted or namespaced. Serge has been working on the issue
    but it took longer than expected to resolve, so the late pull request.

    Given that it's a completely new feature and the patches don't touch
    anything else, the risk seems acceptable. However, if this is too
    late, an alternative is plugging new cgroup ns creation for v4.6 and
    retrying for v4.7"

    * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix compile warning
    kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
    cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
    kernfs_path_from_node_locked: don't overwrite nlen

    Linus Torvalds