02 Apr, 2016

1 commit

  • Newer Fedora and OpenSUSE didn't boot with my standard configuration.
    It took me some time to figure out why, in fact I had to write a script
    to try different config options systematically.

    The problem is that something (systemd) in dracut depends on
    CONFIG_FHANDLE, which adds open by file handle syscalls.

    While it is set in defconfigs it is very easy to miss when updating
    older configs because it is not default y.

    Make it default y and also depend on EXPERT, as dracut use is likely
    widespread.

    Signed-off-by: Andi Kleen
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

19 Mar, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Pull security layer updates from James Morris:
    "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
    fixes scattered across the subsystem.

    IMA now requires signed policy, and that policy is also now measured
    and appraised"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
    X.509: Make algo identifiers text instead of enum
    akcipher: Move the RSA DER encoding check to the crypto layer
    crypto: Add hash param to pkcs1pad
    sign-file: fix build with CMS support disabled
    MAINTAINERS: update tpmdd urls
    MODSIGN: linux/string.h should be #included to get memcpy()
    certs: Fix misaligned data in extra certificate list
    X.509: Handle midnight alternative notation in GeneralizedTime
    X.509: Support leap seconds
    Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
    X.509: Fix leap year handling again
    PKCS#7: fix unitialized boolean 'want'
    firmware: change kernel read fail to dev_dbg()
    KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
    KEYS: Reserve an extra certificate symbol for inserting without recompiling
    modsign: hide openssl output in silent builds
    tpm_tis: fix build warning with tpm_tis_resume
    ima: require signed IMA policy
    ima: measure and appraise the IMA policy itself
    ima: load policy using path
    ...

    Linus Torvalds
     

17 Mar, 2016

1 commit

  • Merge first patch-bomb from Andrew Morton:

    - some misc things

    - ofs2 updates

    - about half of MM

    - checkpatch updates

    - autofs4 update

    * emailed patches from Andrew Morton : (120 commits)
    autofs4: fix string.h include in auto_dev-ioctl.h
    autofs4: use pr_xxx() macros directly for logging
    autofs4: change log print macros to not insert newline
    autofs4: make autofs log prints consistent
    autofs4: fix some white space errors
    autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
    autofs4: fix coding style line length in autofs4_wait()
    autofs4: fix coding style problem in autofs4_get_set_timeout()
    autofs4: coding style fixes
    autofs: show pipe inode in mount options
    kallsyms: add support for relative offsets in kallsyms address table
    kallsyms: don't overload absolute symbol type for percpu symbols
    x86: kallsyms: disable absolute percpu symbols on !SMP
    checkpatch: fix another left brace warning
    checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
    checkpatch: warn on bare unsigned or signed declarations without int
    checkpatch: exclude asm volatile from complex macro check
    mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
    mm: migrate: consolidate mem_cgroup_migrate() calls
    mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
    ...

    Linus Torvalds
     

16 Mar, 2016

4 commits

  • Similar to how relative extables are implemented, it is possible to emit
    the kallsyms table in such a way that it contains offsets relative to
    some anchor point in the kernel image rather than absolute addresses.

    On 64-bit architectures, it cuts the size of the kallsyms address table
    in half, since offsets between kernel symbols can typically be expressed
    in 32 bits. This saves several hundreds of kilobytes of permanent
    .rodata on average. In addition, the kallsyms address table is no
    longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
    effect, so the relocation work done after decompression now doesn't have
    to do relocation updates for all these values. This saves up to 24
    bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
    which easily adds up to a couple of megabytes of uncompressed __init
    data on ppc64 or arm64. Even if these relocation entries typically
    compress well, the combined size reduction of 2.8 MB uncompressed for a
    ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
    KB space saving in the compressed image.

    Since it is useful for some architectures (like x86) to retain the
    ability to emit absolute values as well, this patch also adds support
    for capturing both absolute and relative values when
    KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
    addresses as positive 32-bit values, and addresses relative to the
    lowest encountered relative symbol as negative values, which are
    subtracted from the runtime address of this base symbol to produce the
    actual address.

    Support for the above is enabled by default for all architectures except
    IA-64 and Tile-GX, whose symbols are too far apart to capture in this
    manner.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • scripts/kallsyms.c has a special --absolute-percpu command line option
    which deals with the zero based per cpu offsets that are used when
    building for SMP on x86_64. This means that the option should only be
    passed in that case, so add a Kconfig symbol with the correct predicate,
    and use that instead.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Acked-by: Rusty Russell
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • Use list_for_each_entry() instead of list_for_each() to simplify the code.

    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Pull cpu hotplug updates from Thomas Gleixner:
    "This is the first part of the ongoing cpu hotplug rework:

    - Initial implementation of the state machine

    - Runs all online and prepare down callbacks on the plugged cpu and
    not on some random processor

    - Replaces busy loop waiting with completions

    - Adds tracepoints so the states can be followed"

    More detailed commentary on this work from an earlier email:
    "What's wrong with the current cpu hotplug infrastructure?

    - Asymmetry

    The hotplug notifier mechanism is asymmetric versus the bringup and
    teardown. This is mostly caused by the notifier mechanism.

    - Largely undocumented dependencies

    While some notifiers use explicitely defined notifier priorities,
    we have quite some notifiers which use numerical priorities to
    express dependencies without any documentation why.

    - Control processor driven

    Most of the bringup/teardown of a cpu is driven by a control
    processor. While it is understandable, that preperatory steps,
    like idle thread creation, memory allocation for and initialization
    of essential facilities needs to be done before a cpu can boot,
    there is no reason why everything else must run on a control
    processor. Before this patch series, bringup looks like this:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring the rest up

    - All or nothing approach

    There is no way to do partial bringups. That's something which is
    really desired because we waste e.g. at boot substantial amount of
    time just busy waiting that the cpu comes to life. That's stupid
    as we could very well do preparatory steps and the initial IPI for
    other cpus and then go back and do the necessary low level
    synchronization with the freshly booted cpu.

    - Minimal debuggability

    Due to the notifier based design, it's impossible to switch between
    two stages of the bringup/teardown back and forth in order to test
    the correctness. So in many hotplug notifiers the cancel
    mechanisms are either not existant or completely untested.

    - Notifier [un]registering is tedious

    To [un]register notifiers we need to protect against hotplug at
    every callsite. There is no mechanism that bringup/teardown
    callbacks are issued on the online cpus, so every caller needs to
    do it itself. That also includes error rollback.

    What's the new design?

    The base of the new design is a symmetric state machine, where both
    the control processor and the booting/dying cpu execute a well
    defined set of states. Each state is symmetric in the end, except
    for some well defined exceptions, and the bringup/teardown can be
    stopped and reversed at almost all states.

    So the bringup of a cpu will look like this in the future:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring itself up

    The synchronization step does not require the control cpu to wait.
    That mechanism can be done asynchronously via a worker or some
    other mechanism.

    The teardown can be made very similar, so that the dying cpu cleans
    up and brings itself down. Cleanups which need to be done after
    the cpu is gone, can be scheduled asynchronously as well.

    There is a long way to this, as we need to refactor the notion when a
    cpu is available. Today we set the cpu online right after it comes
    out of the low level bringup, which is not really correct.

    The proper mechanism is to set it to available, i.e. cpu local
    threads, like softirqd, hotplug thread etc. can be scheduled on that
    cpu, and once it finished all booting steps, it's set to online, so
    general workloads can be scheduled on it. The reverse happens on
    teardown. First thing to do is to forbid scheduling of general
    workloads, then teardown all the per cpu resources and finally shut it
    off completely.

    This patch series implements the basic infrastructure for this at the
    core level. This includes the following:

    - Basic state machine implementation with well defined states, so
    ordering and prioritization can be expressed.

    - Interfaces to [un]register state callbacks

    This invokes the bringup/teardown callback on all online cpus with
    the proper protection in place and [un]installs the callbacks in
    the state machine array.

    For callbacks which have no particular ordering requirement we have
    a dynamic state space, so that drivers don't have to register an
    explicit hotplug state.

    If a callback fails, the code automatically does a rollback to the
    previous state.

    - Sysfs interface to drive the state machine to a particular step.

    This is only partially functional today. Full functionality and
    therefor testability will be achieved once we converted all
    existing hotplug notifiers over to the new scheme.

    - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
    processor:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu
    wait for boot
    bring itself up

    Signal completion to control cpu

    In a previous step of this work we've done a full tree mechanical
    conversion of all hotplug notifiers to the new scheme. The balance
    is a net removal of about 4000 lines of code.

    This is not included in this series, as we decided to take a
    different approach. Instead of mechanically converting everything
    over, we will do a proper overhaul of the usage sites one by one so
    they nicely fit into the symmetric callback scheme.

    I decided to do that after I looked at the ugliness of some of the
    converted sites and figured out that their hotplug mechanism is
    completely buggered anyway. So there is no point to do a
    mechanical conversion first as we need to go through the usage
    sites one by one again in order to achieve a full symmetric and
    testable behaviour"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    cpu/hotplug: Document states better
    cpu/hotplug: Fix smpboot thread ordering
    cpu/hotplug: Remove redundant state check
    cpu/hotplug: Plug death reporting race
    rcu: Make CPU_DYING_IDLE an explicit call
    cpu/hotplug: Make wait for dead cpu completion based
    cpu/hotplug: Let upcoming cpu bring itself fully up
    arch/hotplug: Call into idle with a proper state
    cpu/hotplug: Move online calls to hotplugged cpu
    cpu/hotplug: Create hotplug threads
    cpu/hotplug: Split out the state walk into functions
    cpu/hotplug: Unpark smpboot threads from the state machine
    cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
    cpu/hotplug: Implement setup/removal interface
    cpu/hotplug: Make target state writeable
    cpu/hotplug: Add sysfs state interface
    cpu/hotplug: Hand in target state to _cpu_up/down
    cpu/hotplug: Convert the hotplugged cpu work to a state machine
    cpu/hotplug: Convert to a state machine for the control processor
    cpu/hotplug: Add tracepoints
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull read-only kernel memory updates from Ingo Molnar:
    "This tree adds two (security related) enhancements to the kernel's
    handling of read-only kernel memory:

    - extend read-only kernel memory to a new class of formerly writable
    kernel data: 'post-init read-only memory' via the __ro_after_init
    attribute, and mark the ARM and x86 vDSO as such read-only memory.

    This kind of attribute can be used for data that requires a once
    per bootup initialization sequence, but is otherwise never modified
    after that point.

    This feature was based on the work by PaX Team and Brad Spengler.

    (by Kees Cook, the ARM vDSO bits by David Brown.)

    - make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
    Kconfig option. This simplifies the kernel and also signals that
    read-only memory is the default model and a first-class citizen.
    (Kees Cook)"

    * 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/vdso: Mark the vDSO code read-only after init
    x86/vdso: Mark the vDSO code read-only after init
    lkdtm: Verify that '__ro_after_init' works correctly
    arch: Introduce post-init read-only memory
    x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
    mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
    asm-generic: Consolidate mark_rodata_ro()

    Linus Torvalds
     

05 Mar, 2016

1 commit


04 Mar, 2016

1 commit

  • Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key
    subtype to the rsa crypto module's pkcs1pad template. This means that the
    public_key subtype no longer has any dependencies on public key type.

    To make this work, the following changes have been made:

    (1) The rsa pkcs1pad template is now used for RSA keys. This strips off the
    padding and returns just the message hash.

    (2) In a previous patch, the pkcs1pad template gained an optional second
    parameter that, if given, specifies the hash used. We now give this,
    and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5
    encoding and verifies that the correct digest OID is present.

    (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to
    something that doesn't care about what the encryption actually does
    and and has been merged into public_key.c.

    (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone. Module signing must set
    CONFIG_CRYPTO_RSA=y instead.

    Thoughts:

    (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to
    the padding template? Should there be multiple padding templates
    registered that share most of the code?

    Signed-off-by: David Howells
    Signed-off-by: Tadeusz Struk
    Acked-by: Herbert Xu

    David Howells
     

02 Mar, 2016

2 commits

  • Handle the smpboot threads in the state machine.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Move the split out steps into a callback array and let the cpu_up/down
    code iterate through the array functions. For now most of the
    callbacks are asymmetric to resemble the current hotplug maze.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

22 Feb, 2016

1 commit

  • It may be useful to debug writes to the readonly sections of memory,
    so provide a cmdline "rodata=off" to allow for this. This can be
    expanded in the future to support "log" and "write" modes, but that
    will need to be architecture-specific.

    This also makes KDB software breakpoints more usable, as read-only
    mappings can now be disabled on any kernel.

    Suggested-by: H. Peter Anvin
    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Brown
    Cc: Denys Vlasenko
    Cc: Emese Revfy
    Cc: Linus Torvalds
    Cc: Mathias Krause
    Cc: Michael Ellerman
    Cc: PaX Team
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-hardening@lists.openwall.com
    Cc: linux-arch
    Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org
    Signed-off-by: Ingo Molnar

    Kees Cook
     

09 Feb, 2016

1 commit

  • Lockdep is initialized at compile time now. Get rid of lockdep_init().

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Krinkin
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: mm-commits@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     

21 Jan, 2016

4 commits

  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make initrd_load() return bool due to this particular function only using
    either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make obsolete_checksetup() return bool due to this particular function
    only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

18 Jan, 2016

3 commits

  • Pull audit updates from Paul Moore:
    "Seven audit patches for 4.5, all very minor despite the diffstat.

    The diffstat churn for linux/audit.h can be attributed to needing to
    reshuffle the linux/audit.h header to fix the seccomp auditing issue
    (see the commit description for details).

    Besides the seccomp/audit fix, most of the fixes are around trying to
    improve the connection with the audit daemon and a Kconfig
    simplification. Nothing crazy, and everything passes our little
    audit-testsuite"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: always enable syscall auditing when supported and audit is enabled
    audit: force seccomp event logging to honor the audit_enabled flag
    audit: Delete unnecessary checks before two function calls
    audit: wake up threads if queue switched from limited to unlimited
    audit: include auditd's threads in audit_log_start() wait exception
    audit: remove audit_backlog_wait_overflow
    audit: don't needlessly reset valid wait time

    Linus Torvalds
     
  • Merge second patch-bomb from Andrew Morton:

    - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync). This isn't quite complete but DAX
    is new and it's good enough and the guys have a handle on what
    needs to be done - I expect this to be wrapped in the next week or
    two.

    - some vsprintf maintenance work

    - various other misc bits

    * emailed patches from Andrew Morton : (145 commits)
    printk: change recursion_bug type to bool
    lib/vsprintf: factor out %pN[F] handler as netdev_bits()
    lib/vsprintf: refactor duplicate code to special_hex_number()
    printk-formats.txt: remove unimplemented %pT
    printk: help pr_debug and pr_devel to optimize out arguments
    lib/test_printf.c: test dentry printing
    lib/test_printf.c: add test for large bitmaps
    lib/test_printf.c: account for kvasprintf tests
    lib/test_printf.c: add a few number() tests
    lib/test_printf.c: test precision quirks
    lib/test_printf.c: check for out-of-bound writes
    lib/test_printf.c: don't BUG
    lib/kasprintf.c: add sanity check to kvasprintf
    lib/vsprintf.c: warn about too large precisions and field widths
    lib/vsprintf.c: help gcc make number() smaller
    lib/vsprintf.c: expand field_width to 24 bits
    lib/vsprintf.c: eliminate potential race in string()
    lib/vsprintf.c: move string() below widen_string()
    lib/vsprintf.c: pull out padding code from dentry_name()
    printk: do cond_resched() between lines while outputting to consoles
    ...

    Linus Torvalds
     
  • Pull documentation updates from Jon Corbet:
    "A relatively boring cycle in the docs tree. There's a few kernel-doc
    fixes and various document tweaks.

    One patch reaches out of the documentation subtree to fix a comment in
    init/do_mounts_rd.c. There didn't seem to be anybody more appropriate
    to take that one, so I accepted it"

    * tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits)
    thermal: add description for integral_cutoff unit
    Documentation: update libhugetlbfs site url
    Documentation: Explain pci=conf1,conf2 more verbosely
    DMA-API: fix confusing sentence in Documentation/DMA-API.txt
    Documentation: translations: update linux cross reference link
    Documentation: fix typo in CodingStyle
    init, Documentation: Remove ramdisk_blocksize mentions
    Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main()
    Documentation: HOWTO: update versions from 3.x to 4.x
    Documentation: remove outdated references from translations
    Doc: treewide: Fix grammar "a" to "an"
    Documentation: cpu-hotplug: Fix sysfs mount instructions
    can-doc: Add hint about getting timestamps
    Fix CFQ I/O scheduler parameter name in documentation
    Documentation: arm: remove dead links from Marvell Berlin docs
    Documentation: HOWTO: update code cross reference link
    Doc: Docbook/iio: Fix typo in iio.tmpl
    DocBook: make index.html generation less verbose by default
    DocBook: Cleanup: remove an unused $(call) line
    DocBook: Add a help message for DOCBOOKS env var
    ...

    Linus Torvalds
     

17 Jan, 2016

1 commit

  • uselib hasn't been used since libc5; glibc does not use it. Deprecate
    uselib a bit more, by making the default y only if libc5 was widely used
    on the plaform.

    This makes arm64 kernel built with defconfig slightly smaller

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
    function old new delta
    kernel_config_data 18164 18162 -2
    uselib_flags 20 - -20
    padzero 216 192 -24
    sys_uselib 380 - -380
    load_elf_library 964 - -964

    Signed-off-by: Riku Voipio
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

13 Jan, 2016

2 commits

  • To the best of our knowledge, everyone who enables audit at compile
    time also enables syscall auditing; this patch simplifies the Kconfig
    menus by removing the option to disable syscall auditing when audit
    is selected and the target arch supports it.

    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds
     

06 Jan, 2016

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - Adding transitivity uniformly to rcu_node structure ->lock
    acquisitions. (This is implemented by the first two commits
    on top of v4.4-rc2 due to the pervasive nature of this change.)

    - Documentation updates, including RCU requirements.

    - Expedited grace-period changes.

    - Miscellaneous fixes.

    - Linked-list fixes, courtesy of KTSAN.

    - Torture-test updates.

    - Late-breaking fix to sysrq-generated crash.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

26 Dec, 2015

1 commit

  • The brd driver has never supported the ramdisk_blocksize kernel
    parameter that was in the rd driver it replaced, so remove
    mention of this parameter from comments and Documentation.

    Commit 9db5579be4bb ("rewrite rd") replaced rd with brd, keeping
    a brd_blocksize variable in struct brd_device but never using it.

    Commit a2cba2913c76 ("brd: get rid of unused members from struct
    brd_device") removed the unused variable.

    Commit f5abc8e75815 ("Documentation/blockdev/ramdisk.txt: updates")
    removed mentions of ramdisk_blocksize from that file.

    Signed-off-by: Robert Elliott
    Signed-off-by: Jonathan Corbet

    Robert Elliott
     

19 Dec, 2015

2 commits


13 Dec, 2015

1 commit

  • Currently the full stop_machine() routine is only enabled on SMP if
    module unloading is enabled, or if the CPUs are hotpluggable. This
    leads to configurations where stop_machine() is broken as it will then
    only run the callback on the local CPU with irqs disabled, and not stop
    the other CPUs or run the callback on them.

    For example, this breaks MTRR setup on x86 in certain configs since
    ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
    text_poke_smp_batch() functions") as the MTRR is only established on the
    boot CPU.

    This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
    and HOTPLUG_CPU config options to compile the correct stop_machine() for
    the architecture, removing the false dependency on MODULE_UNLOAD in the
    process.

    Link: https://lkml.org/lkml/2014/10/8/124
    References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
    Signed-off-by: Chris Wilson
    Acked-by: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: Pranith Kumar
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: Iulia Manda
    Cc: Andy Lutomirski
    Cc: Rusty Russell
    Cc: Peter Zijlstra
    Cc: Chuck Ebbert
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

05 Dec, 2015

1 commit

  • This commit adds the invocation of rcu_end_inkernel_boot() just before
    init is invoked. This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig
    option to do something useful and prepares for the upcoming
    rcupdate.rcu_normal_after_boot kernel parameter.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

11 Sep, 2015

2 commits

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • We need to launch the usermodehelper kernel threads with the widest
    affinity and this is partly why we use khelper. This workqueue has
    unbound properties and thus a wide affinity inherited by all its children.

    Now khelper also has special properties that we aren't much interested in:
    ordered and singlethread. There is really no need about ordering as all
    we do is creating kernel threads. This can be done concurrently. And
    singlethread is a useless limitation as well.

    The workqueue engine already proposes generic unbound workqueues that
    don't share these useless properties and handle well parallel jobs.

    The only worrysome specific is their affinity to the node of the current
    CPU. It's fine for creating the usermodehelper kernel threads but those
    inherit this affinity for longer jobs such as requesting modules.

    This patch proposes to use these node affine unbound workqueues assuming
    that a node is sufficient to handle several parallel usermodehelper
    requests.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     

09 Sep, 2015

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

06 Sep, 2015

2 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - a few misc things

    - Andy's "ambient capabilities"

    - fs/nofity updates

    - the ocfs2 queue

    - kernel/watchdog.c updates and feature work.

    - some of MM. Includes Andrea's userfaultfd feature.

    [ Hadn't noticed that userfaultfd was 'default y' when applying the
    patches, so that got fixed in this merge instead. We do _not_ mark
    new features that nobody uses yet 'default y' - Linus ]

    * emailed patches from Andrew Morton : (118 commits)
    mm/hugetlb.c: make vma_has_reserves() return bool
    mm/madvise.c: make madvise_behaviour_valid() return bool
    mm/memory.c: make tlb_next_batch() return bool
    mm/dmapool.c: change is_page_busy() return from int to bool
    mm: remove struct node_active_region
    mremap: simplify the "overlap" check in mremap_to()
    mremap: don't do uneccesary checks if new_len == old_len
    mremap: don't do mm_populate(new_addr) on failure
    mm: move ->mremap() from file_operations to vm_operations_struct
    mremap: don't leak new_vma if f_op->mremap() fails
    mm/hugetlb.c: make vma_shareable() return bool
    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    mm: fix status code which move_pages() returns for zero page
    mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
    genalloc: add support of multiple gen_pools per device
    genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
    mm/memblock: WARN_ON when nid differs from overlap region
    Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
    mm: defer flush of writable TLB entries
    mm: send one IPI per CPU to TLB flush all entries after unmapping pages
    ...

    Linus Torvalds
     

05 Sep, 2015

2 commits

  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This allows to select the userfaultfd during configuration to build it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Sep, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - a new PIDs controller is added. It turns out that PIDs are actually
    an independent resource from kmem due to the limited PID space.

    - more core preparations for the v2 interface. Once cpu side interface
    is settled, it should be ready for lifting the devel mask.
    for-4.3-unified-base was temporarily branched so that other trees
    (block) can pull cgroup core changes that blkcg changes depend on.

    - a non-critical idr_preload usage bug fix.

    * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: pids: fix invalid get/put usage
    cgroup: introduce cgroup_subsys->legacy_name
    cgroup: don't print subsystems for the default hierarchy
    cgroup: make cftype->private a unsigned long
    cgroup: export cgrp_dfl_root
    cgroup: define controller file conventions
    cgroup: fix idr_preload usage
    cgroup: add documentation for the PIDs controller
    cgroup: implement the PIDs subsystem
    cgroup: allow a cgroup subsystem to reject a fork

    Linus Torvalds