21 Jan, 2016

4 commits

  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make initrd_load() return bool due to this particular function only using
    either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make obsolete_checksetup() return bool due to this particular function
    only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

18 Jan, 2016

3 commits

  • Pull audit updates from Paul Moore:
    "Seven audit patches for 4.5, all very minor despite the diffstat.

    The diffstat churn for linux/audit.h can be attributed to needing to
    reshuffle the linux/audit.h header to fix the seccomp auditing issue
    (see the commit description for details).

    Besides the seccomp/audit fix, most of the fixes are around trying to
    improve the connection with the audit daemon and a Kconfig
    simplification. Nothing crazy, and everything passes our little
    audit-testsuite"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: always enable syscall auditing when supported and audit is enabled
    audit: force seccomp event logging to honor the audit_enabled flag
    audit: Delete unnecessary checks before two function calls
    audit: wake up threads if queue switched from limited to unlimited
    audit: include auditd's threads in audit_log_start() wait exception
    audit: remove audit_backlog_wait_overflow
    audit: don't needlessly reset valid wait time

    Linus Torvalds
     
  • Merge second patch-bomb from Andrew Morton:

    - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync). This isn't quite complete but DAX
    is new and it's good enough and the guys have a handle on what
    needs to be done - I expect this to be wrapped in the next week or
    two.

    - some vsprintf maintenance work

    - various other misc bits

    * emailed patches from Andrew Morton : (145 commits)
    printk: change recursion_bug type to bool
    lib/vsprintf: factor out %pN[F] handler as netdev_bits()
    lib/vsprintf: refactor duplicate code to special_hex_number()
    printk-formats.txt: remove unimplemented %pT
    printk: help pr_debug and pr_devel to optimize out arguments
    lib/test_printf.c: test dentry printing
    lib/test_printf.c: add test for large bitmaps
    lib/test_printf.c: account for kvasprintf tests
    lib/test_printf.c: add a few number() tests
    lib/test_printf.c: test precision quirks
    lib/test_printf.c: check for out-of-bound writes
    lib/test_printf.c: don't BUG
    lib/kasprintf.c: add sanity check to kvasprintf
    lib/vsprintf.c: warn about too large precisions and field widths
    lib/vsprintf.c: help gcc make number() smaller
    lib/vsprintf.c: expand field_width to 24 bits
    lib/vsprintf.c: eliminate potential race in string()
    lib/vsprintf.c: move string() below widen_string()
    lib/vsprintf.c: pull out padding code from dentry_name()
    printk: do cond_resched() between lines while outputting to consoles
    ...

    Linus Torvalds
     
  • Pull documentation updates from Jon Corbet:
    "A relatively boring cycle in the docs tree. There's a few kernel-doc
    fixes and various document tweaks.

    One patch reaches out of the documentation subtree to fix a comment in
    init/do_mounts_rd.c. There didn't seem to be anybody more appropriate
    to take that one, so I accepted it"

    * tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits)
    thermal: add description for integral_cutoff unit
    Documentation: update libhugetlbfs site url
    Documentation: Explain pci=conf1,conf2 more verbosely
    DMA-API: fix confusing sentence in Documentation/DMA-API.txt
    Documentation: translations: update linux cross reference link
    Documentation: fix typo in CodingStyle
    init, Documentation: Remove ramdisk_blocksize mentions
    Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main()
    Documentation: HOWTO: update versions from 3.x to 4.x
    Documentation: remove outdated references from translations
    Doc: treewide: Fix grammar "a" to "an"
    Documentation: cpu-hotplug: Fix sysfs mount instructions
    can-doc: Add hint about getting timestamps
    Fix CFQ I/O scheduler parameter name in documentation
    Documentation: arm: remove dead links from Marvell Berlin docs
    Documentation: HOWTO: update code cross reference link
    Doc: Docbook/iio: Fix typo in iio.tmpl
    DocBook: make index.html generation less verbose by default
    DocBook: Cleanup: remove an unused $(call) line
    DocBook: Add a help message for DOCBOOKS env var
    ...

    Linus Torvalds
     

17 Jan, 2016

1 commit

  • uselib hasn't been used since libc5; glibc does not use it. Deprecate
    uselib a bit more, by making the default y only if libc5 was widely used
    on the plaform.

    This makes arm64 kernel built with defconfig slightly smaller

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
    function old new delta
    kernel_config_data 18164 18162 -2
    uselib_flags 20 - -20
    padzero 216 192 -24
    sys_uselib 380 - -380
    load_elf_library 964 - -964

    Signed-off-by: Riku Voipio
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

13 Jan, 2016

2 commits

  • To the best of our knowledge, everyone who enables audit at compile
    time also enables syscall auditing; this patch simplifies the Kconfig
    menus by removing the option to disable syscall auditing when audit
    is selected and the target arch supports it.

    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds
     

06 Jan, 2016

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - Adding transitivity uniformly to rcu_node structure ->lock
    acquisitions. (This is implemented by the first two commits
    on top of v4.4-rc2 due to the pervasive nature of this change.)

    - Documentation updates, including RCU requirements.

    - Expedited grace-period changes.

    - Miscellaneous fixes.

    - Linked-list fixes, courtesy of KTSAN.

    - Torture-test updates.

    - Late-breaking fix to sysrq-generated crash.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

26 Dec, 2015

1 commit

  • The brd driver has never supported the ramdisk_blocksize kernel
    parameter that was in the rd driver it replaced, so remove
    mention of this parameter from comments and Documentation.

    Commit 9db5579be4bb ("rewrite rd") replaced rd with brd, keeping
    a brd_blocksize variable in struct brd_device but never using it.

    Commit a2cba2913c76 ("brd: get rid of unused members from struct
    brd_device") removed the unused variable.

    Commit f5abc8e75815 ("Documentation/blockdev/ramdisk.txt: updates")
    removed mentions of ramdisk_blocksize from that file.

    Signed-off-by: Robert Elliott
    Signed-off-by: Jonathan Corbet

    Robert Elliott
     

19 Dec, 2015

2 commits


13 Dec, 2015

1 commit

  • Currently the full stop_machine() routine is only enabled on SMP if
    module unloading is enabled, or if the CPUs are hotpluggable. This
    leads to configurations where stop_machine() is broken as it will then
    only run the callback on the local CPU with irqs disabled, and not stop
    the other CPUs or run the callback on them.

    For example, this breaks MTRR setup on x86 in certain configs since
    ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
    text_poke_smp_batch() functions") as the MTRR is only established on the
    boot CPU.

    This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
    and HOTPLUG_CPU config options to compile the correct stop_machine() for
    the architecture, removing the false dependency on MODULE_UNLOAD in the
    process.

    Link: https://lkml.org/lkml/2014/10/8/124
    References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
    Signed-off-by: Chris Wilson
    Acked-by: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: Pranith Kumar
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: Iulia Manda
    Cc: Andy Lutomirski
    Cc: Rusty Russell
    Cc: Peter Zijlstra
    Cc: Chuck Ebbert
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

05 Dec, 2015

1 commit

  • This commit adds the invocation of rcu_end_inkernel_boot() just before
    init is invoked. This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig
    option to do something useful and prepares for the upcoming
    rcupdate.rcu_normal_after_boot kernel parameter.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

11 Sep, 2015

2 commits

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • We need to launch the usermodehelper kernel threads with the widest
    affinity and this is partly why we use khelper. This workqueue has
    unbound properties and thus a wide affinity inherited by all its children.

    Now khelper also has special properties that we aren't much interested in:
    ordered and singlethread. There is really no need about ordering as all
    we do is creating kernel threads. This can be done concurrently. And
    singlethread is a useless limitation as well.

    The workqueue engine already proposes generic unbound workqueues that
    don't share these useless properties and handle well parallel jobs.

    The only worrysome specific is their affinity to the node of the current
    CPU. It's fine for creating the usermodehelper kernel threads but those
    inherit this affinity for longer jobs such as requesting modules.

    This patch proposes to use these node affine unbound workqueues assuming
    that a node is sufficient to handle several parallel usermodehelper
    requests.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     

09 Sep, 2015

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

06 Sep, 2015

2 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - a few misc things

    - Andy's "ambient capabilities"

    - fs/nofity updates

    - the ocfs2 queue

    - kernel/watchdog.c updates and feature work.

    - some of MM. Includes Andrea's userfaultfd feature.

    [ Hadn't noticed that userfaultfd was 'default y' when applying the
    patches, so that got fixed in this merge instead. We do _not_ mark
    new features that nobody uses yet 'default y' - Linus ]

    * emailed patches from Andrew Morton : (118 commits)
    mm/hugetlb.c: make vma_has_reserves() return bool
    mm/madvise.c: make madvise_behaviour_valid() return bool
    mm/memory.c: make tlb_next_batch() return bool
    mm/dmapool.c: change is_page_busy() return from int to bool
    mm: remove struct node_active_region
    mremap: simplify the "overlap" check in mremap_to()
    mremap: don't do uneccesary checks if new_len == old_len
    mremap: don't do mm_populate(new_addr) on failure
    mm: move ->mremap() from file_operations to vm_operations_struct
    mremap: don't leak new_vma if f_op->mremap() fails
    mm/hugetlb.c: make vma_shareable() return bool
    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    mm: fix status code which move_pages() returns for zero page
    mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
    genalloc: add support of multiple gen_pools per device
    genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
    mm/memblock: WARN_ON when nid differs from overlap region
    Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
    mm: defer flush of writable TLB entries
    mm: send one IPI per CPU to TLB flush all entries after unmapping pages
    ...

    Linus Torvalds
     

05 Sep, 2015

2 commits

  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This allows to select the userfaultfd during configuration to build it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Sep, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - a new PIDs controller is added. It turns out that PIDs are actually
    an independent resource from kmem due to the limited PID space.

    - more core preparations for the v2 interface. Once cpu side interface
    is settled, it should be ready for lifting the devel mask.
    for-4.3-unified-base was temporarily branched so that other trees
    (block) can pull cgroup core changes that blkcg changes depend on.

    - a non-critical idr_preload usage bug fix.

    * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: pids: fix invalid get/put usage
    cgroup: introduce cgroup_subsys->legacy_name
    cgroup: don't print subsystems for the default hierarchy
    cgroup: make cftype->private a unsigned long
    cgroup: export cgrp_dfl_root
    cgroup: define controller file conventions
    cgroup: fix idr_preload usage
    cgroup: add documentation for the PIDs controller
    cgroup: implement the PIDs subsystem
    cgroup: allow a cgroup subsystem to reject a fork

    Linus Torvalds
     

15 Aug, 2015

1 commit


14 Aug, 2015

1 commit


13 Aug, 2015

1 commit

  • The revised sign-file program is no longer a script that wraps the openssl
    program, but now rather a program that makes use of OpenSSL's crypto
    library. This means that to build the sign-file program, the kernel build
    process now has a dependency on the OpenSSL development packages in
    addition to OpenSSL itself.

    Document this in Kconfig and in module-signing.txt.

    Signed-off-by: David Howells
    Reviewed-by: David Woodhouse

    David Howells
     

12 Aug, 2015

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - The combination of tree geometry-initialization simplifications
    and OS-jitter-reduction changes to expedited grace periods.
    These two are stacked due to the large number of conflicts
    that would otherwise result.

    [ With one addition, a temporary commit to silence a lockdep false
    positive. Additional changes to the expedited grace-period
    primitives (queued for 4.4) remove the cause of this false
    positive, and therefore include a revert of this temporary commit. ]

    - Documentation updates.

    - Torture-test updates.

    - Miscellaneous fixes.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

07 Aug, 2015

7 commits

  • Let the user explicitly provide a file containing trusted keys, instead of
    just automatically finding files matching *.x509 in the build tree and
    trusting whatever we find. This really ought to be an *explicit*
    configuration, and the build rules for dealing with the files were
    fairly painful too.

    Fix applied from James Morris that removes an '=' from a macro definition
    in kernel/Makefile as this is a feature that only exists from GNU make 3.82
    onwards.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • The current rule for generating signing_key.priv and signing_key.x509 is
    a classic example of a bad rule which has a tendency to break parallel
    make. When invoked to create *either* target, it generates the other
    target as a side-effect that make didn't predict.

    So let's switch to using a single file signing_key.pem which contains
    both key and certificate. That matches what we do in the case of an
    external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
    slightly cleaner.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Where an external PEM file or PKCS#11 URI is given, we can get the cert
    from it for ourselves instead of making the user drop signing_key.x509
    in place for us.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Extract the function that drives the PKCS#7 signature verification given a
    data blob and a PKCS#7 blob out from the module signing code and lump it with
    the system keyring code as it's generic. This makes it independent of module
    config options and opens it to use by the firmware loader.

    Signed-off-by: David Howells
    Cc: Luis R. Rodriguez
    Cc: Rusty Russell
    Cc: Ming Lei
    Cc: Seth Forshee
    Cc: Kyle McMartin

    David Howells
     
  • Move to using PKCS#7 messages as module signatures because:

    (1) We have to be able to support the use of X.509 certificates that don't
    have a subjKeyId set. We're currently relying on this to look up the
    X.509 certificate in the trusted keyring list.

    (2) PKCS#7 message signed information blocks have a field that supplies the
    data required to match with the X.509 certificate that signed it.

    (3) The PKCS#7 certificate carries fields that specify the digest algorithm
    used to generate the signature in a standardised way and the X.509
    certificates specify the public key algorithm in a standardised way - so
    we don't need our own methods of specifying these.

    (4) We now have PKCS#7 message support in the kernel for signed kexec purposes
    and we can make use of this.

    To make this work, the old sign-file script has been replaced with a program
    that needs compiling in a previous patch. The rules to build it are added
    here.

    Signed-off-by: David Howells
    Tested-by: Vivek Goyal

    David Howells
     
  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Jul, 2015

1 commit


15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

07 Jul, 2015

1 commit


04 Jul, 2015

1 commit

  • Pull scheduler fixes from Ingo Molnar:
    "Debug info and other statistics fixes and related enhancements"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix numa balancing stats in /proc/pid/sched
    sched/numa: Show numa_group ID in /proc/sched_debug task listings
    sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
    sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
    sched/stat: Simplify the sched_info accounting dependency

    Linus Torvalds