09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

03 Aug, 2016

4 commits

  • It doesn't trim just symbols that are totally unused in-tree - it trims
    the symbols unused by any in-tree modules actually built. If you've
    done a 'make localmodconfig' and only build a hundred or so modules,
    it's pretty likely that your out-of-tree module will come up lacking
    something...

    Hopefully this will save the next guy from a Homer Simpson "D'oh!"
    moment.

    Link: http://lkml.kernel.org/r/10177.1469787292@turing-police.cc.vt.edu
    Signed-off-by: Valdis Kletnieks
    Cc: Michal Marek
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valdis Kletnieks
     
  • Doing patches with allmodconfig kernel compiled and committing stuff
    into local tree have unfortunate consequence: kernel version changes (as
    it should) leading to recompiling and relinking of several files even if
    they weren't touched (or interesting at all). This and "git-whatever"
    figuring out current version slow down compilation for no good reason.

    But lets face it, "allmodconfig" kernels don't care about kernel
    version, they are simply compile check guinea pigs.

    Make LOCALVERSION_AUTO depend on !COMPILE_TEST, so it doesn't sneak into
    allmodconfig .config.

    Link: http://lkml.kernel.org/r/20160707214954.GC31678@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Michal Marek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • UML is a bit special since it does not have iomem nor dma. That means a
    lot of drivers will not build if they miss a dependency on HAS_IOMEM.
    s390 used to have the same issues but since it gained PCI support UML is
    the only stranger.

    We are tired of patching dozens of new drivers after every merge window
    just to un-break allmod/yesconfig UML builds. One could argue that a
    decent driver has to know on what it depends and therefore a missing
    HAS_IOMEM dependency is a clear driver bug. But the dependency not
    obvious and not everyone does UML builds with COMPILE_TEST enabled when
    developing a device driver.

    A possible solution to make these builds succeed on UML would be
    providing stub functions for ioremap() and friends which fail upon
    runtime. Another one is simply disabling COMPILE_TEST for UML. Since
    it is the least hassle and does not force use to fake iomem support
    let's do the latter.

    Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
    Signed-off-by: Richard Weinberger
    Acked-by: Arnd Bergmann
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • cgroup's document path is changed to "cgroup-v1". update it.

    Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     

29 Jul, 2016

1 commit

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    fat: fix error message for bogus number of directory entries
    fat: fix typo s/supeblock/superblock/
    ASoC: max9877: Remove unused function declaration
    dw2102: don't output spurious blank lines to the kernel log
    init: fix Kconfig text
    ARM: io: fix comment grammar
    ocfs: fix ocfs2_xattr_user_get() argument name
    scsi/qla2xxx: Remove erroneous unused macro qla82xx_get_temp_val1()

    Linus Torvalds
     

27 Jul, 2016

3 commits

  • Implements freelist randomization for the SLUB allocator. It was
    previous implemented for the SLAB allocator. Both use the same
    configuration option (CONFIG_SLAB_FREELIST_RANDOM).

    The list is randomized during initialization of a new set of pages. The
    order on different freelist sizes is pre-computed at boot for
    performance. Each kmem_cache has its own randomized freelist.

    This security feature reduces the predictability of the kernel SLUB
    allocator against heap overflows rendering attacks much less stable.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    Performance results:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLUB allocator to catch any copies that may span objects. Includes a
    redzone handling fix discovered by Michael Ellerman.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Michael Ellerman
    Reviwed-by: Laura Abbott

    Kees Cook
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLAB allocator to catch any copies that may span objects.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Valdis Kletnieks

    Kees Cook
     

26 Jul, 2016

2 commits

  • Pull NOHZ updates from Ingo Molnar:

    - fix system/idle cputime leaked on cputime accounting (all nohz
    configs) (Rik van Riel)

    - remove the messy, ad-hoc irqtime account on nohz-full and make it
    compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

    - cleanups (Frederic Weisbecker)

    - remove unecessary irq disablement in the irqtime code (Rik van Riel)

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
    sched/cputime: Reorganize vtime native irqtime accounting headers
    sched/cputime: Clean up the old vtime gen irqtime accounting completely
    sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
    sched/cputime: Count actually elapsed irq & softirq time

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - documentation updates

    - miscellaneous fixes

    - minor reorganization of code

    - torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
    rcu: Correctly handle sparse possible cpus
    rcu: sysctl: Panic on RCU Stall
    rcu: Fix a typo in a comment
    rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
    rcu: Disable TASKS_RCU for usermode Linux
    rcu: No ordering for rcu_assign_pointer() of NULL
    rcutorture: Fix error return code in rcu_perf_init()
    torture: Inflict default jitter
    rcuperf: Don't treat gp_exp mis-setting as a WARN
    rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
    rcutorture: Don't specify the cpu type of QEMU on PPC
    rcutorture: Make -soundhw a x86 specific option
    rcutorture: Use vmlinux as the fallback kernel image
    rcutorture/doc: Create initrd using dracut
    torture: Stop onoff task if there is only one cpu
    torture: Add starvation events to error summary
    torture: Break online and offline functions out of torture_onoff()
    torture: Forgive lengthy trace dumps and preemption
    torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
    torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
    ...

    Linus Torvalds
     

14 Jul, 2016

1 commit

  • The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
    appear to currently work right.

    On CPUs without nohz_full=, only tick based irq time sampling is
    done, which breaks down when dealing with a nohz_idle CPU.

    On firewalls and similar systems, no ticks may happen on a CPU for a
    while, and the irq time spent may never get accounted properly. This
    can cause issues with capacity planning and power saving, which use
    the CPU statistics as inputs in decision making.

    Remove the VTIME_GEN vtime irq time code, and replace it with the
    IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

    Signed-off-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

07 Jul, 2016

1 commit

  • The "expert" menu was broken (split) such that all entries in it after
    KALLSYMS were displayed in the "General setup" area instead of in the
    "Expert users" area. Fix this by adding one kconfig dependency.

    Yes, the Expert users menu is fragile. Problems like this have happened
    several times in the past. I will attempt to isolate the Expert users
    menu if there is interest in that.

    Fixes: 4d5d5664c900 ("x86: kallsyms: disable absolute percpu symbols on !SMP")
    Signed-off-by: Randy Dunlap
    Cc: Ard Biesheuvel
    Cc: stable@vger.kernel.org # 4.6
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

21 Jun, 2016

1 commit


16 Jun, 2016

1 commit

  • Usermode Linux currently does not implement arch_irqs_disabled_flags(),
    which results in a build failure in TASKS_RCU. Therefore, this commit
    disables the TASKS_RCU Kconfig option in usermode Linux builds. The
    usermode Linux maintainers expect to merge arch_irqs_disabled_flags()
    into 4.8, at which point this commit may be reverted.

    Signed-off-by: Paul E. McKenney
    Cc: Jeff Dike
    Acked-by: Richard Weinberger

    Paul E. McKenney
     

27 May, 2016

1 commit

  • Pull kbuild updates from Michal Marek:

    - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
    unexports symbols which are not used in the current config [Nicolas
    Pitre]

    - several kbuild rule cleanups [Masahiro Yamada]

    - warning option adjustments for gcov etc [Arnd Bergmann]

    - a few more small fixes

    * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
    kbuild: move -Wunused-const-variable to W=1 warning level
    kbuild: fix if_change and friends to consider argument order
    kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
    kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
    gcov: disable -Wmaybe-uninitialized warning
    gcov: disable tree-loop-im to reduce stack usage
    gcov: disable for COMPILE_TEST
    Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
    Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
    kbuild: forbid kernel directory to contain spaces and colons
    kbuild: adjust ksym_dep_filter for some cmd_* renames
    kbuild: Fix dependencies for final vmlinux link
    kbuild: better abstract vmlinux sequential prerequisites
    kbuild: fix call to adjust_autoksyms.sh when output directory specified
    kbuild: Get rid of KBUILD_STR
    kbuild: rename cmd_as_s_S to cmd_cpp_s_S
    kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
    kbuild: drop redundant "PHONY += FORCE"
    kbuild: delete unnecessary "@:"
    kbuild: mark help target as PHONY
    ...

    Linus Torvalds
     

21 May, 2016

2 commits

  • Testing has shown that the backtrace sometimes does not fit into the 4kB
    temporary buffer that is used in NMI context. The warnings are gone
    when I double the temporary buffer size.

    This patch doubles the buffer size and makes it configurable.

    Note that this problem existed even in the x86-specific implementation
    that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI
    stack trace on all CPUs"). Nobody noticed it because it did not print
    any warnings.

    Signed-off-by: Petr Mladek
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Russell King
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • printk() takes some locks and could not be used a safe way in NMI
    context.

    The chance of a deadlock is real especially when printing stacks from
    all CPUs. This particular problem has been addressed on x86 by the
    commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
    CPUs").

    The patchset brings two big advantages. First, it makes the NMI
    backtraces safe on all architectures for free. Second, it makes all NMI
    messages almost safe on all architectures (the temporary buffer is
    limited. We still should keep the number of messages in NMI context at
    minimum).

    Note that there already are several messages printed in NMI context:
    WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
    handlers. These are not easy to avoid.

    This patch reuses most of the code and makes it generic. It is useful
    for all messages and architectures that support NMI.

    The alternative printk_func is set when entering and is reseted when
    leaving NMI context. It queues IRQ work to copy the messages into the
    main ring buffer in a safe context.

    __printk_nmi_flush() copies all available messages and reset the buffer.
    Then we could use a simple cmpxchg operations to get synchronized with
    writers. There is also used a spinlock to get synchronized with other
    flushers.

    We do not longer use seq_buf because it depends on external lock. It
    would be hard to make all supported operations safe for a lockless use.
    It would be confusing and error prone to make only some operations safe.

    The code is put into separate printk/nmi.c as suggested by Steven
    Rostedt. It needs a per-CPU buffer and is compiled only on
    architectures that call nmi_enter(). This is achieved by the new
    HAVE_NMI Kconfig flag.

    The are MN10300 and Xtensa architectures. We need to clean up NMI
    handling there first. Let's do it separately.

    The patch is heavily based on the draft from Peter Zijlstra, see

    https://lkml.org/lkml/2015/6/10/327

    [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
    [akpm@linux-foundation.org: min_t->min - all types are size_t here]
    Signed-off-by: Petr Mladek
    Suggested-by: Peter Zijlstra
    Suggested-by: Steven Rostedt
    Cc: Jan Kara
    Acked-by: Russell King [arm part]
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

20 May, 2016

1 commit

  • Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
    the SLAB freelist. The list is randomized during initialization of a
    new set of pages. The order on different freelist sizes is pre-computed
    at boot for performance. Each kmem_cache has its own randomized
    freelist. Before pre-computed lists are available freelists are
    generated dynamically. This security feature reduces the predictability
    of the kernel SLAB allocator against heap overflows rendering attacks
    much less stable.

    For example this attack against SLUB (also applicable against SLAB)
    would be affected:

    https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

    Also, since v4.6 the freelist was moved at the end of the SLAB. It
    means a controllable heap is opened to new attacks not yet publicly
    discussed. A kernel heap overflow can be transformed to multiple
    use-after-free. This feature makes this type of attack harder too.

    To generate entropy, we use get_random_bytes_arch because 0 bits of
    entropy is available in the boot stage. In the worse case this function
    will fallback to the get_random_bytes sub API. We also generate a shift
    random number to shift pre-computed freelist for each new set of pages.

    The config option name is not specific to the SLAB as this approach will
    be extended to other allocators like SLUB.

    Performance results highlighted no major changes:

    Hackbench (running 90 10 times):

    Before average: 0.0698
    After average: 0.0663 (-5.01%)

    slab_test 1 run on boot. Difference only seen on the 2048 size test
    being the worse case scenario covered by freelist randomization. New
    slab pages are constantly being created on the 10000 allocations.
    Variance should be mainly due to getting new pages every few
    allocations.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
    10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
    10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
    10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
    10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
    10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
    10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
    10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
    10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
    10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
    10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
    10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 121 cycles
    10000 times kmalloc(64)/kfree -> 121 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 121 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
    10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
    10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
    10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
    10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
    10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
    10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
    10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
    10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
    10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
    10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
    10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 123 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 119 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
    Signed-off-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Greg Thelen
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

10 May, 2016

1 commit

  • CC_OPTIMIZE_FOR_SIZE disables the often useful -Wmaybe-unused warning,
    because that causes a ridiculous amount of false positives when combined
    with -Os.

    This means a lot of warnings don't show up in testing by the developers
    that should see them with an 'allmodconfig' kernel that has
    CC_OPTIMIZE_FOR_SIZE enabled, but only later in randconfig builds
    that don't.

    This changes the Kconfig logic around CC_OPTIMIZE_FOR_SIZE to make
    it a 'choice' statement defaulting to CC_OPTIMIZE_FOR_PERFORMANCE
    that gets added for this purpose. The allmodconfig and allyesconfig
    kernels now default to -O2 with the maybe-unused warning enabled.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Michal Marek

    Arnd Bergmann
     

02 Apr, 2016

1 commit

  • Newer Fedora and OpenSUSE didn't boot with my standard configuration.
    It took me some time to figure out why, in fact I had to write a script
    to try different config options systematically.

    The problem is that something (systemd) in dracut depends on
    CONFIG_FHANDLE, which adds open by file handle syscalls.

    While it is set in defconfigs it is very easy to miss when updating
    older configs because it is not default y.

    Make it default y and also depend on EXPERT, as dracut use is likely
    widespread.

    Signed-off-by: Andi Kleen
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

30 Mar, 2016

1 commit


19 Mar, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Pull security layer updates from James Morris:
    "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
    fixes scattered across the subsystem.

    IMA now requires signed policy, and that policy is also now measured
    and appraised"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
    X.509: Make algo identifiers text instead of enum
    akcipher: Move the RSA DER encoding check to the crypto layer
    crypto: Add hash param to pkcs1pad
    sign-file: fix build with CMS support disabled
    MAINTAINERS: update tpmdd urls
    MODSIGN: linux/string.h should be #included to get memcpy()
    certs: Fix misaligned data in extra certificate list
    X.509: Handle midnight alternative notation in GeneralizedTime
    X.509: Support leap seconds
    Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
    X.509: Fix leap year handling again
    PKCS#7: fix unitialized boolean 'want'
    firmware: change kernel read fail to dev_dbg()
    KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
    KEYS: Reserve an extra certificate symbol for inserting without recompiling
    modsign: hide openssl output in silent builds
    tpm_tis: fix build warning with tpm_tis_resume
    ima: require signed IMA policy
    ima: measure and appraise the IMA policy itself
    ima: load policy using path
    ...

    Linus Torvalds
     

16 Mar, 2016

2 commits

  • Similar to how relative extables are implemented, it is possible to emit
    the kallsyms table in such a way that it contains offsets relative to
    some anchor point in the kernel image rather than absolute addresses.

    On 64-bit architectures, it cuts the size of the kallsyms address table
    in half, since offsets between kernel symbols can typically be expressed
    in 32 bits. This saves several hundreds of kilobytes of permanent
    .rodata on average. In addition, the kallsyms address table is no
    longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
    effect, so the relocation work done after decompression now doesn't have
    to do relocation updates for all these values. This saves up to 24
    bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
    which easily adds up to a couple of megabytes of uncompressed __init
    data on ppc64 or arm64. Even if these relocation entries typically
    compress well, the combined size reduction of 2.8 MB uncompressed for a
    ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
    KB space saving in the compressed image.

    Since it is useful for some architectures (like x86) to retain the
    ability to emit absolute values as well, this patch also adds support
    for capturing both absolute and relative values when
    KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
    addresses as positive 32-bit values, and addresses relative to the
    lowest encountered relative symbol as negative values, which are
    subtracted from the runtime address of this base symbol to produce the
    actual address.

    Support for the above is enabled by default for all architectures except
    IA-64 and Tile-GX, whose symbols are too far apart to capture in this
    manner.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • scripts/kallsyms.c has a special --absolute-percpu command line option
    which deals with the zero based per cpu offsets that are used when
    building for SMP on x86_64. This means that the option should only be
    passed in that case, so add a Kconfig symbol with the correct predicate,
    and use that instead.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Acked-by: Rusty Russell
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     

05 Mar, 2016

1 commit


04 Mar, 2016

1 commit

  • Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key
    subtype to the rsa crypto module's pkcs1pad template. This means that the
    public_key subtype no longer has any dependencies on public key type.

    To make this work, the following changes have been made:

    (1) The rsa pkcs1pad template is now used for RSA keys. This strips off the
    padding and returns just the message hash.

    (2) In a previous patch, the pkcs1pad template gained an optional second
    parameter that, if given, specifies the hash used. We now give this,
    and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5
    encoding and verifies that the correct digest OID is present.

    (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to
    something that doesn't care about what the encryption actually does
    and and has been merged into public_key.c.

    (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone. Module signing must set
    CONFIG_CRYPTO_RSA=y instead.

    Thoughts:

    (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to
    the padding template? Should there be multiple padding templates
    registered that share most of the code?

    Signed-off-by: David Howells
    Signed-off-by: Tadeusz Struk
    Acked-by: Herbert Xu

    David Howells
     

21 Jan, 2016

2 commits

  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Jan, 2016

1 commit

  • Pull audit updates from Paul Moore:
    "Seven audit patches for 4.5, all very minor despite the diffstat.

    The diffstat churn for linux/audit.h can be attributed to needing to
    reshuffle the linux/audit.h header to fix the seccomp auditing issue
    (see the commit description for details).

    Besides the seccomp/audit fix, most of the fixes are around trying to
    improve the connection with the audit daemon and a Kconfig
    simplification. Nothing crazy, and everything passes our little
    audit-testsuite"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: always enable syscall auditing when supported and audit is enabled
    audit: force seccomp event logging to honor the audit_enabled flag
    audit: Delete unnecessary checks before two function calls
    audit: wake up threads if queue switched from limited to unlimited
    audit: include auditd's threads in audit_log_start() wait exception
    audit: remove audit_backlog_wait_overflow
    audit: don't needlessly reset valid wait time

    Linus Torvalds
     

17 Jan, 2016

1 commit

  • uselib hasn't been used since libc5; glibc does not use it. Deprecate
    uselib a bit more, by making the default y only if libc5 was widely used
    on the plaform.

    This makes arm64 kernel built with defconfig slightly smaller

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
    function old new delta
    kernel_config_data 18164 18162 -2
    uselib_flags 20 - -20
    padzero 216 192 -24
    sys_uselib 380 - -380
    load_elf_library 964 - -964

    Signed-off-by: Riku Voipio
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

13 Jan, 2016

2 commits

  • To the best of our knowledge, everyone who enables audit at compile
    time also enables syscall auditing; this patch simplifies the Kconfig
    menus by removing the option to disable syscall auditing when audit
    is selected and the target arch supports it.

    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds
     

19 Dec, 2015

2 commits


13 Dec, 2015

1 commit

  • Currently the full stop_machine() routine is only enabled on SMP if
    module unloading is enabled, or if the CPUs are hotpluggable. This
    leads to configurations where stop_machine() is broken as it will then
    only run the callback on the local CPU with irqs disabled, and not stop
    the other CPUs or run the callback on them.

    For example, this breaks MTRR setup on x86 in certain configs since
    ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
    text_poke_smp_batch() functions") as the MTRR is only established on the
    boot CPU.

    This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
    and HOTPLUG_CPU config options to compile the correct stop_machine() for
    the architecture, removing the false dependency on MODULE_UNLOAD in the
    process.

    Link: https://lkml.org/lkml/2014/10/8/124
    References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
    Signed-off-by: Chris Wilson
    Acked-by: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: Pranith Kumar
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: Iulia Manda
    Cc: Andy Lutomirski
    Cc: Rusty Russell
    Cc: Peter Zijlstra
    Cc: Chuck Ebbert
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

09 Sep, 2015

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

06 Sep, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds