Eric Lee / smarc-fsl-linux-kernel

09 Aug, 2016

1 commit

1eccfa090 Merge tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull usercopy protection from Kees Cook:
"Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
copy_from_user bounds checking for most architectures on SLAB and
SLUB"

* tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
mm: SLUB hardened usercopy support
mm: SLAB hardened usercopy support
s390/uaccess: Enable hardened usercopy
sparc/uaccess: Enable hardened usercopy
powerpc/uaccess: Enable hardened usercopy
ia64/uaccess: Enable hardened usercopy
arm64/uaccess: Enable hardened usercopy
ARM: uaccess: Enable hardened usercopy
x86/uaccess: Enable hardened usercopy
mm: Hardened usercopy
mm: Implement stack frame object validation
mm: Add is_migrate_cma_page

Linus Torvalds
2016-08-09 05:48:14 +0800

03 Aug, 2016

4 commits

f1cb637e7 init/Kconfig: add clarification for out-of-tree modules ... Browse Code »

It doesn't trim just symbols that are totally unused in-tree - it trims
the symbols unused by any in-tree modules actually built. If you've
done a 'make localmodconfig' and only build a hundred or so modules,
it's pretty likely that your out-of-tree module will come up lacking
something...

Hopefully this will save the next guy from a Homer Simpson "D'oh!"
moment.

Link: http://lkml.kernel.org/r/10177.1469787292@turing-police.cc.vt.edu
Signed-off-by: Valdis Kletnieks
Cc: Michal Marek
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valdis Kletnieks
2016-08-03 07:35:43 +0800
ac3339baf init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig ... Browse Code »

Doing patches with allmodconfig kernel compiled and committing stuff
into local tree have unfortunate consequence: kernel version changes (as
it should) leading to recompiling and relinking of several files even if
they weren't touched (or interesting at all). This and "git-whatever"
figuring out current version slow down compilation for no good reason.

But lets face it, "allmodconfig" kernels don't care about kernel
version, they are simply compile check guinea pigs.

Make LOCALVERSION_AUTO depend on !COMPILE_TEST, so it doesn't sneak into
allmodconfig .config.

Link: http://lkml.kernel.org/r/20160707214954.GC31678@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Cc: Michal Marek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-08-03 07:35:41 +0800
bc083a64b init/Kconfig: make COMPILE_TEST depend on !UML ... Browse Code »

UML is a bit special since it does not have iomem nor dma. That means a
lot of drivers will not build if they miss a dependency on HAS_IOMEM.
s390 used to have the same issues but since it gained PCI support UML is
the only stranger.

We are tired of patching dozens of new drivers after every merge window
just to un-break allmod/yesconfig UML builds. One could argue that a
decent driver has to know on what it depends and therefore a missing
HAS_IOMEM dependency is a clear driver bug. But the dependency not
obvious and not everyone does UML builds with COMPILE_TEST enabled when
developing a device driver.

A possible solution to make these builds succeed on UML would be
providing stub functions for ioremap() and friends which fail upon
runtime. Another one is simply disabling COMPILE_TEST for UML. Since
it is the least hassle and does not force use to fake iomem support
let's do the latter.

Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
Signed-off-by: Richard Weinberger
Acked-by: Arnd Bergmann
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Weinberger
2016-08-03 05:31:41 +0800
9991a9c8d cgroup: update cgroup's document path ... Browse Code »

cgroup's document path is changed to "cgroup-v1". update it.

Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

seokhoon.yoon
2016-08-03 05:31:41 +0800

29 Jul, 2016

1 commit

69c428944 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
fat: fix error message for bogus number of directory entries
fat: fix typo s/supeblock/superblock/
ASoC: max9877: Remove unused function declaration
dw2102: don't output spurious blank lines to the kernel log
init: fix Kconfig text
ARM: io: fix comment grammar
ocfs: fix ocfs2_xattr_user_get() argument name
scsi/qla2xxx: Remove erroneous unused macro qla82xx_get_temp_val1()

Linus Torvalds
2016-07-29 05:22:25 +0800

27 Jul, 2016

3 commits

210e7a43f mm: SLUB freelist randomization ... Browse Code »

Implements freelist randomization for the SLUB allocator. It was
previous implemented for the SLAB allocator. Both use the same
configuration option (CONFIG_SLAB_FREELIST_RANDOM).

The list is randomized during initialization of a new set of pages. The
order on different freelist sizes is pre-computed at boot for
performance. Each kmem_cache has its own randomized freelist.

This security feature reduces the predictability of the kernel SLUB
allocator against heap overflows rendering attacks much less stable.

For example these attacks exploit the predictability of the heap:
- Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
- Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

Performance results:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp. It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 70 cycles
100000 times kmalloc(16)/kfree -> 70 cycles
100000 times kmalloc(32)/kfree -> 70 cycles
100000 times kmalloc(64)/kfree -> 70 cycles
100000 times kmalloc(128)/kfree -> 70 cycles
100000 times kmalloc(256)/kfree -> 69 cycles
100000 times kmalloc(512)/kfree -> 70 cycles
100000 times kmalloc(1024)/kfree -> 73 cycles
100000 times kmalloc(2048)/kfree -> 72 cycles
100000 times kmalloc(4096)/kfree -> 71 cycles

After:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 66 cycles
100000 times kmalloc(16)/kfree -> 66 cycles
100000 times kmalloc(32)/kfree -> 66 cycles
100000 times kmalloc(64)/kfree -> 66 cycles
100000 times kmalloc(128)/kfree -> 65 cycles
100000 times kmalloc(256)/kfree -> 67 cycles
100000 times kmalloc(512)/kfree -> 67 cycles
100000 times kmalloc(1024)/kfree -> 64 cycles
100000 times kmalloc(2048)/kfree -> 67 cycles
100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

Average Optimal load -j 12 Run (std deviation):
Elapsed Time 101.873 (1.16069)
User Time 1045.22 (1.60447)
System Time 88.969 (0.559195)
Percent CPU 1112.9 (13.8279)
Context Switches 189140 (2282.15)
Sleeps 99008.6 (768.091)

After:

Average Optimal load -j 12 Run (std deviation):
Elapsed Time 102.47 (0.562732)
User Time 1045.3 (1.34263)
System Time 88.311 (0.342554)
Percent CPU 1105.8 (6.49444)
Context Switches 189081 (2355.78)
Sleeps 99231.5 (800.358)

Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier
Reviewed-by: Kees Cook
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Garnier
2016-07-27 07:19:19 +0800
ed18adc1c mm: SLUB hardened usercopy support ... Browse Code »

Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLUB allocator to catch any copies that may span objects. Includes a
redzone handling fix discovered by Michael Ellerman.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook
Tested-by: Michael Ellerman
Reviwed-by: Laura Abbott

Kees Cook
2016-07-27 05:43:54 +0800
04385fc5e mm: SLAB hardened usercopy support ... Browse Code »

Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLAB allocator to catch any copies that may span objects.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook
Tested-by: Valdis Kletnieks

Kees Cook
2016-07-27 05:41:53 +0800

26 Jul, 2016

2 commits

766fd5f6c Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull NOHZ updates from Ingo Molnar:

- fix system/idle cputime leaked on cputime accounting (all nohz
configs) (Rik van Riel)

- remove the messy, ad-hoc irqtime account on nohz-full and make it
compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

- cleanups (Frederic Weisbecker)

- remove unecessary irq disablement in the irqtime code (Rik van Riel)

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
sched/cputime: Reorganize vtime native irqtime accounting headers
sched/cputime: Clean up the old vtime gen irqtime accounting completely
sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
sched/cputime: Count actually elapsed irq & softirq time

Linus Torvalds
2016-07-26 05:43:00 +0800
df00ccca7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:

- documentation updates

- miscellaneous fixes

- minor reorganization of code

- torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
rcu: Correctly handle sparse possible cpus
rcu: sysctl: Panic on RCU Stall
rcu: Fix a typo in a comment
rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
rcu: Disable TASKS_RCU for usermode Linux
rcu: No ordering for rcu_assign_pointer() of NULL
rcutorture: Fix error return code in rcu_perf_init()
torture: Inflict default jitter
rcuperf: Don't treat gp_exp mis-setting as a WARN
rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
rcutorture: Don't specify the cpu type of QEMU on PPC
rcutorture: Make -soundhw a x86 specific option
rcutorture: Use vmlinux as the fallback kernel image
rcutorture/doc: Create initrd using dracut
torture: Stop onoff task if there is only one cpu
torture: Add starvation events to error summary
torture: Break online and offline functions out of torture_onoff()
torture: Forgive lengthy trace dumps and preemption
torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
...

Linus Torvalds
2016-07-26 03:04:11 +0800

14 Jul, 2016

1 commit

b58c35840 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code ... Browse Code »

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel
Signed-off-by: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krcmar
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-07-14 16:42:34 +0800

07 Jul, 2016

1 commit

076501ff6 init/Kconfig: keep Expert users menu together ... Browse Code »

The "expert" menu was broken (split) such that all entries in it after
KALLSYMS were displayed in the "General setup" area instead of in the
"Expert users" area. Fix this by adding one kconfig dependency.

Yes, the Expert users menu is fragile. Problems like this have happened
several times in the past. I will attempt to isolate the Expert users
menu if there is interest in that.

Fixes: 4d5d5664c900 ("x86: kallsyms: disable absolute percpu symbols on !SMP")
Signed-off-by: Randy Dunlap
Cc: Ard Biesheuvel
Cc: stable@vger.kernel.org # 4.6
Signed-off-by: Linus Torvalds

Randy Dunlap
2016-07-07 07:27:20 +0800

21 Jun, 2016

1 commit

5e0d8d59a init: fix Kconfig text ... Browse Code »

[jkosina@suse.cz: folded another fix on top on the same line as spotted by
Randy Dunlap]
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Jiri Kosina

Geert Uytterhoeven
2016-06-21 19:25:13 +0800

16 Jun, 2016

1 commit

570dd3c74 rcu: Disable TASKS_RCU for usermode Linux ... Browse Code »

Usermode Linux currently does not implement arch_irqs_disabled_flags(),
which results in a build failure in TASKS_RCU. Therefore, this commit
disables the TASKS_RCU Kconfig option in usermode Linux builds. The
usermode Linux maintainers expect to merge arch_irqs_disabled_flags()
into 4.8, at which point this commit may be reverted.

Signed-off-by: Paul E. McKenney
Cc: Jeff Dike
Acked-by: Richard Weinberger

Paul E. McKenney
2016-06-16 06:32:01 +0800

27 May, 2016

1 commit

5b26fc882 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild ... Browse Code »

Pull kbuild updates from Michal Marek:

- new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
unexports symbols which are not used in the current config [Nicolas
Pitre]

- several kbuild rule cleanups [Masahiro Yamada]

- warning option adjustments for gcov etc [Arnd Bergmann]

- a few more small fixes

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
kbuild: move -Wunused-const-variable to W=1 warning level
kbuild: fix if_change and friends to consider argument order
kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
gcov: disable -Wmaybe-uninitialized warning
gcov: disable tree-loop-im to reduce stack usage
gcov: disable for COMPILE_TEST
Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
kbuild: forbid kernel directory to contain spaces and colons
kbuild: adjust ksym_dep_filter for some cmd_* renames
kbuild: Fix dependencies for final vmlinux link
kbuild: better abstract vmlinux sequential prerequisites
kbuild: fix call to adjust_autoksyms.sh when output directory specified
kbuild: Get rid of KBUILD_STR
kbuild: rename cmd_as_s_S to cmd_cpp_s_S
kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
kbuild: drop redundant "PHONY += FORCE"
kbuild: delete unnecessary "@:"
kbuild: mark help target as PHONY
...

Linus Torvalds
2016-05-27 13:01:22 +0800

21 May, 2016

2 commits

427934b87 printk/nmi: increase the size of NMI buffer and make it configurable ... Browse Code »

Testing has shown that the backtrace sometimes does not fit into the 4kB
temporary buffer that is used in NMI context. The warnings are gone
when I double the temporary buffer size.

This patch doubles the buffer size and makes it configurable.

Note that this problem existed even in the x86-specific implementation
that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI
stack trace on all CPUs"). Nobody noticed it because it did not print
any warnings.

Signed-off-by: Petr Mladek
Cc: Jan Kara
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Russell King
Cc: Daniel Thompson
Cc: Jiri Kosina
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Martin Schwidefsky
Cc: David Miller
Cc: Daniel Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-05-21 08:58:30 +0800
42a0bb3f7 printk/nmi: generic solution for safe printk in NMI ... Browse Code »

printk() takes some locks and could not be used a safe way in NMI
context.

The chance of a deadlock is real especially when printing stacks from
all CPUs. This particular problem has been addressed on x86 by the
commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
CPUs").

The patchset brings two big advantages. First, it makes the NMI
backtraces safe on all architectures for free. Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited. We still should keep the number of messages in NMI context at
minimum).

Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers. These are not easy to avoid.

This patch reuses most of the code and makes it generic. It is useful
for all messages and architectures that support NMI.

The alternative printk_func is set when entering and is reseted when
leaving NMI context. It queues IRQ work to copy the messages into the
main ring buffer in a safe context.

__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers. There is also used a spinlock to get synchronized with other
flushers.

We do not longer use seq_buf because it depends on external lock. It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.

The code is put into separate printk/nmi.c as suggested by Steven
Rostedt. It needs a per-CPU buffer and is compiled only on
architectures that call nmi_enter(). This is achieved by the new
HAVE_NMI Kconfig flag.

The are MN10300 and Xtensa architectures. We need to clean up NMI
handling there first. Let's do it separately.

The patch is heavily based on the draft from Peter Zijlstra, see

https://lkml.org/lkml/2015/6/10/327

[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here]
Signed-off-by: Petr Mladek
Suggested-by: Peter Zijlstra
Suggested-by: Steven Rostedt
Cc: Jan Kara
Acked-by: Russell King [arm part]
Cc: Daniel Thompson
Cc: Jiri Kosina
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Martin Schwidefsky
Cc: David Miller
Cc: Daniel Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-05-21 08:58:30 +0800

20 May, 2016

1 commit

c7ce4f60a mm: SLAB freelist randomization ... Browse Code »

Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
the SLAB freelist. The list is randomized during initialization of a
new set of pages. The order on different freelist sizes is pre-computed
at boot for performance. Each kmem_cache has its own randomized
freelist. Before pre-computed lists are available freelists are
generated dynamically. This security feature reduces the predictability
of the kernel SLAB allocator against heap overflows rendering attacks
much less stable.

For example this attack against SLUB (also applicable against SLAB)
would be affected:

https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

Also, since v4.6 the freelist was moved at the end of the SLAB. It
means a controllable heap is opened to new attacks not yet publicly
discussed. A kernel heap overflow can be transformed to multiple
use-after-free. This feature makes this type of attack harder too.

To generate entropy, we use get_random_bytes_arch because 0 bits of
entropy is available in the boot stage. In the worse case this function
will fallback to the get_random_bytes sub API. We also generate a shift
random number to shift pre-computed freelist for each new set of pages.

The config option name is not specific to the SLAB as this approach will
be extended to other allocators like SLUB.

Performance results highlighted no major changes:

Hackbench (running 90 10 times):

Before average: 0.0698
After average: 0.0663 (-5.01%)

slab_test 1 run on boot. Difference only seen on the 2048 size test
being the worse case scenario covered by freelist randomization. New
slab pages are constantly being created on the 10000 allocations.
Variance should be mainly due to getting new pages every few
allocations.

Before:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 121 cycles
10000 times kmalloc(16)/kfree -> 121 cycles
10000 times kmalloc(32)/kfree -> 121 cycles
10000 times kmalloc(64)/kfree -> 121 cycles
10000 times kmalloc(128)/kfree -> 121 cycles
10000 times kmalloc(256)/kfree -> 119 cycles
10000 times kmalloc(512)/kfree -> 119 cycles
10000 times kmalloc(1024)/kfree -> 119 cycles
10000 times kmalloc(2048)/kfree -> 119 cycles
10000 times kmalloc(4096)/kfree -> 121 cycles
10000 times kmalloc(8192)/kfree -> 119 cycles
10000 times kmalloc(16384)/kfree -> 119 cycles

After:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 121 cycles
10000 times kmalloc(16)/kfree -> 121 cycles
10000 times kmalloc(32)/kfree -> 123 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 121 cycles
10000 times kmalloc(256)/kfree -> 119 cycles
10000 times kmalloc(512)/kfree -> 119 cycles
10000 times kmalloc(1024)/kfree -> 119 cycles
10000 times kmalloc(2048)/kfree -> 119 cycles
10000 times kmalloc(4096)/kfree -> 119 cycles
10000 times kmalloc(8192)/kfree -> 119 cycles
10000 times kmalloc(16384)/kfree -> 119 cycles

[akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
Signed-off-by: Thomas Garnier
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Kees Cook
Cc: Greg Thelen
Cc: Laura Abbott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Garnier
2016-05-20 10:12:14 +0800

10 May, 2016

1 commit

877417e6f Kbuild: change CC_OPTIMIZE_FOR_SIZE definition ... Browse Code »

CC_OPTIMIZE_FOR_SIZE disables the often useful -Wmaybe-unused warning,
because that causes a ridiculous amount of false positives when combined
with -Os.

This means a lot of warnings don't show up in testing by the developers
that should see them with an 'allmodconfig' kernel that has
CC_OPTIMIZE_FOR_SIZE enabled, but only later in randconfig builds
that don't.

This changes the Kconfig logic around CC_OPTIMIZE_FOR_SIZE to make
it a 'choice' statement defaulting to CC_OPTIMIZE_FOR_PERFORMANCE
that gets added for this purpose. The allmodconfig and allyesconfig
kernels now default to -O2 with the maybe-unused warning enabled.

Signed-off-by: Arnd Bergmann
Signed-off-by: Michal Marek

Arnd Bergmann
2016-05-10 23:12:48 +0800

02 Apr, 2016

1 commit

f76be6175 Make CONFIG_FHANDLE default y ... Browse Code »

Newer Fedora and OpenSUSE didn't boot with my standard configuration.
It took me some time to figure out why, in fact I had to write a script
to try different config options systematically.

The problem is that something (systemd) in dracut depends on
CONFIG_FHANDLE, which adds open by file handle syscalls.

While it is set in defconfigs it is very easy to miss when updating
older configs because it is not default y.

Make it default y and also depend on EXPERT, as dracut use is likely
widespread.

Signed-off-by: Andi Kleen
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2016-04-02 06:03:37 +0800

30 Mar, 2016

1 commit

dbacb0ef6 kconfig option for TRIM_UNUSED_KSYMS ... Browse Code »

The config option to enable it all.

Signed-off-by: Nicolas Pitre
Acked-by: Rusty Russell

Nicolas Pitre
2016-03-30 04:30:57 +0800

19 Mar, 2016

1 commit

6b5f04b6c Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"cgroup changes for v4.6-rc1. No userland visible behavior changes in
this pull request. I'll send out a separate pull request for the
addition of cgroup namespace support.

- The biggest change is the revamping of cgroup core task migration
and controller handling logic. There are quite a few places where
controllers and tasks are manipulated. Previously, many of those
places implemented custom operations for each specific use case
assuming specific starting conditions. While this worked, it makes
the code fragile and difficult to follow.

The bulk of this pull request restructures these operations so that
most related operations are performed through common helpers which
implement recursive (subtrees are always processed consistently)
and idempotent (they make cgroup hierarchy converge to the target
state rather than performing operations assuming specific starting
conditions). This makes the code a lot easier to understand,
verify and extend.

- Implicit controller support is added. This is primarily for using
perf_event on the v2 hierarchy so that perf can match cgroup v2
path without requiring the user to do anything special. The kernel
portion of perf_event changes is acked but userland changes are
still pending review.

- cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
certain environments.

- There is a regression introduced during v4.4 devel cycle where
attempts to migrate zombie tasks can mess up internal object
management. This was fixed earlier this week and included in this
pull request w/ stable cc'd.

- Misc non-critical fixes and improvements"

* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
cgroup: avoid false positive gcc-6 warning
cgroup: ignore css_sets associated with dead cgroups during migration
Documentation: cgroup v2: Trivial heading correction.
cgroup: implement cgroup_subsys->implicit_on_dfl
cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
cgroup: Trivial correction to reflect controller.
cgroup: remove stale item in cgroup-v1 document INDEX file.
cgroup: update css iteration in cgroup_update_dfl_csses()
cgroup: allocate 2x cgrp_cset_links when setting up a new root
cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
cgroup: use cgroup_apply_enable_control() in cgroup creation path
cgroup: combine cgroup_mutex locking and offline css draining
cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
cgroup: introduce cgroup_{save|propagate|restore}_control()
cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
...

Linus Torvalds
2016-03-19 11:25:49 +0800

18 Mar, 2016

1 commit

bb7aeae3d Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security layer updates from James Morris:
"There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
fixes scattered across the subsystem.

IMA now requires signed policy, and that policy is also now measured
and appraised"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
X.509: Make algo identifiers text instead of enum
akcipher: Move the RSA DER encoding check to the crypto layer
crypto: Add hash param to pkcs1pad
sign-file: fix build with CMS support disabled
MAINTAINERS: update tpmdd urls
MODSIGN: linux/string.h should be #included to get memcpy()
certs: Fix misaligned data in extra certificate list
X.509: Handle midnight alternative notation in GeneralizedTime
X.509: Support leap seconds
Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
X.509: Fix leap year handling again
PKCS#7: fix unitialized boolean 'want'
firmware: change kernel read fail to dev_dbg()
KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
KEYS: Reserve an extra certificate symbol for inserting without recompiling
modsign: hide openssl output in silent builds
tpm_tis: fix build warning with tpm_tis_resume
ima: require signed IMA policy
ima: measure and appraise the IMA policy itself
ima: load policy using path
...

Linus Torvalds
2016-03-18 02:33:45 +0800

16 Mar, 2016

2 commits

2213e9a66 kallsyms: add support for relative offsets in kallsyms address table ... Browse Code »

Similar to how relative extables are implemented, it is possible to emit
the kallsyms table in such a way that it contains offsets relative to
some anchor point in the kernel image rather than absolute addresses.

On 64-bit architectures, it cuts the size of the kallsyms address table
in half, since offsets between kernel symbols can typically be expressed
in 32 bits. This saves several hundreds of kilobytes of permanent
.rodata on average. In addition, the kallsyms address table is no
longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
effect, so the relocation work done after decompression now doesn't have
to do relocation updates for all these values. This saves up to 24
bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
which easily adds up to a couple of megabytes of uncompressed __init
data on ppc64 or arm64. Even if these relocation entries typically
compress well, the combined size reduction of 2.8 MB uncompressed for a
ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
KB space saving in the compressed image.

Since it is useful for some architectures (like x86) to retain the
ability to emit absolute values as well, this patch also adds support
for capturing both absolute and relative values when
KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
addresses as positive 32-bit values, and addresses relative to the
lowest encountered relative symbol as negative values, which are
subtracted from the runtime address of this base symbol to produce the
actual address.

Support for the above is enabled by default for all architectures except
IA-64 and Tile-GX, whose symbols are too far apart to capture in this
manner.

Signed-off-by: Ard Biesheuvel
Tested-by: Guenter Roeck
Reviewed-by: Kees Cook
Tested-by: Kees Cook
Cc: Heiko Carstens
Cc: Michael Ellerman
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Cc: Rusty Russell
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2016-03-16 07:55:16 +0800
4d5d5664c x86: kallsyms: disable absolute percpu symbols on !SMP ... Browse Code »

scripts/kallsyms.c has a special --absolute-percpu command line option
which deals with the zero based per cpu offsets that are used when
building for SMP on x86_64. This means that the option should only be
passed in that case, so add a Kconfig symbol with the correct predicate,
and use that instead.

Signed-off-by: Ard Biesheuvel
Tested-by: Guenter Roeck
Reviewed-by: Kees Cook
Tested-by: Kees Cook
Acked-by: Rusty Russell
Cc: Heiko Carstens
Cc: Michael Ellerman
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2016-03-16 07:55:16 +0800

05 Mar, 2016

1 commit

6cc578df4 cgroup: Trivial correction to reflect controller. ... Browse Code »

Trivial correction in menuconfig help to reflect PIDs as
controller instead of subsystem to align to rest of the text
and documentation.

Signed-off-by: Parav Pandit
Signed-off-by: Tejun Heo

Parav Pandit
2016-03-05 20:48:01 +0800

04 Mar, 2016

1 commit

d43de6c78 akcipher: Move the RSA DER encoding check to the crypto layer ... Browse Code »

Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key
subtype to the rsa crypto module's pkcs1pad template. This means that the
public_key subtype no longer has any dependencies on public key type.

To make this work, the following changes have been made:

(1) The rsa pkcs1pad template is now used for RSA keys. This strips off the
padding and returns just the message hash.

(2) In a previous patch, the pkcs1pad template gained an optional second
parameter that, if given, specifies the hash used. We now give this,
and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5
encoding and verifies that the correct digest OID is present.

(3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to
something that doesn't care about what the encryption actually does
and and has been merged into public_key.c.

(4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone. Module signing must set
CONFIG_CRYPTO_RSA=y instead.

Thoughts:

(*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to
the padding template? Should there be multiple padding templates
registered that share most of the code?

Signed-off-by: David Howells
Signed-off-by: Tadeusz Struk
Acked-by: Herbert Xu

David Howells
2016-03-04 05:49:27 +0800

21 Jan, 2016

2 commits

d886f4e48 mm: memcontrol: rein in the CONFIG space madness ... Browse Code »

What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.

[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
489c2a20a mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM ... Browse Code »

Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.

[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800

18 Jan, 2016

1 commit

2d663b558 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit updates from Paul Moore:
"Seven audit patches for 4.5, all very minor despite the diffstat.

The diffstat churn for linux/audit.h can be attributed to needing to
reshuffle the linux/audit.h header to fix the seccomp auditing issue
(see the commit description for details).

Besides the seccomp/audit fix, most of the fixes are around trying to
improve the connection with the audit daemon and a Kconfig
simplification. Nothing crazy, and everything passes our little
audit-testsuite"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: always enable syscall auditing when supported and audit is enabled
audit: force seccomp event logging to honor the audit_enabled flag
audit: Delete unnecessary checks before two function calls
audit: wake up threads if queue switched from limited to unlimited
audit: include auditd's threads in audit_log_start() wait exception
audit: remove audit_backlog_wait_overflow
audit: don't needlessly reset valid wait time

Linus Torvalds
2016-01-18 10:48:49 +0800

17 Jan, 2016

1 commit

b2113a417 uselib: default depending if libc5 was used ... Browse Code »

uselib hasn't been used since libc5; glibc does not use it. Deprecate
uselib a bit more, by making the default y only if libc5 was widely used
on the plaform.

This makes arm64 kernel built with defconfig slightly smaller

bloat-o-meter:
add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
function old new delta
kernel_config_data 18164 18162 -2
uselib_flags 20 - -20
padzero 216 192 -24
sys_uselib 380 - -380
load_elf_library 964 - -964

Signed-off-by: Riku Voipio
Reviewed-by: Josh Triplett
Acked-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Riku Voipio
2016-01-17 03:17:24 +0800

13 Jan, 2016

2 commits

cb74ed278 audit: always enable syscall auditing when supported and audit is enabled ... Browse Code »

To the best of our knowledge, everyone who enables audit at compile
time also enables syscall auditing; this patch simplifies the Kconfig
menus by removing the option to disable syscall auditing when audit
is selected and the target arch supports it.

Signed-off-by: Paul Moore

Paul Moore
2016-01-13 22:18:55 +0800
34a9304a9 Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.

Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.

- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.

- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.

- Various cleanups

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

Linus Torvalds
2016-01-13 11:20:32 +0800

19 Dec, 2015

2 commits

6bf024e69 cgroup: put controller Kconfig options in meaningful order ... Browse Code »

To make it easier to quickly find what's needed list the basic
resource controllers of cgroup2 first - io, memory, cpu - while
pushing the more exotic and/or legacy controllers to the bottom.

tj: Removed spurious "&& CGROUPS" from CGROUP_PERF as suggested by Li.

Signed-off-by: Johannes Weiner
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Johannes Weiner
2015-12-19 01:43:15 +0800
a0166ec4b cgroup: clean up the kernel configuration menu nomenclature ... Browse Code »

The config options for the different cgroup controllers use various
terms: resource controller, cgroup subsystem, etc. Simplify this to
"controller", which is clear enough in the cgroup context.

Signed-off-by: Johannes Weiner
Signed-off-by: Tejun Heo

Johannes Weiner
2015-12-19 01:43:15 +0800

13 Dec, 2015

1 commit

86fffe4a6 kernel: remove stop_machine() Kconfig dependency ... Browse Code »

Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable. This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.

For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.

This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.

Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson
Acked-by: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: Pranith Kumar
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Cc: H. Peter Anvin
Cc: Tejun Heo
Cc: Iulia Manda
Cc: Andy Lutomirski
Cc: Rusty Russell
Cc: Peter Zijlstra
Cc: Chuck Ebbert
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wilson
2015-12-13 02:15:34 +0800

12 Sep, 2015

1 commit

5b25b13ab sys_membarrier(): system-wide memory barrier (generic, x86) ... Browse Code »

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

NAME
membarrier - issue memory barriers on a set of threads

SYNOPSIS
#include

int membarrier(int cmd, int flags);

DESCRIPTION
The cmd argument is one of the following:

MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.

MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.

The flags argument needs to be 0. For future extensions.

All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():

The pair ordering is detailed as (O: ordered, X: not ordered):

barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O

RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.

ERRORS
ENOSYS System call is not implemented.

EINVAL Invalid arguments.

Linux 2015-04-15 MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers
Reviewed-by: Paul E. McKenney
Reviewed-by: Josh Triplett
Cc: KOSAKI Motohiro
Cc: Steven Rostedt
Cc: Nicholas Miell
Cc: Ingo Molnar
Cc: Alan Cox
Cc: Lai Jiangshan
Cc: Stephen Hemminger
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: David Howells
Cc: Pranith Kumar
Cc: Michael Kerrisk
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2015-09-12 06:21:34 +0800

09 Sep, 2015

1 commit

b793c005c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security subsystem updates from James Morris:
"Highlights:

- PKCS#7 support added to support signed kexec, also utilized for
module signing. See comments in 3f1e1bea.

** NOTE: this requires linking against the OpenSSL library, which
must be installed, e.g. the openssl-devel on Fedora **

- Smack
- add IPv6 host labeling; ignore labels on kernel threads
- support smack labeling mounts which use binary mount data

- SELinux:
- add ioctl whitelisting (see
http://kernsec.org/files/lss2015/vanderstoep.pdf)
- fix mprotect PROT_EXEC regression caused by mm change

- Seccomp:
- add ptrace options for suspend/resume"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
Documentation/Changes: Now need OpenSSL devel packages for module signing
scripts: add extract-cert and sign-file to .gitignore
modsign: Handle signing key in source tree
modsign: Use if_changed rule for extracting cert from module signing key
Move certificate handling to its own directory
sign-file: Fix warning about BIO_reset() return value
PKCS#7: Add MODULE_LICENSE() to test module
Smack - Fix build error with bringup unconfigured
sign-file: Document dependency on OpenSSL devel libraries
PKCS#7: Appropriately restrict authenticated attributes and content type
KEYS: Add a name for PKEY_ID_PKCS7
PKCS#7: Improve and export the X.509 ASN.1 time object decoder
modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
extract-cert: Cope with multiple X.509 certificates in a single file
sign-file: Generate CMS message as signature instead of PKCS#7
PKCS#7: Support CMS messages also [RFC5652]
X.509: Change recorded SKID & AKID to not include Subject or Issuer
PKCS#7: Check content type and versions
MAINTAINERS: The keyrings mailing list has moved
...

Linus Torvalds
2015-09-09 03:41:25 +0800

06 Sep, 2015

1 commit

7d9071a09 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"In this one:

- d_move fixes (Eric Biederman)

- UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
handling ought to be fixed)

- switch of sb_writers to percpu rwsem (Oleg Nesterov)

- superblock scalability (Josef Bacik and Dave Chinner)

- swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
vfs: Test for and handle paths that are unreachable from their mnt_root
dcache: Reduce the scope of i_lock in d_splice_alias
dcache: Handle escaped paths in prepend_path
mm: fix potential data race in SyS_swapon
inode: don't softlockup when evicting inodes
inode: rename i_wb_list to i_io_list
sync: serialise per-superblock sync operations
inode: convert inode_sb_list_lock to per-sb
inode: add hlist_fake to avoid the inode hash lock in evict
writeback: plug writeback at a high level
change sb_writers to use percpu_rw_semaphore
shift percpu_counter_destroy() into destroy_super_work()
percpu-rwsem: kill CONFIG_PERCPU_RWSEM
percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
percpu-rwsem: introduce percpu_down_read_trylock()
document rwsem_release() in sb_wait_write()
fix the broken lockdep logic in __sb_start_write()
introduce __sb_writers_{acquired,release}() helpers
ufs_inode_get{frag,block}(): get rid of 'phys' argument
ufs_getfrag_block(): tidy up a bit
...

Linus Torvalds
2015-09-06 11:34:28 +0800