Eric Lee / smarc-fsl-linux-kernel

21 Jan, 2016

4 commits

d886f4e48 mm: memcontrol: rein in the CONFIG space madness ... Browse Code »

What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.

[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
489c2a20a mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM ... Browse Code »

Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.

[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
f057f3b22 init/do_mounts: initrd_load() can be boolean ... Browse Code »

Make initrd_load() return bool due to this particular function only using
either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-21 09:09:18 +0800
31c025b5f init/main.c: obsolete_checksetup can be boolean ... Browse Code »

Make obsolete_checksetup() return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-21 09:09:18 +0800

18 Jan, 2016

3 commits

2d663b558 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit updates from Paul Moore:
"Seven audit patches for 4.5, all very minor despite the diffstat.

The diffstat churn for linux/audit.h can be attributed to needing to
reshuffle the linux/audit.h header to fix the seccomp auditing issue
(see the commit description for details).

Besides the seccomp/audit fix, most of the fixes are around trying to
improve the connection with the audit daemon and a Kconfig
simplification. Nothing crazy, and everything passes our little
audit-testsuite"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: always enable syscall auditing when supported and audit is enabled
audit: force seccomp event logging to honor the audit_enabled flag
audit: Delete unnecessary checks before two function calls
audit: wake up threads if queue switched from limited to unlimited
audit: include auditd's threads in audit_log_start() wait exception
audit: remove audit_backlog_wait_overflow
audit: don't needlessly reset valid wait time

Linus Torvalds
2016-01-18 10:48:49 +0800
0cbeafb24 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- more MM stuff:

- Kirill's page-flags rework

- Kirill's now-allegedly-fixed THP rework

- MADV_FREE implementation

- DAX feature work (msync/fsync). This isn't quite complete but DAX
is new and it's good enough and the guys have a handle on what
needs to be done - I expect this to be wrapped in the next week or
two.

- some vsprintf maintenance work

- various other misc bits

* emailed patches from Andrew Morton : (145 commits)
printk: change recursion_bug type to bool
lib/vsprintf: factor out %pN[F] handler as netdev_bits()
lib/vsprintf: refactor duplicate code to special_hex_number()
printk-formats.txt: remove unimplemented %pT
printk: help pr_debug and pr_devel to optimize out arguments
lib/test_printf.c: test dentry printing
lib/test_printf.c: add test for large bitmaps
lib/test_printf.c: account for kvasprintf tests
lib/test_printf.c: add a few number() tests
lib/test_printf.c: test precision quirks
lib/test_printf.c: check for out-of-bound writes
lib/test_printf.c: don't BUG
lib/kasprintf.c: add sanity check to kvasprintf
lib/vsprintf.c: warn about too large precisions and field widths
lib/vsprintf.c: help gcc make number() smaller
lib/vsprintf.c: expand field_width to 24 bits
lib/vsprintf.c: eliminate potential race in string()
lib/vsprintf.c: move string() below widen_string()
lib/vsprintf.c: pull out padding code from dentry_name()
printk: do cond_resched() between lines while outputting to consoles
...

Linus Torvalds
2016-01-18 04:58:52 +0800
e535d74bc Merge tag 'docs-4.5' of git://git.lwn.net/linux ... Browse Code »

Pull documentation updates from Jon Corbet:
"A relatively boring cycle in the docs tree. There's a few kernel-doc
fixes and various document tweaks.

One patch reaches out of the documentation subtree to fix a comment in
init/do_mounts_rd.c. There didn't seem to be anybody more appropriate
to take that one, so I accepted it"

* tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits)
thermal: add description for integral_cutoff unit
Documentation: update libhugetlbfs site url
Documentation: Explain pci=conf1,conf2 more verbosely
DMA-API: fix confusing sentence in Documentation/DMA-API.txt
Documentation: translations: update linux cross reference link
Documentation: fix typo in CodingStyle
init, Documentation: Remove ramdisk_blocksize mentions
Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main()
Documentation: HOWTO: update versions from 3.x to 4.x
Documentation: remove outdated references from translations
Doc: treewide: Fix grammar "a" to "an"
Documentation: cpu-hotplug: Fix sysfs mount instructions
can-doc: Add hint about getting timestamps
Fix CFQ I/O scheduler parameter name in documentation
Documentation: arm: remove dead links from Marvell Berlin docs
Documentation: HOWTO: update code cross reference link
Doc: Docbook/iio: Fix typo in iio.tmpl
DocBook: make index.html generation less verbose by default
DocBook: Cleanup: remove an unused $(call) line
DocBook: Add a help message for DOCBOOKS env var
...

Linus Torvalds
2016-01-18 03:55:07 +0800

17 Jan, 2016

1 commit

b2113a417 uselib: default depending if libc5 was used ... Browse Code »

uselib hasn't been used since libc5; glibc does not use it. Deprecate
uselib a bit more, by making the default y only if libc5 was widely used
on the plaform.

This makes arm64 kernel built with defconfig slightly smaller

bloat-o-meter:
add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
function old new delta
kernel_config_data 18164 18162 -2
uselib_flags 20 - -20
padzero 216 192 -24
sys_uselib 380 - -380
load_elf_library 964 - -964

Signed-off-by: Riku Voipio
Reviewed-by: Josh Triplett
Acked-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Riku Voipio
2016-01-17 03:17:24 +0800

13 Jan, 2016

2 commits

cb74ed278 audit: always enable syscall auditing when supported and audit is enabled ... Browse Code »

To the best of our knowledge, everyone who enables audit at compile
time also enables syscall auditing; this patch simplifies the Kconfig
menus by removing the option to disable syscall auditing when audit
is selected and the target arch supports it.

Signed-off-by: Paul Moore

Paul Moore
2016-01-13 22:18:55 +0800
34a9304a9 Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.

Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.

- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.

- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.

- Various cleanups

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

Linus Torvalds
2016-01-13 11:20:32 +0800

06 Jan, 2016

1 commit

3104fb3dd Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull RCU changes from Paul E. McKenney:

- Adding transitivity uniformly to rcu_node structure ->lock
acquisitions. (This is implemented by the first two commits
on top of v4.4-rc2 due to the pervasive nature of this change.)

- Documentation updates, including RCU requirements.

- Expedited grace-period changes.

- Miscellaneous fixes.

- Linked-list fixes, courtesy of KTSAN.

- Torture-test updates.

- Late-breaking fix to sysrq-generated crash.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2016-01-06 18:41:48 +0800

26 Dec, 2015

1 commit

b03539665 init, Documentation: Remove ramdisk_blocksize mentions ... Browse Code »

The brd driver has never supported the ramdisk_blocksize kernel
parameter that was in the rd driver it replaced, so remove
mention of this parameter from comments and Documentation.

Commit 9db5579be4bb ("rewrite rd") replaced rd with brd, keeping
a brd_blocksize variable in struct brd_device but never using it.

Commit a2cba2913c76 ("brd: get rid of unused members from struct
brd_device") removed the unused variable.

Commit f5abc8e75815 ("Documentation/blockdev/ramdisk.txt: updates")
removed mentions of ramdisk_blocksize from that file.

Signed-off-by: Robert Elliott
Signed-off-by: Jonathan Corbet

Robert Elliott
2015-12-26 20:22:00 +0800

19 Dec, 2015

2 commits

6bf024e69 cgroup: put controller Kconfig options in meaningful order ... Browse Code »

To make it easier to quickly find what's needed list the basic
resource controllers of cgroup2 first - io, memory, cpu - while
pushing the more exotic and/or legacy controllers to the bottom.

tj: Removed spurious "&& CGROUPS" from CGROUP_PERF as suggested by Li.

Signed-off-by: Johannes Weiner
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Johannes Weiner
2015-12-19 01:43:15 +0800
a0166ec4b cgroup: clean up the kernel configuration menu nomenclature ... Browse Code »

The config options for the different cgroup controllers use various
terms: resource controller, cgroup subsystem, etc. Simplify this to
"controller", which is clear enough in the cgroup context.

Signed-off-by: Johannes Weiner
Signed-off-by: Tejun Heo

Johannes Weiner
2015-12-19 01:43:15 +0800

13 Dec, 2015

1 commit

86fffe4a6 kernel: remove stop_machine() Kconfig dependency ... Browse Code »

Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable. This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.

For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.

This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.

Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson
Acked-by: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: Pranith Kumar
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Cc: H. Peter Anvin
Cc: Tejun Heo
Cc: Iulia Manda
Cc: Andy Lutomirski
Cc: Rusty Russell
Cc: Peter Zijlstra
Cc: Chuck Ebbert
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wilson
2015-12-13 02:15:34 +0800

05 Dec, 2015

1 commit

967dcb8fe rcu: Wire up rcu_end_inkernel_boot() ... Browse Code »

This commit adds the invocation of rcu_end_inkernel_boot() just before
init is invoked. This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig
option to do something useful and prepares for the upcoming
rcupdate.rcu_normal_after_boot kernel parameter.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2015-12-05 04:26:54 +0800

12 Sep, 2015

1 commit

5b25b13ab sys_membarrier(): system-wide memory barrier (generic, x86) ... Browse Code »

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system. It is
implemented by calling synchronize_sched(). It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier. For synchronization primitives
that distinguish between read-side and write-side (e.g. userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
- DNS server (Knot DNS) https://www.knot-dns.cz/
- Network sniffer (http://netsniff-ng.org/)
- Distributed object storage (https://sheepdog.github.io/sheepdog/)
- User-space tracing (http://lttng.org)
- Network storage system (https://www.gluster.org/)
- Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
- Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking. Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
- core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux. They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 2202847 writes
signal-based scheme: 9830061167 reads, 6700 writes
sys_membarrier: 9952759104 reads, 425 writes
sys_membarrier (dyn. check): 7970328887 reads, 425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

NAME
membarrier - issue memory barriers on a set of threads

SYNOPSIS
#include

int membarrier(int cmd, int flags);

DESCRIPTION
The cmd argument is one of the following:

MEMBARRIER_CMD_QUERY
Query the set of supported commands. It returns a bitmask of
supported commands.

MEMBARRIER_CMD_SHARED
Execute a memory barrier on all threads running on the system.
Upon return from system call, the caller thread is ensured that
all running threads have passed through a state where all memory
accesses to user-space addresses match program order between
entry to and return from the system call (non-running threads
are de facto in such a state). This covers threads from all pro=E2=80=90
cesses running on the system. This command returns 0.

The flags argument needs to be 0. For future extensions.

All memory accesses performed in program order from each targeted
thread is guaranteed to be ordered with respect to sys_membarrier(). If
we use the semantic "barrier()" to represent a compiler barrier forcing
memory accesses to be performed in program order across the barrier,
and smp_mb() to represent explicit memory barriers forcing full memory
ordering across the barrier, we have the following ordering table for
each pair of barrier(), sys_membarrier() and smp_mb():

The pair ordering is detailed as (O: ordered, X: not ordered):

barrier() smp_mb() sys_membarrier()
barrier() X X O
smp_mb() X O O
sys_membarrier() O O O

RETURN VALUE
On success, these system calls return zero. On error, -1 is returned,
and errno is set appropriately. For a given command, with flags
argument set to 0, this system call is guaranteed to always return the
same value until reboot.

ERRORS
ENOSYS System call is not implemented.

EINVAL Invalid arguments.

Linux 2015-04-15 MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers
Reviewed-by: Paul E. McKenney
Reviewed-by: Josh Triplett
Cc: KOSAKI Motohiro
Cc: Steven Rostedt
Cc: Nicholas Miell
Cc: Ingo Molnar
Cc: Alan Cox
Cc: Lai Jiangshan
Cc: Stephen Hemminger
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: David Howells
Cc: Pranith Kumar
Cc: Michael Kerrisk
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2015-09-12 06:21:34 +0800

11 Sep, 2015

2 commits

2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800
90f023030 kmod: use system_unbound_wq instead of khelper ... Browse Code »

We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.

Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.

The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.

The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.

This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800

09 Sep, 2015

1 commit

b793c005c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security subsystem updates from James Morris:
"Highlights:

- PKCS#7 support added to support signed kexec, also utilized for
module signing. See comments in 3f1e1bea.

** NOTE: this requires linking against the OpenSSL library, which
must be installed, e.g. the openssl-devel on Fedora **

- Smack
- add IPv6 host labeling; ignore labels on kernel threads
- support smack labeling mounts which use binary mount data

- SELinux:
- add ioctl whitelisting (see
http://kernsec.org/files/lss2015/vanderstoep.pdf)
- fix mprotect PROT_EXEC regression caused by mm change

- Seccomp:
- add ptrace options for suspend/resume"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
Documentation/Changes: Now need OpenSSL devel packages for module signing
scripts: add extract-cert and sign-file to .gitignore
modsign: Handle signing key in source tree
modsign: Use if_changed rule for extracting cert from module signing key
Move certificate handling to its own directory
sign-file: Fix warning about BIO_reset() return value
PKCS#7: Add MODULE_LICENSE() to test module
Smack - Fix build error with bringup unconfigured
sign-file: Document dependency on OpenSSL devel libraries
PKCS#7: Appropriately restrict authenticated attributes and content type
KEYS: Add a name for PKEY_ID_PKCS7
PKCS#7: Improve and export the X.509 ASN.1 time object decoder
modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
extract-cert: Cope with multiple X.509 certificates in a single file
sign-file: Generate CMS message as signature instead of PKCS#7
PKCS#7: Support CMS messages also [RFC5652]
X.509: Change recorded SKID & AKID to not include Subject or Issuer
PKCS#7: Check content type and versions
MAINTAINERS: The keyrings mailing list has moved
...

Linus Torvalds
2015-09-09 03:41:25 +0800

06 Sep, 2015

2 commits

7d9071a09 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"In this one:

- d_move fixes (Eric Biederman)

- UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
handling ought to be fixed)

- switch of sb_writers to percpu rwsem (Oleg Nesterov)

- superblock scalability (Josef Bacik and Dave Chinner)

- swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
vfs: Test for and handle paths that are unreachable from their mnt_root
dcache: Reduce the scope of i_lock in d_splice_alias
dcache: Handle escaped paths in prepend_path
mm: fix potential data race in SyS_swapon
inode: don't softlockup when evicting inodes
inode: rename i_wb_list to i_io_list
sync: serialise per-superblock sync operations
inode: convert inode_sb_list_lock to per-sb
inode: add hlist_fake to avoid the inode hash lock in evict
writeback: plug writeback at a high level
change sb_writers to use percpu_rw_semaphore
shift percpu_counter_destroy() into destroy_super_work()
percpu-rwsem: kill CONFIG_PERCPU_RWSEM
percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
percpu-rwsem: introduce percpu_down_read_trylock()
document rwsem_release() in sb_wait_write()
fix the broken lockdep logic in __sb_start_write()
introduce __sb_writers_{acquired,release}() helpers
ufs_inode_get{frag,block}(): get rid of 'phys' argument
ufs_getfrag_block(): tidy up a bit
...

Linus Torvalds
2015-09-06 11:34:28 +0800
6c0f568e8 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge patch-bomb from Andrew Morton:

- a few misc things

- Andy's "ambient capabilities"

- fs/nofity updates

- the ocfs2 queue

- kernel/watchdog.c updates and feature work.

- some of MM. Includes Andrea's userfaultfd feature.

[ Hadn't noticed that userfaultfd was 'default y' when applying the
patches, so that got fixed in this merge instead. We do _not_ mark
new features that nobody uses yet 'default y' - Linus ]

* emailed patches from Andrew Morton : (118 commits)
mm/hugetlb.c: make vma_has_reserves() return bool
mm/madvise.c: make madvise_behaviour_valid() return bool
mm/memory.c: make tlb_next_batch() return bool
mm/dmapool.c: change is_page_busy() return from int to bool
mm: remove struct node_active_region
mremap: simplify the "overlap" check in mremap_to()
mremap: don't do uneccesary checks if new_len == old_len
mremap: don't do mm_populate(new_addr) on failure
mm: move ->mremap() from file_operations to vm_operations_struct
mremap: don't leak new_vma if f_op->mremap() fails
mm/hugetlb.c: make vma_shareable() return bool
mm: make GUP handle pfn mapping unless FOLL_GET is requested
mm: fix status code which move_pages() returns for zero page
mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
genalloc: add support of multiple gen_pools per device
genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
mm/memblock: WARN_ON when nid differs from overlap region
Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
mm: defer flush of writable TLB entries
mm: send one IPI per CPU to TLB flush all entries after unmapping pages
...

Linus Torvalds
2015-09-06 05:27:38 +0800

05 Sep, 2015

2 commits

72b252aed mm: send one IPI per CPU to TLB flush all entries after unmapping pages ... Browse Code »

An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate
CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that
potentially have TLB entries for pages being unmapped. When the unmapping
is complete, the full TLB is flushed on the assumption that a refill cost
is lower than flushing individual entries.

Architectures wishing to do this must give the following guarantee.

If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that linear address
from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting. The
architecture should consider whether the cost of the full TLB flush is
higher than sending an IPI to flush each individual entry. An additional
architecture helper called flush_tlb_local is required. It's a trivial
wrapper with some accounting in the x86 case.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU
readers of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 581.00 611.43
System 5804.93 4111.76
Elapsed 161.03 122.12

This is showing that the readers completed 24.40% faster with 29% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 58.09 57.64
System 111.82 76.56
Elapsed 27.29 22.55

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or have
relatively few mapped pages. It will have an unpredictable impact on the
workload running on the CPU being flushed as it'll depend on how many TLB
entries need to be refilled and how long that takes. Worst case, the TLB
will be completely cleared of active entries when the target PFNs were not
resident at all.

[sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Dave Hansen
Acked-by: Ingo Molnar
Cc: Linus Torvalds
Signed-off-by: Sasha Levin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-09-05 07:54:41 +0800
a14c151e5 userfaultfd: buildsystem activation ... Browse Code »

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800

02 Sep, 2015

1 commit

8bdc69b76 Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- a new PIDs controller is added. It turns out that PIDs are actually
an independent resource from kmem due to the limited PID space.

- more core preparations for the v2 interface. Once cpu side interface
is settled, it should be ready for lifting the devel mask.
for-4.3-unified-base was temporarily branched so that other trees
(block) can pull cgroup core changes that blkcg changes depend on.

- a non-critical idr_preload usage bug fix.

* 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: pids: fix invalid get/put usage
cgroup: introduce cgroup_subsys->legacy_name
cgroup: don't print subsystems for the default hierarchy
cgroup: make cftype->private a unsigned long
cgroup: export cgrp_dfl_root
cgroup: define controller file conventions
cgroup: fix idr_preload usage
cgroup: add documentation for the PIDs controller
cgroup: implement the PIDs subsystem
cgroup: allow a cgroup subsystem to reject a fork

Linus Torvalds
2015-09-02 23:04:23 +0800

15 Aug, 2015

1 commit

bf3eac84c percpu-rwsem: kill CONFIG_PERCPU_RWSEM ... Browse Code »

Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov

Oleg Nesterov
2015-08-15 19:52:11 +0800

14 Aug, 2015

1 commit

cfc411e7f Move certificate handling to its own directory ... Browse Code »

Move certificate handling out of the kernel/ directory and into a certs/
directory to get all the weird stuff in one place and move the generated
signing keys into this directory.

Signed-off-by: David Howells
Reviewed-by: David Woodhouse

David Howells
2015-08-14 23:06:13 +0800

13 Aug, 2015

1 commit

228c37ff9 sign-file: Document dependency on OpenSSL devel libraries ... Browse Code »

The revised sign-file program is no longer a script that wraps the openssl
program, but now rather a program that makes use of OpenSSL's crypto
library. This means that to build the sign-file program, the kernel build
process now has a dependency on the OpenSSL development packages in
addition to OpenSSL itself.

Document this in Kconfig and in module-signing.txt.

Signed-off-by: David Howells
Reviewed-by: David Woodhouse

David Howells
2015-08-13 00:01:01 +0800

12 Aug, 2015

1 commit

9b9412dc7 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull RCU changes from Paul E. McKenney:

- The combination of tree geometry-initialization simplifications
and OS-jitter-reduction changes to expedited grace periods.
These two are stacked due to the large number of conflicts
that would otherwise result.

[ With one addition, a temporary commit to silence a lockdep false
positive. Additional changes to the expedited grace-period
primitives (queued for 4.4) remove the cause of this false
positive, and therefore include a revert of this temporary commit. ]

- Documentation updates.

- Torture-test updates.

- Miscellaneous fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2015-08-12 18:12:12 +0800

07 Aug, 2015

7 commits

99d27b1b5 modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS option ... Browse Code »

Let the user explicitly provide a file containing trusted keys, instead of
just automatically finding files matching *.x509 in the build tree and
trusting whatever we find. This really ought to be an *explicit*
configuration, and the build rules for dealing with the files were
fairly painful too.

Fix applied from James Morris that removes an '=' from a macro definition
in kernel/Makefile as this is a feature that only exists from GNU make 3.82
onwards.

Signed-off-by: David Woodhouse
Signed-off-by: David Howells

David Woodhouse
2015-08-07 23:26:14 +0800
fb1179499 modsign: Use single PEM file for autogenerated key ... Browse Code »

The current rule for generating signing_key.priv and signing_key.x509 is
a classic example of a bad rule which has a tendency to break parallel
make. When invoked to create *either* target, it generates the other
target as a side-effect that make didn't predict.

So let's switch to using a single file signing_key.pem which contains
both key and certificate. That matches what we do in the case of an
external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
slightly cleaner.

Signed-off-by: David Woodhouse
Signed-off-by: David Howells

David Woodhouse
2015-08-07 23:26:14 +0800
1329e8cc6 modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if needed ... Browse Code »

Where an external PEM file or PKCS#11 URI is given, we can get the cert
from it for ourselves instead of making the user drop signing_key.x509
in place for us.

Signed-off-by: David Woodhouse
Signed-off-by: David Howells

David Woodhouse
2015-08-07 23:26:14 +0800
19e91b69d modsign: Allow external signing key to be specified ... Browse Code »

Signed-off-by: David Woodhouse
Signed-off-by: David Howells

David Woodhouse
2015-08-07 23:26:14 +0800
091f6e26e MODSIGN: Extract the blob PKCS#7 signature verifier from module signing ... Browse Code »

Extract the function that drives the PKCS#7 signature verification given a
data blob and a PKCS#7 blob out from the module signing code and lump it with
the system keyring code as it's generic. This makes it independent of module
config options and opens it to use by the firmware loader.

Signed-off-by: David Howells
Cc: Luis R. Rodriguez
Cc: Rusty Russell
Cc: Ming Lei
Cc: Seth Forshee
Cc: Kyle McMartin

David Howells
2015-08-07 23:26:13 +0800
3f1e1bea3 MODSIGN: Use PKCS#7 messages as module signatures ... Browse Code »

Move to using PKCS#7 messages as module signatures because:

(1) We have to be able to support the use of X.509 certificates that don't
have a subjKeyId set. We're currently relying on this to look up the
X.509 certificate in the trusted keyring list.

(2) PKCS#7 message signed information blocks have a field that supplies the
data required to match with the X.509 certificate that signed it.

(3) The PKCS#7 certificate carries fields that specify the digest algorithm
used to generate the signature in a standardised way and the X.509
certificates specify the public key algorithm in a standardised way - so
we don't need our own methods of specifying these.

(4) We now have PKCS#7 message support in the kernel for signed kexec purposes
and we can make use of this.

To make this work, the old sign-file script has been replaced with a program
that needs compiling in a previous patch. The rules to build it are added
here.

Signed-off-by: David Howells
Tested-by: Vivek Goyal

David Howells
2015-08-07 23:26:13 +0800
4248b0da4 fs, file table: reinit files_stat.max_files after deferred memory initialisation ... Browse Code »

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman
Reported-by: Dave Hansen
Cc: Nicolai Stange
Cc: Dave Hansen
Cc: Alex Ng
Cc: Fengguang Wu
Cc: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-08-07 09:39:40 +0800

23 Jul, 2015

1 commit

be55fa2ad rcu: Hide RCU_NOCB_CPU behind RCU_EXPERT ... Browse Code »

This commit prevents Kconfig from asking the user about RCU_NOCB_CPU
unless the user really wants to be asked.

Reported-by: Ingo Molnar
Signed-off-by: Paul E. McKenney
Cc: Steven Rostedt
Cc: Sebastian Andrzej Siewior
Cc: Thomas Gleixner

Paul E. McKenney
2015-07-23 06:27:25 +0800

15 Jul, 2015

1 commit

49b786ea1 cgroup: implement the PIDs subsystem ... Browse Code »

Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup -- even if migrating
mid-fork it must be able to fork in the parent first).

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai
Signed-off-by: Tejun Heo

Aleksa Sarai
2015-07-15 05:29:23 +0800

07 Jul, 2015

1 commit

d1ec4c34c rcu: Drop RCU_USER_QS in favor of NO_HZ_FULL ... Browse Code »

The RCU_USER_QS Kconfig parameter is now just a synonym for NO_HZ_FULL,
so this commit eliminates RCU_USER_QS, replacing all uses with NO_HZ_FULL.

Reported-by: Frederic Weisbecker
Signed-off-by: Paul E. McKenney
Acked-by: Frederic Weisbecker

Paul E. McKenney
2015-07-07 04:52:18 +0800

04 Jul, 2015

1 commit

22a093b2f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Fix numa balancing stats in /proc/pid/sched
sched/numa: Show numa_group ID in /proc/sched_debug task listings
sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
sched/stat: Simplify the sched_info accounting dependency

Linus Torvalds
2015-07-04 23:56:53 +0800