Eric Lee / smarc-fsl-linux-kernel

01 Feb, 2017

1 commit

03707d6c3 sysctl: fix proc_doulongvec_ms_jiffies_minmax() ... Browse Code »

commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-02-01 15:33:05 +0800

11 Oct, 2016

1 commit

abb5a14fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.

There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...

Linus Torvalds
2016-10-11 04:04:49 +0800

07 Oct, 2016

1 commit

14986a34e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.

The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.

To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.

Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.

There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.

These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...

Linus Torvalds
2016-10-07 00:52:23 +0800

01 Oct, 2016

1 commit

d29216842 mnt: Add a per mount namespace limit on the number of mounts ... Browse Code »

CAI Qian pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-10-01 01:46:48 +0800

28 Sep, 2016

2 commits

9dcfcda57 compat: remove compat_printk() ... Browse Code »

After 7e8e385aaf6e ("x86/compat: Remove sys32_vm86_warning"), this
function has become unused, so we can remove it as well.

Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Cc: Alexander Viro
Cc: "Theodore Ts'o"
Cc: Arnaldo Carvalho de Melo
Signed-off-by: Andrew Morton

Arnd Bergmann
2016-09-28 09:20:53 +0800
9b80a184e fs/file: more unsigned file descriptors ... Browse Code »

Propagate unsignedness for grand total of 149 bytes:

$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
function old new delta
set_close_on_exec 99 98 -1
put_files_struct 201 200 -1
get_close_on_exec 59 58 -1
do_prlimit 498 497 -1
do_execveat_common.isra 1662 1661 -1
__close_fd 178 173 -5
do_dup2 219 204 -15
seq_show 685 660 -25
__alloc_fd 384 357 -27
dup_fd 718 646 -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2016-09-28 06:47:38 +0800

27 Aug, 2016

1 commit

e7d316a02 sysctl: handle error writing UINT_MAX to u32 fields ... Browse Code »

We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.

echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.

Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.

static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Subash Abhinov Kasiviswanathan
2016-08-27 08:39:35 +0800

03 Aug, 2016

1 commit

750afe7ba printk: add kernel parameter to control writes to /dev/kmsg ... Browse Code »

Add a "printk.devkmsg" kernel command line parameter which controls how
userspace writes into /dev/kmsg. It has three options:

* ratelimit - ratelimit logging from userspace.
* on - unlimited logging from userspace
* off - logging from userspace gets ignored

The default setting is to ratelimit the messages written to it.

This changes the kernel default setting of "on" to "ratelimit" and we do
that because we want to keep userspace spamming /dev/kmsg to sane
levels. This is especially moot when a small kernel log buffer wraps
around and messages get lost. So the ratelimiting setting should be a
sane setting where kernel messages should have a bit higher chance of
survival from all the spamming.

It additionally does not limit logging to /dev/kmsg while the system is
booting if we haven't disabled it on the command line.

Furthermore, we can control the logging from a lower priority sysctl
interface - kernel.printk_devkmsg.

That interface will succeed only if printk.devkmsg *hasn't* been
supplied on the command line. If it has, then printk.devkmsg is a
one-time setting which remains for the duration of the system lifetime.
This "locking" of the setting is to prevent userspace from changing the
logging on us through sysctl(2).

This patch is based on previous patches from Linus and Steven.

[bp@suse.de: fixes]
Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Dave Young
Cc: Franck Bui
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:06 +0800

29 Jul, 2016

1 commit

a5f5f91da mm: convert zone_reclaim to node_reclaim ... Browse Code »

As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim. It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far. The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.

Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

16 Jun, 2016

1 commit

088e9d253 rcu: sysctl: Panic on RCU Stall ... Browse Code »

It is not always easy to determine the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.

We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.

This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.

The kernel.panic_on_rcu_stall sysctl is disabled by default.

Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly

Cc: Jonathan Corbet
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Cc: Lai Jiangshan
Acked-by: Christian Borntraeger
Reviewed-by: Josh Triplett
Reviewed-by: Arnaldo Carvalho de Melo
Tested-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Paul E. McKenney

Daniel Bristot de Oliveira
2016-06-16 07:00:05 +0800

26 May, 2016

1 commit

bdc6b758e Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...

Linus Torvalds
2016-05-26 08:05:40 +0800

20 May, 2016

1 commit

52b6f46bc mm: /proc/sys/vm/stat_refresh to force vmstat update ... Browse Code »

Provide /proc/sys/vm/stat_refresh to force an immediate update of
per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
before checking counts when testing. Originally added to work around a
bug which left counts stranded indefinitely on a cpu going idle (an
inaccuracy magnified when small below-batch numbers represent "huge"
amounts of memory), but I believe that bug is now fixed: nonetheless,
this is still a useful knob.

Its schedule_on_each_cpu() is probably too expensive just to fold into
reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
Allow a write or a read to do the same: nothing to read, but "grep -h
Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
since global_page_state() itself is careful to disguise any underflow as
0, hack in an "Invalid argument" and pr_warn() if a counter is negative
after the refresh - this helped to fix a misaccounting of
NR_ISOLATED_FILE in my migration code.

But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
often go negative some of the time. I have not yet worked out why, but
have no evidence that it's actually harmful. Punt for the moment by
just ignoring the anomaly on those.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Andres Lagar-Cavilla
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800

17 May, 2016

2 commits

c85b03349 perf core: Separate accounting of contexts and real addresses in a stack trace ... Browse Code »

The perf_sample->ip_callchain->nr value includes all the entries in the
ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.

So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.

A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Brendan Gregg
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: He Kuang
Cc: Jiri Olsa
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-17 10:11:53 +0800
a831100ae perf core: Generalize max_stack sysctl handler ... Browse Code »

So that it can be used for other stack related knobs, such as the
upcoming one to tweak the max number of of contexts per stack sample.

In all those cases we can only change the value if there are no perf
sessions collecting stacks, so they need to grab that mutex, etc.

Cc: David Ahern
Cc: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-8t3fk94wuzp8m2z1n4gc0s17@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-17 10:11:49 +0800

27 Apr, 2016

1 commit

c5dfd78eb perf core: Allow setting up max frame stack depth via sysctl ... Browse Code »

The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

# cat /proc/sys/kernel/perf_event_max_stack
127

Chaging it:

# echo 256 > /proc/sys/kernel/perf_event_max_stack
# cat /proc/sys/kernel/perf_event_max_stack
256

But as soon as there is some event using callchains we get:

# echo 512 > /proc/sys/kernel/perf_event_max_stack
-bash: echo: write error: Device or resource busy
#

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg
Reviewed-by: Frederic Weisbecker
Acked-by: Alexei Starovoitov
Acked-by: David Ahern
Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: He Kuang
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-04-27 21:20:39 +0800

18 Mar, 2016

1 commit

795ae7a0d mm: scale kswapd watermarks in proportion to memory ... Browse Code »

In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.

On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.

Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.

Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.

On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Acked-by: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-18 06:09:34 +0800

09 Feb, 2016

1 commit

cb2517653 sched/debug: Make schedstats a runtime tunable that is disabled by default ... Browse Code »

schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.

This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.

The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.

These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

Small gains of 2-4% at low thread counts and otherwise flat. The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.

Signed-off-by: Mel Gorman
Reviewed-by: Matt Fleming
Reviewed-by: Srikar Dronamraju
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar

Mel Gorman
2016-02-09 18:54:23 +0800

23 Jan, 2016

1 commit

eadee0ce6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
"Embarrassing braino fix + pipe page accounting + fixing an eyesore in
find_filesystem() (checking that s1 is equal to prefix of s2 of given
length can be done in many ways, but "compare strlen(s1) with length
and then do strncmp()" is not a good one...)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[regression] fix braino in fs/dlm/user.c
pipe: limit the per-user amount of pages allocated in pipes
find_filesystem(): simplify comparison

Linus Torvalds
2016-01-23 02:24:03 +0800

21 Jan, 2016

1 commit

41662f5cc sysctl: enable strict writes ... Browse Code »

SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014. Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2]. As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Signed-off-by: Kees Cook
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2016-01-21 09:09:18 +0800

20 Jan, 2016

1 commit

759c01142 pipe: limit the per-user amount of pages allocated in pipes ... Browse Code »

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds
Signed-off-by: Willy Tarreau
Signed-off-by: Al Viro

Willy Tarreau
2016-01-20 08:25:21 +0800

15 Jan, 2016

1 commit

d07e22597 mm: mmap: add new /proc tunable for mmap_base ASLR ... Browse Code »

Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.

The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.

The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:

http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.

The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.

When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.

This patch (of 4):

ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Signed-off-by: Daniel Cashman
Cc: Russell King
Acked-by: Kees Cook
Cc: Ingo Molnar
Cc: Jonathan Corbet
Cc: Don Zickus
Cc: Eric W. Biederman
Cc: Heinrich Schuchardt
Cc: Josh Poimboeuf
Cc: Kirill A. Shutemov
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: David Rientjes
Cc: Mark Salyzyn
Cc: Jeff Vander Stoep
Cc: Nick Kralevich
Cc: Catalin Marinas
Cc: Will Deacon
Cc: "H. Peter Anvin"
Cc: Hector Marco-Gisbert
Cc: Borislav Petkov
Cc: Ralf Baechle
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Cashman
2016-01-15 08:00:49 +0800

04 Jan, 2016

1 commit

70f6cbb6f kernel/*: switch to memdup_user_nul() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2016-01-04 23:27:55 +0800

06 Nov, 2015

2 commits

ac1f59124 kernel/watchdog.c: add sysctl knob hardlockup_panic ... Browse Code »

The only way to enable a hardlockup to panic the machine is to set
'nmi_watchdog=panic' on the kernel command line.

This makes it awkward for end users and folks who want to run automate
tests (like myself).

Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
knob.

Signed-off-by: Don Zickus
Cc: Ulrich Obergfell
Acked-by: Jiri Kosina
Reviewed-by: Aaron Tomlin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Don Zickus
2015-11-06 11:34:48 +0800
55537871e kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup ... Browse Code »

In many cases of hardlockup reports, it's actually not possible to know
why it triggered, because the CPU that got stuck is usually waiting on a
resource (with IRQs disabled) in posession of some other CPU is holding.

IOW, we are often looking at the stacktrace of the victim and not the
actual offender.

Introduce sysctl / cmdline parameter that makes it possible to have
hardlockup detector perform all-CPU backtrace.

Signed-off-by: Jiri Kosina
Reviewed-by: Aaron Tomlin
Cc: Ulrich Obergfell
Acked-by: Don Zickus
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2015-11-06 11:34:48 +0800

13 Oct, 2015

1 commit

1be7f75d1 bpf: enable non-root eBPF programs ... Browse Code »

In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps

Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.

Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.

Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.

For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);

if (value)
*value += skb->len;
return 0;
}

but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);

if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.

Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns

The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.

Signed-off-by: Alexei Starovoitov
Reviewed-by: Kees Cook
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-10-13 10:13:35 +0800

11 Sep, 2015

2 commits

9a5bc726d sysctl: fix int -> unsigned long assignments in INT_MIN case ... Browse Code »

The following

if (val < 0)
*lvalp = (unsigned long)-val;

is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern. This is
a problem if val == INT_MIN:

# echo -2147483648 >/proc/sys/dev/scsi/logging_level
# cat /proc/sys/dev/scsi/logging_level
-18446744071562067968

Cast to unsigned long before negation - that way we first sign-extend and
then negate an unsigned, which is well defined. With this:

# cat /proc/sys/dev/scsi/logging_level
-2147483648

Signed-off-by: Ilya Dryomov
Cc: Mikulas Patocka
Cc: Robert Xiao
Cc: "Eric W. Biederman"
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ilya Dryomov
2015-09-11 04:29:01 +0800
2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800

04 Jul, 2015

1 commit

0cbee9926 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace updates from Eric Biederman:
"Long ago and far away when user namespaces where young it was realized
that allowing fresh mounts of proc and sysfs with only user namespace
permissions could violate the basic rule that only root gets to decide
if proc or sysfs should be mounted at all.

Some hacks were put in place to reduce the worst of the damage could
be done, and the common sense rule was adopted that fresh mounts of
proc and sysfs should allow no more than bind mounts of proc and
sysfs. Unfortunately that rule has not been fully enforced.

There are two kinds of gaps in that enforcement. Only filesystems
mounted on empty directories of proc and sysfs should be ignored but
the test for empty directories was insufficient. So in my tree
directories on proc, sysctl and sysfs that will always be empty are
created specially. Every other technique is imperfect as an ordinary
directory can have entries added even after a readdir returns and
shows that the directory is empty. Special creation of directories
for mount points makes the code in the kernel a smidge clearer about
it's purpose. I asked container developers from the various container
projects to help test this and no holes were found in the set of mount
points on proc and sysfs that are created specially.

This set of changes also starts enforcing the mount flags of fresh
mounts of proc and sysfs are consistent with the existing mount of
proc and sysfs. I expected this to be the boring part of the work but
unfortunately unprivileged userspace winds up mounting fresh copies of
proc and sysfs with noexec and nosuid clear when root set those flags
on the previous mount of proc and sysfs. So for now only the atime,
read-only and nodev attributes which userspace happens to keep
consistent are enforced. Dealing with the noexec and nosuid
attributes remains for another time.

This set of changes also addresses an issue with how open file
descriptors from /proc//ns/* are displayed. Recently readlink of
/proc//fd has been triggering a WARN_ON that has not been
meaningful since it was added (as all of the code in the kernel was
converted) and is not now actively wrong.

There is also a short list of issues that have not been fixed yet that
I will mention briefly.

It is possible to rename a directory from below to above a bind mount.
At which point any directory pointers below the renamed directory can
be walked up to the root directory of the filesystem. With user
namespaces enabled a bind mount of the bind mount can be created
allowing the user to pick a directory whose children they can rename
to outside of the bind mount. This is challenging to fix and doubly
so because all obvious solutions must touch code that is in the
performance part of pathname resolution.

As mentioned above there is also a question of how to ensure that
developers by accident or with purpose do not introduce exectuable
files on sysfs and proc and in doing so introduce security regressions
in the current userspace that will not be immediately obvious and as
such are likely to require breaking userspace in painful ways once
they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Remove incorrect debugging WARN in prepend_path
mnt: Update fs_fully_visible to test for permanently empty directories
sysfs: Create mountpoints with sysfs_create_mount_point
sysfs: Add support for permanently empty directories to serve as mount points.
kernfs: Add support for always empty directories.
proc: Allow creating permanently empty directories that serve as mount points
sysctl: Allow creating permanently empty directories that serve as mountpoints.
fs: Add helper functions for permanently empty directories.
vfs: Ignore unlocked mounts in fs_fully_visible
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
mnt: Refactor the logic for mounting sysfs and proc in a user namespace

Linus Torvalds
2015-07-04 06:20:57 +0800

01 Jul, 2015

1 commit

f9bd6733d sysctl: Allow creating permanently empty directories that serve as mountpoints. ... Browse Code »

Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2015-07-01 23:36:39 +0800

25 Jun, 2015

1 commit

fe4ba3c34 watchdog: add watchdog_cpumask sysctl to assist nohz ... Browse Code »

Change the default behavior of watchdog so it only runs on the
housekeeping cores when nohz_full is enabled at build and boot time.
Allow modifying the set of cores the watchdog is currently running on
with a new kernel.watchdog_cpumask sysctl.

In the current system, the watchdog subsystem runs a periodic timer that
schedules the watchdog kthread to run. However, nohz_full cores are
designed to allow userspace application code running on those cores to
have 100% access to the CPU. So the watchdog system prevents the
nohz_full application code from being able to run the way it wants to,
thus the motivation to suppress the watchdog on nohz_full cores, which
this patchset provides by default.

However, if we disable the watchdog globally, then the housekeeping
cores can't benefit from the watchdog functionality. So we allow
disabling it only on some cores. See Documentation/lockup-watchdogs.txt
for more information.

[jhubbard@nvidia.com: fix a watchdog crash in some configurations]
Signed-off-by: Chris Metcalf
Acked-by: Don Zickus
Cc: Ingo Molnar
Cc: Ulrich Obergfell
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Frederic Weisbecker
Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2015-06-25 08:49:40 +0800

19 Jun, 2015

1 commit

bc7a34b8b timer: Reduce timer migration overhead if disabled ... Browse Code »

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
47.76% hog [.] main
14.84% [kernel] [k] _raw_spin_lock_irqsave
9.55% [kernel] [k] _raw_spin_unlock_irqrestore
6.71% [kernel] [k] mod_timer
6.24% [kernel] [k] lock_timer_base.isra.38
3.76% [kernel] [k] detach_if_pending
3.71% [kernel] [k] del_timer
2.50% [kernel] [k] internal_add_timer
1.51% [kernel] [k] get_nohz_timer_target
1.28% [kernel] [k] __internal_add_timer
0.78% [kernel] [k] timerfn
0.48% [kernel] [k] wake_up_nohz_cpu

After:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu

Reported-by: Eric Dumazet
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:28 +0800

17 Apr, 2015

2 commits

230633d10 kernel/sysctl.c: detect overflows when converting to int ... Browse Code »

When converting unsigned long to int overflows may occur. These currently
are not detected when writing to the sysctl file system.

E.g. on a system where int has 32 bits and long has 64 bits
echo 0x800001234 > /proc/sys/kernel/threads-max
has the same effect as
echo 0x1234 > /proc/sys/kernel/threads-max

The patch adds the missing check in do_proc_dointvec_conv.

With the patch an overflow will result in an error EINVAL when writing to
the the sysctl file system.

Signed-off-by: Heinrich Schuchardt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heinrich Schuchardt
2015-04-17 21:04:08 +0800
16db3d3f1 kernel/sysctl.c: threads-max observe limits ... Browse Code »

Users can change the maximum number of threads by writing to
/proc/sys/kernel/threads-max.

With the patch the value entered is checked against the same limits that
apply when fork_init is called.

Signed-off-by: Heinrich Schuchardt
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Guenter Roeck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heinrich Schuchardt
2015-04-17 21:04:07 +0800

16 Apr, 2015

1 commit

5bbe3547a mm: allow compaction of unevictable pages ... Browse Code »

Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration. The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion. The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state. This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru. Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data. These maps are
created locked and read only. Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool. When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory. When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Acked-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Christoph Lameter
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: David Rientjes
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-04-16 07:35:17 +0800

15 Apr, 2015

3 commits

1dcf58d6e Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge first patchbomb from Andrew Morton:

- arch/sh updates

- ocfs2 updates

- kernel/watchdog feature

- about half of mm/

* emailed patches from Andrew Morton : (122 commits)
Documentation: update arch list in the 'memtest' entry
Kconfig: memtest: update number of test patterns up to 17
arm: add support for memtest
arm64: add support for memtest
memtest: use phys_addr_t for physical addresses
mm: move memtest under mm
mm, hugetlb: abort __get_user_pages if current has been oom killed
mm, mempool: do not allow atomic resizing
memcg: print cgroup information when system panics due to panic_on_oom
mm: numa: remove migrate_ratelimited
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
s390: standardize mmap_rnd() usage
powerpc: standardize mmap_rnd() usage
mips: extract logic for mmap_rnd()
arm64: standardize mmap_rnd() usage
x86: standardize mmap_rnd() usage
arm: factor out mmap ASLR into mmap_rnd
...

Linus Torvalds
2015-04-15 07:49:17 +0800
195daf665 watchdog: enable the new user interface of the watchdog mechanism ... Browse Code »

With the current user interface of the watchdog mechanism it is only
possible to disable or enable both lockup detectors at the same time.
This series introduces new kernel parameters and changes the semantics of
some existing kernel parameters, so that the hard lockup detector and the
soft lockup detector can be disabled or enabled individually. With this
series applied, the user interface is as follows.

- parameters in /proc/sys/kernel

. soft_watchdog
This is a new parameter to control and examine the run state of
the soft lockup detector.

. nmi_watchdog
The semantics of this parameter have changed. It can now be used
to control and examine the run state of the hard lockup detector.

. watchdog
This parameter is still available to control the run state of both
lockup detectors at the same time. If this parameter is examined,
it shows the logical OR of soft_watchdog and nmi_watchdog.

. watchdog_thresh
The semantics of this parameter are not affected by the patch.

- kernel command line parameters

. nosoftlockup
The semantics of this parameter have changed. It can now be used
to disable the soft lockup detector at boot time.

. nmi_watchdog=0 or nmi_watchdog=1
Disable or enable the hard lockup detector at boot time. The patch
introduces '=1' as a new option.

. nowatchdog
The semantics of this parameter are not affected by the patch. It
is still available to disable both lockup detectors at boot time.

Also, remove the proc_dowatchdog() function which is no longer needed.

[dzickus@redhat.com: wrote changelog]
[dzickus@redhat.com: update documentation for kernel params and sysctl]
Signed-off-by: Ulrich Obergfell
Signed-off-by: Don Zickus
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-04-15 07:48:59 +0800
ca2ec3265 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:
"Part one:

- struct filename-related cleanups

- saner iov_iter_init() replacements (and switching the syscalls to
use of those)

- ntfs switch to ->write_iter() (Anton)

- aio cleanups and splitting iocb into common and async parts
(Christoph)

- assorted fixes (me, bfields, Andrew Elble)

There's a lot more, including the completion of switchover to
->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
race fixes, etc, but that goes after #for-davem merge. David has
pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
sg_start_req(): use import_iovec()
sg_start_req(): make sure that there's not too many elements in iovec
blk_rq_map_user(): use import_single_range()
sg_io(): use import_iovec()
process_vm_access: switch to {compat_,}import_iovec()
switch keyctl_instantiate_key_common() to iov_iter
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
vmsplice_to_user(): switch to import_iovec()
kill aio_setup_single_vector()
aio: simplify arguments of aio_setup_..._rw()
aio: lift iov_iter_init() into aio_setup_..._rw()
lift iov_iter into {compat_,}do_readv_writev()
NFS: fix BUG() crash in notify_change() with patch to chown_common()
dcache: return -ESTALE not -EBUSY on distributed fs race
NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
VFS: Add iov_iter_fault_in_multipages_readable()
drop bogus check in file_open_root()
switch security_inode_getattr() to struct path *
constify tomoyo_realpath_from_path()
...

Linus Torvalds
2015-04-15 06:31:03 +0800

26 Mar, 2015

1 commit

e2e40f2c1 fs: move struct kiocb to fs.h ... Browse Code »

struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2015-03-26 08:28:11 +0800

18 Mar, 2015

1 commit

1efff914a fs: add dirtytime_expire_seconds sysctl ... Browse Code »

Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.

Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Theodore Ts'o
2015-03-18 00:23:32 +0800

11 Feb, 2015

1 commit

3cd7645de mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? ... Browse Code »

Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and
hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
leading to out-of-bounds access in proc_doulongvec_minmax(). Use
'.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is
equivalent to passing minimal value, which is 0 for unsigned types.

Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
Signed-off-by: Andrey Ryabinin
Reported-by: Dmitry Vyukov
Suggested-by: Manfred Spraul
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-02-11 06:30:34 +0800