01 Feb, 2017

1 commit

  • commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

    We perform the conversion between kernel jiffies and ms only when
    exporting kernel value to user space.

    We need to do the opposite operation when value is written by user.

    Only matters when HZ != 1000

    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

11 Oct, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

07 Oct, 2016

1 commit

  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Sep, 2016

2 commits

  • After 7e8e385aaf6e ("x86/compat: Remove sys32_vm86_warning"), this
    function has become unused, so we can remove it as well.

    Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Cc: Alexander Viro
    Cc: "Theodore Ts'o"
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Andrew Morton

    Arnd Bergmann
     
  • Propagate unsignedness for grand total of 149 bytes:

    $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
    add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
    function old new delta
    set_close_on_exec 99 98 -1
    put_files_struct 201 200 -1
    get_close_on_exec 59 58 -1
    do_prlimit 498 497 -1
    do_execveat_common.isra 1662 1661 -1
    __close_fd 178 173 -5
    do_dup2 219 204 -15
    seq_show 685 660 -25
    __alloc_fd 384 357 -27
    dup_fd 718 646 -72

    It mostly comes from converting "unsigned int" to "long" for bit operations.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Al Viro

    Alexey Dobriyan
     

27 Aug, 2016

1 commit

  • We have scripts which write to certain fields on 3.18 kernels but this
    seems to be failing on 4.4 kernels. An entry which we write to here is
    xfrm_aevent_rseqth which is u32.

    echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

    Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
    to int") prevented writing to sysctl entries when integer overflow
    occurs. However, this does not apply to unsigned integers.

    Heinrich suggested that we introduce a new option to handle 64 bit
    limits and set min as 0 and max as UINT_MAX. This might not work as it
    leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
    we would need to change the datatype of the entry to 64 bit.

    static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
    {
    i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
    vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

    Introduce a new proc handler proc_douintvec. Individual proc entries
    will need to be updated to use the new handler.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
    Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Subash Abhinov Kasiviswanathan
     

03 Aug, 2016

1 commit

  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

29 Jul, 2016

1 commit

  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jun, 2016

1 commit

  • It is not always easy to determine the cause of an RCU stall just by
    analysing the RCU stall messages, mainly when the problem is caused
    by the indirect starvation of rcu threads. For example, when preempt_rcu
    is not awakened due to the starvation of a timer softirq.

    We have been hard coding panic() in the RCU stall functions for
    some time while testing the kernel-rt. But this is not possible in
    some scenarios, like when supporting customers.

    This patch implements the sysctl kernel.panic_on_rcu_stall. If
    set to 1, the system will panic() when an RCU stall takes place,
    enabling the capture of a vmcore. The vmcore provides a way to analyze
    all kernel/tasks states, helping out to point to the culprit and the
    solution for the stall.

    The kernel.panic_on_rcu_stall sysctl is disabled by default.

    Changes from v1:
    - Fixed a typo in the git log
    - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
    - Fixed the CONFIG_TINY_RCU compilation issue
    - The var sysctl_panic_on_rcu_stall is now __read_mostly

    Cc: Jonathan Corbet
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Acked-by: Christian Borntraeger
    Reviewed-by: Josh Triplett
    Reviewed-by: Arnaldo Carvalho de Melo
    Tested-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Paul E. McKenney

    Daniel Bristot de Oliveira
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

20 May, 2016

1 commit

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 May, 2016

2 commits

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • So that it can be used for other stack related knobs, such as the
    upcoming one to tweak the max number of of contexts per stack sample.

    In all those cases we can only change the value if there are no perf
    sessions collecting stacks, so they need to grab that mutex, etc.

    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-8t3fk94wuzp8m2z1n4gc0s17@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

27 Apr, 2016

1 commit

  • The default remains 127, which is good for most cases, and not even hit
    most of the time, but then for some cases, as reported by Brendan, 1024+
    deep frames are appearing on the radar for things like groovy, ruby.

    And in some workloads putting a _lower_ cap on this may make sense. One
    that is per event still needs to be put in place tho.

    The new file is:

    # cat /proc/sys/kernel/perf_event_max_stack
    127

    Chaging it:

    # echo 256 > /proc/sys/kernel/perf_event_max_stack
    # cat /proc/sys/kernel/perf_event_max_stack
    256

    But as soon as there is some event using callchains we get:

    # echo 512 > /proc/sys/kernel/perf_event_max_stack
    -bash: echo: write error: Device or resource busy
    #

    Because we only allocate the callchain percpu data structures when there
    is a user, which allows for changing the max easily, its just a matter
    of having no callchain users at that point.

    Reported-and-Tested-by: Brendan Gregg
    Reviewed-by: Frederic Weisbecker
    Acked-by: Alexei Starovoitov
    Acked-by: David Ahern
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

18 Mar, 2016

1 commit

  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Feb, 2016

1 commit

  • schedstats is very useful during debugging and performance tuning but it
    incurs overhead to calculate the stats. As such, even though it can be
    disabled at build time, it is often enabled as the information is useful.

    This patch adds a kernel command-line and sysctl tunable to enable or
    disable schedstats on demand (when it's built in). It is disabled
    by default as someone who knows they need it can also learn to enable
    it when necessary.

    The benefits are dependent on how scheduler-intensive the workload is.
    If it is then the patch reduces the number of cycles spent calculating
    the stats with a small benefit from reducing the cache footprint of the
    scheduler.

    These measurements were taken from a 48-core 2-socket
    machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
    single socket machine 8-core machine with Intel i7-3770 processors.

    netperf-tcp
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
    Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
    Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
    Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
    Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
    Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
    Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
    Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
    Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

    Small gains here, UDP_STREAM showed nothing intresting and neither did
    the TCP_RR tests. The gains on the 8-core machine were very similar.

    tbench4
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
    Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
    Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
    Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
    Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
    Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
    Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
    Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
    Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
    Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
    Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
    Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
    Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

    Small gains of 2-4% at low thread counts and otherwise flat. The
    gains on the 8-core machine were slightly different

    tbench4 on 8-core i7-3770 single socket machine
    Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
    Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
    Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
    Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
    Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
    Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
    Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
    Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
    Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
    Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

    In constract, this shows a relatively steady 2-3% gain at higher thread
    counts. Due to the nature of the patch and the type of workload, it's
    not a surprise that the result will depend on the CPU used.

    hackbench-pipes
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
    Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
    Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
    Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
    Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
    Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
    Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
    Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
    Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
    Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
    Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
    Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

    Some small gains and losses and while the variance data is not included,
    it's close to the noise. The UMA machine did not show anything particularly
    different

    pipetest
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v2r2
    Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
    1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
    2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
    3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
    Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
    Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
    Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
    Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
    Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
    Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
    Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
    Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
    Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
    Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
    Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
    Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
    Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

    Small improvement and similar gains were seen on the UMA machine.

    The gain is small but it stands to reason that doing less work in the
    scheduler is a good thing. The downside is that the lack of schedstats and
    tracepoints may be surprising to experts doing performance analysis until
    they find the existence of the schedstats= parameter or schedstats sysctl.
    It will be automatically activated for latencytop and sleep profiling to
    alleviate the problem. For tracepoints, there is a simple warning as it's
    not safe to activate schedstats in the context when it's known the tracepoint
    may be wanted but is unavailable.

    Signed-off-by: Mel Gorman
    Reviewed-by: Matt Fleming
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

23 Jan, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Embarrassing braino fix + pipe page accounting + fixing an eyesore in
    find_filesystem() (checking that s1 is equal to prefix of s2 of given
    length can be done in many ways, but "compare strlen(s1) with length
    and then do strncmp()" is not a good one...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] fix braino in fs/dlm/user.c
    pipe: limit the per-user amount of pages allocated in pipes
    find_filesystem(): simplify comparison

    Linus Torvalds
     

21 Jan, 2016

1 commit

  • SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
    strict write position handling"), and released in v3.16 in August of
    2014. Since then I can find only 1 instance of non-zero offset
    writing[1], and it was fixed immediately in CRIU[2]. As such, it
    appears safe to flip this to the strict state now.

    [1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
    [2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

    Signed-off-by: Kees Cook
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Jan, 2016

1 commit

  • On no-so-small systems, it is possible for a single process to cause an
    OOM condition by filling large pipes with data that are never read. A
    typical process filling 4000 pipes with 1 MB of data will use 4 GB of
    memory. On small systems it may be tricky to set the pipe max size to
    prevent this from happening.

    This patch makes it possible to enforce a per-user soft limit above
    which new pipes will be limited to a single page, effectively limiting
    them to 4 kB each, as well as a hard limit above which no new pipes may
    be created for this user. This has the effect of protecting the system
    against memory abuse without hurting other users, and still allowing
    pipes to work correctly though with less data at once.

    The limit are controlled by two new sysctls : pipe-user-pages-soft, and
    pipe-user-pages-hard. Both may be disabled by setting them to zero. The
    default soft limit allows the default number of FDs per process (1024)
    to create pipes of the default size (64kB), thus reaching a limit of 64MB
    before starting to create only smaller pipes. With 256 processes limited
    to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
    1084 MB of memory allocated for a user. The hard limit is disabled by
    default to avoid breaking existing applications that make intensive use
    of pipes (eg: for splicing).

    Reported-by: socketpair@gmail.com
    Reported-by: Tetsuo Handa
    Mitigates: CVE-2013-4312 (Linux 2.0+)
    Suggested-by: Linus Torvalds
    Signed-off-by: Willy Tarreau
    Signed-off-by: Al Viro

    Willy Tarreau
     

15 Jan, 2016

1 commit

  • Address Space Layout Randomization (ASLR) provides a barrier to
    exploitation of user-space processes in the presence of security
    vulnerabilities by making it more difficult to find desired code/data
    which could help an attack. This is done by adding a random offset to
    the location of regions in the process address space, with a greater
    range of potential offset values corresponding to better protection/a
    larger search-space for brute force, but also to greater potential for
    fragmentation.

    The offset added to the mmap_base address, which provides the basis for
    the majority of the mappings for a process, is set once on process exec
    in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
    which reflect, hopefully, the best compromise for all systems. The
    trade-off between increased entropy in the offset value generation and
    the corresponding increased variability in address space fragmentation
    is not absolute, however, and some platforms may tolerate higher amounts
    of entropy. This patch introduces both new Kconfig values and a sysctl
    interface which may be used to change the amount of entropy used for
    offset generation on a system.

    The direct motivation for this change was in response to the
    libstagefright vulnerabilities that affected Android, specifically to
    information provided by Google's project zero at:

    http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

    The attack presented therein, by Google's project zero, specifically
    targeted the limited randomness used to generate the offset added to the
    mmap_base address in order to craft a brute-force-based attack.
    Concretely, the attack was against the mediaserver process, which was
    limited to respawning every 5 seconds, on an arm device. The hard-coded
    8 bits used resulted in an average expected success rate of defeating
    the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
    piece). With this patch, and an accompanying increase in the entropy
    value to 16 bits, the same attack would take an average expected time of
    over 45 hours (32768 tries), which makes it both less feasible and more
    likely to be noticed.

    The introduced Kconfig and sysctl options are limited by per-arch
    minimum and maximum values, the minimum of which was chosen to match the
    current hard-coded value and the maximum of which was chosen so as to
    give the greatest flexibility without generating an invalid mmap_base
    address, generally a 3-4 bits less than the number of bits in the
    user-space accessible virtual address space.

    When decided whether or not to change the default value, a system
    developer should consider that mmap_base address could be placed
    anywhere up to 2^(value) bits away from the non-randomized location,
    which would introduce variable-sized areas above and below the mmap_base
    address such that the maximum vm_area_struct size may be reduced,
    preventing very large allocations.

    This patch (of 4):

    ASLR only uses as few as 8 bits to generate the random offset for the
    mmap base address on 32 bit architectures. This value was chosen to
    prevent a poorly chosen value from dividing the address space in such a
    way as to prevent large allocations. This may not be an issue on all
    platforms. Allow the specification of a minimum number of bits so that
    platforms desiring greater ASLR protection may determine where to place
    the trade-off.

    Signed-off-by: Daniel Cashman
    Cc: Russell King
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Don Zickus
    Cc: Eric W. Biederman
    Cc: Heinrich Schuchardt
    Cc: Josh Poimboeuf
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: David Rientjes
    Cc: Mark Salyzyn
    Cc: Jeff Vander Stoep
    Cc: Nick Kralevich
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Hector Marco-Gisbert
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Cashman
     

04 Jan, 2016

1 commit


06 Nov, 2015

2 commits

  • The only way to enable a hardlockup to panic the machine is to set
    'nmi_watchdog=panic' on the kernel command line.

    This makes it awkward for end users and folks who want to run automate
    tests (like myself).

    Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic
    knob.

    Signed-off-by: Don Zickus
    Cc: Ulrich Obergfell
    Acked-by: Jiri Kosina
    Reviewed-by: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Don Zickus
     
  • In many cases of hardlockup reports, it's actually not possible to know
    why it triggered, because the CPU that got stuck is usually waiting on a
    resource (with IRQs disabled) in posession of some other CPU is holding.

    IOW, we are often looking at the stacktrace of the victim and not the
    actual offender.

    Introduce sysctl / cmdline parameter that makes it possible to have
    hardlockup detector perform all-CPU backtrace.

    Signed-off-by: Jiri Kosina
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

13 Oct, 2015

1 commit

  • In order to let unprivileged users load and execute eBPF programs
    teach verifier to prevent pointer leaks.
    Verifier will prevent
    - any arithmetic on pointers
    (except R10+Imm which is used to compute stack addresses)
    - comparison of pointers
    (except if (map_value_ptr == 0) ... )
    - passing pointers to helper functions
    - indirectly passing pointers in stack to helper functions
    - returning pointer from bpf program
    - storing pointers into ctx or maps

    Spill/fill of pointers into stack is allowed, but mangling
    of pointers stored in the stack or reading them byte by byte is not.

    Within bpf programs the pointers do exist, since programs need to
    be able to access maps, pass skb pointer to LD_ABS insns, etc
    but programs cannot pass such pointer values to the outside
    or obfuscate them.

    Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
    so that socket filters (tcpdump), af_packet (quic acceleration)
    and future kcm can use it.
    tracing and tc cls/act program types still require root permissions,
    since tracing actually needs to be able to see all kernel pointers
    and tc is for root only.

    For example, the following unprivileged socket filter program is allowed:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += skb->len;
    return 0;
    }

    but the following program is not:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += (u64) skb;
    return 0;
    }
    since it would leak the kernel address into the map.

    Unprivileged socket filter bpf programs have access to the
    following helper functions:
    - map lookup/update/delete (but they cannot store kernel pointers into them)
    - get_random (it's already exposed to unprivileged user space)
    - get_smp_processor_id
    - tail_call into another socket filter program
    - ktime_get_ns

    The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
    This toggle defaults to off (0), but can be set true (1). Once true,
    bpf programs and maps cannot be accessed from unprivileged process,
    and the toggle cannot be set back to false.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Kees Cook
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

11 Sep, 2015

2 commits

  • The following

    if (val < 0)
    *lvalp = (unsigned long)-val;

    is incorrect because the compiler is free to assume -val to be positive
    and use a sign-extend instruction for extending the bit pattern. This is
    a problem if val == INT_MIN:

    # echo -2147483648 >/proc/sys/dev/scsi/logging_level
    # cat /proc/sys/dev/scsi/logging_level
    -18446744071562067968

    Cast to unsigned long before negation - that way we first sign-extend and
    then negate an unsigned, which is well defined. With this:

    # cat /proc/sys/dev/scsi/logging_level
    -2147483648

    Signed-off-by: Ilya Dryomov
    Cc: Mikulas Patocka
    Cc: Robert Xiao
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Dryomov
     
  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

1 commit


25 Jun, 2015

1 commit

  • Change the default behavior of watchdog so it only runs on the
    housekeeping cores when nohz_full is enabled at build and boot time.
    Allow modifying the set of cores the watchdog is currently running on
    with a new kernel.watchdog_cpumask sysctl.

    In the current system, the watchdog subsystem runs a periodic timer that
    schedules the watchdog kthread to run. However, nohz_full cores are
    designed to allow userspace application code running on those cores to
    have 100% access to the CPU. So the watchdog system prevents the
    nohz_full application code from being able to run the way it wants to,
    thus the motivation to suppress the watchdog on nohz_full cores, which
    this patchset provides by default.

    However, if we disable the watchdog globally, then the housekeeping
    cores can't benefit from the watchdog functionality. So we allow
    disabling it only on some cores. See Documentation/lockup-watchdogs.txt
    for more information.

    [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
    Signed-off-by: Chris Metcalf
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

19 Jun, 2015

1 commit

  • Eric reported that the timer_migration sysctl is not really nice
    performance wise as it needs to check at every timer insertion whether
    the feature is enabled or not. Further the check does not live in the
    timer code, so we have an extra function call which checks an extra
    cache line to figure out that it is disabled.

    We can do better and store that information in the per cpu (hr)timer
    bases. I pondered to use a static key, but that's a nightmare to
    update from the nohz code and the timer base cache line is hot anyway
    when we select a timer base.

    The old logic enabled the timer migration unconditionally if
    CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
    line.

    With this modification, we start off with migration disabled. The user
    visible sysctl is still set to enabled. If the kernel switches to NOHZ
    migration is enabled, if the user did not disable it via the sysctl
    prior to the switch. If nohz=off is on the kernel command line,
    migration stays disabled no matter what.

    Before:
    47.76% hog [.] main
    14.84% [kernel] [k] _raw_spin_lock_irqsave
    9.55% [kernel] [k] _raw_spin_unlock_irqrestore
    6.71% [kernel] [k] mod_timer
    6.24% [kernel] [k] lock_timer_base.isra.38
    3.76% [kernel] [k] detach_if_pending
    3.71% [kernel] [k] del_timer
    2.50% [kernel] [k] internal_add_timer
    1.51% [kernel] [k] get_nohz_timer_target
    1.28% [kernel] [k] __internal_add_timer
    0.78% [kernel] [k] timerfn
    0.48% [kernel] [k] wake_up_nohz_cpu

    After:
    48.10% hog [.] main
    15.25% [kernel] [k] _raw_spin_lock_irqsave
    9.76% [kernel] [k] _raw_spin_unlock_irqrestore
    6.50% [kernel] [k] mod_timer
    6.44% [kernel] [k] lock_timer_base.isra.38
    3.87% [kernel] [k] detach_if_pending
    3.80% [kernel] [k] del_timer
    2.67% [kernel] [k] internal_add_timer
    1.33% [kernel] [k] __internal_add_timer
    0.73% [kernel] [k] timerfn
    0.54% [kernel] [k] wake_up_nohz_cpu

    Reported-by: Eric Dumazet
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

17 Apr, 2015

2 commits

  • When converting unsigned long to int overflows may occur. These currently
    are not detected when writing to the sysctl file system.

    E.g. on a system where int has 32 bits and long has 64 bits
    echo 0x800001234 > /proc/sys/kernel/threads-max
    has the same effect as
    echo 0x1234 > /proc/sys/kernel/threads-max

    The patch adds the missing check in do_proc_dointvec_conv.

    With the patch an overflow will result in an error EINVAL when writing to
    the the sysctl file system.

    Signed-off-by: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     
  • Users can change the maximum number of threads by writing to
    /proc/sys/kernel/threads-max.

    With the patch the value entered is checked against the same limits that
    apply when fork_init is called.

    Signed-off-by: Heinrich Schuchardt
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     

16 Apr, 2015

1 commit

  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

3 commits

  • Merge first patchbomb from Andrew Morton:

    - arch/sh updates

    - ocfs2 updates

    - kernel/watchdog feature

    - about half of mm/

    * emailed patches from Andrew Morton : (122 commits)
    Documentation: update arch list in the 'memtest' entry
    Kconfig: memtest: update number of test patterns up to 17
    arm: add support for memtest
    arm64: add support for memtest
    memtest: use phys_addr_t for physical addresses
    mm: move memtest under mm
    mm, hugetlb: abort __get_user_pages if current has been oom killed
    mm, mempool: do not allow atomic resizing
    memcg: print cgroup information when system panics due to panic_on_oom
    mm: numa: remove migrate_ratelimited
    mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
    mm: split ET_DYN ASLR from mmap ASLR
    s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
    mm: expose arch_mmap_rnd when available
    s390: standardize mmap_rnd() usage
    powerpc: standardize mmap_rnd() usage
    mips: extract logic for mmap_rnd()
    arm64: standardize mmap_rnd() usage
    x86: standardize mmap_rnd() usage
    arm: factor out mmap ASLR into mmap_rnd
    ...

    Linus Torvalds
     
  • With the current user interface of the watchdog mechanism it is only
    possible to disable or enable both lockup detectors at the same time.
    This series introduces new kernel parameters and changes the semantics of
    some existing kernel parameters, so that the hard lockup detector and the
    soft lockup detector can be disabled or enabled individually. With this
    series applied, the user interface is as follows.

    - parameters in /proc/sys/kernel

    . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

    . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

    . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

    . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

    - kernel command line parameters

    . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

    . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

    . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

    Also, remove the proc_dowatchdog() function which is no longer needed.

    [dzickus@redhat.com: wrote changelog]
    [dzickus@redhat.com: update documentation for kernel params and sysctl]
    Signed-off-by: Ulrich Obergfell
    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

26 Mar, 2015

1 commit


18 Mar, 2015

1 commit


11 Feb, 2015

1 commit

  • Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and
    hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
    leading to out-of-bounds access in proc_doulongvec_minmax(). Use
    '.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is
    equivalent to passing minimal value, which is 0 for unsigned types.

    Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dmitry Vyukov
    Suggested-by: Manfred Spraul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin