27 Jul, 2013

1 commit

  • When (integer) sysctl values are expressed in ms and have to be
    represented internally as jiffies. The msecs_to_jiffies function
    returns an unsigned long, which gets assigned to the integer.
    This patch prevents the value to be assigned if bigger than
    INT_MAX, done in a similar way as in cba9f3 ("Range checking in
    do_proc_dointvec_(userhz_)jiffies_conv").

    Signed-off-by: Francesco Fusco
    CC: Andrew Morton
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Francesco Fusco
     

12 Jul, 2013

2 commits

  • Get upstream changes so we can apply fixes against them

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Pull tracing changes from Steven Rostedt:
    "The majority of the changes here are cleanups for the large changes
    that were added to 3.10, which includes several bug fixes that have
    been marked for stable.

    As for new features, there were a few, but nothing to write to LWN
    about. These include:

    New function trigger called "dump" and "cpudump" that will cause
    ftrace to dump its buffer to the console when the function is called.
    The difference between "dump" and "cpudump" is that "dump" will dump
    the entire contents of the ftrace buffer, where as "cpudump" will only
    dump the contents of the ftrace buffer for the CPU that called the
    function.

    Another small enhancement is a new sysctl switch called
    "traceoff_on_warning" which, when enabled, will disable tracing if any
    WARN_ON() is triggered. This is useful if you want to debug what
    caused a warning and do not want to risk losing your trace data by the
    ring buffer overwriting the data before you can disable it. There's
    also a kernel command line option that will make this enabled at boot
    up called the same thing"

    * tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (34 commits)
    tracing: Make tracing_open_generic_{tr,tc}() static
    tracing: Remove ftrace() function
    tracing: Remove TRACE_EVENT_TYPE enum definition
    tracing: Make tracer_tracing_{off,on,is_on}() static
    tracing: Fix irqs-off tag display in syscall tracing
    uprobes: Fix return value in error handling path
    tracing: Fix race between deleting buffer and setting events
    tracing: Add trace_array_get/put() to event handling
    tracing: Get trace_array ref counts when accessing trace files
    tracing: Add trace_array_get/put() to handle instance refs better
    tracing: Protect ftrace_trace_arrays list in trace_events.c
    tracing: Make trace_marker use the correct per-instance buffer
    ftrace: Do not run selftest if command line parameter is set
    tracing/kprobes: Don't pass addr=ip to perf_trace_buf_submit()
    tracing: Use flag buffer_disabled for irqsoff tracer
    tracing/kprobes: Turn trace_probe->files into list_head
    tracing: Fix disabling of soft disable
    tracing: Add missing syscall_metadata comment
    tracing: Simplify code for showing of soft disabled flag
    tracing/kprobes: Kill probe_enable_lock
    ...

    Linus Torvalds
     

10 Jul, 2013

1 commit

  • …eric/linux-dynticks into timers/urgent

    Pull nohz updates/fixes from Frederic Weisbecker:

    ' Note that "watchdog: Boot-disable by default on full dynticks" is a temporary
    solution to solve the issue with the watchdog that prevents the tick from
    stopping. This is to make sure that 3.11 doesn't have that problem as several
    people complained about it.

    A proper and longer term solution has been proposed by Peterz:

    http://lkml.kernel.org/r/20130618103632.GO3204@twins.programming.kicks-ass.net
    '

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

23 Jun, 2013

1 commit

  • This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second. If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.

    This way, we don't have a runaway sampling process eating up the
    CPU.

    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well. *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.

    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.

    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

    Signed-off-by: Dave Hansen
    Acked-by: Peter Zijlstra
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Cc: Dave Hansen
    [ Prettified it a bit. ]
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

20 Jun, 2013

2 commits

  • We have two very conflicting state variable names in the
    watchdog:

    * watchdog_enabled: This one reflects the user interface. It's
    set to 1 by default and can be overriden with boot options
    or sysctl/procfs interface.

    * watchdog_disabled: This is the internal toggle state that
    tells if watchdog threads, timers and NMI events are currently
    running or not. This state mostly depends on the user settings.
    It's a convenient state latch.

    Now we really need to find clearer names because those
    are just too confusing to encourage deep review.

    watchdog_enabled now becomes watchdog_user_enabled to reflect
    its purpose as an interface.

    watchdog_disabled becomes watchdog_running to suggest its
    role as a pure internal state.

    Signed-off-by: Frederic Weisbecker
    Cc: Srivatsa S. Bhat
    Cc: Anish Singh
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Don Zickus

    Frederic Weisbecker
     
  • Add a traceoff_on_warning option in both the kernel command line as well
    as a sysctl option. When set, any WARN*() function that is hit will cause
    the tracing_on variable to be cleared, which disables writing to the
    ring buffer.

    This is useful especially when tracing a bug with function tracing. When
    a warning is hit, the print caused by the warning can flood the trace with
    the functions that producing the output for the warning. This can make the
    resulting trace useless by either hiding where the bug happened, or worse,
    by overflowing the buffer and losing the trace of the bug totally.

    Acked-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

28 May, 2013

1 commit

  • In old kernels, it's allowed to set softlockup_thresh to -1 or 0
    to disable softlockup detection. However watchdog_thresh only
    uses 0 to disable detection, and setting it to -1 just froze my
    box and nothing I can do but reboot.

    Signed-off-by: Li Zefan
    Acked-by: Don Zickus
    Link: http://lkml.kernel.org/r/51959668.9040106@huawei.com
    Signed-off-by: Ingo Molnar

    Li Zefan
     

30 Apr, 2013

3 commits

  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • drop_caches.c provides code only invokable via sysctl, so don't compile it
    in when CONFIG_SYSCTL=n.

    Signed-off-by: Josh Triplett
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

02 Mar, 2013

1 commit

  • Pull new ARC architecture from Vineet Gupta:
    "Initial ARC Linux port with some fixes on top for 3.9-rc1:

    I would like to introduce the Linux port to ARC Processors (from
    Synopsys) for 3.9-rc1. The patch-set has been discussed on the public
    lists since Nov and has received a fair bit of review, specially from
    Arnd, tglx, Al and other subsystem maintainers for DeviceTree, kgdb...

    The arch bits are in arch/arc, some asm-generic changes (acked by
    Arnd), a minor change to PARISC (acked by Helge).

    The series is a touch bigger for a new port for 2 main reasons:

    1. It enables a basic kernel in first sub-series and adds
    ptrace/kgdb/.. later

    2. Some of the fallout of review (DeviceTree support, multi-platform-
    image support) were added on top of orig series, primarily to
    record the revision history.

    This updated pull request additionally contains

    - fixes due to our GNU tools catching up with the new syscall/ptrace
    ABI

    - some (minor) cross-arch Kconfig updates."

    * tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (82 commits)
    ARC: split elf.h into uapi and export it for userspace
    ARC: Fixup the current ABI version
    ARC: gdbserver using regset interface possibly broken
    ARC: Kconfig cleanup tracking cross-arch Kconfig pruning in merge window
    ARC: make a copy of flat DT
    ARC: [plat-arcfpga] DT arc-uart bindings change: "baud" => "current-speed"
    ARC: Ensure CONFIG_VIRT_TO_BUS is not enabled
    ARC: Fix pt_orig_r8 access
    ARC: [3.9] Fallout of hlist iterator update
    ARC: 64bit RTSC timestamp hardware issue
    ARC: Don't fiddle with non-existent caches
    ARC: Add self to MAINTAINERS
    ARC: Provide a default serial.h for uart drivers needing BASE_BAUD
    ARC: [plat-arcfpga] defconfig for fully loaded ARC Linux
    ARC: [Review] Multi-platform image #8: platform registers SMP callbacks
    ARC: [Review] Multi-platform image #7: SMP common code to use callbacks
    ARC: [Review] Multi-platform image #6: cpu-to-dma-addr optional
    ARC: [Review] Multi-platform image #5: NR_IRQS defined by ARC core
    ARC: [Review] Multi-platform image #4: Isolate platform headers
    ARC: [Review] Multi-platform image #3: switch to board callback
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
    defines introduced in 54b501992dd2 ("coredump: warn about unsafe
    suid_dumpable / core_pattern combo"). Remove the new ones, and use the
    prior values instead.

    Signed-off-by: Kees Cook
    Reported-by: Chen Gang
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

26 Feb, 2013

1 commit

  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • When calculating amount of dirtyable memory, min_free_kbytes should be
    subtracted because it is not intended for dirty pages.

    Addresses http://bugs.debian.org/695182

    [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
    [akpm@linux-foundation.org: fix min() warning]
    Signed-off-by: Paul Szabo
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Szabo
     

22 Feb, 2013

1 commit


16 Feb, 2013

1 commit

  • PARISC defines /proc/sys/kernel/unaligned-trap to runtime toggle
    unaligned access emulation.

    The exact mechanics of enablig/disabling are still arch specific, we can
    make the sysctl usable by other arches.

    Signed-off-by: Vineet Gupta
    Acked-by: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn

    Vineet Gupta
     

08 Feb, 2013

2 commits

  • Add a /proc/sys/kernel scheduler knob named
    sched_rr_timeslice_ms that allows global changing of the
    SCHED_RR timeslice value. User visable value is in milliseconds
    but is stored as jiffies. Setting to 0 (zero) resets to the
    default (currently 100ms).

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     
  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

21 Jan, 2013

1 commit


10 Jan, 2013

1 commit

  • IA64 defines /proc/sys/kernel/ignore-unaligned-usertrap to control
    verbose warnings on unaligned access emulation.

    Although the exact mechanics of what to do with sysctl (ignore/shout)
    are arch specific, this change enables the sysctl to be usable cross-arch.

    Signed-off-by: Vineet Gupta
    Cc: Fenghua Yu
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Tony Luck

    Vineet Gupta
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

4 commits

  • The PTE scanning rate and fault rates are two of the biggest sources of
    system CPU overhead with automatic NUMA placement. Ideally a proper policy
    would detect if a workload was properly placed, schedule and adjust the
    PTE scanning rate accordingly. We do not track the necessary information
    to do that but we at least know if we migrated or not.

    This patch scans slower if a page was not migrated as the result of a
    NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
    now higher than the previous default. Once every minute it will reset
    the scanner in case of phase changes.

    This is hilariously crude and the numbers are arbitrary. Workloads will
    converge quite slowly in comparison to what a proper policy should be able
    to do. On the plus side, we will chew up less CPU for workloads that have
    no need for automatic balancing.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Add a 1 second delay before starting to scan the working set of
    a task and starting to balance it amongst nodes.

    [ note that before the constant per task WSS sampling rate patch
    the initial scan would happen much later still, in effect that
    patch caused this regression. ]

    The theory is that short-run tasks benefit very little from NUMA
    placement: they come and go, and they better stick to the node
    they were started on. As tasks mature and rebalance to other CPUs
    and nodes, so does their NUMA placement have to change and so
    does it start to matter more and more.

    In practice this change fixes an observable kbuild regression:

    # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

    !NUMA:
    45.291088843 seconds time elapsed ( +- 0.40% )
    45.154231752 seconds time elapsed ( +- 0.36% )

    +NUMA, no slow start:
    46.172308123 seconds time elapsed ( +- 0.30% )
    46.343168745 seconds time elapsed ( +- 0.25% )

    +NUMA, 1 sec slow start:
    45.224189155 seconds time elapsed ( +- 0.25% )
    45.160866532 seconds time elapsed ( +- 0.17% )

    and it also fixes an observable perf bench (hackbench) regression:

    # perf stat --null --repeat 10 perf bench sched messaging

    -NUMA:

    -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
    +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

    +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

    The implementation is simple and straightforward, most of the patch
    deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
    knob.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote the changelog, ran measurements, tuned the default. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • Previously, to probe the working set of a task, we'd use
    a very simple and crude method: mark all of its address
    space PROT_NONE.

    That method has various (obvious) disadvantages:

    - it samples the working set at dissimilar rates,
    giving some tasks a sampling quality advantage
    over others.

    - creates performance problems for tasks with very
    large working sets

    - over-samples processes with large address spaces but
    which only very rarely execute

    Improve that method by keeping a rotating offset into the
    address space that marks the current position of the scan,
    and advance it by a constant rate (in a CPU cycles execution
    proportional manner). If the offset reaches the last mapped
    address of the mm then it then it starts over at the first
    address.

    The per-task nature of the working set sampling functionality in this tree
    allows such constant rate, per task, execution-weight proportional sampling
    of the working set, with an adaptive sampling interval/frequency that
    goes from once per 100ms up to just once per 8 seconds. The current
    sampling volume is 256 MB per interval.

    As tasks mature and converge their working set, so does the
    sampling rate slow down to just a trickle, 256 MB per 8
    seconds of CPU time executed.

    This, beyond being adaptive, also rate-limits rarely
    executing systems and does not over-sample on overloaded
    systems.

    [ In AutoNUMA speak, this patch deals with the effective sampling
    rate of the 'hinting page fault'. AutoNUMA's scanning is
    currently rate-limited, but it is also fundamentally
    single-threaded, executing in the knuma_scand kernel thread,
    so the limit in AutoNUMA is global and does not scale up with
    the number of CPUs, nor does it scan tasks in an execution
    proportional manner.

    So the idea of rate-limiting the scanning was first implemented
    in the AutoNUMA tree via a global rate limit. This patch goes
    beyond that by implementing an execution rate proportional
    working set sampling rate that is not implemented via a single
    global scanning daemon. ]

    [ Dan Carpenter pointed out a possible NULL pointer dereference in the
    first version of this patch. ]

    Based-on-idea-by: Andrea Arcangeli
    Bug-Found-By: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote changelog and fixed bug. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     

29 Nov, 2012

1 commit


09 Oct, 2012

1 commit

  • Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
    architectures requiring support for the "exception-trace" debug_table
    entry in kernel/sysctl.c.

    Signed-off-by: Catalin Marinas
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

06 Oct, 2012

1 commit

  • Adds an expert Kconfig option, CONFIG_COREDUMP, which allows disabling of
    core dump. This saves approximately 2.6k in the compiled kernel, and
    complements CONFIG_ELF_CORE, which now depends on it.

    CONFIG_COREDUMP also disables coredump-related sysctls, except for
    suid_dumpable and related functions, which are necessary for ptrace.

    [akpm@linux-foundation.org: fix binfmt_aout.c build]
    Signed-off-by: Alex Kelly
    Reviewed-by: Josh Triplett
    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Kelly
     

02 Oct, 2012

1 commit

  • Pull arm64 support from Catalin Marinas:
    "Linux support for the 64-bit ARM architecture (AArch64)

    Features currently supported:
    - 39-bit address space for user and kernel (each)
    - 4KB and 64KB page configurations
    - Compat (32-bit) user applications (ARMv7, EABI only)
    - Flattened Device Tree (mandated for all AArch64 platforms)
    - ARM generic timers"

    * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
    arm64: ptrace: remove obsolete ptrace request numbers from user headers
    arm64: Do not set the SMP/nAMP processor bit
    arm64: MAINTAINERS update
    arm64: Build infrastructure
    arm64: Miscellaneous header files
    arm64: Generic timers support
    arm64: Loadable modules
    arm64: Miscellaneous library functions
    arm64: Performance counters support
    arm64: Add support for /proc/sys/debug/exception-trace
    arm64: Debugging support
    arm64: Floating point and SIMD
    arm64: 32-bit (compat) applications support
    arm64: User access library functions
    arm64: Signal handling support
    arm64: VDSO support
    arm64: System calls handling
    arm64: ELF definitions
    arm64: SMP support
    arm64: DMA mapping API
    ...

    Linus Torvalds
     

17 Sep, 2012

1 commit

  • This patch allows setting of the show_unhandled_signals variable via
    /proc/sys/debug/exception-trace. The default value is currently 1
    showing unhandled user faults (undefined instructions, data aborts) and
    invalid signal stack frames.

    Signed-off-by: Catalin Marinas
    Acked-by: Tony Lindgren
    Acked-by: Arnd Bergmann
    Acked-by: Nicolas Pitre
    Acked-by: Olof Johansson
    Acked-by: Santosh Shilimkar

    Catalin Marinas
     

04 Sep, 2012

1 commit

  • Unlike others, sched_migration_cost, sched_time_avg and
    sched_shares_window doesn't have time unit as suffix. Add them.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1345083330-19486-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

31 Jul, 2012

2 commits

  • register_sysctl_table() is a strange function, as it makes internal
    allocations (a header) to register a sysctl_table. This header is a
    handle to the table that is created, and can be used to unregister the
    table. But if the table is permanent and never unregistered, the header
    acts the same as a static variable.

    Unfortunately, this allocation of memory that is never expected to be
    freed fools kmemleak in thinking that we have leaked memory. For those
    sysctl tables that are never unregistered, and have no pointer referencing
    them, kmemleak will think that these are memory leaks:

    unreferenced object 0xffff880079fb9d40 (size 192):
    comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x73/0x98
    [] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [] __kmalloc+0x107/0x153
    [] kzalloc.constprop.8+0xe/0x10
    [] __register_sysctl_paths+0xe1/0x160
    [] register_sysctl_paths+0x1b/0x1d
    [] register_sysctl_table+0x18/0x1a
    [] sysctl_init+0x10/0x14
    [] proc_sys_init+0x2f/0x31
    [] proc_root_init+0xa5/0xa7
    [] start_kernel+0x3d0/0x40a
    [] x86_64_start_reservations+0xae/0xb2
    [] x86_64_start_kernel+0x102/0x111
    [] 0xffffffffffffffff

    The sysctl_base_table used by sysctl itself is one such instance that
    registers the table to never be unregistered.

    Use kmemleak_not_leak() to suppress the kmemleak false positive.

    Signed-off-by: Steven Rostedt
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • When suid_dumpable=2, detect unsafe core_pattern settings and warn when
    they are seen.

    Signed-off-by: Kees Cook
    Suggested-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Jul, 2012

1 commit

  • This adds symlink and hardlink restrictions to the Linux VFS.

    Symlinks:

    A long-standing class of security issues is the symlink-based
    time-of-check-time-of-use race, most commonly seen in world-writable
    directories like /tmp. The common method of exploitation of this flaw
    is to cross privilege boundaries when following a given symlink (i.e. a
    root process follows a symlink belonging to another user). For a likely
    incomplete list of hundreds of examples across the years, please see:
    http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

    The solution is to permit symlinks to only be followed when outside
    a sticky world-writable directory, or when the uid of the symlink and
    follower match, or when the directory owner matches the symlink's owner.

    Some pointers to the history of earlier discussion that I could find:

    1996 Aug, Zygo Blaxell
    http://marc.info/?l=bugtraq&m=87602167419830&w=2
    1996 Oct, Andrew Tridgell
    http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
    1997 Dec, Albert D Cahalan
    http://lkml.org/lkml/1997/12/16/4
    2005 Feb, Lorenzo Hernández García-Hierro
    http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
    2010 May, Kees Cook
    https://lkml.org/lkml/2010/5/30/144

    Past objections and rebuttals could be summarized as:

    - Violates POSIX.
    - POSIX didn't consider this situation and it's not useful to follow
    a broken specification at the cost of security.
    - Might break unknown applications that use this feature.
    - Applications that break because of the change are easy to spot and
    fix. Applications that are vulnerable to symlink ToCToU by not having
    the change aren't. Additionally, no applications have yet been found
    that rely on this behavior.
    - Applications should just use mkstemp() or O_CREATE|O_EXCL.
    - True, but applications are not perfect, and new software is written
    all the time that makes these mistakes; blocking this flaw at the
    kernel is a single solution to the entire class of vulnerability.
    - This should live in the core VFS.
    - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
    - This should live in an LSM.
    - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

    Hardlinks:

    On systems that have user-writable directories on the same partition
    as system files, a long-standing class of security issues is the
    hardlink-based time-of-check-time-of-use race, most commonly seen in
    world-writable directories like /tmp. The common method of exploitation
    of this flaw is to cross privilege boundaries when following a given
    hardlink (i.e. a root process follows a hardlink created by another
    user). Additionally, an issue exists where users can "pin" a potentially
    vulnerable setuid/setgid file so that an administrator will not actually
    upgrade a system fully.

    The solution is to permit hardlinks to only be created when the user is
    already the existing file's owner, or if they already have read/write
    access to the existing file.

    Many Linux users are surprised when they learn they can link to files
    they have no access to, so this change appears to follow the doctrine
    of "least surprise". Additionally, this change does not violate POSIX,
    which states "the implementation may require that the calling process
    has permission to access the existing file"[1].

    This change is known to break some implementations of the "at" daemon,
    though the version used by Fedora and Ubuntu has been fixed[2] for
    a while. Otherwise, the change has been undisruptive while in use in
    Ubuntu for the last 1.5 years.

    [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
    [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

    This patch is based on the patches in Openwall and grsecurity, along with
    suggestions from Al Viro. I have added a sysctl to enable the protected
    behavior, and documentation.

    Signed-off-by: Kees Cook
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kees Cook
     

05 Apr, 2012

1 commit

  • Commit bfdc0b4 adds code to restrict access to dmesg_restrict,
    however, it incorrectly alters kptr_restrict rather than
    dmesg_restrict.

    The original patch from Richard Weinberger
    (https://lkml.org/lkml/2011/3/14/362) alters dmesg_restrict as
    expected, and so the patch seems to have been misapplied.

    This adds the CAP_SYS_ADMIN check to both dmesg_restrict and
    kptr_restrict, since both are sensitive.

    Reported-by: Phillip Lougher
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Richard Weinberger
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Kees Cook
     

29 Mar, 2012

2 commits

  • Merge third batch of patches from Andrew Morton:
    - Some MM stragglers
    - core SMP library cleanups (on_each_cpu_mask)
    - Some IPI optimisations
    - kexec
    - kdump
    - IPMI
    - the radix-tree iterator work
    - various other misc bits.

    "That'll do for -rc1. I still have ~10 patches for 3.4, will send
    those along when they've baked a little more."

    * emailed from Andrew Morton : (35 commits)
    backlight: fix typo in tosa_lcd.c
    crc32: add help text for the algorithm select option
    mm: move hugepage test examples to tools/testing/selftests/vm
    mm: move slabinfo.c to tools/vm
    mm: move page-types.c from Documentation to tools/vm
    selftests/Makefile: make `run_tests' depend on `all'
    selftests: launch individual selftests from the main Makefile
    radix-tree: use iterators in find_get_pages* functions
    radix-tree: rewrite gang lookup using iterator
    radix-tree: introduce bit-optimized iterator
    fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
    nbd: rename the nbd_device variable from lo to nbd
    pidns: add reboot_pid_ns() to handle the reboot syscall
    sysctl: use bitmap library functions
    ipmi: use locks on watchdog timeout set on reboot
    ipmi: simplify locking
    ipmi: fix message handling during panics
    ipmi: use a tasklet for handling received messages
    ipmi: increase KCS timeouts
    ipmi: decrease the IPMI message transaction time in interrupt mode
    ...

    Linus Torvalds
     
  • Use bitmap_set() instead of using set_bit() for each bit. This conversion
    is valid because the bitmap is private in the function call and atomic
    bitops were unnecessary.

    This also includes minor change.
    - Use bitmap_copy() for shorter typing

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita