14 Nov, 2013

1 commit

  • Pull core locking changes from Ingo Molnar:
    "The biggest changes:

    - add lockdep support for seqcount/seqlocks structures, this
    unearthed both bugs and required extra annotation.

    - move the various kernel locking primitives to the new
    kernel/locking/ directory"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    block: Use u64_stats_init() to initialize seqcounts
    locking/lockdep: Mark __lockdep_count_forward_deps() as static
    lockdep/proc: Fix lock-time avg computation
    locking/doc: Update references to kernel/mutex.c
    ipv6: Fix possible ipv6 seqlock deadlock
    cpuset: Fix potential deadlock w/ set_mems_allowed
    seqcount: Add lockdep functionality to seqcount/seqlock structures
    net: Explicitly initialize u64_stats_sync structures for lockdep
    locking: Move the percpu-rwsem code to kernel/locking/
    locking: Move the lglocks code to kernel/locking/
    locking: Move the rwsem code to kernel/locking/
    locking: Move the rtmutex code to kernel/locking/
    locking: Move the semaphore core to kernel/locking/
    locking: Move the spinlock code to kernel/locking/
    locking: Move the lockdep code to kernel/locking/
    locking: Move the mutex code to kernel/locking/
    hung_task debugging: Add tracepoint to report the hang
    x86/locking/kconfig: Update paravirt spinlock Kconfig description
    lockstat: Report avg wait and hold times
    lockdep, x86/alternatives: Drop ancient lockdep fixup message
    ...

    Linus Torvalds
     

13 Nov, 2013

1 commit


12 Nov, 2013

2 commits

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle are:

    - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
    van Riel, Peter Zijlstra et al. Yay!

    - optimize preemption counter handling: merge the NEED_RESCHED flag
    into the preempt_count variable, by Peter Zijlstra.

    - wait.h fixes and code reorganization from Peter Zijlstra

    - cfs_bandwidth fixes from Ben Segall

    - SMP load-balancer cleanups from Peter Zijstra

    - idle balancer improvements from Jason Low

    - other fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
    ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
    stop_machine: Fix race between stop_two_cpus() and stop_cpus()
    sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
    sched: Fix asymmetric scheduling for POWER7
    sched: Move completion code from core.c to completion.c
    sched: Move wait code from core.c to wait.c
    sched: Move wait.c into kernel/sched/
    sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
    sched: Avoid throttle_cfs_rq() racing with period_timer stopping
    sched: Guarantee new group-entities always have weight
    sched: Fix hrtimer_cancel()/rq->lock deadlock
    sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
    sched: Fix race on toggling cfs_bandwidth_used
    sched: Remove extra put_online_cpus() inside sched_setaffinity()
    sched/rt: Fix task_tick_rt() comment
    sched/wait: Fix build breakage
    sched/wait: Introduce prepare_to_wait_event()
    sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
    sched: Remove get_online_cpus() usage
    sched: Fix race in migrate_swap_stop()
    ...

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "As a first remark I'd like to note that the way to build perf tooling
    has been simplified and sped up, in the future it should be enough for
    you to build perf via:

    cd tools/perf/
    make install

    (ie without the -j option.) The build system will figure out the
    number of CPUs and will do a parallel build+install.

    The various build system inefficiencies and breakages Linus reported
    against the v3.12 pull request should now be resolved - please
    (re-)report any remaining annoyances or bugs.

    Main changes on the perf kernel side:

    * Performance optimizations:
    . perf ring-buffer code optimizations, by Peter Zijlstra
    . perf ring-buffer code optimizations, by Oleg Nesterov
    . x86 NMI call-stack processing optimizations, by Peter Zijlstra
    . perf context-switch optimizations, by Peter Zijlstra
    . perf sampling speedups, by Peter Zijlstra
    . x86 Intel PEBS processing speedups, by Peter Zijlstra

    * Enhanced hardware support:
    . for Intel Ivy Bridge-EP uncore PMUs, by Zheng Yan
    . for Haswell transactions, by Andi Kleen, Peter Zijlstra

    * Core perf events code enhancements and fixes by Oleg Nesterov:
    . for uprobes, if fork() is called with pending ret-probes
    . for uprobes platform support code

    * New ABI details by Andi Kleen:
    . Report x86 Haswell TSX transaction abort cost as weight

    Main changes on the perf tooling side (some of these tooling changes
    utilize the above kernel side changes):

    * 'perf report/top' enhancements:

    . Convert callchain children list to rbtree, greatly reducing the
    time taken for callchain processing, from Namhyung Kim.

    . Add new COMM infrastructure, further improving histogram
    processing, from Frédéric Weisbecker, one fix from Namhyung Kim.

    . Add /proc/kcore based live-annotation improvements, including
    build-id cache support, multi map 'call' instruction navigation
    fixes, kcore address validation, objdump workarounds. From
    Adrian Hunter.

    . Show progress on histogram collapsing, that can take a long
    time, from Namhyung Kim.

    . Add --max-stack option to limit callchain stack scan in 'top'
    and 'report', improving callchain processing when reducing the
    stack depth is an option, from Waiman Long.

    . Add new option --ignore-vmlinux for perf top, from Willy
    Tarreau.

    * 'perf trace' enhancements:

    . 'perf trace' now can can use a 'perf probe' dynamic tracepoints
    to hook into the userspace -> kernel pathname copy so that it
    can map fds to pathnames without reading /proc/pid/fd/ symlinks.
    From Arnaldo Carvalho de Melo.

    . Show VFS path associated with fd in live sessions, using a
    'vfs_getname' 'perf probe' created dynamic tracepoint or by
    looking at /proc/pid/fd, from Arnaldo Carvalho de Melo.

    . Add 'trace' beautifiers for lots of syscall arguments, from
    Arnaldo Carvalho de Melo.

    . Implement more compact 'trace' output by suppressing zeroed
    args, from Arnaldo Carvalho de Melo.

    . Show thread COMM by default in 'trace', from Arnaldo Carvalho de
    Melo.

    . Add option to show full timestamp in 'trace', from David Ahern.

    . Add 'record' command in 'trace', to record raw_syscalls:*, from
    David Ahern.

    . Add summary option to dump syscall statistics in 'trace', from
    David Ahern.

    . Improve error messages in 'trace', providing hints about system
    configuration steps needed for using it, from Ramkumar
    Ramachandra.

    . 'perf trace' now emits hints as to why tracing is not possible,
    helping the user to setup the system to allow tracing in the
    desired permission granularity, telling if the problem is due to
    debugfs not being mounted or with not enough permission for
    !root, /proc/sys/kernel/perf_event_paranoit value, etc. From
    Arnaldo Carvalho de Melo.

    * 'perf record' enhancements:

    . Check maximum frequency rate for record/top, emitting better
    error messages, from Jiri Olsa.

    . 'perf record' code cleanups, from David Ahern.

    . Improve write_output error message in 'perf record', from Adrian
    Hunter.

    . Allow specifying B/K/M/G unit to the --mmap-pages arguments,
    from Jiri Olsa.

    . Fix command line callchain attribute tests to handle the new
    -g/--call-chain semantics, from Arnaldo Carvalho de Melo.

    * 'perf kvm' enhancements:

    . Disable live kvm command if timerfd is not supported, from David
    Ahern.

    . Fix detection of non-core features, from David Ahern.

    * 'perf list' enhancements:

    . Add usage to 'perf list', from David Ahern.

    . Show error in 'perf list' if tracepoints not available, from
    Pekka Enberg.

    * 'perf probe' enhancements:

    . Support "$vars" meta argument syntax for local variables,
    allowing asking for all possible variables at a given probe
    point to be collected when it hits, from Masami Hiramatsu.

    * 'perf sched' enhancements:

    . Address the root cause of that 'perf sched' stack initialization
    build slowdown, by programmatically setting a big array after
    moving the global variable back to the stack. Fix from Adrian
    Hunter.

    * 'perf script' enhancements:

    . Set up output options for in-stream attributes, from Adrian
    Hunter.

    . Print addr by default for BTS in 'perf script', from Adrian
    Juntmer

    * 'perf stat' enhancements:

    . Improved messages when doing profiling in all or a subset of
    CPUs using a workload as the session delimitator, as in:

    'perf stat --cpu 0,2 sleep 10s'

    from Arnaldo Carvalho de Melo.

    . Add units to nanosec-based counters in 'perf stat', from David
    Ahern.

    . Remove bogus info when using 'perf stat' -e cycles/instructions,
    from Ramkumar Ramachandra.

    * 'perf lock' enhancements:

    . 'perf lock' fixes and cleanups, from Davidlohr Bueso.

    * 'perf test' enhancements:

    . Fixup PERF_SAMPLE_TRANSACTION handling in sample synthesizing
    and 'perf test', from Adrian Hunter.

    . Clarify the "sample parsing" test entry, from Arnaldo Carvalho
    de Melo.

    . Consider PERF_SAMPLE_TRANSACTION in the "sample parsing" test,
    from Arnaldo Carvalho de Melo.

    . Memory leak fixes in 'perf test', from Felipe Pena.

    * 'perf bench' enhancements:

    . Change the procps visible command-name of invididual benchmark
    tests plus cleanups, from Ingo Molnar.

    * Generic perf tooling infrastructure/plumbing changes:

    . Separating data file properties from session, code
    reorganization from Jiri Olsa.

    . Fix version when building out of tree, as when using one of
    these:

    $ make help | grep perf
    perf-tar-src-pkg - Build perf-3.12.0.tar source tarball
    perf-targz-src-pkg - Build perf-3.12.0.tar.gz source tarball
    perf-tarbz2-src-pkg - Build perf-3.12.0.tar.bz2 source tarball
    perf-tarxz-src-pkg - Build perf-3.12.0.tar.xz source tarball
    $

    from David Ahern.

    . Enhance option parse error message, showing just the help lines
    of the options affected, from Namhyung Kim.

    . libtraceevent updates from upstream trace-cmd repo, from Steven
    Rostedt.

    . Always use perf_evsel__set_sample_bit to set sample_type, from
    Adrian Hunter.

    . Memory and mmap leak fixes from Chenggang Qin.

    . Assorted build fixes for from David Ahern and Jiri Olsa.

    . Speed up and prettify the build system, from Ingo Molnar.

    . Implement addr2line directly using libbfd, from Roberto Vitillo.

    . Separate the GTK support in a separate libperf-gtk.so DSO, that
    is only loaded when --gtk is specified, from Namhyung Kim.

    . perf bash completion fixes and improvements from Ramkumar
    Ramachandra.

    . Support for Openembedded/Yocto -dbg packages, from Ricardo
    Ribalda Delgado.

    And lots and lots of other fixes and code reorganizations that did not
    make it into the list, see the shortlog, diffstat and the Git log for
    details!"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (300 commits)
    uprobes: Fix the memory out of bound overwrite in copy_insn()
    uprobes: Fix the wrong usage of current->utask in uprobe_copy_process()
    perf tools: Remove unneeded include
    perf record: Remove post_processing_offset variable
    perf record: Remove advance_output function
    perf record: Refactor feature handling into a separate function
    perf trace: Don't relookup fields by name in each sample
    perf tools: Fix version when building out of tree
    perf evsel: Ditch evsel->handler.data field
    uprobes: Export write_opcode() as uprobe_write_opcode()
    uprobes: Introduce arch_uprobe->ixol
    uprobes: Kill module_init() and module_exit()
    uprobes: Move function declarations out of arch
    perf/x86/intel: Add Ivy Bridge-EP uncore IRP box support
    perf/x86/intel/uncore: Add filter support for IvyBridge-EP QPI boxes
    perf: Factor out strncpy() in perf_event_mmap_event()
    tools/perf: Add required memory barriers
    perf: Fix arch_perf_out_copy_user default
    perf: Update a stale comment
    perf: Optimize perf_output_begin() -- address calculation
    ...

    Linus Torvalds
     

06 Nov, 2013

1 commit


17 Oct, 2013

1 commit


09 Oct, 2013

3 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With scan rate adaptions based on whether the workload has properly
    converged or not there should be no need for the scan period reset
    hammer. Get rid of it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • This patch favours moving tasks towards NUMA node that recorded a higher
    number of NUMA faults during active load balancing. Ideally this is
    self-reinforcing as the longer the task runs on that node, the more faults
    it should incur causing task_numa_placement to keep the task running on that
    node. In reality a big weakness is that the nodes CPUs can be overloaded
    and it would be more efficient to queue tasks on an idle node and migrate
    to the new node. This would require additional smarts in the balancer so
    for now the balancer will simply prefer to place the task on the preferred
    node for a PTE scans which is controlled by the numa_balancing_settle_count
    sysctl. Once the settle_count number of scans has complete the schedule
    is free to place the task on an alternative node if the load is imbalanced.

    [srikar@linux.vnet.ibm.com: Fixed statistics]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    [ Tunable and use higher faults instead of preferred. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

04 Oct, 2013

1 commit

  • /proc/sys/kernel/perf_event_max_sample_rate will accept
    negative values as well as 0.

    Negative values are unreasonable, and 0 causes a
    divide by zero exception in perf_proc_update_handler.

    This patch enforces a lower limit of 1.

    Signed-off-by: Knut Petersen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
    Signed-off-by: Ingo Molnar

    Knut Petersen
     

23 Sep, 2013

1 commit

  • As 'sysctl_hung_task_check_count' is 'unsigned long' when this
    value is assigned to max_count in check_hung_uninterruptible_tasks(),
    it's truncated to 'int' type.

    This causes a minor artifact: if we write 2^32 to sysctl.hung_task_check_count,
    hung task detection will be effectively disabled.

    With this fix, it will still truncate the user input to 32 bits, but
    reading sysctl.hung_task_check_count reflects the actual truncated value.

    Signed-off-by: Li Zefan
    Acked-by: Ingo Molnar
    Link: http://lkml.kernel.org/r/523FFF4E.9050401@huawei.com
    Signed-off-by: Ingo Molnar

    Li Zefan
     

13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

12 Sep, 2013

1 commit

  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Sep, 2013

1 commit

  • This series reworks our current object cache shrinking infrastructure in
    two main ways:

    * Noticing that a lot of users copy and paste their own version of LRU
    lists for objects, we put some effort in providing a generic version.
    It is modeled after the filesystem users: dentries, inodes, and xfs
    (for various tasks), but we expect that other users could benefit in
    the near future with little or no modification. Let us know if you
    have any issues.

    * The underlying list_lru being proposed automatically and
    transparently keeps the elements in per-node lists, and is able to
    manipulate the node lists individually. Given this infrastructure, we
    are able to modify the up-to-now hammer called shrink_slab to proceed
    with node-reclaim instead of always searching memory from all over like
    it has been doing.

    Per-node lru lists are also expected to lead to less contention in the lru
    locks on multi-node scans, since we are now no longer fighting for a
    global lock. The locks usually disappear from the profilers with this
    change.

    Although we have no official benchmarks for this version - be our guest to
    independently evaluate this - earlier versions of this series were
    performance tested (details at
    http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
    visible performance regressions while yielding a better qualitative
    behavior in NUMA machines.

    With this infrastructure in place, we can use the list_lru entry point to
    provide memcg isolation and per-memcg targeted reclaim. Historically,
    those two pieces of work have been posted together. This version presents
    only the infrastructure work, deferring the memcg work for a later time,
    so we can focus on getting this part tested. You can see more about the
    history of such work at http://lwn.net/Articles/552769/

    Dave Chinner (18):
    dcache: convert dentry_stat.nr_unused to per-cpu counters
    dentry: move to per-sb LRU locks
    dcache: remove dentries from LRU before putting on dispose list
    mm: new shrinker API
    shrinker: convert superblock shrinkers to new API
    list: add a new LRU list type
    inode: convert inode lru list to generic lru list code.
    dcache: convert to use new lru list infrastructure
    list_lru: per-node list infrastructure
    shrinker: add node awareness
    fs: convert inode and dentry shrinking to be node aware
    xfs: convert buftarg LRU to generic code
    xfs: rework buffer dispose list tracking
    xfs: convert dquot cache lru to list_lru
    fs: convert fs shrinkers to new scan/count API
    drivers: convert shrinkers to new count/scan API
    shrinker: convert remaining shrinkers to count/scan API
    shrinker: Kill old ->shrink API.

    Glauber Costa (7):
    fs: bump inode and dentry counters to long
    super: fix calculation of shrinkable objects for small numbers
    list_lru: per-node API
    vmscan: per-node deferred work
    i915: bail out earlier when shrinker cannot acquire mutex
    hugepage: convert huge zero page shrinker to new shrinker API
    list_lru: dynamically adjust node arrays

    This patch:

    There are situations in very large machines in which we can have a large
    quantity of dirty inodes, unused dentries, etc. This is particularly true
    when umounting a filesystem, where eventually since every live object will
    eventually be discarded.

    Dave Chinner reported a problem with this while experimenting with the
    shrinker revamp patchset. So we believe it is time for a change. This
    patch just moves int to longs. Machines where it matters should have a
    big long anyway.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: Dave Chinner
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     

27 Jul, 2013

1 commit

  • When (integer) sysctl values are expressed in ms and have to be
    represented internally as jiffies. The msecs_to_jiffies function
    returns an unsigned long, which gets assigned to the integer.
    This patch prevents the value to be assigned if bigger than
    INT_MAX, done in a similar way as in cba9f3 ("Range checking in
    do_proc_dointvec_(userhz_)jiffies_conv").

    Signed-off-by: Francesco Fusco
    CC: Andrew Morton
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Francesco Fusco
     

12 Jul, 2013

2 commits

  • Get upstream changes so we can apply fixes against them

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Pull tracing changes from Steven Rostedt:
    "The majority of the changes here are cleanups for the large changes
    that were added to 3.10, which includes several bug fixes that have
    been marked for stable.

    As for new features, there were a few, but nothing to write to LWN
    about. These include:

    New function trigger called "dump" and "cpudump" that will cause
    ftrace to dump its buffer to the console when the function is called.
    The difference between "dump" and "cpudump" is that "dump" will dump
    the entire contents of the ftrace buffer, where as "cpudump" will only
    dump the contents of the ftrace buffer for the CPU that called the
    function.

    Another small enhancement is a new sysctl switch called
    "traceoff_on_warning" which, when enabled, will disable tracing if any
    WARN_ON() is triggered. This is useful if you want to debug what
    caused a warning and do not want to risk losing your trace data by the
    ring buffer overwriting the data before you can disable it. There's
    also a kernel command line option that will make this enabled at boot
    up called the same thing"

    * tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (34 commits)
    tracing: Make tracing_open_generic_{tr,tc}() static
    tracing: Remove ftrace() function
    tracing: Remove TRACE_EVENT_TYPE enum definition
    tracing: Make tracer_tracing_{off,on,is_on}() static
    tracing: Fix irqs-off tag display in syscall tracing
    uprobes: Fix return value in error handling path
    tracing: Fix race between deleting buffer and setting events
    tracing: Add trace_array_get/put() to event handling
    tracing: Get trace_array ref counts when accessing trace files
    tracing: Add trace_array_get/put() to handle instance refs better
    tracing: Protect ftrace_trace_arrays list in trace_events.c
    tracing: Make trace_marker use the correct per-instance buffer
    ftrace: Do not run selftest if command line parameter is set
    tracing/kprobes: Don't pass addr=ip to perf_trace_buf_submit()
    tracing: Use flag buffer_disabled for irqsoff tracer
    tracing/kprobes: Turn trace_probe->files into list_head
    tracing: Fix disabling of soft disable
    tracing: Add missing syscall_metadata comment
    tracing: Simplify code for showing of soft disabled flag
    tracing/kprobes: Kill probe_enable_lock
    ...

    Linus Torvalds
     

10 Jul, 2013

1 commit

  • …eric/linux-dynticks into timers/urgent

    Pull nohz updates/fixes from Frederic Weisbecker:

    ' Note that "watchdog: Boot-disable by default on full dynticks" is a temporary
    solution to solve the issue with the watchdog that prevents the tick from
    stopping. This is to make sure that 3.11 doesn't have that problem as several
    people complained about it.

    A proper and longer term solution has been proposed by Peterz:

    http://lkml.kernel.org/r/20130618103632.GO3204@twins.programming.kicks-ass.net
    '

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

23 Jun, 2013

1 commit

  • This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second. If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.

    This way, we don't have a runaway sampling process eating up the
    CPU.

    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well. *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.

    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.

    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

    Signed-off-by: Dave Hansen
    Acked-by: Peter Zijlstra
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Cc: Dave Hansen
    [ Prettified it a bit. ]
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

20 Jun, 2013

2 commits

  • We have two very conflicting state variable names in the
    watchdog:

    * watchdog_enabled: This one reflects the user interface. It's
    set to 1 by default and can be overriden with boot options
    or sysctl/procfs interface.

    * watchdog_disabled: This is the internal toggle state that
    tells if watchdog threads, timers and NMI events are currently
    running or not. This state mostly depends on the user settings.
    It's a convenient state latch.

    Now we really need to find clearer names because those
    are just too confusing to encourage deep review.

    watchdog_enabled now becomes watchdog_user_enabled to reflect
    its purpose as an interface.

    watchdog_disabled becomes watchdog_running to suggest its
    role as a pure internal state.

    Signed-off-by: Frederic Weisbecker
    Cc: Srivatsa S. Bhat
    Cc: Anish Singh
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Don Zickus

    Frederic Weisbecker
     
  • Add a traceoff_on_warning option in both the kernel command line as well
    as a sysctl option. When set, any WARN*() function that is hit will cause
    the tracing_on variable to be cleared, which disables writing to the
    ring buffer.

    This is useful especially when tracing a bug with function tracing. When
    a warning is hit, the print caused by the warning can flood the trace with
    the functions that producing the output for the warning. This can make the
    resulting trace useless by either hiding where the bug happened, or worse,
    by overflowing the buffer and losing the trace of the bug totally.

    Acked-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

28 May, 2013

1 commit

  • In old kernels, it's allowed to set softlockup_thresh to -1 or 0
    to disable softlockup detection. However watchdog_thresh only
    uses 0 to disable detection, and setting it to -1 just froze my
    box and nothing I can do but reboot.

    Signed-off-by: Li Zefan
    Acked-by: Don Zickus
    Link: http://lkml.kernel.org/r/51959668.9040106@huawei.com
    Signed-off-by: Ingo Molnar

    Li Zefan
     

30 Apr, 2013

3 commits

  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • drop_caches.c provides code only invokable via sysctl, so don't compile it
    in when CONFIG_SYSCTL=n.

    Signed-off-by: Josh Triplett
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

02 Mar, 2013

1 commit

  • Pull new ARC architecture from Vineet Gupta:
    "Initial ARC Linux port with some fixes on top for 3.9-rc1:

    I would like to introduce the Linux port to ARC Processors (from
    Synopsys) for 3.9-rc1. The patch-set has been discussed on the public
    lists since Nov and has received a fair bit of review, specially from
    Arnd, tglx, Al and other subsystem maintainers for DeviceTree, kgdb...

    The arch bits are in arch/arc, some asm-generic changes (acked by
    Arnd), a minor change to PARISC (acked by Helge).

    The series is a touch bigger for a new port for 2 main reasons:

    1. It enables a basic kernel in first sub-series and adds
    ptrace/kgdb/.. later

    2. Some of the fallout of review (DeviceTree support, multi-platform-
    image support) were added on top of orig series, primarily to
    record the revision history.

    This updated pull request additionally contains

    - fixes due to our GNU tools catching up with the new syscall/ptrace
    ABI

    - some (minor) cross-arch Kconfig updates."

    * tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (82 commits)
    ARC: split elf.h into uapi and export it for userspace
    ARC: Fixup the current ABI version
    ARC: gdbserver using regset interface possibly broken
    ARC: Kconfig cleanup tracking cross-arch Kconfig pruning in merge window
    ARC: make a copy of flat DT
    ARC: [plat-arcfpga] DT arc-uart bindings change: "baud" => "current-speed"
    ARC: Ensure CONFIG_VIRT_TO_BUS is not enabled
    ARC: Fix pt_orig_r8 access
    ARC: [3.9] Fallout of hlist iterator update
    ARC: 64bit RTSC timestamp hardware issue
    ARC: Don't fiddle with non-existent caches
    ARC: Add self to MAINTAINERS
    ARC: Provide a default serial.h for uart drivers needing BASE_BAUD
    ARC: [plat-arcfpga] defconfig for fully loaded ARC Linux
    ARC: [Review] Multi-platform image #8: platform registers SMP callbacks
    ARC: [Review] Multi-platform image #7: SMP common code to use callbacks
    ARC: [Review] Multi-platform image #6: cpu-to-dma-addr optional
    ARC: [Review] Multi-platform image #5: NR_IRQS defined by ARC core
    ARC: [Review] Multi-platform image #4: Isolate platform headers
    ARC: [Review] Multi-platform image #3: switch to board callback
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
    defines introduced in 54b501992dd2 ("coredump: warn about unsafe
    suid_dumpable / core_pattern combo"). Remove the new ones, and use the
    prior values instead.

    Signed-off-by: Kees Cook
    Reported-by: Chen Gang
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

26 Feb, 2013

1 commit

  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • When calculating amount of dirtyable memory, min_free_kbytes should be
    subtracted because it is not intended for dirty pages.

    Addresses http://bugs.debian.org/695182

    [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
    [akpm@linux-foundation.org: fix min() warning]
    Signed-off-by: Paul Szabo
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Szabo
     

22 Feb, 2013

1 commit


16 Feb, 2013

1 commit

  • PARISC defines /proc/sys/kernel/unaligned-trap to runtime toggle
    unaligned access emulation.

    The exact mechanics of enablig/disabling are still arch specific, we can
    make the sysctl usable by other arches.

    Signed-off-by: Vineet Gupta
    Acked-by: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn

    Vineet Gupta
     

08 Feb, 2013

2 commits

  • Add a /proc/sys/kernel scheduler knob named
    sched_rr_timeslice_ms that allows global changing of the
    SCHED_RR timeslice value. User visable value is in milliseconds
    but is stored as jiffies. Setting to 0 (zero) resets to the
    default (currently 100ms).

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     
  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

21 Jan, 2013

1 commit


10 Jan, 2013

1 commit

  • IA64 defines /proc/sys/kernel/ignore-unaligned-usertrap to control
    verbose warnings on unaligned access emulation.

    Although the exact mechanics of what to do with sysctl (ignore/shout)
    are arch specific, this change enables the sysctl to be usable cross-arch.

    Signed-off-by: Vineet Gupta
    Cc: Fenghua Yu
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Tony Luck

    Vineet Gupta
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

4 commits

  • The PTE scanning rate and fault rates are two of the biggest sources of
    system CPU overhead with automatic NUMA placement. Ideally a proper policy
    would detect if a workload was properly placed, schedule and adjust the
    PTE scanning rate accordingly. We do not track the necessary information
    to do that but we at least know if we migrated or not.

    This patch scans slower if a page was not migrated as the result of a
    NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
    now higher than the previous default. Once every minute it will reset
    the scanner in case of phase changes.

    This is hilariously crude and the numbers are arbitrary. Workloads will
    converge quite slowly in comparison to what a proper policy should be able
    to do. On the plus side, we will chew up less CPU for workloads that have
    no need for automatic balancing.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Add a 1 second delay before starting to scan the working set of
    a task and starting to balance it amongst nodes.

    [ note that before the constant per task WSS sampling rate patch
    the initial scan would happen much later still, in effect that
    patch caused this regression. ]

    The theory is that short-run tasks benefit very little from NUMA
    placement: they come and go, and they better stick to the node
    they were started on. As tasks mature and rebalance to other CPUs
    and nodes, so does their NUMA placement have to change and so
    does it start to matter more and more.

    In practice this change fixes an observable kbuild regression:

    # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

    !NUMA:
    45.291088843 seconds time elapsed ( +- 0.40% )
    45.154231752 seconds time elapsed ( +- 0.36% )

    +NUMA, no slow start:
    46.172308123 seconds time elapsed ( +- 0.30% )
    46.343168745 seconds time elapsed ( +- 0.25% )

    +NUMA, 1 sec slow start:
    45.224189155 seconds time elapsed ( +- 0.25% )
    45.160866532 seconds time elapsed ( +- 0.17% )

    and it also fixes an observable perf bench (hackbench) regression:

    # perf stat --null --repeat 10 perf bench sched messaging

    -NUMA:

    -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
    +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

    +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

    The implementation is simple and straightforward, most of the patch
    deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
    knob.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote the changelog, ran measurements, tuned the default. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • Previously, to probe the working set of a task, we'd use
    a very simple and crude method: mark all of its address
    space PROT_NONE.

    That method has various (obvious) disadvantages:

    - it samples the working set at dissimilar rates,
    giving some tasks a sampling quality advantage
    over others.

    - creates performance problems for tasks with very
    large working sets

    - over-samples processes with large address spaces but
    which only very rarely execute

    Improve that method by keeping a rotating offset into the
    address space that marks the current position of the scan,
    and advance it by a constant rate (in a CPU cycles execution
    proportional manner). If the offset reaches the last mapped
    address of the mm then it then it starts over at the first
    address.

    The per-task nature of the working set sampling functionality in this tree
    allows such constant rate, per task, execution-weight proportional sampling
    of the working set, with an adaptive sampling interval/frequency that
    goes from once per 100ms up to just once per 8 seconds. The current
    sampling volume is 256 MB per interval.

    As tasks mature and converge their working set, so does the
    sampling rate slow down to just a trickle, 256 MB per 8
    seconds of CPU time executed.

    This, beyond being adaptive, also rate-limits rarely
    executing systems and does not over-sample on overloaded
    systems.

    [ In AutoNUMA speak, this patch deals with the effective sampling
    rate of the 'hinting page fault'. AutoNUMA's scanning is
    currently rate-limited, but it is also fundamentally
    single-threaded, executing in the knuma_scand kernel thread,
    so the limit in AutoNUMA is global and does not scale up with
    the number of CPUs, nor does it scan tasks in an execution
    proportional manner.

    So the idea of rate-limiting the scanning was first implemented
    in the AutoNUMA tree via a global rate limit. This patch goes
    beyond that by implementing an execution rate proportional
    working set sampling rate that is not implemented via a single
    global scanning daemon. ]

    [ Dan Carpenter pointed out a possible NULL pointer dereference in the
    first version of this patch. ]

    Based-on-idea-by: Andrea Arcangeli
    Bug-Found-By: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    [ Wrote changelog and fixed bug. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra