24 Jun, 2014

2 commits

  • A 'softlockup' is defined as a bug that causes the kernel to loop in
    kernel mode for more than a predefined period to time, without giving
    other tasks a chance to run.

    Currently, upon detection of this condition by the per-cpu watchdog
    task, debug information (including a stack trace) is sent to the system
    log.

    On some occasions, we have observed that the "victim" rather than the
    actual "culprit" (i.e. the owner/holder of the contended resource) is
    reported to the user. Often this information has proven to be
    insufficient to assist debugging efforts.

    To avoid loss of useful debug information, for architectures which
    support NMI, this patch makes it possible to improve soft lockup
    reporting. This is accomplished by issuing an NMI to each cpu to obtain
    a stack trace.

    If NMI is not supported we just revert back to the old method. A sysctl
    and boot-time parameter is available to toggle this feature.

    [dzickus@redhat.com: add CONFIG_SMP in certain areas]
    [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
    [mq@suse.cz: fix warning]
    Signed-off-by: Aaron Tomlin
    Signed-off-by: Don Zickus
    Cc: David S. Miller
    Cc: Mateusz Guzik
    Cc: Oleg Nesterov
    Signed-off-by: Jan Moskyto Matejka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Jun, 2014

1 commit

  • Pull sparc fixes from David Miller:
    "Sparc sparse fixes from Sam Ravnborg"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: (67 commits)
    sparc64: fix sparse warnings in int_64.c
    sparc64: fix sparse warning in ftrace.c
    sparc64: fix sparse warning in kprobes.c
    sparc64: fix sparse warning in kgdb_64.c
    sparc64: fix sparse warnings in compat_audit.c
    sparc64: fix sparse warnings in init_64.c
    sparc64: fix sparse warnings in aes_glue.c
    sparc: fix sparse warnings in smp_32.c + smp_64.c
    sparc64: fix sparse warnings in perf_event.c
    sparc64: fix sparse warnings in kprobes.c
    sparc64: fix sparse warning in tsb.c
    sparc64: clean up compat_sigset_t.seta handling
    sparc64: fix sparse "Should it be static?" warnings in signal32.c
    sparc64: fix sparse warnings in sys_sparc32.c
    sparc64: fix sparse warning in pci.c
    sparc64: fix sparse warnings in smp_64.c
    sparc64: fix sparse warning in prom_64.c
    sparc64: fix sparse warning in btext.c
    sparc64: fix sparse warnings in sys_sparc_64.c + unaligned_64.c
    sparc64: fix sparse warning in process_64.c
    ...

    Conflicts:
    arch/sparc/include/asm/pgtable_64.h

    Linus Torvalds
     

13 Jun, 2014

1 commit

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

07 Jun, 2014

4 commits

  • This typedef is unnecessary and should just be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    begins writing the string from the start. This means the contents of
    the last write to the sysctl controls the string contents instead of the
    first:

    open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
    write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
    write(1, "/bin/true", 9) = 9
    close(1) = 0

    $ cat /proc/sys/kernel/modprobe
    /bin/true

    Expected behaviour would be to have the sysctl be "AAAA..." capped at
    maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
    contents of the second write. Similarly, multiple short writes would
    not append to the sysctl.

    The old behavior is unlike regular POSIX files enough that doing audits
    of software that interact with sysctls can end up in unexpected or
    dangerous situations. For example, "as long as the input starts with a
    trusted path" turns out to be an insufficient filter, as what must also
    happen is for the input to be entirely contained in a single write
    syscall -- not a common consideration, especially for high level tools.

    This provides kernel.sysctl_writes_strict as a way to make this behavior
    act in a less surprising manner for strings, and disallows non-zero file
    position when writing numeric sysctls (similar to what is already done
    when reading from non-zero file positions). For now, the default (0) is
    to warn about non-zero file position use, but retain the legacy
    behavior. Setting this to -1 disables the warning, and setting this to
    1 enables the file position respecting behavior.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: move misplaced hunk, per Randy]
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Consolidate buffer length checking with new-line/end-of-line checking.
    Additionally, instead of reading user memory twice, just do the
    assignment during the loop.

    This change doesn't affect the potential races here. It was already
    possible to read a sysctl that was in the middle of a write. In both
    cases, the string will always be NULL terminated. The pre-existing race
    remains a problem to be solved.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    began writing the string from the start. This meant the contents of the
    last write to the sysctl controlled the string contents instead of the
    first.

    This misbehavior was featured in an exploit against Chrome OS. While
    it's not in itself a vulnerability, it's a weirdness that isn't on the
    mind of most auditors: "This filter looks correct, the first line
    written would not be meaningful to sysctl" doesn't apply here, since the
    size of the write and the contents of the final write are what matter
    when writing to sysctls.

    This adds the sysctl kernel.sysctl_writes_strict to control the write
    behavior. The default (0) reports when VFS position is non-0 on a
    write, but retains legacy behavior, -1 disables the warning, and 1
    enables the position-respecting behavior.

    The long-term plan here is to wait for userspace to be fixed in response
    to the new warning and to then switch the default kernel behavior to the
    new position-respecting behavior.

    This patch (of 4):

    The char buffer arguments are needlessly cast in weird places. Clean it
    up so things are easier to read.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2014

1 commit

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     

19 May, 2014

1 commit

  • Fix following warning:
    tsb.c:290:5: warning: symbol 'sysctl_tsb_ratio' was not declared. Should it be static?

    Add extern declaration in asm/setup.h and remove local declaration
    in kernel/sysctl.c

    Signed-off-by: Sam Ravnborg
    Signed-off-by: David S. Miller

    Sam Ravnborg
     

15 May, 2014

1 commit

  • ip_local_port_range is already per netns, so should ip_local_reserved_ports
    be. And since it is none by default we don't actually need it when we don't
    enable CONFIG_SYSCTL.

    By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()

    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

06 May, 2014

1 commit


26 Apr, 2014

1 commit


08 Apr, 2014

1 commit

  • As sysctl_hung_task_timeout_sec is unsigned long, when this value is
    larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
    watchdog will return immediately without sleep and with print :

    schedule_timeout: wrong timeout value ffffffffffffff83

    and then the funtion watchdog will call schedule_timeout_interruptible
    again and again. The screen will be filled with

    "schedule_timeout: wrong timeout value ffffffffffffff83"

    This patch does some check and correction in sysctl, to let the function
    schedule_timeout_interruptible allways get the valid parameter.

    Signed-off-by: Liu Hua
    Tested-by: Satoru Takeuchi
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Hua
     

04 Apr, 2014

1 commit

  • There is plenty of anecdotal evidence and a load of blog posts
    suggesting that using "drop_caches" periodically keeps your system
    running in "tip top shape". Perhaps adding some kernel documentation
    will increase the amount of accurate data on its use.

    If we are not shrinking caches effectively, then we have real bugs.
    Using drop_caches will simply mask the bugs and make them harder to
    find, but certainly does not fix them, nor is it an appropriate
    "workaround" to limit the size of the caches. On the contrary, there
    have been bug reports on issues that turned out to be misguided use of
    cache dropping.

    Dropping caches is a very drastic and disruptive operation that is good
    for debugging and running tests, but if it creates bug reports from
    production use, kernel developers should be aware of its use.

    Add a bit more documentation about it, a syslog message to track down
    abusers, and vmstat drop counters to help analyze problem reports.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hannes@cmpxchg.org: add runtime suppression control]
    Signed-off-by: Dave Hansen
    Signed-off-by: Michal Hocko
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

02 Apr, 2014

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • This was a debugging measure to toggle enabled/disabled
    when testing. But for real production setups, it's not
    safe to toggle this setting without either reloading
    drivers of quiescing IO first. Neither of which the toggle
    enforces.

    Additionally, it makes drivers deal with the conditional
    state.

    Remove it completely. It's up to the driver whether iopoll
    is enabled or not.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Feb, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull core debug changes from Ingo Molnar:
    "This contains mostly kernel debugging related updates:

    - make hung_task detection more configurable to distros
    - add final bits for x86 UV NMI debugging, with related KGDB changes
    - update the mailing-list of MAINTAINERS entries I'm involved with"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hung_task: Display every hung task warning
    sysctl: Add neg_one as a standard constraint
    x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
    x86/uv/nmi: Fix Sparse warnings
    kgdb/kdb: Fix no KDB config problem
    MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

    Linus Torvalds
     

28 Jan, 2014

1 commit

  • Excessive migration of pages can hurt the performance of workloads
    that span multiple NUMA nodes. However, it turns out that the
    p->numa_migrate_deferred knob is a really big hammer, which does
    reduce migration rates, but does not actually help performance.

    Now that the second stage of the automatic numa balancing code
    has stabilized, it is time to replace the simplistic migration
    deferral code with something smarter.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

25 Jan, 2014

2 commits

  • When khungtaskd detects hung tasks, it prints out
    backtraces from a number of those tasks.

    Limiting the number of backtraces being printed
    out can result in the user not seeing the information
    necessary to debug the issue. The hung_task_warnings
    sysctl controls this feature.

    This patch makes it possible for hung_task_warnings
    to accept a special value to print an unlimited
    number of backtraces when khungtaskd detects hung
    tasks.

    The special value is -1. To use this value it is
    necessary to change types from ulong to int.

    Signed-off-by: Aaron Tomlin
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
    [ Build warning fix. ]
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     
  • Add neg_one to the list of standard constraints - will be used by the next patch.

    Signed-off-by: Aaron Tomlin
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-2-git-send-email-atomlin@redhat.com
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     

24 Jan, 2014

2 commits

  • For general-purpose (i.e. distro) kernel builds it makes sense to build
    with CONFIG_KEXEC to allow end users to choose what kind of things they
    want to do with kexec. However, in the face of trying to lock down a
    system with such a kernel, there needs to be a way to disable kexec_load
    (much like module loading can be disabled). Without this, it is too easy
    for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
    and modules_disabled are set. With this change, it is still possible to
    load an image for use later, then disable kexec_load so the image (or lack
    of image) can't be altered.

    The intention is for using this in environments where "perfect"
    enforcement is hard. Without a verified boot, along with verified
    modules, and along with verified kexec, this is trying to give a system a
    better chance to defend itself (or at least grow the window of
    discoverability) against attack in the face of a privilege escalation.

    In my mind, I consider several boot scenarios:

    1) Verified boot of read-only verified root fs loading fd-based
    verification of kexec images.
    2) Secure boot of writable root fs loading signed kexec images.
    3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
    4) Regular boot with no control of kexec image at all.

    1 and 2 don't exist yet, but will soon once the verified kexec series has
    landed. 4 is the state of things now. The gap between 2 and 4 is too
    large, so this change creates scenario 3, a middle-ground above 4 when 2
    and 1 are not possible for a system.

    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Add a working sysctl to enable/disable automatic numa memory balancing
    at runtime.

    This allows us to track down performance problems with this feature and
    is generally a good idea.

    This was possible earlier through debugfs, but only with special
    debugging options set. Also fix the boot message.

    [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
    Signed-off-by: Andi Kleen
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

13 Jan, 2014

2 commits

  • Remove the deadline specific sysctls for now. The problem with them is
    that the interaction with the exisiting rt knobs is nearly impossible
    to get right.

    The current (as per before this patch) situation is that the rt and dl
    bandwidth is completely separate and we enforce rt+dl < 100%. This is
    undesirable because this means that the rt default of 95% leaves us
    hardly any room, even though dl tasks are saver than rt tasks.

    Another proposed solution was (a discarted patch) to have the dl
    bandwidth be a fraction of the rt bandwidth. This is highly
    confusing imo.

    Furthermore neither proposal is consistent with the situation we
    actually want; which is rt tasks ran from a dl server. In which case
    the rt bandwidth is a direct subset of dl.

    So whichever way we go, the introduction of dl controls at this point
    is painful. Therefore remove them and instead share the rt budget.

    This means that for now the rt knobs are used for dl admission control
    and the dl runtime is accounted against the rt runtime. I realise that
    this isn't entirely desirable either; but whatever we do we appear to
    need to change the interface later, so better have a small interface
    for now.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order of deadline scheduling to be effective and useful, it is
    important that some method of having the allocation of the available
    CPU bandwidth to tasks and task groups under control.
    This is usually called "admission control" and if it is not performed
    at all, no guarantee can be given on the actual scheduling of the
    -deadline tasks.

    Since when RT-throttling has been introduced each task group have a
    bandwidth associated to itself, calculated as a certain amount of
    runtime over a period. Moreover, to make it possible to manipulate
    such bandwidth, readable/writable controls have been added to both
    procfs (for system wide settings) and cgroupfs (for per-group
    settings).

    Therefore, the same interface is being used for controlling the
    bandwidth distrubution to -deadline tasks and task groups, i.e.,
    new controls but with similar names, equivalent meaning and with
    the same usage paradigm are added.

    However, more discussion is needed in order to figure out how
    we want to manage SCHED_DEADLINE bandwidth at the task group level.
    Therefore, this patch adds a less sophisticated, but actually
    very sensible, mechanism to ensure that a certain utilization
    cap is not overcome per each root_domain (the single rq for !SMP
    configurations).

    Another main difference between deadline bandwidth management and
    RT-throttling is that -deadline tasks have bandwidth on their own
    (while -rt ones doesn't!), and thus we don't need an higher level
    throttling mechanism to enforce the desired bandwidth.

    This patch, therefore:

    - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
    that determine (i.e., runtime / period) the total bandwidth
    available on each CPU of each root_domain for -deadline tasks;

    - couples the RT and deadline bandwidth management, i.e., enforces
    that the sum of how much bandwidth is being devoted to -rt
    -deadline tasks to stay below 100%.

    This means that, for a root_domain comprising M CPUs, -deadline tasks
    can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

    It is also possible to disable this bandwidth management logic, and
    be thus free of oversubscribing the system up to any arbitrary level.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

17 Dec, 2013

1 commit

  • commit 887c290e (sched/numa: Decide whether to favour task or group weights
    based on swap candidate relationships) drop the check against
    sysctl_numa_balancing_settle_count, this patch remove the sysctl.

    Signed-off-by: Wanpeng Li
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

14 Nov, 2013

1 commit

  • Pull core locking changes from Ingo Molnar:
    "The biggest changes:

    - add lockdep support for seqcount/seqlocks structures, this
    unearthed both bugs and required extra annotation.

    - move the various kernel locking primitives to the new
    kernel/locking/ directory"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    block: Use u64_stats_init() to initialize seqcounts
    locking/lockdep: Mark __lockdep_count_forward_deps() as static
    lockdep/proc: Fix lock-time avg computation
    locking/doc: Update references to kernel/mutex.c
    ipv6: Fix possible ipv6 seqlock deadlock
    cpuset: Fix potential deadlock w/ set_mems_allowed
    seqcount: Add lockdep functionality to seqcount/seqlock structures
    net: Explicitly initialize u64_stats_sync structures for lockdep
    locking: Move the percpu-rwsem code to kernel/locking/
    locking: Move the lglocks code to kernel/locking/
    locking: Move the rwsem code to kernel/locking/
    locking: Move the rtmutex code to kernel/locking/
    locking: Move the semaphore core to kernel/locking/
    locking: Move the spinlock code to kernel/locking/
    locking: Move the lockdep code to kernel/locking/
    locking: Move the mutex code to kernel/locking/
    hung_task debugging: Add tracepoint to report the hang
    x86/locking/kconfig: Update paravirt spinlock Kconfig description
    lockstat: Report avg wait and hold times
    lockdep, x86/alternatives: Drop ancient lockdep fixup message
    ...

    Linus Torvalds
     

13 Nov, 2013

1 commit


12 Nov, 2013

2 commits

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle are:

    - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
    van Riel, Peter Zijlstra et al. Yay!

    - optimize preemption counter handling: merge the NEED_RESCHED flag
    into the preempt_count variable, by Peter Zijlstra.

    - wait.h fixes and code reorganization from Peter Zijlstra

    - cfs_bandwidth fixes from Ben Segall

    - SMP load-balancer cleanups from Peter Zijstra

    - idle balancer improvements from Jason Low

    - other fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
    ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
    stop_machine: Fix race between stop_two_cpus() and stop_cpus()
    sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
    sched: Fix asymmetric scheduling for POWER7
    sched: Move completion code from core.c to completion.c
    sched: Move wait code from core.c to wait.c
    sched: Move wait.c into kernel/sched/
    sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
    sched: Avoid throttle_cfs_rq() racing with period_timer stopping
    sched: Guarantee new group-entities always have weight
    sched: Fix hrtimer_cancel()/rq->lock deadlock
    sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
    sched: Fix race on toggling cfs_bandwidth_used
    sched: Remove extra put_online_cpus() inside sched_setaffinity()
    sched/rt: Fix task_tick_rt() comment
    sched/wait: Fix build breakage
    sched/wait: Introduce prepare_to_wait_event()
    sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
    sched: Remove get_online_cpus() usage
    sched: Fix race in migrate_swap_stop()
    ...

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "As a first remark I'd like to note that the way to build perf tooling
    has been simplified and sped up, in the future it should be enough for
    you to build perf via:

    cd tools/perf/
    make install

    (ie without the -j option.) The build system will figure out the
    number of CPUs and will do a parallel build+install.

    The various build system inefficiencies and breakages Linus reported
    against the v3.12 pull request should now be resolved - please
    (re-)report any remaining annoyances or bugs.

    Main changes on the perf kernel side:

    * Performance optimizations:
    . perf ring-buffer code optimizations, by Peter Zijlstra
    . perf ring-buffer code optimizations, by Oleg Nesterov
    . x86 NMI call-stack processing optimizations, by Peter Zijlstra
    . perf context-switch optimizations, by Peter Zijlstra
    . perf sampling speedups, by Peter Zijlstra
    . x86 Intel PEBS processing speedups, by Peter Zijlstra

    * Enhanced hardware support:
    . for Intel Ivy Bridge-EP uncore PMUs, by Zheng Yan
    . for Haswell transactions, by Andi Kleen, Peter Zijlstra

    * Core perf events code enhancements and fixes by Oleg Nesterov:
    . for uprobes, if fork() is called with pending ret-probes
    . for uprobes platform support code

    * New ABI details by Andi Kleen:
    . Report x86 Haswell TSX transaction abort cost as weight

    Main changes on the perf tooling side (some of these tooling changes
    utilize the above kernel side changes):

    * 'perf report/top' enhancements:

    . Convert callchain children list to rbtree, greatly reducing the
    time taken for callchain processing, from Namhyung Kim.

    . Add new COMM infrastructure, further improving histogram
    processing, from Frédéric Weisbecker, one fix from Namhyung Kim.

    . Add /proc/kcore based live-annotation improvements, including
    build-id cache support, multi map 'call' instruction navigation
    fixes, kcore address validation, objdump workarounds. From
    Adrian Hunter.

    . Show progress on histogram collapsing, that can take a long
    time, from Namhyung Kim.

    . Add --max-stack option to limit callchain stack scan in 'top'
    and 'report', improving callchain processing when reducing the
    stack depth is an option, from Waiman Long.

    . Add new option --ignore-vmlinux for perf top, from Willy
    Tarreau.

    * 'perf trace' enhancements:

    . 'perf trace' now can can use a 'perf probe' dynamic tracepoints
    to hook into the userspace -> kernel pathname copy so that it
    can map fds to pathnames without reading /proc/pid/fd/ symlinks.
    From Arnaldo Carvalho de Melo.

    . Show VFS path associated with fd in live sessions, using a
    'vfs_getname' 'perf probe' created dynamic tracepoint or by
    looking at /proc/pid/fd, from Arnaldo Carvalho de Melo.

    . Add 'trace' beautifiers for lots of syscall arguments, from
    Arnaldo Carvalho de Melo.

    . Implement more compact 'trace' output by suppressing zeroed
    args, from Arnaldo Carvalho de Melo.

    . Show thread COMM by default in 'trace', from Arnaldo Carvalho de
    Melo.

    . Add option to show full timestamp in 'trace', from David Ahern.

    . Add 'record' command in 'trace', to record raw_syscalls:*, from
    David Ahern.

    . Add summary option to dump syscall statistics in 'trace', from
    David Ahern.

    . Improve error messages in 'trace', providing hints about system
    configuration steps needed for using it, from Ramkumar
    Ramachandra.

    . 'perf trace' now emits hints as to why tracing is not possible,
    helping the user to setup the system to allow tracing in the
    desired permission granularity, telling if the problem is due to
    debugfs not being mounted or with not enough permission for
    !root, /proc/sys/kernel/perf_event_paranoit value, etc. From
    Arnaldo Carvalho de Melo.

    * 'perf record' enhancements:

    . Check maximum frequency rate for record/top, emitting better
    error messages, from Jiri Olsa.

    . 'perf record' code cleanups, from David Ahern.

    . Improve write_output error message in 'perf record', from Adrian
    Hunter.

    . Allow specifying B/K/M/G unit to the --mmap-pages arguments,
    from Jiri Olsa.

    . Fix command line callchain attribute tests to handle the new
    -g/--call-chain semantics, from Arnaldo Carvalho de Melo.

    * 'perf kvm' enhancements:

    . Disable live kvm command if timerfd is not supported, from David
    Ahern.

    . Fix detection of non-core features, from David Ahern.

    * 'perf list' enhancements:

    . Add usage to 'perf list', from David Ahern.

    . Show error in 'perf list' if tracepoints not available, from
    Pekka Enberg.

    * 'perf probe' enhancements:

    . Support "$vars" meta argument syntax for local variables,
    allowing asking for all possible variables at a given probe
    point to be collected when it hits, from Masami Hiramatsu.

    * 'perf sched' enhancements:

    . Address the root cause of that 'perf sched' stack initialization
    build slowdown, by programmatically setting a big array after
    moving the global variable back to the stack. Fix from Adrian
    Hunter.

    * 'perf script' enhancements:

    . Set up output options for in-stream attributes, from Adrian
    Hunter.

    . Print addr by default for BTS in 'perf script', from Adrian
    Juntmer

    * 'perf stat' enhancements:

    . Improved messages when doing profiling in all or a subset of
    CPUs using a workload as the session delimitator, as in:

    'perf stat --cpu 0,2 sleep 10s'

    from Arnaldo Carvalho de Melo.

    . Add units to nanosec-based counters in 'perf stat', from David
    Ahern.

    . Remove bogus info when using 'perf stat' -e cycles/instructions,
    from Ramkumar Ramachandra.

    * 'perf lock' enhancements:

    . 'perf lock' fixes and cleanups, from Davidlohr Bueso.

    * 'perf test' enhancements:

    . Fixup PERF_SAMPLE_TRANSACTION handling in sample synthesizing
    and 'perf test', from Adrian Hunter.

    . Clarify the "sample parsing" test entry, from Arnaldo Carvalho
    de Melo.

    . Consider PERF_SAMPLE_TRANSACTION in the "sample parsing" test,
    from Arnaldo Carvalho de Melo.

    . Memory leak fixes in 'perf test', from Felipe Pena.

    * 'perf bench' enhancements:

    . Change the procps visible command-name of invididual benchmark
    tests plus cleanups, from Ingo Molnar.

    * Generic perf tooling infrastructure/plumbing changes:

    . Separating data file properties from session, code
    reorganization from Jiri Olsa.

    . Fix version when building out of tree, as when using one of
    these:

    $ make help | grep perf
    perf-tar-src-pkg - Build perf-3.12.0.tar source tarball
    perf-targz-src-pkg - Build perf-3.12.0.tar.gz source tarball
    perf-tarbz2-src-pkg - Build perf-3.12.0.tar.bz2 source tarball
    perf-tarxz-src-pkg - Build perf-3.12.0.tar.xz source tarball
    $

    from David Ahern.

    . Enhance option parse error message, showing just the help lines
    of the options affected, from Namhyung Kim.

    . libtraceevent updates from upstream trace-cmd repo, from Steven
    Rostedt.

    . Always use perf_evsel__set_sample_bit to set sample_type, from
    Adrian Hunter.

    . Memory and mmap leak fixes from Chenggang Qin.

    . Assorted build fixes for from David Ahern and Jiri Olsa.

    . Speed up and prettify the build system, from Ingo Molnar.

    . Implement addr2line directly using libbfd, from Roberto Vitillo.

    . Separate the GTK support in a separate libperf-gtk.so DSO, that
    is only loaded when --gtk is specified, from Namhyung Kim.

    . perf bash completion fixes and improvements from Ramkumar
    Ramachandra.

    . Support for Openembedded/Yocto -dbg packages, from Ricardo
    Ribalda Delgado.

    And lots and lots of other fixes and code reorganizations that did not
    make it into the list, see the shortlog, diffstat and the Git log for
    details!"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (300 commits)
    uprobes: Fix the memory out of bound overwrite in copy_insn()
    uprobes: Fix the wrong usage of current->utask in uprobe_copy_process()
    perf tools: Remove unneeded include
    perf record: Remove post_processing_offset variable
    perf record: Remove advance_output function
    perf record: Refactor feature handling into a separate function
    perf trace: Don't relookup fields by name in each sample
    perf tools: Fix version when building out of tree
    perf evsel: Ditch evsel->handler.data field
    uprobes: Export write_opcode() as uprobe_write_opcode()
    uprobes: Introduce arch_uprobe->ixol
    uprobes: Kill module_init() and module_exit()
    uprobes: Move function declarations out of arch
    perf/x86/intel: Add Ivy Bridge-EP uncore IRP box support
    perf/x86/intel/uncore: Add filter support for IvyBridge-EP QPI boxes
    perf: Factor out strncpy() in perf_event_mmap_event()
    tools/perf: Add required memory barriers
    perf: Fix arch_perf_out_copy_user default
    perf: Update a stale comment
    perf: Optimize perf_output_begin() -- address calculation
    ...

    Linus Torvalds
     

06 Nov, 2013

1 commit


17 Oct, 2013

1 commit


09 Oct, 2013

3 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With scan rate adaptions based on whether the workload has properly
    converged or not there should be no need for the scan period reset
    hammer. Get rid of it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • This patch favours moving tasks towards NUMA node that recorded a higher
    number of NUMA faults during active load balancing. Ideally this is
    self-reinforcing as the longer the task runs on that node, the more faults
    it should incur causing task_numa_placement to keep the task running on that
    node. In reality a big weakness is that the nodes CPUs can be overloaded
    and it would be more efficient to queue tasks on an idle node and migrate
    to the new node. This would require additional smarts in the balancer so
    for now the balancer will simply prefer to place the task on the preferred
    node for a PTE scans which is controlled by the numa_balancing_settle_count
    sysctl. Once the settle_count number of scans has complete the schedule
    is free to place the task on an alternative node if the load is imbalanced.

    [srikar@linux.vnet.ibm.com: Fixed statistics]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    [ Tunable and use higher faults instead of preferred. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

04 Oct, 2013

1 commit

  • /proc/sys/kernel/perf_event_max_sample_rate will accept
    negative values as well as 0.

    Negative values are unreasonable, and 0 causes a
    divide by zero exception in perf_proc_update_handler.

    This patch enforces a lower limit of 1.

    Signed-off-by: Knut Petersen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
    Signed-off-by: Ingo Molnar

    Knut Petersen
     

23 Sep, 2013

1 commit

  • As 'sysctl_hung_task_check_count' is 'unsigned long' when this
    value is assigned to max_count in check_hung_uninterruptible_tasks(),
    it's truncated to 'int' type.

    This causes a minor artifact: if we write 2^32 to sysctl.hung_task_check_count,
    hung task detection will be effectively disabled.

    With this fix, it will still truncate the user input to 32 bits, but
    reading sysctl.hung_task_check_count reflects the actual truncated value.

    Signed-off-by: Li Zefan
    Acked-by: Ingo Molnar
    Link: http://lkml.kernel.org/r/523FFF4E.9050401@huawei.com
    Signed-off-by: Ingo Molnar

    Li Zefan
     

13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds