14 Dec, 2014

1 commit

  • SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: increase MSGMNI to the maximum supported.

    And: If we ignore the risk of locking too much memory, then an automatic
    scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

    The code preserves auto_msgmni to avoid breaking any user space applications
    that expect that the value exists.

    Notes:
    1) If an administrator must limit the memory allocations, then he can set
    MSGMNI as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
    to control latency vs. throughput:
    If MSGMNB is large, then msgsnd() just returns and more messages can be queued
    before a task switch to a task that calls msgrcv() is forced.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

12 Dec, 2014

1 commit

  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • There have been several times where I have had to rebuild a kernel to
    cause a panic when hitting a WARN() in the code in order to get a crash
    dump from a system. Sometimes this is easy to do, other times (such as
    in the case of a remote admin) it is not trivial to send new images to
    the user.

    A much easier method would be a switch to change the WARN() over to a
    panic. This makes debugging easier in that I can now test the actual
    image the WARN() was seen on and I do not have to engage in remote
    debugging.

    This patch adds a panic_on_warn kernel parameter and
    /proc/sys/kernel/panic_on_warn calls panic() in the
    warn_slowpath_common() path. The function will still print out the
    location of the warning.

    An example of the panic_on_warn output:

    The first line below is from the WARN_ON() to output the WARN_ON()'s
    location. After that the panic() output is displayed.

    WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
    0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
    ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
    Call Trace:
    [] dump_stack+0x46/0x58
    [] panic+0xd0/0x204
    [] ? init_dummy+0x1f/0x30 [dummy_module]
    [] warn_slowpath_common+0xd0/0xd0
    [] ? dummy_greetings+0x40/0x40 [dummy_module]
    [] warn_slowpath_null+0x1a/0x20
    [] init_dummy+0x1f/0x30 [dummy_module]
    [] do_one_initcall+0xd4/0x210
    [] ? __vunmap+0xc2/0x110
    [] load_module+0x16a9/0x1b30
    [] ? store_uevent+0x70/0x70
    [] ? copy_module_from_fd.isra.44+0x129/0x180
    [] SyS_finit_module+0xa6/0xd0
    [] system_call_fastpath+0x12/0x17

    Successfully tested by me.

    hpa said: There is another very valid use for this: many operators would
    rather a machine shuts down than being potentially compromised either
    functionally or security-wise.

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Corbet
    Cc: Rusty Russell
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Masami Hiramatsu
    Acked-by: Yasuaki Ishimatsu
    Cc: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     

17 Nov, 2014

1 commit

  • RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
    RSS key.

    Some drivers use a constant (and well known key), some drivers use a random
    key per port, making bonding setups hard to tune. Well known keys increase
    attack surface, considering that number of queues is usually a power of two.

    This patch provides infrastructure to help drivers doing the right thing.

    netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
    even if they provide ethtool -X support to let user redefine the key later.

    A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
    RSS key even for drivers not providing ethtool -x support, in case some
    applications want to precisely setup flows to match some RX queues.

    Tested:

    myhost:~# cat /proc/sys/net/core/netdev_rss_key
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

    myhost:~# ethtool -x eth0
    RX flow hash indirection table for eth0 with 8 RX ring(s):
    0: 0 1 2 3 4 5 6 7
    RSS hash key:
    11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Nov, 2014

1 commit

  • Use the more common dynamic_debug capable net_dbg_ratelimited
    and remove the LIMIT_NETDEBUG macro.

    All messages are still ratelimited.

    Some KERN_ uses are changed to KERN_DEBUG.

    This may have some negative impact on messages that were
    emitted at KERN_INFO that are not not enabled at all unless
    DEBUG is defined or dynamic_debug is enabled. Even so,
    these messages are now _not_ emitted by default.

    This also eliminates the use of the net_msg_warn sysctl
    "/proc/sys/net/core/warnings". For backward compatibility,
    the sysctl is not removed, but it has no function. The extern
    declaration of net_msg_warn is removed from sock.h and made
    static in net/core/sysctl_net_core.c

    Miscellanea:

    o Update the sysctl documentation
    o Remove the embedded uses of pr_fmt
    o Coalesce format fragments
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

14 Oct, 2014

1 commit

  • format_corename() can only pass the leader's pid to the core handler,
    but there is no simple way to figure out which thread originated the
    coredump.

    As Jan explains, this also means that there is no simple way to create
    the backtrace of the crashed process:

    As programs are mostly compiled with implicit gcc -fomit-frame-pointer
    one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
    segment) or .debug_frame section. .debug_frame usually is present only
    in separate debug info files usually not even installed on the system.
    While .eh_frame is a part of the executable/library (and it is even
    always mapped for C++ exceptions unwinding) it no longer has to be
    present anywhere on the disk as the program could be upgraded in the
    meantime and the running instance has its executable file already
    unlinked from disk.

    One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
    the file-backed memory including the executable's .eh_frame section.
    But that can create huge core files, for example even due to mmapped
    data files.

    Other possibility would be to read .eh_frame from /proc/PID/mem at the
    core_pattern handler time of the core dump. For the backtrace one needs
    to read the register state first which can be done from core_pattern
    handler:

    ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
    close(0); // close pipe fd to resume the sleeping dumper
    waitpid(); // should report EXIT
    PTRACE_GETREGS or other requests

    The remaining problem is how to get the 'tid' value of the crashed
    thread. It could be read from the first NT_PRSTATUS note of the core
    file but that makes the core_pattern handler complicated.

    Unfortunately %t is already used so this patch uses %i/%I.

    Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
    is experimenting with this. It is using the elfutils
    (https://fedorahosted.org/elfutils/) unwinder for generating the
    backtraces. Apart from not needing matching executables as mentioned
    above, another advantage is that we can get the backtrace without saving
    the core (which might be quite large) to disk.

    [mmilata@redhat.com: final paragraph of changelog]
    Signed-off-by: Jan Kratochvil
    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: Mark Wielaard
    Cc: Martin Milata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Sep, 2014

1 commit

  • TIPC name table updates are distributed asynchronously in a cluster,
    entailing a risk of certain race conditions. E.g., if two nodes
    simultaneously issue conflicting (overlapping) publications, this may
    not be detected until both publications have reached a third node, in
    which case one of the publications will be silently dropped on that
    node. Hence, we end up with an inconsistent name table.

    In most cases this conflict is just a temporary race, e.g., one
    node is issuing a publication under the assumption that a previous,
    conflicting, publication has already been withdrawn by the other node.
    However, because of the (rtt related) distributed update delay, this
    may not yet hold true on all nodes. The symptom of this failure is a
    syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".

    In this commit we add a resiliency queue at the receiving end of
    the name table distributor. When insertion of an arriving publication
    fails, we retain it in this queue for a short amount of time, assuming
    that another update will arrive very soon and clear the conflict. If so
    happens, we insert the publication, otherwise we drop it.

    The (configurable) retention value defaults to 2000 ms. Knowing from
    experience that the situation described above is extremely rare, there
    is no risk that the queue will accumulate any large number of items.

    Signed-off-by: Erik Hugne
    Signed-off-by: Jon Maloy
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Erik Hugne
     

09 Aug, 2014

1 commit

  • This taint flag will be set if the system has ever entered a softlockup
    state. Similar to TAINT_WARN it is useful to know whether or not the
    system has been in a softlockup state when debugging.

    [akpm@linux-foundation.org: apply the taint before calling panic()]
    Signed-off-by: Josh Hunt
    Cc: Jason Baron
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Hunt
     

24 Jun, 2014

2 commits

  • A 'softlockup' is defined as a bug that causes the kernel to loop in
    kernel mode for more than a predefined period to time, without giving
    other tasks a chance to run.

    Currently, upon detection of this condition by the per-cpu watchdog
    task, debug information (including a stack trace) is sent to the system
    log.

    On some occasions, we have observed that the "victim" rather than the
    actual "culprit" (i.e. the owner/holder of the contended resource) is
    reported to the user. Often this information has proven to be
    insufficient to assist debugging efforts.

    To avoid loss of useful debug information, for architectures which
    support NMI, this patch makes it possible to improve soft lockup
    reporting. This is accomplished by issuing an NMI to each cpu to obtain
    a stack trace.

    If NMI is not supported we just revert back to the old method. A sysctl
    and boot-time parameter is available to toggle this feature.

    [dzickus@redhat.com: add CONFIG_SMP in certain areas]
    [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
    [mq@suse.cz: fix warning]
    Signed-off-by: Aaron Tomlin
    Signed-off-by: Don Zickus
    Cc: David S. Miller
    Cc: Mateusz Guzik
    Cc: Oleg Nesterov
    Signed-off-by: Jan Moskyto Matejka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Jun, 2014

1 commit

  • When writing to a sysctl string, each write, regardless of VFS position,
    begins writing the string from the start. This means the contents of
    the last write to the sysctl controls the string contents instead of the
    first:

    open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
    write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
    write(1, "/bin/true", 9) = 9
    close(1) = 0

    $ cat /proc/sys/kernel/modprobe
    /bin/true

    Expected behaviour would be to have the sysctl be "AAAA..." capped at
    maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
    contents of the second write. Similarly, multiple short writes would
    not append to the sysctl.

    The old behavior is unlike regular POSIX files enough that doing audits
    of software that interact with sysctls can end up in unexpected or
    dangerous situations. For example, "as long as the input starts with a
    trusted path" turns out to be an insufficient filter, as what must also
    happen is for the input to be entirely contained in a single write
    syscall -- not a common consideration, especially for high level tools.

    This provides kernel.sysctl_writes_strict as a way to make this behavior
    act in a less surprising manner for strings, and disallows non-zero file
    position when writing numeric sysctls (similar to what is already done
    when reading from non-zero file positions). For now, the default (0) is
    to warn about non-zero file position use, but retain the legacy
    behavior. Setting this to -1 disables the warning, and setting this to
    1 enables the file position respecting behavior.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: move misplaced hunk, per Randy]
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2014

2 commits

  • Existing description is worded in a way which almost encourages setting of
    vfs_cache_pressure above 100, possibly way above it.

    Users are left in a dark what this numeric value is - an int? a
    percentage? what the scale is?

    As a result, we are getting reports about noticeable performance
    degradation from users who have set vfs_cache_pressure to ridiculously
    high values - because they thought there is no downside to it.

    Via code inspection it's obvious that this value is treated as a
    percentage. This patch changes text to reflect this fact, and adds a
    cautionary paragraph advising against setting vfs_cache_pressure sky high.

    Signed-off-by: Denys Vlasenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • When it was introduced, zone_reclaim_mode made sense as NUMA distances
    punished and workloads were generally partitioned to fit into a NUMA
    node. NUMA machines are now common but few of the workloads are
    NUMA-aware and it's routine to see major performance degradation due to
    zone_reclaim_mode being enabled but relatively few can identify the
    problem.

    Those that require zone_reclaim_mode are likely to be able to detect
    when it needs to be enabled and tune appropriately so lets have a
    sensible default for the bulk of users.

    This patch (of 2):

    zone_reclaim_mode causes processes to prefer reclaiming memory from
    local node instead of spilling over to other nodes. This made sense
    initially when NUMA machines were almost exclusively HPC and the
    workload was partitioned into nodes. The NUMA penalties were
    sufficiently high to justify reclaiming the memory. On current machines
    and workloads it is often the case that zone_reclaim_mode destroys
    performance but not all users know how to detect this. Favour the
    common case and disable it by default. Users that are sophisticated
    enough to know they need zone_reclaim_mode will detect it.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Zhang Yanfei
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Apr, 2014

1 commit

  • As sysctl_hung_task_timeout_sec is unsigned long, when this value is
    larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
    watchdog will return immediately without sleep and with print :

    schedule_timeout: wrong timeout value ffffffffffffff83

    and then the funtion watchdog will call schedule_timeout_interruptible
    again and again. The screen will be filled with

    "schedule_timeout: wrong timeout value ffffffffffffff83"

    This patch does some check and correction in sysctl, to let the function
    schedule_timeout_interruptible allways get the valid parameter.

    Signed-off-by: Liu Hua
    Tested-by: Satoru Takeuchi
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Hua
     

07 Apr, 2014

1 commit

  • Pull module updates from Rusty Russell:
    "Nothing major: the stricter permissions checking for sysfs broke a
    staging driver; fix included. Greg KH said he'd take the patch but
    hadn't as the merge window opened, so it's included here to avoid
    breaking build"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    staging: fix up speakup kobject mode
    Use 'E' instead of 'X' for unsigned module taint flag.
    VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
    kallsyms: fix percpu vars on x86-64 with relocation.
    kallsyms: generalize address range checking
    module: LLVMLinux: Remove unused function warning from __param_check macro
    Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
    module: remove MODULE_GENERIC_TABLE
    module: allow multiple calls to MODULE_DEVICE_TABLE() per module
    module: use pr_cont

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • There is plenty of anecdotal evidence and a load of blog posts
    suggesting that using "drop_caches" periodically keeps your system
    running in "tip top shape". Perhaps adding some kernel documentation
    will increase the amount of accurate data on its use.

    If we are not shrinking caches effectively, then we have real bugs.
    Using drop_caches will simply mask the bugs and make them harder to
    find, but certainly does not fix them, nor is it an appropriate
    "workaround" to limit the size of the caches. On the contrary, there
    have been bug reports on issues that turned out to be misguided use of
    cache dropping.

    Dropping caches is a very drastic and disruptive operation that is good
    for debugging and running tests, but if it creates bug reports from
    production use, kernel developers should be aware of its use.

    Add a bit more documentation about it, a syslog message to track down
    abusers, and vmstat drop counters to help analyze problem reports.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hannes@cmpxchg.org: add runtime suppression control]
    Signed-off-by: Dave Hansen
    Signed-off-by: Michal Hocko
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

01 Apr, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Bigger changes:

    - sched/idle restructuring: they are WIP preparation for deeper
    integration between the scheduler and idle state selection, by
    Nicolas Pitre.

    - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

    - optimize cgroup context switches, by Peter Zijlstra.

    - RT scheduling enhancements, by Thomas Gleixner.

    The rest is smaller changes, non-urgnt fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
    sched: Clean up the task_hot() function
    sched: Remove double calculation in fix_small_imbalance()
    sched: Fix broken setscheduler()
    sparc64, sched: Remove unused sparc64_multi_core
    sched: Remove unused mc_capable() and smt_capable()
    sched/numa: Move task_numa_free() to __put_task_struct()
    sched/fair: Fix endless loop in idle_balance()
    sched/core: Fix endless loop in pick_next_task()
    sched/fair: Push down check for high priority class task into idle_balance()
    sched/rt: Fix picking RT and DL tasks from empty queue
    trace: Replace hardcoding of 19 with MAX_NICE
    sched: Guarantee task priority in pick_next_task()
    sched/idle: Remove stale old file
    sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
    cpuidle/arm64: Remove redundant cpuidle_idle_call()
    cpuidle/powernv: Remove redundant cpuidle_idle_call()
    sched, nohz: Exclude isolated cores from load balancing
    sched: Fix select_task_rq_fair() description comments
    workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    ...

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • Users have reported being unable to trace non-signed modules loaded
    within a kernel supporting module signature.

    This is caused by tracepoint.c:tracepoint_module_coming() refusing to
    take into account tracepoints sitting within force-loaded modules
    (TAINT_FORCED_MODULE). The reason for this check, in the first place, is
    that a force-loaded module may have a struct module incompatible with
    the layout expected by the kernel, and can thus cause a kernel crash
    upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.

    Tracepoints, however, specifically accept TAINT_OOT_MODULE and
    TAINT_CRAP, since those modules do not lead to the "very likely system
    crash" issue cited above for force-loaded modules.

    With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
    module is tainted re-using the TAINT_FORCED_MODULE taint flag.
    Unfortunately, this means that Tracepoints treat that module as a
    force-loaded module, and thus silently refuse to consider any tracepoint
    within this module.

    Since an unsigned module does not fit within the "very likely system
    crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
    to specifically address this taint behavior, and accept those modules
    within Tracepoints. We use the letter 'X' as a taint flag character for
    a module being loaded that doesn't know how to sign its name (proposed
    by Steven Rostedt).

    Also add the missing 'O' entry to trace event show_module_flags() list
    for the sake of completeness.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Steven Rostedt
    NAKed-by: Ingo Molnar
    CC: Thomas Gleixner
    CC: David Howells
    CC: Greg Kroah-Hartman
    Signed-off-by: Rusty Russell

    Mathieu Desnoyers
     

27 Feb, 2014

1 commit


02 Feb, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull core debug changes from Ingo Molnar:
    "This contains mostly kernel debugging related updates:

    - make hung_task detection more configurable to distros
    - add final bits for x86 UV NMI debugging, with related KGDB changes
    - update the mailing-list of MAINTAINERS entries I'm involved with"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hung_task: Display every hung task warning
    sysctl: Add neg_one as a standard constraint
    x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
    x86/uv/nmi: Fix Sparse warnings
    kgdb/kdb: Fix no KDB config problem
    MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

    Linus Torvalds
     

31 Jan, 2014

1 commit


30 Jan, 2014

1 commit

  • Prior to commit fe35004fbf9e ("mm: avoid swapping out with
    swappiness==0") setting swappiness to 0, reclaim code could still evict
    recently used user anonymous memory to swap even though there is a
    significant amount of RAM used for page cache.

    The behaviour of setting swappiness to 0 has since changed. When set,
    the reclaim code does not initiate swap until the amount of free pages
    and file-backed pages, is less than the high water mark in a zone.

    Let's update the documentation to reflect this.

    [akpm@linux-foundation.org: remove comma, per Randy]
    Signed-off-by: Aaron Tomlin
    Acked-by: Rik van Riel
    Acked-by: Bryn M. Reeves
    Cc: Satoru Moriya
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     

28 Jan, 2014

1 commit

  • Excessive migration of pages can hurt the performance of workloads
    that span multiple NUMA nodes. However, it turns out that the
    p->numa_migrate_deferred knob is a really big hammer, which does
    reduce migration rates, but does not actually help performance.

    Now that the second stage of the automatic numa balancing code
    has stabilized, it is time to replace the simplistic migration
    deferral code with something smarter.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

25 Jan, 2014

1 commit

  • When khungtaskd detects hung tasks, it prints out
    backtraces from a number of those tasks.

    Limiting the number of backtraces being printed
    out can result in the user not seeing the information
    necessary to debug the issue. The hung_task_warnings
    sysctl controls this feature.

    This patch makes it possible for hung_task_warnings
    to accept a special value to print an unlimited
    number of backtraces when khungtaskd detects hung
    tasks.

    The special value is -1. To use this value it is
    necessary to change types from ulong to int.

    Signed-off-by: Aaron Tomlin
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
    [ Build warning fix. ]
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     

24 Jan, 2014

1 commit

  • For general-purpose (i.e. distro) kernel builds it makes sense to build
    with CONFIG_KEXEC to allow end users to choose what kind of things they
    want to do with kexec. However, in the face of trying to lock down a
    system with such a kernel, there needs to be a way to disable kexec_load
    (much like module loading can be disabled). Without this, it is too easy
    for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
    and modules_disabled are set. With this change, it is still possible to
    load an image for use later, then disable kexec_load so the image (or lack
    of image) can't be altered.

    The intention is for using this in environments where "perfect"
    enforcement is hard. Without a verified boot, along with verified
    modules, and along with verified kexec, this is trying to give a system a
    better chance to defend itself (or at least grow the window of
    discoverability) against attack in the face of a privilege escalation.

    In my mind, I consider several boot scenarios:

    1) Verified boot of read-only verified root fs loading fd-based
    verification of kexec images.
    2) Secure boot of writable root fs loading signed kexec images.
    3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
    4) Regular boot with no control of kexec image at all.

    1 and 2 don't exist yet, but will soon once the verified kexec series has
    landed. 4 is the state of things now. The gap between 2 and 4 is too
    large, so this change creates scenario 3, a middle-ground above 4 when 2
    and 1 are not possible for a system.

    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

17 Dec, 2013

1 commit

  • commit 887c290e (sched/numa: Decide whether to favour task or group weights
    based on swap candidate relationships) drop the check against
    sysctl_numa_balancing_settle_count, this patch remove the sysctl.

    Signed-off-by: Wanpeng Li
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

13 Nov, 2013

2 commits

  • Some setuid binaries will allow reading of files which have read
    permission by the real user id. This is problematic with files which
    use %pK because the file access permission is checked at open() time,
    but the kptr_restrict setting is checked at read() time. If a setuid
    binary opens a %pK file as an unprivileged user, and then elevates
    permissions before reading the file, then kernel pointer values may be
    leaked.

    This happens for example with the setuid pppd application on Ubuntu 12.04:

    $ head -1 /proc/kallsyms
    00000000 T startup_32

    $ pppd file /proc/kallsyms
    pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

    This will only leak the pointer value from the first line, but other
    setuid binaries may leak more information.

    Fix this by adding a check that in addition to the current process having
    CAP_SYSLOG, that effective user and group ids are equal to the real ids.
    If a setuid binary reads the contents of a file which uses %pK then the
    pointer values will be printed as NULL if the real user is unprivileged.

    Update the sysctl documentation to reflect the changes, and also correct
    the documentation to state the kptr_restrict=0 is the default.

    This is a only temporary solution to the issue. The correct solution is
    to do the permission check at open() time on files, and to replace %pK
    with a function which checks the open() time permission. %pK uses in
    printk should be removed since no sane permission check can be done, and
    instead protected by using dmesg_restrict.

    Signed-off-by: Ryan Mallon
    Cc: Kees Cook
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Mallon
     
  • Now dirty_background_ratio/dirty_ratio contains a percentage of total
    avaiable memory, which contains free pages and reclaimable pages. The
    number of these pages is not equal to the number of total system memory.
    But they are described as a percentage of total system memory in
    Documentation/sysctl/vm.txt. So we need to fix them to avoid
    misunderstanding.

    Signed-off-by: Zheng Liu
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zheng Liu
     

09 Oct, 2013

5 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With scan rate adaptions based on whether the workload has properly
    converged or not there should be no need for the scan period reset
    hammer. Get rid of it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • This patch favours moving tasks towards NUMA node that recorded a higher
    number of NUMA faults during active load balancing. Ideally this is
    self-reinforcing as the longer the task runs on that node, the more faults
    it should incur causing task_numa_placement to keep the task running on that
    node. In reality a big weakness is that the nodes CPUs can be overloaded
    and it would be more efficient to queue tasks on an idle node and migrate
    to the new node. This would require additional smarts in the balancer so
    for now the balancer will simply prefer to place the task on the preferred
    node for a PTE scans which is controlled by the numa_balancing_settle_count
    sysctl. Once the settle_count number of scans has complete the schedule
    is free to place the task on an alternative node if the load is imbalanced.

    [srikar@linux.vnet.ibm.com: Fixed statistics]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    [ Tunable and use higher faults instead of preferred. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • The NUMA PTE scan rate is controlled with a combination of the
    numa_balancing_scan_period_min, numa_balancing_scan_period_max and
    numa_balancing_scan_size. This scan rate is independent of the size
    of the task and as an aside it is further complicated by the fact that
    numa_balancing_scan_size controls how many pages are marked pte_numa and
    not how much virtual memory is scanned.

    In combination, it is almost impossible to meaningfully tune the min and
    max scan periods and reasoning about performance is complex when the time
    to complete a full scan is is partially a function of the tasks memory
    size. This patch alters the semantic of the min and max tunables to be
    about tuning the length time it takes to complete a scan of a tasks occupied
    virtual address space. Conceptually this is a lot easier to understand. There
    is a "sanity" check to ensure the scan rate is never extremely fast based on
    the amount of virtual memory that should be scanned in a second. The default
    of 2.5G seems arbitrary but it is to have the maximum scan rate after the
    patch roughly match the maximum scan rate before the patch was applied.

    On a similar note, numa_scan_period is in milliseconds and not
    jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
    to numa_scan_period means that the rate scanning slows depends on HZ which
    is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Sep, 2013

2 commits

  • Add a new %P variable to be used in core_pattern. This variable contains
    the global PID (PID in the init namespace) as %p contains the PID in the
    current namespace which isn't always what we want.

    The main use for this is to make it easier to handle crashes that happened
    within a container. With that new variables it's possible to have the
    crashes dumped into the container or forwarded to the host with the right
    PID (from the host's point of view).

    Signed-off-by: Stéphane Graber
    Reported-by: Hans Feldt
    Cc: Alexander Viro
    Cc: Eric W. Biederman
    Cc: Andy Whitcroft
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stéphane Graber
     
  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

02 Aug, 2013

1 commit


11 Jul, 2013

1 commit

  • Rename LL_SO to BUSY_POLL_SO
    Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
    Fix up users of these variables.
    Fix documentation for sysctl.

    a patch for the socket.7 man page will follow separately,
    because of limitations of my mail setup.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir