10 Jul, 2014

1 commit

  • commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

    Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

06 May, 2014

1 commit

  • commit 80df28476505ed4e6701c3448c63c9229a50c655 upstream.

    As sysctl_hung_task_timeout_sec is unsigned long, when this value is
    larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
    watchdog will return immediately without sleep and with print :

    schedule_timeout: wrong timeout value ffffffffffffff83

    and then the funtion watchdog will call schedule_timeout_interruptible
    again and again. The screen will be filled with

    "schedule_timeout: wrong timeout value ffffffffffffff83"

    This patch does some check and correction in sysctl, to let the function
    schedule_timeout_interruptible allways get the valid parameter.

    Signed-off-by: Liu Hua
    Tested-by: Satoru Takeuchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Liu Hua
     

01 Feb, 2014

1 commit

  • Pull core debug changes from Ingo Molnar:
    "This contains mostly kernel debugging related updates:

    - make hung_task detection more configurable to distros
    - add final bits for x86 UV NMI debugging, with related KGDB changes
    - update the mailing-list of MAINTAINERS entries I'm involved with"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hung_task: Display every hung task warning
    sysctl: Add neg_one as a standard constraint
    x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
    x86/uv/nmi: Fix Sparse warnings
    kgdb/kdb: Fix no KDB config problem
    MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

    Linus Torvalds
     

30 Jan, 2014

1 commit

  • Prior to commit fe35004fbf9e ("mm: avoid swapping out with
    swappiness==0") setting swappiness to 0, reclaim code could still evict
    recently used user anonymous memory to swap even though there is a
    significant amount of RAM used for page cache.

    The behaviour of setting swappiness to 0 has since changed. When set,
    the reclaim code does not initiate swap until the amount of free pages
    and file-backed pages, is less than the high water mark in a zone.

    Let's update the documentation to reflect this.

    [akpm@linux-foundation.org: remove comma, per Randy]
    Signed-off-by: Aaron Tomlin
    Acked-by: Rik van Riel
    Acked-by: Bryn M. Reeves
    Cc: Satoru Moriya
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     

25 Jan, 2014

1 commit

  • When khungtaskd detects hung tasks, it prints out
    backtraces from a number of those tasks.

    Limiting the number of backtraces being printed
    out can result in the user not seeing the information
    necessary to debug the issue. The hung_task_warnings
    sysctl controls this feature.

    This patch makes it possible for hung_task_warnings
    to accept a special value to print an unlimited
    number of backtraces when khungtaskd detects hung
    tasks.

    The special value is -1. To use this value it is
    necessary to change types from ulong to int.

    Signed-off-by: Aaron Tomlin
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
    [ Build warning fix. ]
    Signed-off-by: Ingo Molnar

    Aaron Tomlin
     

24 Jan, 2014

1 commit

  • For general-purpose (i.e. distro) kernel builds it makes sense to build
    with CONFIG_KEXEC to allow end users to choose what kind of things they
    want to do with kexec. However, in the face of trying to lock down a
    system with such a kernel, there needs to be a way to disable kexec_load
    (much like module loading can be disabled). Without this, it is too easy
    for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
    and modules_disabled are set. With this change, it is still possible to
    load an image for use later, then disable kexec_load so the image (or lack
    of image) can't be altered.

    The intention is for using this in environments where "perfect"
    enforcement is hard. Without a verified boot, along with verified
    modules, and along with verified kexec, this is trying to give a system a
    better chance to defend itself (or at least grow the window of
    discoverability) against attack in the face of a privilege escalation.

    In my mind, I consider several boot scenarios:

    1) Verified boot of read-only verified root fs loading fd-based
    verification of kexec images.
    2) Secure boot of writable root fs loading signed kexec images.
    3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
    4) Regular boot with no control of kexec image at all.

    1 and 2 don't exist yet, but will soon once the verified kexec series has
    landed. 4 is the state of things now. The gap between 2 and 4 is too
    large, so this change creates scenario 3, a middle-ground above 4 when 2
    and 1 are not possible for a system.

    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

17 Dec, 2013

1 commit

  • commit 887c290e (sched/numa: Decide whether to favour task or group weights
    based on swap candidate relationships) drop the check against
    sysctl_numa_balancing_settle_count, this patch remove the sysctl.

    Signed-off-by: Wanpeng Li
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

13 Nov, 2013

2 commits

  • Some setuid binaries will allow reading of files which have read
    permission by the real user id. This is problematic with files which
    use %pK because the file access permission is checked at open() time,
    but the kptr_restrict setting is checked at read() time. If a setuid
    binary opens a %pK file as an unprivileged user, and then elevates
    permissions before reading the file, then kernel pointer values may be
    leaked.

    This happens for example with the setuid pppd application on Ubuntu 12.04:

    $ head -1 /proc/kallsyms
    00000000 T startup_32

    $ pppd file /proc/kallsyms
    pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

    This will only leak the pointer value from the first line, but other
    setuid binaries may leak more information.

    Fix this by adding a check that in addition to the current process having
    CAP_SYSLOG, that effective user and group ids are equal to the real ids.
    If a setuid binary reads the contents of a file which uses %pK then the
    pointer values will be printed as NULL if the real user is unprivileged.

    Update the sysctl documentation to reflect the changes, and also correct
    the documentation to state the kptr_restrict=0 is the default.

    This is a only temporary solution to the issue. The correct solution is
    to do the permission check at open() time on files, and to replace %pK
    with a function which checks the open() time permission. %pK uses in
    printk should be removed since no sane permission check can be done, and
    instead protected by using dmesg_restrict.

    Signed-off-by: Ryan Mallon
    Cc: Kees Cook
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Mallon
     
  • Now dirty_background_ratio/dirty_ratio contains a percentage of total
    avaiable memory, which contains free pages and reclaimable pages. The
    number of these pages is not equal to the number of total system memory.
    But they are described as a percentage of total system memory in
    Documentation/sysctl/vm.txt. So we need to fix them to avoid
    misunderstanding.

    Signed-off-by: Zheng Liu
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zheng Liu
     

09 Oct, 2013

5 commits

  • Shared faults can lead to lots of unnecessary page migrations,
    slowing down the system, and causing private faults to hit the
    per-pgdat migration ratelimit.

    This patch adds sysctl numa_balancing_migrate_deferred, which specifies
    how many shared page migrations to skip unconditionally, after each page
    migration that is skipped because it is a shared fault.

    This reduces the number of page migrations back and forth in
    shared fault situations. It also gives a strong preference to
    the tasks that are already running where most of the memory is,
    and to moving the other tasks to near the memory.

    Testing this with a much higher scan rate than the default
    still seems to result in fewer page migrations than before.

    Memory seems to be somewhat better consolidated than previously,
    with multi-instance specjbb runs on a 4 node system.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With scan rate adaptions based on whether the workload has properly
    converged or not there should be no need for the scan period reset
    hammer. Get rid of it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • This patch favours moving tasks towards NUMA node that recorded a higher
    number of NUMA faults during active load balancing. Ideally this is
    self-reinforcing as the longer the task runs on that node, the more faults
    it should incur causing task_numa_placement to keep the task running on that
    node. In reality a big weakness is that the nodes CPUs can be overloaded
    and it would be more efficient to queue tasks on an idle node and migrate
    to the new node. This would require additional smarts in the balancer so
    for now the balancer will simply prefer to place the task on the preferred
    node for a PTE scans which is controlled by the numa_balancing_settle_count
    sysctl. Once the settle_count number of scans has complete the schedule
    is free to place the task on an alternative node if the load is imbalanced.

    [srikar@linux.vnet.ibm.com: Fixed statistics]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    [ Tunable and use higher faults instead of preferred. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • The NUMA PTE scan rate is controlled with a combination of the
    numa_balancing_scan_period_min, numa_balancing_scan_period_max and
    numa_balancing_scan_size. This scan rate is independent of the size
    of the task and as an aside it is further complicated by the fact that
    numa_balancing_scan_size controls how many pages are marked pte_numa and
    not how much virtual memory is scanned.

    In combination, it is almost impossible to meaningfully tune the min and
    max scan periods and reasoning about performance is complex when the time
    to complete a full scan is is partially a function of the tasks memory
    size. This patch alters the semantic of the min and max tunables to be
    about tuning the length time it takes to complete a scan of a tasks occupied
    virtual address space. Conceptually this is a lot easier to understand. There
    is a "sanity" check to ensure the scan rate is never extremely fast based on
    the amount of virtual memory that should be scanned in a second. The default
    of 2.5G seems arbitrary but it is to have the maximum scan rate after the
    patch roughly match the maximum scan rate before the patch was applied.

    On a similar note, numa_scan_period is in milliseconds and not
    jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
    to numa_scan_period means that the rate scanning slows depends on HZ which
    is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Sep, 2013

2 commits

  • Add a new %P variable to be used in core_pattern. This variable contains
    the global PID (PID in the init namespace) as %p contains the PID in the
    current namespace which isn't always what we want.

    The main use for this is to make it easier to handle crashes that happened
    within a container. With that new variables it's possible to have the
    crashes dumped into the container or forwarded to the host with the right
    PID (from the host's point of view).

    Signed-off-by: Stéphane Graber
    Reported-by: Hans Feldt
    Cc: Alexander Viro
    Cc: Eric W. Biederman
    Cc: Andy Whitcroft
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stéphane Graber
     
  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

02 Aug, 2013

1 commit


11 Jul, 2013

1 commit

  • Rename LL_SO to BUSY_POLL_SO
    Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
    Fix up users of these variables.
    Fix documentation for sysctl.

    a patch for the socket.7 man page will follow separately,
    because of limitations of my mail setup.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

10 Jul, 2013

2 commits

  • Pull networking updates from David Miller:
    "This is a re-do of the net-next pull request for the current merge
    window. The only difference from the one I made the other day is that
    this has Eliezer's interface renames and the timeout handling changes
    made based upon your feedback, as well as a few bug fixes that have
    trickeled in.

    Highlights:

    1) Low latency device polling, eliminating the cost of interrupt
    handling and context switches. Allows direct polling of a network
    device from socket operations, such as recvmsg() and poll().

    Currently ixgbe, mlx4, and bnx2x support this feature.

    Full high level description, performance numbers, and design in
    commit 0a4db187a999 ("Merge branch 'll_poll'")

    From Eliezer Tamir.

    2) With the routing cache removed, ip_check_mc_rcu() gets exercised
    more than ever before in the case where we have lots of multicast
    addresses. Use a hash table instead of a simple linked list, from
    Eric Dumazet.

    3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
    Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
    Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

    4) Support reporting the TUN device persist flag to userspace, from
    Pavel Emelyanov.

    5) Allow controlling network device VF link state using netlink, from
    Rony Efraim.

    6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

    7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
    Daniel Borkmann and Eric Dumazet.

    8) Allow controlling of TCP quickack behavior on a per-route basis,
    from Cong Wang.

    9) Several bug fixes and improvements to vxlan from Stephen
    Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
    support receiving on multiple UDP ports.

    10) Major cleanups, particular in the area of debugging and cookie
    lifetime handline, to the SCTP protocol code. From Daniel
    Borkmann.

    11) Allow packets to cross network namespaces when traversing tunnel
    devices. From Nicolas Dichtel.

    12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
    manner akin to how we monitor real network traffic via ptype_all.
    From Daniel Borkmann.

    13) Several bug fixes and improvements for the new alx device driver,
    from Johannes Berg.

    14) Fix scalability issues in the netem packet scheduler's time queue,
    by using an rbtree. From Eric Dumazet.

    15) Several bug fixes in TCP loss recovery handling, from Yuchung
    Cheng.

    16) Add support for GSO segmentation of MPLS packets, from Simon
    Horman.

    17) Make network notifiers have a real data type for the opaque
    pointer that's passed into them. Use this to properly handle
    network device flag changes in arp_netdev_event(). From Jiri
    Pirko and Timo Teräs.

    18) Convert several drivers over to module_pci_driver(), from Peter
    Huewe.

    19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
    O(1) calculation instead. From Eric Dumazet.

    20) Support setting of explicit tunnel peer addresses in ipv6, just
    like ipv4. From Nicolas Dichtel.

    21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

    22) Prevent a single high rate flow from overruning an individual cpu
    during RX packet processing via selective flow shedding. From
    Willem de Bruijn.

    23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
    Dumazet.

    24) Don't just drop GSO packets which are above the TBF scheduler's
    burst limit, chop them up so they are in-bounds instead. Also
    from Eric Dumazet.

    25) VLAN offloads are missed when configured on top of a bridge, fix
    from Vlad Yasevich.

    26) Support IPV6 in ping sockets. From Lorenzo Colitti.

    27) Receive flow steering targets should be updated at poll() time
    too, from David Majnemer.

    28) Fix several corner case regressions in PMTU/redirect handling due
    to the routing cache removal, from Timo Teräs.

    29) We have to be mindful of ipv4 mapped ipv6 sockets in
    upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

    30) Fix L2TP sequence number handling bugs, from James Chapman."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
    drivers/net: caif: fix wrong rtnl_is_locked() usage
    drivers/net: enic: release rtnl_lock on error-path
    vhost-net: fix use-after-free in vhost_net_flush
    net: mv643xx_eth: do not use port number as platform device id
    net: sctp: confirm route during forward progress
    virtio_net: fix race in RX VQ processing
    virtio: support unlocked queue poll
    net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
    Documentation: Fix references to defunct linux-net@vger.kernel.org
    net/fs: change busy poll time accounting
    net: rename low latency sockets functions to busy poll
    bridge: fix some kernel warning in multicast timer
    sfc: Fix memory leak when discarding scattered packets
    sit: fix tunnel update via netlink
    dt:net:stmmac: Add dt specific phy reset callback support.
    dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
    dt:net:stmmac: Allocate platform data only if its NULL.
    net:stmmac: fix memleak in the open method
    ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
    net: ipv6: fix wrong ping_v6_sendmsg return value
    ...

    Linus Torvalds
     
  • The default zonelist order selecter will select "node" order if any nodes
    DMA zone comprises greater than 70% of its local memory instead of 60%,
    according to default_zonelist_order::low_kmem_size > total * 70/100.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

09 Jul, 2013

1 commit

  • Rename functions in include/net/ll_poll.h to busy wait.
    Clarify documentation about expected power use increase.
    Rename POLL_LL to POLL_BUSY_LOOP.
    Add need_resched() testing to poll/select busy loops.

    Note, that in select and poll can_busy_poll is dynamic and is
    updated continuously to reflect the existence of supported
    sockets with valid queue information.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

26 Jun, 2013

1 commit

  • select/poll busy-poll support.

    Split sysctl value into two separate ones, one for read and one for poll.
    updated Documentation/sysctl/net.txt

    Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
    sk_poll_ll if possible. sock_poll sets this flag in its return value
    to indicate to select/poll when a socket that can busy poll is found.

    When poll/select have nothing to report, call the low-level
    sock_poll again until we are out of time or we find something.

    Once the system call finds something, it stops setting POLL_LL, so it can
    return the result to the user ASAP.

    Signed-off-by: Eliezer Tamir
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

24 Jun, 2013

1 commit


23 Jun, 2013

1 commit

  • This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second. If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.

    This way, we don't have a runaway sampling process eating up the
    CPU.

    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well. *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.

    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.

    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

    Signed-off-by: Dave Hansen
    Acked-by: Peter Zijlstra
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Cc: Dave Hansen
    [ Prettified it a bit. ]
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Jun, 2013

1 commit

  • As per feedback from the netdev community, we change the buffer
    overflow protection algorithm in receiving sockets so that it
    always respects the nominal upper limit set in sk_rcvbuf.

    Instead of scaling up from a small sk_rcvbuf value, which leads to
    violation of the configured sk_rcvbuf limit, we now calculate the
    weighted per-message limit by scaling down from a much bigger value,
    still in the same field, according to the importance priority of the
    received message.

    To allow for administrative tunability of the socket receive buffer
    size, we create a tipc_rmem sysctl variable to allow the user to
    configure an even bigger value via sysctl command. It is a size of
    three (min/default/max) to be consistent with things like tcp_rmem.

    By default, the value initialized in tipc_rmem[1] is equal to the
    receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
    This value is also set as the default value of sk_rcvbuf.

    Originally-by: Jon Maloy
    Cc: Neil Horman
    Cc: Jon Maloy
    [Ying: added sysctl variation to Jon's original patch]
    Signed-off-by: Ying Xue
    [PG: don't compile sysctl.c if not config'd; add Documentation]
    Signed-off-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Ying Xue
     

11 Jun, 2013

1 commit

  • Adds an ndo_ll_poll method and the code that supports it.
    This method can be used by low latency applications to busy-poll
    Ethernet device queues directly from the socket code.
    sysctl_net_ll_poll controls how many microseconds to poll.
    Default is zero (disabled).
    Individual protocol support will be added by subsequent patches.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jesse Brandeburg
    Signed-off-by: Eliezer Tamir
    Acked-by: Eric Dumazet
    Tested-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

28 May, 2013

2 commits


20 May, 2013

1 commit

  • This patch removes mentioning the sysfsf net_device weight attribute
    (class/net//weight)
    in Documentation/sysctl/net.txt, since the net sysfs weight attribute
    was removed by the following patch:

    [NET]: Make NAPI polling independent of struct net_device objects
    bea3348eef27e6044b6161fd04c3152215f96411

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

30 Apr, 2013

2 commits

  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     

05 Jan, 2013

2 commits

  • Signed-off-by: Carlos Alberto Lopez Perez
    Cc: Rob Landley
    Cc: Larry Finger
    Cc: Neil Horman
    Cc: Mitsuo Hayasaka
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carlos Alberto Lopez Perez
     
  • Add 3 new variables and sysctls to tune them (by one "next_id" variable
    for messages, semaphores and shared memory respectively). This variable
    can be used to set desired id for next allocated IPC object. By default
    it's equal to -1 and old behaviour is preserved. If this variable is
    non-negative, then desired idr will be extracted from it and used as a
    start value to search for free IDR slot.

    Notes:

    1) this patch doesn't guarantee that the new object will have desired
    id. So it's up to user space how to handle new object with wrong id.

    2) After a sucessful id allocation attempt, "next_id" will be set back
    to -1 (if it was non-negative).

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     

06 Oct, 2012

1 commit

  • Some coredump handlers want to create a core file in a way compatible with
    standard behavior. Standard behavior with fs.suid_dumpable = 2 is to
    create core file with uid=gid=0. However, there was no way for coredump
    handler to know that the process being dumped was suid'ed.

    This patch adds the new %d specifier for format_corename() which simply
    reports __get_dumpable(mm->flags), this is compatible with
    /proc/sys/fs/suid_dumpable we already have.

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=787135

    Developed during a discussion with Denys Vlasenko.

    Signed-off-by: Oleg Nesterov
    Cc: Denys Vlasenko
    Cc: Alex Kelly
    Cc: Andi Kleen
    Cc: Cong Wang
    Cc: Jiri Moskovcak
    Acked-by: Neil Horman
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Aug, 2012

1 commit


02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • The number of ptes and swap entries are used in the oom killer's badness
    heuristic, so they should be shown in the tasklist dump.

    This patch adds those fields and replaces cpu and oom_adj values that are
    currently emitted. Cpu isn't interesting and oom_adj is deprecated and
    will be removed later this year, the same information is already displayed
    as oom_score_adj which is used internally.

    At the same time, make the documentation a little more clear to state this
    information is helpful to determine why the oom killer chose the task it
    did to kill.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes