20 May, 2016

1 commit

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

1 commit

  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 May, 2016

1 commit


10 May, 2016

1 commit

  • Allowing unprivileged kernel profiling lets any user dump follow kernel
    control flow and dump kernel registers. This most likely allows trivial
    kASLR bypassing, and it may allow other mischief as well. (Off the top
    of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
    could be quite interesting.)

    Signed-off-by: Andy Lutomirski
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

05 May, 2016

1 commit


29 Apr, 2016

1 commit

  • Commit 3193913ce62c ("mm: page_alloc: default node-ordering on 64-bit
    NUMA, zone-ordering on 32-bit") changes the default value of
    numa_zonelist_order. Update the document.

    Signed-off-by: Xishi Qiu
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

27 Apr, 2016

1 commit

  • The default remains 127, which is good for most cases, and not even hit
    most of the time, but then for some cases, as reported by Brendan, 1024+
    deep frames are appearing on the radar for things like groovy, ruby.

    And in some workloads putting a _lower_ cap on this may make sense. One
    that is per event still needs to be put in place tho.

    The new file is:

    # cat /proc/sys/kernel/perf_event_max_stack
    127

    Chaging it:

    # echo 256 > /proc/sys/kernel/perf_event_max_stack
    # cat /proc/sys/kernel/perf_event_max_stack
    256

    But as soon as there is some event using callchains we get:

    # echo 512 > /proc/sys/kernel/perf_event_max_stack
    -bash: echo: write error: Device or resource busy
    #

    Because we only allocate the callchain percpu data structures when there
    is a user, which allows for changing the max easily, its just a matter
    of having no callchain users at that point.

    Reported-and-Tested-by: Brendan Gregg
    Reviewed-by: Frederic Weisbecker
    Acked-by: Alexei Starovoitov
    Acked-by: David Ahern
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

19 Mar, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits

  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull tty/serial updates from Greg KH:
    "Here's the big tty/serial driver pull request for 4.6-rc1.

    Lots of changes in here, Peter has been on a tear again, with lots of
    refactoring and bugs fixes, many thanks to the great work he has been
    doing. Lots of driver updates and fixes as well, full details in the
    shortlog.

    All have been in linux-next for a while with no reported issues"

    * tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
    serial: 8250: describe CONFIG_SERIAL_8250_RSA
    serial: samsung: optimize UART rx fifo access routine
    serial: pl011: add mark/space parity support
    serial: sa1100: make sa1100_register_uart_fns a function
    tty: serial: 8250: add MOXA Smartio MUE boards support
    serial: 8250: convert drivers to use up_to_u8250p()
    serial: 8250/mediatek: fix building with SERIAL_8250=m
    serial: 8250/ingenic: fix building with SERIAL_8250=m
    serial: 8250/uniphier: fix modular build
    Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
    Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
    serial: mvebu-uart: initial support for Armada-3700 serial port
    serial: mctrl_gpio: Add missing module license
    serial: ifx6x60: avoid uninitialized variable use
    tty/serial: at91: fix bad offset for UART timeout register
    tty/serial: at91: restore dynamic driver binding
    serial: 8250: Add hardware dependency to RT288X option
    TTY, devpts: document pty count limiting
    tty: goldfish: support platform_device with id -1
    drivers: tty: goldfish: Add device tree bindings
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - Make schedstats a runtime tunable (disabled by default) and
    optimize it via static keys.

    As most distributions enable CONFIG_SCHEDSTATS=y due to its
    instrumentation value, this is a nice performance enhancement.
    (Mel Gorman)

    - Implement 'simple waitqueues' (swait): these are just pure
    waitqueues without any of the more complex features of full-blown
    waitqueues (callbacks, wake flags, wake keys, etc.). Simple
    waitqueues have less memory overhead and are faster.

    Use simple waitqueues in the RCU code (in 4 different places) and
    for handling KVM vCPU wakeups.

    (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
    Marcelo Tosatti)

    - sched/numa enhancements (Rik van Riel)

    - NOHZ performance enhancements (Rik van Riel)

    - Various sched/deadline enhancements (Steven Rostedt)

    - Various fixes (Peter Zijlstra)

    - ... and a number of other fixes, cleanups and smaller enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
    sched/cputime: Fix steal_account_process_tick() to always return jiffies
    sched/deadline: Remove dl_new from struct sched_dl_entity
    Revert "kbuild: Add option to turn incompatible pointer check into error"
    sched/deadline: Remove superfluous call to switched_to_dl()
    sched/debug: Fix preempt_disable_ip recording for preempt_disable()
    sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
    time, acct: Drop irq save & restore from __acct_update_integrals()
    acct, time: Change indentation in __acct_update_integrals()
    sched, time: Remove non-power-of-two divides from __acct_update_integrals()
    sched/rt: Kick RT bandwidth timer immediately on start up
    sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
    sched/debug: Move sched_domain_sysctl to debug.c
    sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
    sched/rt: Fix PI handling vs. sched_setscheduler()
    sched/core: Remove duplicated sched_group_set_shares() prototype
    sched/fair: Consolidate nohz CPU load update code
    sched/fair: Avoid using decay_load_missed() with a negative value
    sched/deadline: Always calculate end of period on sched_yield()
    sched/cgroup: Fix cgroup entity load tracking tear-down
    rcu: Use simple wait queues where possible in rcutree
    ...

    Linus Torvalds
     

08 Mar, 2016

1 commit

  • Logic has been changed in kernel 3.4 by commit e9aba5158a80
    ("tty: rework pty count limiting") but still not documented.

    Sysctl kernel.pty.max works as global limit, kernel.pty.reserve ptys
    are reserved for initial devpts instance (mounted without "newinstance").
    Per-instance limit also could be set by mount option "max=%d".

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

09 Feb, 2016

1 commit

  • schedstats is very useful during debugging and performance tuning but it
    incurs overhead to calculate the stats. As such, even though it can be
    disabled at build time, it is often enabled as the information is useful.

    This patch adds a kernel command-line and sysctl tunable to enable or
    disable schedstats on demand (when it's built in). It is disabled
    by default as someone who knows they need it can also learn to enable
    it when necessary.

    The benefits are dependent on how scheduler-intensive the workload is.
    If it is then the patch reduces the number of cycles spent calculating
    the stats with a small benefit from reducing the cache footprint of the
    scheduler.

    These measurements were taken from a 48-core 2-socket
    machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
    single socket machine 8-core machine with Intel i7-3770 processors.

    netperf-tcp
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
    Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
    Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
    Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
    Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
    Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
    Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
    Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
    Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

    Small gains here, UDP_STREAM showed nothing intresting and neither did
    the TCP_RR tests. The gains on the 8-core machine were very similar.

    tbench4
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
    Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
    Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
    Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
    Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
    Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
    Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
    Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
    Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
    Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
    Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
    Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
    Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

    Small gains of 2-4% at low thread counts and otherwise flat. The
    gains on the 8-core machine were slightly different

    tbench4 on 8-core i7-3770 single socket machine
    Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
    Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
    Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
    Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
    Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
    Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
    Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
    Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
    Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
    Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

    In constract, this shows a relatively steady 2-3% gain at higher thread
    counts. Due to the nature of the patch and the type of workload, it's
    not a surprise that the result will depend on the CPU used.

    hackbench-pipes
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
    Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
    Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
    Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
    Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
    Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
    Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
    Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
    Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
    Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
    Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
    Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

    Some small gains and losses and while the variance data is not included,
    it's close to the noise. The UMA machine did not show anything particularly
    different

    pipetest
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v2r2
    Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
    1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
    2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
    3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
    Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
    Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
    Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
    Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
    Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
    Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
    Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
    Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
    Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
    Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
    Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
    Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
    Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

    Small improvement and similar gains were seen on the UMA machine.

    The gain is small but it stands to reason that doing less work in the
    scheduler is a good thing. The downside is that the lack of schedstats and
    tracepoints may be surprising to experts doing performance analysis until
    they find the existence of the schedstats= parameter or schedstats sysctl.
    It will be automatically activated for latencytop and sleep profiling to
    alleviate the problem. For tracepoints, there is a simple warning as it's
    not safe to activate schedstats in the context when it's known the tracepoint
    may be wanted but is unavailable.

    Signed-off-by: Mel Gorman
    Reviewed-by: Matt Fleming
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

03 Feb, 2016

1 commit

  • …/acme/linux into perf/core

    Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

    User visible changes:

    - Rename the "colors.code" ~/.perfconfig variable to "colors.jump_arrows",
    as it controls just the that UI element in the annotate browser (Taeung Song)

    - Avoid trying to read ELF symtabs from device files, noticed while doing
    memory profiling work (Jiri Olsa)

    - Improve context detection when offering options in the hists browser,
    i.e. some options don't make sense when the browser is not working with
    a perf.data file ('perf top' mode), only in 'perf report' mode, like
    scripting (Namhyung Kim)

    Infrastructure changes:

    - Elliminate duplication in the hists browser filter functions, getting the
    common part into a function that receives callbacks for filtering by
    DSO, thread, etc. (Namhyung Kim)

    - Fix misleadingly indented assignment, found using
    gcc6 -Wmisleading-indentation (Markus Trippelsdorf)

    - Handle LLVM relocation oddities in libbpf, introducing a 'perf test' that
    detects such problems and then fixing the problem, so that the test now
    passes (Wang Nan)

    - More improvements to the build infrastructure to allow reusing the
    feature detection facilities (Wang Nan)

    - Auto initialize the globals needed by cpu__max_{cpu,node}() routines
    (Arnaldo Carvalho de Melo)

    Documentation changes:

    - Document the perf sysctls in Documentation/sysctl/kernel.txt (Ben Hutchings)

    - Document a bunch more ~/.perfconfig knobs (Taeung Song)

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

26 Jan, 2016

1 commit

  • perf_event_paranoid was only documented in source code and a perf error
    message. Copy the documentation from the error message to
    Documentation/sysctl/kernel.txt.

    perf_cpu_time_max_percent was already documented but missing from the
    list at the top, so add it there.

    Signed-off-by: Ben Hutchings
    Cc: Peter Zijlstra
    Cc: linux-doc@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160119213515.GG2637@decadent.org.uk
    [ Remove reference to external Documentation file, provide info inline, as before ]
    Signed-off-by: Arnaldo Carvalho de Melo

    Ben Hutchings
     

23 Jan, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Embarrassing braino fix + pipe page accounting + fixing an eyesore in
    find_filesystem() (checking that s1 is equal to prefix of s2 of given
    length can be done in many ways, but "compare strlen(s1) with length
    and then do strncmp()" is not a good one...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] fix braino in fs/dlm/user.c
    pipe: limit the per-user amount of pages allocated in pipes
    find_filesystem(): simplify comparison

    Linus Torvalds
     

21 Jan, 2016

1 commit

  • SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
    strict write position handling"), and released in v3.16 in August of
    2014. Since then I can find only 1 instance of non-zero offset
    writing[1], and it was fixed immediately in CRIU[2]. As such, it
    appears safe to flip this to the strict state now.

    [1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
    [2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

    Signed-off-by: Kees Cook
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Jan, 2016

1 commit

  • On no-so-small systems, it is possible for a single process to cause an
    OOM condition by filling large pipes with data that are never read. A
    typical process filling 4000 pipes with 1 MB of data will use 4 GB of
    memory. On small systems it may be tricky to set the pipe max size to
    prevent this from happening.

    This patch makes it possible to enforce a per-user soft limit above
    which new pipes will be limited to a single page, effectively limiting
    them to 4 kB each, as well as a hard limit above which no new pipes may
    be created for this user. This has the effect of protecting the system
    against memory abuse without hurting other users, and still allowing
    pipes to work correctly though with less data at once.

    The limit are controlled by two new sysctls : pipe-user-pages-soft, and
    pipe-user-pages-hard. Both may be disabled by setting them to zero. The
    default soft limit allows the default number of FDs per process (1024)
    to create pipes of the default size (64kB), thus reaching a limit of 64MB
    before starting to create only smaller pipes. With 256 processes limited
    to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
    1084 MB of memory allocated for a user. The hard limit is disabled by
    default to avoid breaking existing applications that make intensive use
    of pipes (eg: for splicing).

    Reported-by: socketpair@gmail.com
    Reported-by: Tetsuo Handa
    Mitigates: CVE-2013-4312 (Linux 2.0+)
    Suggested-by: Linus Torvalds
    Signed-off-by: Willy Tarreau
    Signed-off-by: Al Viro

    Willy Tarreau
     

16 Jan, 2016

1 commit

  • Merge first patch-bomb from Andrew Morton:

    - A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
    -stable

    - A few misc fixes

    - OCFS2 updates

    - Part of MM. Including pretty large changes to page-flags handling
    and to thp management which have been buffered up for 2-3 cycles now.

    I have a lot of MM material this time.

    [ It turns out the THP part wasn't quite ready, so that got dropped from
    this series - Linus ]

    * emailed patches from Andrew Morton : (117 commits)
    zsmalloc: reorganize struct size_class to pack 4 bytes hole
    mm/zbud.c: use list_last_entry() instead of list_tail_entry()
    zram/zcomp: do not zero out zcomp private pages
    zram: pass gfp from zcomp frontend to backend
    zram: try vmalloc() after kmalloc()
    zram/zcomp: use GFP_NOIO to allocate streams
    mm: add tracepoint for scanning pages
    drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
    mm/page_isolation: use macro to judge the alignment
    mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
    mm: rework virtual memory accounting
    include/linux/memblock.h: fix ordering of 'flags' argument in comments
    mm: move lru_to_page to mm_inline.h
    Documentation/filesystems: describe the shared memory usage/accounting
    memory-hotplug: don't BUG() in register_memory_resource()
    hugetlb: make mm and fs code explicitly non-modular
    mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
    mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
    mm: make sure isolate_lru_page() is never called for tail page
    vmstat: make vmstat_updater deferrable again and shut down on idle
    ...

    Linus Torvalds
     

15 Jan, 2016

2 commits

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    floppy: make local variable non-static
    exynos: fixes an incorrect header guard
    dt-bindings: fixes some incorrect header guards
    cpufreq-dt: correct dead link in documentation
    cpufreq: ARM big LITTLE: correct dead link in documentation
    treewide: Fix typos in printk
    Documentation: filesystem: Fix typo in fs/eventfd.c
    fs/super.c: use && instead of & for warn_on condition
    Documentation: fix sysfs-ptp
    lib: scatterlist: fix Kconfig description

    Linus Torvalds
     
  • Address Space Layout Randomization (ASLR) provides a barrier to
    exploitation of user-space processes in the presence of security
    vulnerabilities by making it more difficult to find desired code/data
    which could help an attack. This is done by adding a random offset to
    the location of regions in the process address space, with a greater
    range of potential offset values corresponding to better protection/a
    larger search-space for brute force, but also to greater potential for
    fragmentation.

    The offset added to the mmap_base address, which provides the basis for
    the majority of the mappings for a process, is set once on process exec
    in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
    which reflect, hopefully, the best compromise for all systems. The
    trade-off between increased entropy in the offset value generation and
    the corresponding increased variability in address space fragmentation
    is not absolute, however, and some platforms may tolerate higher amounts
    of entropy. This patch introduces both new Kconfig values and a sysctl
    interface which may be used to change the amount of entropy used for
    offset generation on a system.

    The direct motivation for this change was in response to the
    libstagefright vulnerabilities that affected Android, specifically to
    information provided by Google's project zero at:

    http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

    The attack presented therein, by Google's project zero, specifically
    targeted the limited randomness used to generate the offset added to the
    mmap_base address in order to craft a brute-force-based attack.
    Concretely, the attack was against the mediaserver process, which was
    limited to respawning every 5 seconds, on an arm device. The hard-coded
    8 bits used resulted in an average expected success rate of defeating
    the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
    piece). With this patch, and an accompanying increase in the entropy
    value to 16 bits, the same attack would take an average expected time of
    over 45 hours (32768 tries), which makes it both less feasible and more
    likely to be noticed.

    The introduced Kconfig and sysctl options are limited by per-arch
    minimum and maximum values, the minimum of which was chosen to match the
    current hard-coded value and the maximum of which was chosen so as to
    give the greatest flexibility without generating an invalid mmap_base
    address, generally a 3-4 bits less than the number of bits in the
    user-space accessible virtual address space.

    When decided whether or not to change the default value, a system
    developer should consider that mmap_base address could be placed
    anywhere up to 2^(value) bits away from the non-randomized location,
    which would introduce variable-sized areas above and below the mmap_base
    address such that the maximum vm_area_struct size may be reduced,
    preventing very large allocations.

    This patch (of 4):

    ASLR only uses as few as 8 bits to generate the random offset for the
    mmap base address on 32 bit architectures. This value was chosen to
    prevent a poorly chosen value from dividing the address space in such a
    way as to prevent large allocations. This may not be an issue on all
    platforms. Allow the specification of a minimum number of bits so that
    platforms desiring greater ASLR protection may determine where to place
    the trade-off.

    Signed-off-by: Daniel Cashman
    Cc: Russell King
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Don Zickus
    Cc: Eric W. Biederman
    Cc: Heinrich Schuchardt
    Cc: Josh Poimboeuf
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: David Rientjes
    Cc: Mark Salyzyn
    Cc: Jeff Vander Stoep
    Cc: Nick Kralevich
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Hector Marco-Gisbert
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Cashman
     

19 Dec, 2015

1 commit

  • kernel.panic_on_io_nmi sysctl was introduced by commit

    5211a242d0cb ("x86: Add sysctl to allow panic on IOCK NMI error")

    but its documentation is missing. So, add it.

    Signed-off-by: Hidehiro Kawai
    Requested-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Baoquan He
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: "Eric W. Biederman"
    Cc: Heinrich Schuchardt
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jiri Kosina
    Cc: Jonathan Corbet
    Cc: kexec@lists.infradead.org
    Cc: linux-doc@vger.kernel.org
    Cc: Manfred Spraul
    Cc: Masami Hiramatsu
    Cc: Michal Hocko
    Cc: Nicolas Iooss
    Cc: Peter Zijlstra
    Cc: Seth Jennings
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Ulrich Obergfell
    Cc: Vivek Goyal
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/20151210014637.25437.71903.stgit@softrs
    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner

    Hidehiro Kawai
     

08 Dec, 2015

1 commit

  • s/avaiable/available/g

    This fixup is already in scripts/spelling.txt.

    The fix in Documentation/ABI/testing/sysfs-ptp affects documentation of
    a /sys entry: the /sys entry itself is correct.

    Signed-off-by: Chris Dunlop
    Signed-off-by: Jiri Kosina

    Chris Dunlop
     

10 Nov, 2015

1 commit


06 Nov, 2015

1 commit

  • In many cases of hardlockup reports, it's actually not possible to know
    why it triggered, because the CPU that got stuck is usually waiting on a
    resource (with IRQs disabled) in posession of some other CPU is holding.

    IOW, we are often looking at the stacktrace of the victim and not the
    actual offender.

    Introduce sysctl / cmdline parameter that makes it possible to have
    hardlockup detector perform all-CPU backtrace.

    Signed-off-by: Jiri Kosina
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Acked-by: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

18 Sep, 2015

1 commit


09 Sep, 2015

1 commit

  • The comment says that the per-cpu batchsize and zone watermarks are
    determined by present_pages which is definitely wrong, they are both
    calculated from managed_pages. Fix it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

24 Jul, 2015

1 commit


26 Jun, 2015

1 commit

  • When adding __printf attribute to cn_printf, gcc reports some issues:

    fs/coredump.c:213:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kuid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->uid);
    ^
    fs/coredump.c:217:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kgid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->gid);
    ^

    These warnings come from the fact that the value of uid/gid needs to be
    extracted from the kuid_t/kgid_t structure before being used as an
    integer. More precisely, cred->uid and cred->gid need to be converted to
    either user-namespace uid/gid or to init_user_ns uid/gid.

    Use init_user_ns in order not to break existing ABI, and document this in
    Documentation/sysctl/kernel.txt.

    While at it, format uid and gid values with %u instead of %d because
    uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.

    Signed-off-by: Nicolas Iooss
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     

25 Jun, 2015

1 commit

  • Change the default behavior of watchdog so it only runs on the
    housekeeping cores when nohz_full is enabled at build and boot time.
    Allow modifying the set of cores the watchdog is currently running on
    with a new kernel.watchdog_cpumask sysctl.

    In the current system, the watchdog subsystem runs a periodic timer that
    schedules the watchdog kthread to run. However, nohz_full cores are
    designed to allow userspace application code running on those cores to
    have 100% access to the CPU. So the watchdog system prevents the
    nohz_full application code from being able to run the way it wants to,
    thus the motivation to suppress the watchdog on nohz_full cores, which
    this patchset provides by default.

    However, if we disable the watchdog globally, then the housekeeping
    cores can't benefit from the watchdog functionality. So we allow
    disabling it only on some cores. See Documentation/lockup-watchdogs.txt
    for more information.

    [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
    Signed-off-by: Chris Metcalf
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

17 Apr, 2015

1 commit

  • File /proc/sys/kernel/threads-max controls the maximum number of threads
    that can be created using fork().

    [akpm@linux-foundation.org: fix typo, per Guenter]
    Signed-off-by: Heinrich Schuchardt
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     

16 Apr, 2015

1 commit

  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

1 commit

  • With the current user interface of the watchdog mechanism it is only
    possible to disable or enable both lockup detectors at the same time.
    This series introduces new kernel parameters and changes the semantics of
    some existing kernel parameters, so that the hard lockup detector and the
    soft lockup detector can be disabled or enabled individually. With this
    series applied, the user interface is as follows.

    - parameters in /proc/sys/kernel

    . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

    . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

    . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

    . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

    - kernel command line parameters

    . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

    . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

    . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

    Also, remove the proc_dowatchdog() function which is no longer needed.

    [dzickus@redhat.com: wrote changelog]
    [dzickus@redhat.com: update documentation for kernel params and sysctl]
    Signed-off-by: Ulrich Obergfell
    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     

12 Feb, 2015

3 commits

  • Merge second set of updates from Andrew Morton:
    "More of MM"

    * emailed patches from Andrew Morton : (83 commits)
    mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
    mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
    vmstat: Reduce time interval to stat update on idle cpu
    mm/page_owner.c: remove unnecessary stack_trace field
    Documentation/filesystems/proc.txt: describe /proc//map_files
    mm: incorporate read-only pages into transparent huge pages
    vmstat: do not use deferrable delayed work for vmstat_update
    mm: more aggressive page stealing for UNMOVABLE allocations
    mm: always steal split buddies in fallback allocations
    mm: when stealing freepages, also take pages created by splitting buddy page
    mincore: apply page table walker on do_mincore()
    mm: /proc/pid/clear_refs: avoid split_huge_page()
    mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
    mempolicy: apply page table walker on queue_pages_range()
    arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
    memcg: cleanup preparation for page table walk
    numa_maps: remove numa_maps->vma
    numa_maps: fix typo in gather_hugetbl_stats
    pagemap: use walk->vma instead of calling find_vma()
    clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
    ...

    Linus Torvalds
     
  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull documentation updates from Jonathan Corbet:
    "Highlights this time around include:

    - A thrashing of SubmittingPatches to bring it out of the "send
    everything to Linus" era of kernel development.

    - A new document on completions from Nicholas McGuire

    - Lots of typo fixes, formatting improvements, corrections, build
    fixes, and more"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (35 commits)
    Documentation: Fix the wrong command `echo -1 > set_ftrace_pid` for cleaning the filter.
    can-doc: Fixed a wrong filepath in can.txt
    Documentation: Fix trivial typo in comment.
    kgdb,docs: Fix typo and minor style issues
    Documentation: add description for FTRACE probe status
    doc: brief user documentation for completion
    Documentation/misc-devices/mei: Fix indentation of embedded code.
    Documentation/misc-devices/mei: Fix indentation of enumeration.
    Documentation/misc-devices/mei: Fix spacing around parentheses.
    Documentation/misc-devices/mei: Fix formatting of headings.
    Documentation: devicetree: Fix double words in Doumentation/devicetree
    Documentation: mm: Fix typo in vm.txt
    lockstat: Add documentation on contention and contenting points
    Documentation: fix blackfin gptimers-example build errors
    Fixes column alignment in table of contents entry 1.9 in Documentation/filesystems/proc.txt
    CodingStyle: enable emacs display of trailing whitespace
    DocBook: Do not exceed argument list limit
    gpio: board.txt: Fix the gpio name example
    Documentation/SubmittingPatches: unify whitespace/tabs for the DCO
    MAINTAINERS: Add the docs-next git tree to the maintainer entry
    ...

    Linus Torvalds
     

11 Feb, 2015

1 commit

  • Pull networking updates from David Miller:

    1) More iov_iter conversion work from Al Viro.

    [ The "crypto: switch af_alg_make_sg() to iov_iter" commit was
    wrong, and this pull actually adds an extra commit on top of the
    branch I'm pulling to fix that up, so that the pre-merge state is
    ok. - Linus ]

    2) Various optimizations to the ipv4 forwarding information base trie
    lookup implementation. From Alexander Duyck.

    3) Remove sock_iocb altogether, from CHristoph Hellwig.

    4) Allow congestion control algorithm selection via routing metrics.
    From Daniel Borkmann.

    5) Make ipv4 uncached route list per-cpu, from Eric Dumazet.

    6) Handle rfs hash collisions more gracefully, also from Eric Dumazet.

    7) Add xmit_more support to r8169, e1000, and e1000e drivers. From
    Florian Westphal.

    8) Transparent Ethernet Bridging support for GRO, from Jesse Gross.

    9) Add BPF packet actions to packet scheduler, from Jiri Pirko.

    10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer.

    11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman
    Kwok.

    12) More sanely handle out-of-window dupacks, which can result in
    serious ACK storms. From Neal Cardwell.

    13) Various rhashtable bug fixes and enhancements, from Herbert Xu,
    Patrick McHardy, and Thomas Graf.

    14) Support xmit_more in be2net, from Sathya Perla.

    15) Group Policy extensions for vxlan, from Thomas Graf.

    16) Remove Checksum Offload support for vxlan, from Tom Herbert.

    17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From
    Vlad Yasevich.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits)
    crypto: fix af_alg_make_sg() conversion to iov_iter
    ipv4: Namespecify TCP PMTU mechanism
    i40e: Fix for stats init function call in Rx setup
    tcp: don't include Fast Open option in SYN-ACK on pure SYN-data
    openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set
    ipv6: Make __ipv6_select_ident static
    ipv6: Fix fragment id assignment on LE arches.
    bridge: Fix inability to add non-vlan fdb entry
    net: Mellanox: Delete unnecessary checks before the function call "vunmap"
    cxgb4: Add support in cxgb4 to get expansion rom version via ethtool
    ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version
    net: dsa: Remove redundant phy_attach()
    IB/mlx4: Reset flow support for IB kernel ULPs
    IB/mlx4: Always use the correct port for mirrored multicast attachments
    net/bonding: Fix potential bad memory access during bonding events
    tipc: remove tipc_snprintf
    tipc: nl compat add noop and remove legacy nl framework
    tipc: convert legacy nl stats show to nl compat
    tipc: convert legacy nl net id get to nl compat
    tipc: convert legacy nl net id set to nl compat
    ...

    Linus Torvalds
     

03 Feb, 2015

1 commit

  • Tx timestamps are looped onto the error queue on top of an skb. This
    mechanism leaks packet headers to processes unless the no-payload
    options SOF_TIMESTAMPING_OPT_TSONLY is set.

    Add a sysctl that optionally drops looped timestamp with data. This
    only affects processes without CAP_NET_RAW.

    The policy is checked when timestamps are generated in the stack.
    It is possible for timestamps with data to be reported after the
    sysctl is set, if these were queued internally earlier.

    No vulnerability is immediately known that exploits knowledge
    gleaned from packet headers, but it may still be preferable to allow
    administrators to lock down this path at the cost of possible
    breakage of legacy applications.

    Signed-off-by: Willem de Bruijn

    ----

    Changes
    (v1 -> v2)
    - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
    (rfc -> v1)
    - document the sysctl in Documentation/sysctl/net.txt
    - fix access control race: read .._OPT_TSONLY only once,
    use same value for permission check and skb generation.
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

29 Jan, 2015

1 commit