24 Aug, 2017

1 commit

  • As eBPF JIT support for arm32 was added recently with
    commit 39c13c204bb1150d401e27d41a9d8b332be47c49, it seems appropriate to
    add arm32 as arch with support for eBPF JIT in bpf and sysctl docs as well.

    Signed-off-by: Shubham Bansal
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Shubham Bansal
     

21 Aug, 2017

1 commit


19 Aug, 2017

1 commit


18 Aug, 2017

1 commit


11 Jul, 2017

1 commit

  • It seems that there are still people using 32b kernels which a lot of
    memory and the IO tend to suck a lot for them by default. Mostly
    because writers are throttled too when the lowmem is used. We have
    highmem_is_dirtyable to work around that issue but it seems we never
    bothered to document it. Let's do it now, finally.

    Link: http://lkml.kernel.org/r/20170626093200.18958-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Alkis Georgopoulos
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Apr, 2017

1 commit

  • Constants used for tuning are generally a bad idea, especially as hardware
    changes over time. Replace the constant 2 jiffies with sysctl variable
    netdev_budget_usecs to enable sysadmins to tune the softirq processing.
    Also document the variable.

    For example, a very fast machine might tune this to 1000 microseconds,
    while my regression testing 486DX-25 needs it to be 4000 microseconds on
    a nearly idle network to prevent time_squeeze from being incremented.

    Version 2: changed jiffies to microseconds for predictable units.

    Signed-off-by: Matthew Whitehead
    Signed-off-by: David S. Miller

    Matthew Whitehead
     

05 Mar, 2017

1 commit

  • Pull documentation fixes from Jonathan Corbet:
    "A few fixes for the docs tree, including one for a 4.11 build
    regression"

    * tag 'docs-4.11-fixes' of git://git.lwn.net/linux:
    Documentation/sphinx: fix primary_domain configuration
    docs: Fix htmldocs build failure
    doc/ko_KR/memory-barriers: Update control-dependencies section
    pcieaer doc: update the link
    Documentation: Update path to sysrq.txt

    Linus Torvalds
     

04 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Dec, 2016

1 commit

  • Oftenly, introducing side effects on packet processing on the other half
    of the stack by adjusting one of TX/RX via sysctl is not desirable.
    There are cases of demand for asymmetric, orthogonal configurability.

    This holds true especially for nodes where RPS for RFS usage on top is
    configured and therefore use the 'old dev_weight'. This is quite a
    common base configuration setup nowadays, even with NICs of superior processing
    support (e.g. aRFS).

    A good example use case are nodes acting as noSQL data bases with a
    large number of tiny requests and rather fewer but large packets as responses.
    It's affordable to have large budget and rx dev_weights for the
    requests. But as a side effect having this large a number on TX
    processed in one run can overwhelm drivers.

    This patch therefore introduces an independent configurability via sysctl to
    userland.

    Signed-off-by: Matthias Tafelmeier
    Signed-off-by: David S. Miller

    Matthias Tafelmeier
     

13 Dec, 2016

1 commit

  • Pull documentation update from Jonathan Corbet:
    "These are the documentation changes for 4.10.

    It's another busy cycle for the docs tree, as the sphinx conversion
    continues. Highlights include:

    - Further work on PDF output, which remains a bit of a pain but
    should be more solid now.

    - Five more DocBook template files converted to Sphinx. Only 27 to
    go... Lots of plain-text files have also been converted and
    integrated.

    - Images in binary formats have been replaced with more
    source-friendly versions.

    - Various bits of organizational work, including the renaming of
    various files discussed at the kernel summit.

    - New documentation for the device_link mechanism.

    ... and, of course, lots of typo fixes and small updates"

    * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
    dma-buf: Extract dma-buf.rst
    Update Documentation/00-INDEX
    docs: 00-INDEX: document directories/files with no docs
    docs: 00-INDEX: remove non-existing entries
    docs: 00-INDEX: add missing entries for documentation files/dirs
    docs: 00-INDEX: consolidate process/ and admin-guide/ description
    scripts: add a script to check if Documentation/00-INDEX is sane
    Docs: change sh -> awk in REPORTING-BUGS
    Documentation/core-api/device_link: Add initial documentation
    core-api: remove an unexpected unident
    ppc/idle: Add documentation for powersave=off
    Doc: Correct typo, "Introdution" => "Introduction"
    Documentation/atomic_ops.txt: convert to ReST markup
    Documentation/local_ops.txt: convert to ReST markup
    Documentation/assoc_array.txt: convert to ReST markup
    docs-rst: parse-headers.pl: cleanup the documentation
    docs-rst: fix media cleandocs target
    docs-rst: media/Makefile: reorganize the rules
    docs-rst: media: build SVG from graphviz files
    docs-rst: replace bayer.png by a SVG image
    ...

    Linus Torvalds
     

26 Oct, 2016

1 commit

  • For mostly historical reasons, the x86 oops dump shows the raw stack
    values:

    ...
    [registers]
    Stack:
    ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
    ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
    0000000000000002 0000000000000000 0000000000000000 0000000000000000
    Call Trace:
    ...

    This seems to be an artifact from long ago, and probably isn't needed
    anymore. It generally just adds noise to the dump, and it can be
    actively harmful because it leaks kernel addresses.

    Linus says:

    "The stack dump actually goes back to forever, and it used to be
    useful back in 1992 or so. But it used to be useful mainly because
    stacks were simpler and we didn't have very good call traces anyway. I
    definitely remember having used them - I just do not remember having
    used them in the last ten+ years.

    Of course, it's still true that if you can trigger an oops, you've
    likely already lost the security game, but since the stack dump is so
    useless, let's aim to just remove it and make games like the above
    harder."

    This also removes the related 'kstack=' cmdline option and the
    'kstack_depth_to_print' sysctl.

    Suggested-by: Linus Torvalds
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

24 Oct, 2016

1 commit


01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Sep, 2016

1 commit


03 Aug, 2016

1 commit

  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

27 Jul, 2016

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "Some big changes this month, headlined by the addition of a new
    formatted documentation mechanism based on the Sphinx system.

    The objectives here are to make it easier to create better-integrated
    (and more attractive) documents while (eventually) dumping our
    one-of-a-kind, cobbled-together system for something that is widely
    used and maintained by others. There's a fair amount of information
    what's being done, why, and how to use it in:

    https://lwn.net/Articles/692704/
    https://lwn.net/Articles/692705/

    Closer to home, Documentation/kernel-documentation.rst describes how
    it works.

    For now, the new system exists alongside the old one; you should soon
    see the GPU documentation converted over in the DRM pull and some
    significant media conversion work as well. Once all the docs have
    been moved over and we're convinced that the rough edges (of which are
    are a few) have been smoothed over, the DocBook-based stuff should go
    away.

    Primary credit is to Jani Nikula for doing the heavy lifting to make
    this stuff actually work; there has also been notable effort from
    Markus Heiser, Daniel Vetter, and Mauro Carvalho Chehab.

    Expect a couple of conflicts on the new index.rst file over the course
    of the merge window; they are trivially resolvable. That file may be
    a bit of a conflict magnet in the short term, but I don't expect that
    situation to last for any real length of time.

    Beyond that, of course, we have the usual collection of tweaks,
    updates, and typo fixes"

    * tag 'docs-for-linus' of git://git.lwn.net/linux: (77 commits)
    doc-rst: kernel-doc: fix handling of address_space tags
    Revert "doc/sphinx: Enable keep_warnings"
    doc-rst: kernel-doc directive, fix state machine reporter
    docs: deprecate kernel-doc-nano-HOWTO.txt
    doc/sphinx: Enable keep_warnings
    Documentation: add watermark_scale_factor to the list of vm systcl file
    kernel-doc: Fix up warning output
    docs: Get rid of some kernel-documentation warnings
    doc-rst: add an option to ignore DocBooks when generating docs
    workqueue: Fix a typo in workqueue.txt
    Doc: ocfs: Fix typo in filesystems/ocfs2-online-filecheck.txt
    Documentation/sphinx: skip build if user requested specific DOCBOOKS
    Documentation: add cleanmediadocs to the documentation targets
    Add .pyc files to .gitignore
    Doc: PM: Fix a typo in intel_powerclamp.txt
    doc-rst: flat-table directive - initial implementation
    Documentation: add meta-documentation for Sphinx and kernel-doc
    Documentation: tiny typo fix in usb/gadget_multi.txt
    Documentation: fix wrong value in md.txt
    bcache: documentation formatting, edited for clarity, stripe alignment notes
    ...

    Linus Torvalds
     

18 Jul, 2016

1 commit


16 Jun, 2016

1 commit

  • It is not always easy to determine the cause of an RCU stall just by
    analysing the RCU stall messages, mainly when the problem is caused
    by the indirect starvation of rcu threads. For example, when preempt_rcu
    is not awakened due to the starvation of a timer softirq.

    We have been hard coding panic() in the RCU stall functions for
    some time while testing the kernel-rt. But this is not possible in
    some scenarios, like when supporting customers.

    This patch implements the sysctl kernel.panic_on_rcu_stall. If
    set to 1, the system will panic() when an RCU stall takes place,
    enabling the capture of a vmcore. The vmcore provides a way to analyze
    all kernel/tasks states, helping out to point to the culprit and the
    solution for the stall.

    The kernel.panic_on_rcu_stall sysctl is disabled by default.

    Changes from v1:
    - Fixed a typo in the git log
    - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
    - Fixed the CONFIG_TINY_RCU compilation issue
    - The var sysctl_panic_on_rcu_stall is now __read_mostly

    Cc: Jonathan Corbet
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Acked-by: Christian Borntraeger
    Reviewed-by: Josh Triplett
    Reviewed-by: Arnaldo Carvalho de Melo
    Tested-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Paul E. McKenney

    Daniel Bristot de Oliveira
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

20 May, 2016

1 commit

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

2 commits

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 May, 2016

1 commit


10 May, 2016

1 commit

  • Allowing unprivileged kernel profiling lets any user dump follow kernel
    control flow and dump kernel registers. This most likely allows trivial
    kASLR bypassing, and it may allow other mischief as well. (Off the top
    of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
    could be quite interesting.)

    Signed-off-by: Andy Lutomirski
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

05 May, 2016

1 commit


29 Apr, 2016

1 commit

  • Commit 3193913ce62c ("mm: page_alloc: default node-ordering on 64-bit
    NUMA, zone-ordering on 32-bit") changes the default value of
    numa_zonelist_order. Update the document.

    Signed-off-by: Xishi Qiu
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

27 Apr, 2016

1 commit

  • The default remains 127, which is good for most cases, and not even hit
    most of the time, but then for some cases, as reported by Brendan, 1024+
    deep frames are appearing on the radar for things like groovy, ruby.

    And in some workloads putting a _lower_ cap on this may make sense. One
    that is per event still needs to be put in place tho.

    The new file is:

    # cat /proc/sys/kernel/perf_event_max_stack
    127

    Chaging it:

    # echo 256 > /proc/sys/kernel/perf_event_max_stack
    # cat /proc/sys/kernel/perf_event_max_stack
    256

    But as soon as there is some event using callchains we get:

    # echo 512 > /proc/sys/kernel/perf_event_max_stack
    -bash: echo: write error: Device or resource busy
    #

    Because we only allocate the callchain percpu data structures when there
    is a user, which allows for changing the max easily, its just a matter
    of having no callchain users at that point.

    Reported-and-Tested-by: Brendan Gregg
    Reviewed-by: Frederic Weisbecker
    Acked-by: Alexei Starovoitov
    Acked-by: David Ahern
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

19 Mar, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits

  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull tty/serial updates from Greg KH:
    "Here's the big tty/serial driver pull request for 4.6-rc1.

    Lots of changes in here, Peter has been on a tear again, with lots of
    refactoring and bugs fixes, many thanks to the great work he has been
    doing. Lots of driver updates and fixes as well, full details in the
    shortlog.

    All have been in linux-next for a while with no reported issues"

    * tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
    serial: 8250: describe CONFIG_SERIAL_8250_RSA
    serial: samsung: optimize UART rx fifo access routine
    serial: pl011: add mark/space parity support
    serial: sa1100: make sa1100_register_uart_fns a function
    tty: serial: 8250: add MOXA Smartio MUE boards support
    serial: 8250: convert drivers to use up_to_u8250p()
    serial: 8250/mediatek: fix building with SERIAL_8250=m
    serial: 8250/ingenic: fix building with SERIAL_8250=m
    serial: 8250/uniphier: fix modular build
    Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
    Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
    serial: mvebu-uart: initial support for Armada-3700 serial port
    serial: mctrl_gpio: Add missing module license
    serial: ifx6x60: avoid uninitialized variable use
    tty/serial: at91: fix bad offset for UART timeout register
    tty/serial: at91: restore dynamic driver binding
    serial: 8250: Add hardware dependency to RT288X option
    TTY, devpts: document pty count limiting
    tty: goldfish: support platform_device with id -1
    drivers: tty: goldfish: Add device tree bindings
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - Make schedstats a runtime tunable (disabled by default) and
    optimize it via static keys.

    As most distributions enable CONFIG_SCHEDSTATS=y due to its
    instrumentation value, this is a nice performance enhancement.
    (Mel Gorman)

    - Implement 'simple waitqueues' (swait): these are just pure
    waitqueues without any of the more complex features of full-blown
    waitqueues (callbacks, wake flags, wake keys, etc.). Simple
    waitqueues have less memory overhead and are faster.

    Use simple waitqueues in the RCU code (in 4 different places) and
    for handling KVM vCPU wakeups.

    (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
    Marcelo Tosatti)

    - sched/numa enhancements (Rik van Riel)

    - NOHZ performance enhancements (Rik van Riel)

    - Various sched/deadline enhancements (Steven Rostedt)

    - Various fixes (Peter Zijlstra)

    - ... and a number of other fixes, cleanups and smaller enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
    sched/cputime: Fix steal_account_process_tick() to always return jiffies
    sched/deadline: Remove dl_new from struct sched_dl_entity
    Revert "kbuild: Add option to turn incompatible pointer check into error"
    sched/deadline: Remove superfluous call to switched_to_dl()
    sched/debug: Fix preempt_disable_ip recording for preempt_disable()
    sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
    time, acct: Drop irq save & restore from __acct_update_integrals()
    acct, time: Change indentation in __acct_update_integrals()
    sched, time: Remove non-power-of-two divides from __acct_update_integrals()
    sched/rt: Kick RT bandwidth timer immediately on start up
    sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
    sched/debug: Move sched_domain_sysctl to debug.c
    sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
    sched/rt: Fix PI handling vs. sched_setscheduler()
    sched/core: Remove duplicated sched_group_set_shares() prototype
    sched/fair: Consolidate nohz CPU load update code
    sched/fair: Avoid using decay_load_missed() with a negative value
    sched/deadline: Always calculate end of period on sched_yield()
    sched/cgroup: Fix cgroup entity load tracking tear-down
    rcu: Use simple wait queues where possible in rcutree
    ...

    Linus Torvalds
     

08 Mar, 2016

1 commit

  • Logic has been changed in kernel 3.4 by commit e9aba5158a80
    ("tty: rework pty count limiting") but still not documented.

    Sysctl kernel.pty.max works as global limit, kernel.pty.reserve ptys
    are reserved for initial devpts instance (mounted without "newinstance").
    Per-instance limit also could be set by mount option "max=%d".

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

09 Feb, 2016

1 commit

  • schedstats is very useful during debugging and performance tuning but it
    incurs overhead to calculate the stats. As such, even though it can be
    disabled at build time, it is often enabled as the information is useful.

    This patch adds a kernel command-line and sysctl tunable to enable or
    disable schedstats on demand (when it's built in). It is disabled
    by default as someone who knows they need it can also learn to enable
    it when necessary.

    The benefits are dependent on how scheduler-intensive the workload is.
    If it is then the patch reduces the number of cycles spent calculating
    the stats with a small benefit from reducing the cache footprint of the
    scheduler.

    These measurements were taken from a 48-core 2-socket
    machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
    single socket machine 8-core machine with Intel i7-3770 processors.

    netperf-tcp
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
    Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
    Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
    Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
    Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
    Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
    Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
    Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
    Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

    Small gains here, UDP_STREAM showed nothing intresting and neither did
    the TCP_RR tests. The gains on the 8-core machine were very similar.

    tbench4
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
    Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
    Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
    Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
    Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
    Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
    Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
    Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
    Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
    Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
    Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
    Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
    Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

    Small gains of 2-4% at low thread counts and otherwise flat. The
    gains on the 8-core machine were slightly different

    tbench4 on 8-core i7-3770 single socket machine
    Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
    Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
    Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
    Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
    Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
    Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
    Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
    Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
    Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
    Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

    In constract, this shows a relatively steady 2-3% gain at higher thread
    counts. Due to the nature of the patch and the type of workload, it's
    not a surprise that the result will depend on the CPU used.

    hackbench-pipes
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v3r1
    Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
    Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
    Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
    Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
    Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
    Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
    Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
    Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
    Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
    Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
    Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
    Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

    Some small gains and losses and while the variance data is not included,
    it's close to the noise. The UMA machine did not show anything particularly
    different

    pipetest
    4.5.0-rc1 4.5.0-rc1
    vanilla nostats-v2r2
    Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
    1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
    2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
    3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
    Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
    Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
    Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
    Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
    Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
    Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
    Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
    Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
    Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
    Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
    Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
    Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
    Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

    Small improvement and similar gains were seen on the UMA machine.

    The gain is small but it stands to reason that doing less work in the
    scheduler is a good thing. The downside is that the lack of schedstats and
    tracepoints may be surprising to experts doing performance analysis until
    they find the existence of the schedstats= parameter or schedstats sysctl.
    It will be automatically activated for latencytop and sleep profiling to
    alleviate the problem. For tracepoints, there is a simple warning as it's
    not safe to activate schedstats in the context when it's known the tracepoint
    may be wanted but is unavailable.

    Signed-off-by: Mel Gorman
    Reviewed-by: Matt Fleming
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

03 Feb, 2016

1 commit

  • …/acme/linux into perf/core

    Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

    User visible changes:

    - Rename the "colors.code" ~/.perfconfig variable to "colors.jump_arrows",
    as it controls just the that UI element in the annotate browser (Taeung Song)

    - Avoid trying to read ELF symtabs from device files, noticed while doing
    memory profiling work (Jiri Olsa)

    - Improve context detection when offering options in the hists browser,
    i.e. some options don't make sense when the browser is not working with
    a perf.data file ('perf top' mode), only in 'perf report' mode, like
    scripting (Namhyung Kim)

    Infrastructure changes:

    - Elliminate duplication in the hists browser filter functions, getting the
    common part into a function that receives callbacks for filtering by
    DSO, thread, etc. (Namhyung Kim)

    - Fix misleadingly indented assignment, found using
    gcc6 -Wmisleading-indentation (Markus Trippelsdorf)

    - Handle LLVM relocation oddities in libbpf, introducing a 'perf test' that
    detects such problems and then fixing the problem, so that the test now
    passes (Wang Nan)

    - More improvements to the build infrastructure to allow reusing the
    feature detection facilities (Wang Nan)

    - Auto initialize the globals needed by cpu__max_{cpu,node}() routines
    (Arnaldo Carvalho de Melo)

    Documentation changes:

    - Document the perf sysctls in Documentation/sysctl/kernel.txt (Ben Hutchings)

    - Document a bunch more ~/.perfconfig knobs (Taeung Song)

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

26 Jan, 2016

1 commit

  • perf_event_paranoid was only documented in source code and a perf error
    message. Copy the documentation from the error message to
    Documentation/sysctl/kernel.txt.

    perf_cpu_time_max_percent was already documented but missing from the
    list at the top, so add it there.

    Signed-off-by: Ben Hutchings
    Cc: Peter Zijlstra
    Cc: linux-doc@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160119213515.GG2637@decadent.org.uk
    [ Remove reference to external Documentation file, provide info inline, as before ]
    Signed-off-by: Arnaldo Carvalho de Melo

    Ben Hutchings
     

23 Jan, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Embarrassing braino fix + pipe page accounting + fixing an eyesore in
    find_filesystem() (checking that s1 is equal to prefix of s2 of given
    length can be done in many ways, but "compare strlen(s1) with length
    and then do strncmp()" is not a good one...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] fix braino in fs/dlm/user.c
    pipe: limit the per-user amount of pages allocated in pipes
    find_filesystem(): simplify comparison

    Linus Torvalds
     

21 Jan, 2016

1 commit

  • SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
    strict write position handling"), and released in v3.16 in August of
    2014. Since then I can find only 1 instance of non-zero offset
    writing[1], and it was fixed immediately in CRIU[2]. As such, it
    appears safe to flip this to the strict state now.

    [1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
    [2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

    Signed-off-by: Kees Cook
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook