01 Dec, 2018

1 commit

  • commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.

    Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Loic
    Signed-off-by: Greg Kroah-Hartman

    Salvatore Mesoraca
     

23 Sep, 2017

1 commit

  • Pull seccomp updates from Kees Cook:
    "Major additions:

    - sysctl and seccomp operation to discover available actions
    (tyhicks)

    - new per-filter configurable logging infrastructure and sysctl
    (tyhicks)

    - SECCOMP_RET_LOG to log allowed syscalls (tyhicks)

    - SECCOMP_RET_KILL_PROCESS as the new strictest possible action

    - self-tests for new behaviors"

    [ This is the seccomp part of the security pull request during the merge
    window that was nixed due to unrelated problems - Linus ]

    * tag 'seccomp-v4.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    samples: Unrename SECCOMP_RET_KILL
    selftests/seccomp: Test thread vs process killing
    seccomp: Implement SECCOMP_RET_KILL_PROCESS action
    seccomp: Introduce SECCOMP_RET_KILL_PROCESS
    seccomp: Rename SECCOMP_RET_KILL to SECCOMP_RET_KILL_THREAD
    seccomp: Action to log before allowing
    seccomp: Filter flag to log all actions except SECCOMP_RET_ALLOW
    seccomp: Selftest for detection of filter flag support
    seccomp: Sysctl to configure actions that are allowed to be logged
    seccomp: Operation for checking if an action is available
    seccomp: Sysctl to display available actions
    seccomp: Provide matching filter for introspection
    selftests/seccomp: Refactor RET_ERRNO tests
    selftests/seccomp: Add simple seccomp overhead benchmark
    selftests/seccomp: Add tests for basic ptrace actions

    Linus Torvalds
     

07 Sep, 2017

3 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • Patch series "cleanup zonelists initialization", v1.

    This is aimed at cleaning up the zonelists initialization code we have
    but the primary motivation was bug report [2] which got resolved but the
    usage of stop_machine is just too ugly to live. Most patches are
    straightforward but 3 of them need a special consideration.

    Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
    because this is a user visible change. As I argue in the patch
    description I do not think we have a strong usecase for it these days.
    I have kept sysctl in place and warn into the log if somebody tries to
    configure zone lists ordering. If somebody has a real usecase for it we
    can revert this patch but I do not expect anybody will actually notice
    runtime differences. This patch is not strictly needed for the rest but
    it made patch 6 easier to implement.

    Patch 7 removes stop_machine from build_all_zonelists without adding any
    special synchronization between iterators and updater which I _believe_
    is acceptable as explained in the changelog. I hope I am not missing
    anything.

    Patch 8 then removes zonelists_mutex which is kind of ugly as well and
    not really needed AFAICS but a care should be taken when double checking
    my thinking.

    This patch (of 9):

    Supporting zone ordered zonelists costs us just a lot of code while the
    usefulness is arguable if existent at all. Mel has already made node
    ordering default on 64b systems. 32b systems are still using
    ZONELIST_ORDER_ZONE because it is considered better to fallback to a
    different NUMA node rather than consume precious lowmem zones.

    This argument is, however, weaken by the fact that the memory reclaim
    has been reworked to be node rather than zone oriented. This means that
    lowmem requests have to skip over all highmem pages on LRUs already and
    so zone ordering doesn't save the reclaim time much. So the only
    advantage of the zone ordering is under a light memory pressure when
    highmem requests do not ever hit into lowmem zones and the lowmem
    pressure doesn't need to reclaim.

    Considering that 32b NUMA systems are rather suboptimal already and it
    is generally advisable to use 64b kernel on such a HW I believe we
    should rather care about the code maintainability and just get rid of
    ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
    somebody tries to set zone ordering either from kernel command line or
    the sysctl.

    [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
    Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Abdul Haleem
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

24 Aug, 2017

2 commits

  • …nux/kernel/git/acme/linux into perf/core

    Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

    - Expression parser enhancements for metrics (Andi Kleen)

    - Fix buffer overflow while freeing events in 'perf stat' (Andi Kleen)

    - Fix static linking with elfutils's libdf and with libunwind
    in Debian/Ubuntu (Konstantin Khlebnikov)

    - Tighten detection of BPF events, avoiding matching some other PMU
    events such as 'cpu/uops_executed.core,cmask=1/' as a .c source
    file that ended up being considered a BPF event (Andi Kleen)

    - Add Skylake server uncore JSON vendor events (Andi Kleen)

    - Add support for printing new mem_info encodings, including
    'perf test' checks (Andi Kleen)

    - Really install manpages via 'make install-man' (Konstantin Khlebnikov)

    - Fix documentation for perf_event_paranoid and perf_event_mlock_kb
    sysctls (Konstantin Khlebnikov)

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • As eBPF JIT support for arm32 was added recently with
    commit 39c13c204bb1150d401e27d41a9d8b332be47c49, it seems appropriate to
    add arm32 as arch with support for eBPF JIT in bpf and sysctl docs as well.

    Signed-off-by: Shubham Bansal
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Shubham Bansal
     

23 Aug, 2017

1 commit

  • Fix misprint CAP_IOC_LOCK -> CAP_IPC_LOCK. This capability have nothing
    to do with raw tracepoints. This part is about bypassing mlock limits.

    Sysctl kernel.perf_event_paranoid = -1 allows raw and ftrace function
    tracepoints without CAP_SYS_ADMIN.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Shishkin
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/150322916080.129746.11285255474738558340.stgit@buzz
    Signed-off-by: Arnaldo Carvalho de Melo

    Konstantin Khlebnikov
     

21 Aug, 2017

1 commit


19 Aug, 2017

1 commit


18 Aug, 2017

1 commit


15 Aug, 2017

1 commit

  • This patch creates a read-only sysctl containing an ordered list of
    seccomp actions that the kernel supports. The ordering, from left to
    right, is the lowest action value (kill) to the highest action value
    (allow). Currently, a read of the sysctl file would return "kill trap
    errno trace allow". The contents of this sysctl file can be useful for
    userspace code as well as the system administrator.

    The path to the sysctl is:

    /proc/sys/kernel/seccomp/actions_avail

    libseccomp and other userspace code can easily determine which actions
    the current kernel supports. The set of actions supported by the current
    kernel may be different than the set of action macros found in kernel
    headers that were installed where the userspace code was built.

    In addition, this sysctl will allow system administrators to know which
    actions are supported by the kernel and make it easier to configure
    exactly what seccomp logs through the audit subsystem. Support for this
    level of logging configuration will come in a future patch.

    Signed-off-by: Tyler Hicks
    Signed-off-by: Kees Cook

    Tyler Hicks
     

11 Jul, 2017

1 commit

  • It seems that there are still people using 32b kernels which a lot of
    memory and the IO tend to suck a lot for them by default. Mostly
    because writers are throttled too when the lowmem is used. We have
    highmem_is_dirtyable to work around that issue but it seems we never
    bothered to document it. Let's do it now, finally.

    Link: http://lkml.kernel.org/r/20170626093200.18958-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Alkis Georgopoulos
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Apr, 2017

1 commit

  • Constants used for tuning are generally a bad idea, especially as hardware
    changes over time. Replace the constant 2 jiffies with sysctl variable
    netdev_budget_usecs to enable sysadmins to tune the softirq processing.
    Also document the variable.

    For example, a very fast machine might tune this to 1000 microseconds,
    while my regression testing 486DX-25 needs it to be 4000 microseconds on
    a nearly idle network to prevent time_squeeze from being incremented.

    Version 2: changed jiffies to microseconds for predictable units.

    Signed-off-by: Matthew Whitehead
    Signed-off-by: David S. Miller

    Matthew Whitehead
     

05 Mar, 2017

1 commit

  • Pull documentation fixes from Jonathan Corbet:
    "A few fixes for the docs tree, including one for a 4.11 build
    regression"

    * tag 'docs-4.11-fixes' of git://git.lwn.net/linux:
    Documentation/sphinx: fix primary_domain configuration
    docs: Fix htmldocs build failure
    doc/ko_KR/memory-barriers: Update control-dependencies section
    pcieaer doc: update the link
    Documentation: Update path to sysrq.txt

    Linus Torvalds
     

04 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Dec, 2016

1 commit

  • Oftenly, introducing side effects on packet processing on the other half
    of the stack by adjusting one of TX/RX via sysctl is not desirable.
    There are cases of demand for asymmetric, orthogonal configurability.

    This holds true especially for nodes where RPS for RFS usage on top is
    configured and therefore use the 'old dev_weight'. This is quite a
    common base configuration setup nowadays, even with NICs of superior processing
    support (e.g. aRFS).

    A good example use case are nodes acting as noSQL data bases with a
    large number of tiny requests and rather fewer but large packets as responses.
    It's affordable to have large budget and rx dev_weights for the
    requests. But as a side effect having this large a number on TX
    processed in one run can overwhelm drivers.

    This patch therefore introduces an independent configurability via sysctl to
    userland.

    Signed-off-by: Matthias Tafelmeier
    Signed-off-by: David S. Miller

    Matthias Tafelmeier
     

13 Dec, 2016

1 commit

  • Pull documentation update from Jonathan Corbet:
    "These are the documentation changes for 4.10.

    It's another busy cycle for the docs tree, as the sphinx conversion
    continues. Highlights include:

    - Further work on PDF output, which remains a bit of a pain but
    should be more solid now.

    - Five more DocBook template files converted to Sphinx. Only 27 to
    go... Lots of plain-text files have also been converted and
    integrated.

    - Images in binary formats have been replaced with more
    source-friendly versions.

    - Various bits of organizational work, including the renaming of
    various files discussed at the kernel summit.

    - New documentation for the device_link mechanism.

    ... and, of course, lots of typo fixes and small updates"

    * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
    dma-buf: Extract dma-buf.rst
    Update Documentation/00-INDEX
    docs: 00-INDEX: document directories/files with no docs
    docs: 00-INDEX: remove non-existing entries
    docs: 00-INDEX: add missing entries for documentation files/dirs
    docs: 00-INDEX: consolidate process/ and admin-guide/ description
    scripts: add a script to check if Documentation/00-INDEX is sane
    Docs: change sh -> awk in REPORTING-BUGS
    Documentation/core-api/device_link: Add initial documentation
    core-api: remove an unexpected unident
    ppc/idle: Add documentation for powersave=off
    Doc: Correct typo, "Introdution" => "Introduction"
    Documentation/atomic_ops.txt: convert to ReST markup
    Documentation/local_ops.txt: convert to ReST markup
    Documentation/assoc_array.txt: convert to ReST markup
    docs-rst: parse-headers.pl: cleanup the documentation
    docs-rst: fix media cleandocs target
    docs-rst: media/Makefile: reorganize the rules
    docs-rst: media: build SVG from graphviz files
    docs-rst: replace bayer.png by a SVG image
    ...

    Linus Torvalds
     

26 Oct, 2016

1 commit

  • For mostly historical reasons, the x86 oops dump shows the raw stack
    values:

    ...
    [registers]
    Stack:
    ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
    ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
    0000000000000002 0000000000000000 0000000000000000 0000000000000000
    Call Trace:
    ...

    This seems to be an artifact from long ago, and probably isn't needed
    anymore. It generally just adds noise to the dump, and it can be
    actively harmful because it leaks kernel addresses.

    Linus says:

    "The stack dump actually goes back to forever, and it used to be
    useful back in 1992 or so. But it used to be useful mainly because
    stacks were simpler and we didn't have very good call traces anyway. I
    definitely remember having used them - I just do not remember having
    used them in the last ten+ years.

    Of course, it's still true that if you can trigger an oops, you've
    likely already lost the security game, but since the stack dump is so
    useless, let's aim to just remove it and make games like the above
    harder."

    This also removes the related 'kstack=' cmdline option and the
    'kstack_depth_to_print' sysctl.

    Suggested-by: Linus Torvalds
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

24 Oct, 2016

1 commit


01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Sep, 2016

1 commit


03 Aug, 2016

1 commit

  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

27 Jul, 2016

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "Some big changes this month, headlined by the addition of a new
    formatted documentation mechanism based on the Sphinx system.

    The objectives here are to make it easier to create better-integrated
    (and more attractive) documents while (eventually) dumping our
    one-of-a-kind, cobbled-together system for something that is widely
    used and maintained by others. There's a fair amount of information
    what's being done, why, and how to use it in:

    https://lwn.net/Articles/692704/
    https://lwn.net/Articles/692705/

    Closer to home, Documentation/kernel-documentation.rst describes how
    it works.

    For now, the new system exists alongside the old one; you should soon
    see the GPU documentation converted over in the DRM pull and some
    significant media conversion work as well. Once all the docs have
    been moved over and we're convinced that the rough edges (of which are
    are a few) have been smoothed over, the DocBook-based stuff should go
    away.

    Primary credit is to Jani Nikula for doing the heavy lifting to make
    this stuff actually work; there has also been notable effort from
    Markus Heiser, Daniel Vetter, and Mauro Carvalho Chehab.

    Expect a couple of conflicts on the new index.rst file over the course
    of the merge window; they are trivially resolvable. That file may be
    a bit of a conflict magnet in the short term, but I don't expect that
    situation to last for any real length of time.

    Beyond that, of course, we have the usual collection of tweaks,
    updates, and typo fixes"

    * tag 'docs-for-linus' of git://git.lwn.net/linux: (77 commits)
    doc-rst: kernel-doc: fix handling of address_space tags
    Revert "doc/sphinx: Enable keep_warnings"
    doc-rst: kernel-doc directive, fix state machine reporter
    docs: deprecate kernel-doc-nano-HOWTO.txt
    doc/sphinx: Enable keep_warnings
    Documentation: add watermark_scale_factor to the list of vm systcl file
    kernel-doc: Fix up warning output
    docs: Get rid of some kernel-documentation warnings
    doc-rst: add an option to ignore DocBooks when generating docs
    workqueue: Fix a typo in workqueue.txt
    Doc: ocfs: Fix typo in filesystems/ocfs2-online-filecheck.txt
    Documentation/sphinx: skip build if user requested specific DOCBOOKS
    Documentation: add cleanmediadocs to the documentation targets
    Add .pyc files to .gitignore
    Doc: PM: Fix a typo in intel_powerclamp.txt
    doc-rst: flat-table directive - initial implementation
    Documentation: add meta-documentation for Sphinx and kernel-doc
    Documentation: tiny typo fix in usb/gadget_multi.txt
    Documentation: fix wrong value in md.txt
    bcache: documentation formatting, edited for clarity, stripe alignment notes
    ...

    Linus Torvalds
     

18 Jul, 2016

1 commit


16 Jun, 2016

1 commit

  • It is not always easy to determine the cause of an RCU stall just by
    analysing the RCU stall messages, mainly when the problem is caused
    by the indirect starvation of rcu threads. For example, when preempt_rcu
    is not awakened due to the starvation of a timer softirq.

    We have been hard coding panic() in the RCU stall functions for
    some time while testing the kernel-rt. But this is not possible in
    some scenarios, like when supporting customers.

    This patch implements the sysctl kernel.panic_on_rcu_stall. If
    set to 1, the system will panic() when an RCU stall takes place,
    enabling the capture of a vmcore. The vmcore provides a way to analyze
    all kernel/tasks states, helping out to point to the culprit and the
    solution for the stall.

    The kernel.panic_on_rcu_stall sysctl is disabled by default.

    Changes from v1:
    - Fixed a typo in the git log
    - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
    - Fixed the CONFIG_TINY_RCU compilation issue
    - The var sysctl_panic_on_rcu_stall is now __read_mostly

    Cc: Jonathan Corbet
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Acked-by: Christian Borntraeger
    Reviewed-by: Josh Triplett
    Reviewed-by: Arnaldo Carvalho de Melo
    Tested-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Paul E. McKenney

    Daniel Bristot de Oliveira
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

20 May, 2016

1 commit

  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

2 commits

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • This work adds a generic facility for use from eBPF JIT compilers
    that allows for further hardening of JIT generated images through
    blinding constants. In response to the original work on BPF JIT
    spraying published by Keegan McAllister [1], most BPF JITs were
    changed to make images read-only and start at a randomized offset
    in the page, where the rest was filled with trap instructions. We
    have this nowadays in x86, arm, arm64 and s390 JIT compilers.
    Additionally, later work also made eBPF interpreter images read
    only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
    arm, arm64 and s390 archs as well currently. This is done by
    default for mentioned JITs when JITing is enabled. Furthermore,
    we had a generic and configurable constant blinding facility on our
    todo for quite some time now to further make spraying harder, and
    first implementation since around netconf 2016.

    We found that for systems where untrusted users can load cBPF/eBPF
    code where JIT is enabled, start offset randomization helps a bit
    to make jumps into crafted payload harder, but in case where larger
    programs that cross page boundary are injected, we again have some
    part of the program opcodes at a page start offset. With improved
    guessing and more reliable payload injection, chances can increase
    to jump into such payload. Elena Reshetova recently wrote a test
    case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
    can leave some more room for payloads. Note that for all this,
    additional bugs in the kernel are still required to make the jump
    (and of course to guess right, to not jump into a trap) and naturally
    the JIT must be enabled, which is disabled by default.

    For helping mitigation, the general idea is to provide an option
    bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
    that for cases where JIT should be enabled for performance reasons,
    the generated image can be further hardened with blinding constants
    for unpriviledged users (bpf_jit_harden == 1), with trading off
    performance for these, but not for privileged ones. We also added
    the option of blinding for all users (bpf_jit_harden == 2), which
    is quite helpful for testing f.e. with test_bpf.ko. There are no
    further e.g. hardening levels of bpf_jit_harden switch intended,
    rationale is to have it dead simple to use as on/off. Since this
    functionality would need to be duplicated over and over for JIT
    compilers to use, which are already complex enough, we provide a
    generic eBPF byte-code level based blinding implementation, which is
    then just transparently JITed. JIT compilers need to make only a few
    changes to integrate this facility and can be migrated one by one.

    This option is for eBPF JITs and will be used in x86, arm64, s390
    without too much effort, and soon ppc64 JITs, thus that native eBPF
    can be blinded as well as cBPF to eBPF migrations, so that both can
    be covered with a single implementation. The rule for JITs is that
    bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
    and in case blinding is disabled, we follow normally with JITing the
    passed program. In case blinding is enabled and we fail during the
    process of blinding itself, we must return with the interpreter.
    Similarly, in case the JITing process after the blinding failed, we
    return normally to the interpreter with the non-blinded code. Meaning,
    interpreter doesn't change in any way and operates on eBPF code as
    usual. For doing this pre-JIT blinding step, we need to make use of
    a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
    to the JIT and not in any way part of the eBPF architecture. Just like
    in the same way as JITs internally make use of some helper registers
    when emitting code, only that here the helper register is one
    abstraction level higher in eBPF bytecode, but nevertheless in JIT
    phase. That helper register is needed since f.e. manually written
    program can issue loads to all registers of eBPF architecture.

    The core concept with the additional register is: blind out all 32
    and 64 bit constants by converting BPF_K based instructions into a
    small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
    is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
    and REG BPF_REG_AX, so actual operation on the target register
    is translated from BPF_K into BPF_X one that is operating on
    BPF_REG_AX's content. During rewriting phase when blinding, RND is
    newly generated via prandom_u32() for each processed instruction.
    64 bit loads are split into two 32 bit loads to make translation and
    patching not too complex. Only basic thing required by JITs is to
    call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
    pair, and to map BPF_REG_AX into an unused register.

    Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

    echo 0 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f5e9 + :
    [...]
    39: mov $0xa8909090,%eax
    3e: mov $0xa8909090,%eax
    43: mov $0xa8ff3148,%eax
    48: mov $0xa89081b4,%eax
    4d: mov $0xa8900bb0,%eax
    52: mov $0xa810e0c1,%eax
    57: mov $0xa8908eb4,%eax
    5c: mov $0xa89020b0,%eax
    [...]

    echo 1 > /proc/sys/net/core/bpf_jit_harden

    ffffffffa034f1e5 + :
    [...]
    39: mov $0xe1192563,%r10d
    3f: xor $0x4989b5f3,%r10d
    46: mov %r10d,%eax
    49: mov $0xb8296d93,%r10d
    4f: xor $0x10b9fd03,%r10d
    56: mov %r10d,%eax
    59: mov $0x8c381146,%r10d
    5f: xor $0x24c7200e,%r10d
    66: mov %r10d,%eax
    69: mov $0xeb2a830e,%r10d
    6f: xor $0x43ba02ba,%r10d
    76: mov %r10d,%eax
    79: mov $0xd9730af,%r10d
    7f: xor $0xa5073b1f,%r10d
    86: mov %r10d,%eax
    89: mov $0x9a45662b,%r10d
    8f: xor $0x325586ea,%r10d
    96: mov %r10d,%eax
    [...]

    As can be seen, original constants that carry payload are hidden
    when enabled, actual operations are transformed from constant-based
    to register-based ones, making jumps into constants ineffective.
    Above extract/example uses single BPF load instruction over and
    over, but of course all instructions with constants are blinded.

    Performance wise, JIT with blinding performs a bit slower than just
    JIT and faster than interpreter case. This is expected, since we
    still get all the performance benefits from JITing and in normal
    use-cases not every single instruction needs to be blinded. Summing
    up all 296 test cases averaged over multiple runs from test_bpf.ko
    suite, interpreter was 55% slower than JIT only and JIT with blinding
    was 8% slower than JIT only. Since there are also some extremes in
    the test suite, I expect for ordinary workloads that the performance
    for the JIT with blinding case is even closer to JIT only case,
    f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
    35 (+ blinding), and 151 (interpreter).

    BPF test suite, seccomp test suite, eBPF sample code and various
    bigger networking eBPF programs have been tested with this and were
    running fine. For testing purposes, I also adapted interpreter and
    redirected blinded eBPF image to interpreter and also here all tests
    pass.

    [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
    [2] https://github.com/01org/jit-spray-poc-for-ksp/
    [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Elena Reshetova
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 May, 2016

1 commit


10 May, 2016

1 commit

  • Allowing unprivileged kernel profiling lets any user dump follow kernel
    control flow and dump kernel registers. This most likely allows trivial
    kASLR bypassing, and it may allow other mischief as well. (Off the top
    of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
    could be quite interesting.)

    Signed-off-by: Andy Lutomirski
    Acked-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

05 May, 2016

1 commit


29 Apr, 2016

1 commit

  • Commit 3193913ce62c ("mm: page_alloc: default node-ordering on 64-bit
    NUMA, zone-ordering on 32-bit") changes the default value of
    numa_zonelist_order. Update the document.

    Signed-off-by: Xishi Qiu
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

27 Apr, 2016

1 commit

  • The default remains 127, which is good for most cases, and not even hit
    most of the time, but then for some cases, as reported by Brendan, 1024+
    deep frames are appearing on the radar for things like groovy, ruby.

    And in some workloads putting a _lower_ cap on this may make sense. One
    that is per event still needs to be put in place tho.

    The new file is:

    # cat /proc/sys/kernel/perf_event_max_stack
    127

    Chaging it:

    # echo 256 > /proc/sys/kernel/perf_event_max_stack
    # cat /proc/sys/kernel/perf_event_max_stack
    256

    But as soon as there is some event using callchains we get:

    # echo 512 > /proc/sys/kernel/perf_event_max_stack
    -bash: echo: write error: Device or resource busy
    #

    Because we only allocate the callchain percpu data structures when there
    is a user, which allows for changing the max easily, its just a matter
    of having no callchain users at that point.

    Reported-and-Tested-by: Brendan Gregg
    Reviewed-by: Frederic Weisbecker
    Acked-by: Alexei Starovoitov
    Acked-by: David Ahern
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

19 Mar, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner