02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Oct, 2017

6 commits

  • Pull smp/hotplug fix from Thomas Gleixner:
    "The recent rework of the callback invocation missed to cleanup the
    leftovers of the operation, so under certain circumstances a
    subsequent CPU hotplug operation accesses stale data and crashes.
    Clean it up."

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Reset node state after operation

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "A set of small fixes mostly in the irq drivers area:

    - Make the tango irq chip work correctly, which requires a new
    function in the generiq irq chip implementation

    - A set of updates to the GIC-V3 ITS driver removing a bogus BUG_ON()
    and parsing the VCPU table size correctly"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack()
    irqchip/tango: Use irq_gc_mask_disable_and_ack_set
    genirq: generic chip: Add irq_gc_mask_disable_and_ack_set()
    irqchip/gic-v3-its: Add missing changes to support 52bit physical address
    irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size
    irqchip/gic-v3-its: Fix the incorrect BUG_ON in its_init_vpe_domain()
    DT: arm,gic-v3: Update the ITS size in the examples

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A little more than usual this time around. Been travelling, so that is
    part of it.

    Anyways, here are the highlights:

    1) Deal with memcontrol races wrt. listener dismantle, from Eric
    Dumazet.

    2) Handle page allocation failures properly in nfp driver, from Jaku
    Kicinski.

    3) Fix memory leaks in macsec, from Sabrina Dubroca.

    4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.

    5) Several fixes in bnxt_en driver, including preventing potential
    NVRAM parameter corruption from Michael Chan.

    6) Fix for KRACK attacks in wireless, from Johannes Berg.

    7) rtnetlink event generation fixes from Xin Long.

    8) Deadlock in mlxsw driver, from Ido Schimmel.

    9) Disallow arithmetic operations on context pointers in bpf, from
    Jakub Kicinski.

    10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
    Xin Long.

    11) Only TCP is supported for sockmap, make that explicit with a
    check, from John Fastabend.

    12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.

    13) Fix panic in packet_getsockopt(), also from Eric Dumazet.

    14) Add missing locked in hv_sock layer, from Dexuan Cui.

    15) Various aquantia bug fixes, including several statistics handling
    cures. From Igor Russkikh et al.

    16) Fix arithmetic overflow in devmap code, from John Fastabend.

    17) Fix busted socket memory accounting when we get a fault in the tcp
    zero copy paths. From Willem de Bruijn.

    18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
    stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
    ipv6: flowlabel: do not leave opt->tot_len with garbage
    of_mdio: Fix broken PHY IRQ in case of probe deferral
    textsearch: fix typos in library helpers
    rxrpc: Don't release call mutex on error pointer
    net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
    net: stmmac: Fix stmmac_get_rx_hwtstamp()
    net: stmmac: Add missing call to dev_kfree_skb()
    mlxsw: spectrum_router: Configure TIGCR on init
    mlxsw: reg: Add Tunneling IPinIP General Configuration Register
    net: ethtool: remove error check for legacy setting transceiver type
    soreuseport: fix initialization race
    net: bridge: fix returning of vlan range op errors
    sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
    bpf: add test cases to bpf selftests to cover all access tests
    bpf: fix pattern matches for direct packet access
    bpf: fix off by one for range markings with L{T, E} patterns
    bpf: devmap fix arithmetic overflow in bitmap_size calculation
    net: aquantia: Bad udp rate on default interrupt coalescing
    net: aquantia: Enable coalescing management via ethtool interface
    ...

    Linus Torvalds
     
  • Alexander had a test program with direct packet access, where
    the access test was in the form of data + X > data_end. In an
    unrelated change to the program LLVM decided to swap the branches
    and emitted code for the test in form of data + X
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • During review I noticed that the current logic for direct packet
    access marking in check_cond_jmp_op() has an off by one for the
    upper right range border when marking in find_good_pkt_pointers()
    with BPF_JLT and BPF_JLE. It's not really harmful given access
    up to pkt_end is always safe, but we should nevertheless correct
    the range marking before it becomes ABI. If pkt_data' denotes a
    pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end
    in the true branch as well as for pkt_end < pkt_end the verifier simulation cannot
    deduce that a byte load of pkt_data' - 1 would succeed in this
    branch.

    Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • An integer overflow is possible in dev_map_bitmap_size() when
    calculating the BITS_TO_LONG logic which becomes, after macro
    replacement,

    (((n) + (d) - 1)/ (d))

    where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid
    overflow cast to u64 before arithmetic.

    Reported-by: Richard Weinberger
    Acked-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

21 Oct, 2017

2 commits

  • The recent rework of the cpu hotplug internals changed the usage of the per
    cpu state->node field, but missed to clean it up after usage.

    So subsequent hotplug operations use the stale pointer from a previous
    operation and hand it into the callback functions. The callbacks then
    dereference a pointer which either belongs to a different facility or
    points to freed and potentially reused memory. In either case data
    corruption and crashes are the obvious consequence.

    Reset the node and the last pointers in the per cpu state to NULL after the
    operation which set them has completed.

    Fixes: 96abb968549c ("smp/hotplug: Allow external multi-instance rollback")
    Reported-by: Tvrtko Ursulin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Boris Ostrovsky
    Cc: "Paul E. McKenney"
    Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710211606130.3213@nanos

    Thomas Gleixner
     
  • As pointed out by Linus and David, the earlier waitid() fix resulted in
    a (currently harmless) unbalanced user_access_end() call. This fixes it
    to just directly return EFAULT on access_ok() failure.

    Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks")
    Acked-by: David Daney
    Cc: Al Viro
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Oct, 2017

6 commits

  • Devmap is used with XDP which requires CAP_NET_ADMIN so lets also
    make CAP_NET_ADMIN required to use the map.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Restrict sockmap to CAP_NET_ADMIN.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • SK_SKB BPF programs are run from the socket/tcp context but early in
    the stack before much of the TCP metadata is needed in tcp_skb_cb. So
    we can use some unused fields to place BPF metadata needed for SK_SKB
    programs when implementing the redirect function.

    This allows us to drop the preempt disable logic. It does however
    require an API change so sk_redirect_map() has been updated to
    additionally provide ctx_ptr to skb. Note, we do however continue to
    disable/enable preemption around actual BPF program running to account
    for map updates.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Only TCP sockets have been tested and at the moment the state change
    callback only handles TCP sockets. This adds a check to ensure that
    sockets actually being added are TCP sockets.

    For net-next we can consider UDP support.

    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Because many of RCU's files have not been included into docbook, a
    number of errors have accumulated. This commit fixes them.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

19 Oct, 2017

2 commits

  • PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
    allocator. Given we support __GFP_NOWARN now, lets just let
    the allocation request fail naturally instead. The two call
    sites from BPF mistakenly assumed __GFP_NOWARN would work, so
    no changes needed to their actual __alloc_percpu_gfp() calls
    which use the flag already.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • It was reported that syzkaller was able to trigger a splat on
    devmap percpu allocation due to illegal/unsupported allocation
    request size passed to __alloc_percpu():

    [ 70.094249] illegal size (32776) or align (8) for percpu allocation
    [ 70.094256] ------------[ cut here ]------------
    [ 70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630
    [...]
    [ 70.094325] Call Trace:
    [ 70.094328] __alloc_percpu_gfp+0x12/0x20
    [ 70.094330] dev_map_alloc+0x134/0x1e0
    [ 70.094331] SyS_bpf+0x9bc/0x1610
    [ 70.094333] ? selinux_task_setrlimit+0x5a/0x60
    [ 70.094334] ? security_task_setrlimit+0x43/0x60
    [ 70.094336] entry_SYSCALL_64_fastpath+0x1a/0xa5

    This was due to too large max_entries for the map such that we
    surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to
    fail naturally here, so switch to __alloc_percpu_gfp() and pass
    __GFP_NOWARN instead.

    Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map")
    Reported-by: Mark Rutland
    Reported-by: Shankara Pailoor
    Reported-by: Richard Weinberger
    Signed-off-by: Daniel Borkmann
    Cc: John Fastabend
    Acked-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

18 Oct, 2017

1 commit

  • Commit f1174f77b50c ("bpf/verifier: rework value tracking")
    removed the crafty selection of which pointer types are
    allowed to be modified. This is OK for most pointer types
    since adjust_ptr_min_max_vals() will catch operations on
    immutable pointers. One exception is PTR_TO_CTX which is
    now allowed to be offseted freely.

    The intent of aforementioned commit was to allow context
    access via modified registers. The offset passed to
    ->is_valid_access() verifier callback has been adjusted
    by the value of the variable offset.

    What is missing, however, is taking the variable offset
    into account when the context register is used. Or in terms
    of the code adding the offset to the value passed to the
    ->convert_ctx_access() callback. This leads to the following
    eBPF user code:

    r1 += 68
    r0 = *(u32 *)(r1 + 8)
    exit

    being translated to this in kernel space:

    0: (07) r1 += 68
    1: (61) r0 = *(u32 *)(r1 +180)
    2: (95) exit

    Offset 8 is corresponding to 180 in the kernel, but offset
    76 is valid too. Verifier will "accept" access to offset
    68+8=76 but then "convert" access to offset 8 as 180.
    Effective access to offset 248 is beyond the kernel context.
    (This is a __sk_buff example on a debug-heavy kernel -
    packet mark is 8 -> 180, 76 would be data.)

    Dereferencing the modified context pointer is not as easy
    as dereferencing other types, because we have to translate
    the access to reading a field in kernel structures which is
    usually at a different offset and often of a different size.
    To allow modifying the pointer we would have to make sure
    that given eBPF instruction will always access the same
    field or the fields accessed are "compatible" in terms of
    offset and size...

    Disallow dereferencing modified context pointers and add
    to selftests the test case described here.

    Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
    Signed-off-by: Jakub Kicinski
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Edward Cree
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

16 Oct, 2017

1 commit


15 Oct, 2017

4 commits


14 Oct, 2017

1 commit

  • Kmemleak considers any pointers on task stacks as references. This
    patch clears newly allocated and reused vmap stacks.

    Link: http://lkml.kernel.org/r/150728990124.744199.8403409836394318684.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Oct, 2017

3 commits

  • Any usage of the irq_gc_mask_disable_reg_and_ack() function has
    been replaced with the desired functionality.

    The incorrect and ambiguously named function is removed here to
    prevent accidental misuse.

    Signed-off-by: Doug Berger
    Signed-off-by: Marc Zyngier

    Doug Berger
     
  • The irq_gc_mask_disable_reg_and_ack() function name implies that it
    provides the combined functions of irq_gc_mask_disable_reg() and
    irq_gc_ack(). However, the implementation does not actually do
    that since it writes the mask instead of the disable register. It
    also does not maintain the mask cache which makes it inappropriate
    to use with other masking functions.

    In addition, commit 659fb32d1b67 ("genirq: replace irq_gc_ack() with
    {set,clr}_bit variants (fwd)") effectively renamed irq_gc_ack() to
    irq_gc_ack_set_bit() so this function probably should have also been
    renamed at that time.

    The generic chip code currently provides three functions for use
    with the irq_mask member of the irq_chip structure and two functions
    for use with the irq_ack member of the irq_chip structure. These
    functions could be combined into six functions for use with the
    irq_mask_ack member of the irq_chip structure. However, since only
    one of the combinations is currently used, only the function
    irq_gc_mask_disable_and_ack_set() is added by this commit.

    The '_reg' and '_bit' portions of the base function name were left
    out of the new combined function name in an attempt to keep the
    function name length manageable with the 80 character source code
    line length while still allowing the distinct aspects of each
    combination to be captured by the name.

    If other combinations are desired in the future please add them to
    the irq generic chip library at that time.

    Signed-off-by: Doug Berger
    Signed-off-by: Marc Zyngier

    Doug Berger
     
  • Pull livepatching fix from Jiri Kosina:

    - bugfix for handling of coming modules (incorrect handling of failure)
    from Joe Lawrence

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: unpatch all klp_objects if klp_module_coming fails

    Linus Torvalds
     

12 Oct, 2017

1 commit

  • Merge waitid() fix from Kees Cook.

    I'd have hoped that the unsafe_{get|put}_user() naming would have
    avoided these kinds of stupid bugs, but no such luck.

    * waitid-fix:
    waitid(): Add missing access_ok() checks

    Linus Torvalds
     

11 Oct, 2017

4 commits

  • When an incoming module is considered for livepatching by
    klp_module_coming(), it iterates over multiple patches and multiple
    kernel objects in this order:

    list_for_each_entry(patch, &klp_patches, list) {
    klp_for_each_object(patch, obj) {

    which means that if one of the kernel objects fails to patch,
    klp_module_coming()'s error path needs to unpatch and cleanup any kernel
    objects that were already patched by a previous patch.

    Reported-by: Miroslav Benes
    Suggested-by: Petr Mladek
    Signed-off-by: Joe Lawrence
    Acked-by: Josh Poimboeuf
    Reviewed-by: Petr Mladek
    Signed-off-by: Jiri Kosina

    Joe Lawrence
     
  • This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227.

    This was not the proper fix, lets cleanly revert it, so that
    following patch can be carried to stable versions.

    sock_cgroup_ptr() callers do not expect a NULL return value.

    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pull seccomp fixlet from Kees Cook:
    "Minor seccomp fix for v4.14-rc5. I debated sending this at all for
    v4.14, but since it fixes a minor issue in the prior fix, which also
    went to -stable, it seemed better to just get all of it cleaned up
    right now.

    - fix missed "static" to avoid Sparse warning (Colin King)"

    * tag 'seccomp-v4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    seccomp: make function __get_seccomp_filter static

    Linus Torvalds
     
  • The function __get_seccomp_filter is local to the source and does
    not need to be in global scope, so make it static.

    Cleans up sparse warning:
    symbol '__get_seccomp_filter' was not declared. Should it be static?

    Signed-off-by: Colin Ian King
    Fixes: 66a733ea6b61 ("seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook

    Colin Ian King
     

10 Oct, 2017

8 commits

  • While load_balance() masks the source CPUs against active_mask, it had
    a hole against the destination CPU. Ensure the destination CPU is also
    part of the 'domain-mask & active-mask' set.

    Reported-by: Levin, Alexander (Sasha Levin)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The trivial wake_affine_idle() implementation is very good for a
    number of workloads, but it comes apart at the moment there are no
    idle CPUs left, IOW. the overloaded case.

    hackbench:

    NO_WA_WEIGHT WA_WEIGHT

    hackbench-20 : 7.362717561 seconds 6.450509391 seconds

    (win)

    netperf:

    NO_WA_WEIGHT WA_WEIGHT

    TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
    TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
    TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
    TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
    TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4

    TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
    TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
    TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
    TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
    TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41

    TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
    TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
    TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
    TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
    TCP_RR-80 : Avg: 25330.3 Avg: 26867.9

    UDP_RR-1 : Avg: 108266 Avg: 107844
    UDP_RR-10 : Avg: 95480 Avg: 95245.2
    UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
    UDP_RR-40 : Avg: 76231 Avg: 75419.1
    UDP_RR-80 : Avg: 34578.3 Avg: 35639.1

    UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
    UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
    UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
    UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
    UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97

    (wins and losses)

    sysbench:

    NO_WA_WEIGHT WA_WEIGHT

    sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
    sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
    sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
    sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
    sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
    sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.

    sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
    sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
    sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
    sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
    sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
    sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.

    (increase on the top end)

    tbench:

    NO_WA_WEIGHT

    Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
    Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
    Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
    Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
    Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
    Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms

    WA_WEIGHT

    Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
    Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
    Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
    Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
    Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
    Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms

    (increase on the top end)

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Eric reported a sysbench regression against commit:

    3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")

    Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
    against his v3.10 enterprise kernel.

    PRE (current tip/master):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64110 (2136.94 per sec.)
    5: [30 secs] transactions: 143644 (4787.99 per sec.)
    10: [30 secs] transactions: 274298 (9142.93 per sec.)
    20: [30 secs] transactions: 418683 (13955.45 per sec.)
    40: [30 secs] transactions: 320731 (10690.15 per sec.)
    80: [30 secs] transactions: 355096 (11834.28 per sec.)

    hsw-ex NAS:

    OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
    OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
    OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
    lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
    lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
    lu.C.x_threads_144_run_3.log: Time in seconds = 433.83

    POST (+patch):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64494 (2149.75 per sec.)
    5: [30 secs] transactions: 145114 (4836.99 per sec.)
    10: [30 secs] transactions: 278311 (9276.69 per sec.)
    20: [30 secs] transactions: 437169 (14571.60 per sec.)
    40: [30 secs] transactions: 669837 (22326.73 per sec.)
    80: [30 secs] transactions: 631739 (21055.88 per sec.)

    hsw-ex NAS:

    lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
    lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
    lu.C.x_threads_144_run_3.log: Time in seconds = 22.52

    This patch takes out all the shiny wake_affine() stuff and goes back to
    utter basics. Between the two CPUs involved with the wakeup (the CPU
    doing the wakeup and the CPU we ran on previously) pick the CPU we can
    run on _now_.

    This restores much of the regressions against the older kernels,
    but leaves some ground in the overloaded case. The default-enabled
    WA_WEIGHT (which will be introduced in the next patch) is an attempt
    to address the overloaded situation.

    Reported-by: Eric Farman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Linus Torvalds
    Cc: Matthew Rosato
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: jinpuwang@gmail.com
    Cc: vcaputo@pengaru.com
    Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Update cgroup time when an event is scheduled in by descendants.

    Reviewed-and-tested-by: Jiri Olsa
    Signed-off-by: leilei.lin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Cc: brendan.d.gregg@gmail.com
    Cc: yang_oliver@hotmail.com
    Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.com
    Signed-off-by: Ingo Molnar

    leilei.lin
     
  • Since commit:

    1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")

    ... when a PMU is unregistered then its associated ->pmu_cpu_context is
    unconditionally freed. Whilst this is fine for dynamically allocated
    context types (i.e. those registered using perf_invalid_context), this
    causes a problem for sharing of static contexts such as
    perf_{sw,hw}_context, which are used by multiple built-in PMUs and
    effectively have a global lifetime.

    Whilst testing the ARM SPE driver, which must use perf_sw_context to
    support per-task AUX tracing, unregistering the driver as a result of a
    module unload resulted in:

    Unable to handle kernel NULL pointer dereference at virtual address 00000038
    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in: [last unloaded: arm_spe_pmu]
    PC is at ctx_resched+0x38/0xe8
    LR is at perf_event_exec+0x20c/0x278
    [...]
    ctx_resched+0x38/0xe8
    perf_event_exec+0x20c/0x278
    setup_new_exec+0x88/0x118
    load_elf_binary+0x26c/0x109c
    search_binary_handler+0x90/0x298
    do_execveat_common.isra.14+0x540/0x618
    SyS_execve+0x38/0x48

    since the software context has been freed and the ctx.pmu->pmu_disable_count
    field has been set to NULL.

    This patch fixes the problem by avoiding the freeing of static PMU contexts
    altogether. Whilst the sharing of dynamic contexts is questionable, this
    actually requires the caller to share their context pointer explicitly
    and so the burden is on them to manage the object lifetime.

    Reported-by: Kim Phillips
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1fd7e4169954 ("perf/core: Remove perf_cpu_context::unique_pmu")
    Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • There is some complication between check_prevs_add() and
    check_prev_add() wrt. saving stack traces. The problem is that we want
    to be frugal with saving stack traces, since it consumes static
    resources.

    We'll only know in check_prev_add() if we need the trace, but we can
    call into it multiple times. So we want to do on-demand and re-use.

    A further complication is that check_prev_add() can drop graph_lock
    and mess with our static resources.

    In any case, the current state; after commit:

    ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")

    is that we'll assume the trace contains valid data once
    check_prev_add() returns '2'. However, as noted by Josh, this is
    false, check_prev_add() can return '2' before having saved a trace,
    this then result in the possibility of using uninitialized data.
    Testing, as reported by Wu, shows a NULL deref.

    So simplify.

    Since the graph_lock() thing is a debug path that hasn't
    really been used in a long while, take it out back and avoid the
    head-ache.

    Further initialize the stack_trace to a known 'empty' state; as long
    as nr_entries == 0, nothing should deref entries. We can then use the
    'entries == NULL' test for a valid trace / on-demand saving.

    Analyzed-by: Josh Poimboeuf
    Reported-by: Fengguang Wu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: ce07a9415f26 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sk_clone_lock() might run while TCP/DCCP listener already vanished.

    In order to prevent use after free, it is better to defer cgroup_sk_alloc()
    to the point we know both parent and child exist, and from process context.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Adds missing access_ok() checks.

    CVE-2017-5123

    Reported-by: Chris Salls
    Signed-off-by: Kees Cook
    Acked-by: Al Viro
    Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
    Cc: stable@kernel.org # 4.13
    Signed-off-by: Linus Torvalds

    Kees Cook