11 Feb, 2019

2 commits

  • Pull perf fixes from Ingo Molnar:
    "A couple of kernel side fixes:

    - Fix the Intel uncore driver on certain hardware configurations

    - Fix a CPU hotplug related memory allocation bug

    - Remove a spurious WARN()

    ... plus also a handful of perf tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf script python: Add Python3 support to tests/attr.py
    perf trace: Support multiple "vfs_getname" probes
    perf symbols: Filter out hidden symbols from labels
    perf symbols: Add fallback definitions for GELF_ST_VISIBILITY()
    tools headers uapi: Sync linux/in.h copy from the kernel sources
    perf clang: Do not use 'return std::move(something)'
    perf mem/c2c: Fix perf_mem_events to support powerpc
    perf tests evsel-tp-sched: Fix bitwise operator
    perf/core: Don't WARN() for impossible ring-buffer sizes
    perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
    perf/x86/intel/uncore: Add Node ID mask

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "An rtmutex (PI-futex) deadlock scenario fix, plus a locking
    documentation fix"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Handle early deadlock return correctly
    futex: Fix barrier comment

    Linus Torvalds
     

09 Feb, 2019

3 commits

  • Pull signal fixes from Eric Biederman:
    "This contains four small fixes for signal handling. A missing range
    check, a regression fix, prioritizing signals we have already started
    a signal group exit for, and better detection of synchronous signals.

    The confused decision of which signals to handle failed spectacularly
    when a timer was pointed at SIGBUS and the stack overflowed. Resulting
    in an unkillable process in an infinite loop instead of a SIGSEGV and
    core dump"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    signal: Better detection of synchronous signals
    signal: Always notice exiting tasks
    signal: Always attempt to allocate siginfo for SIGSTOP
    signal: Make siginmask safe when passed a signal of 0

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "This pull request is dedicated to the upcoming snowpocalypse parts 2
    and 3 in the Pacific Northwest:

    1) Drop profiles are broken because some drivers use dev_kfree_skb*
    instead of dev_consume_skb*, from Yang Wei.

    2) Fix IWLWIFI kconfig deps, from Luca Coelho.

    3) Fix percpu maps updating in bpftool, from Paolo Abeni.

    4) Missing station release in batman-adv, from Felix Fietkau.

    5) Fix some networking compat ioctl bugs, from Johannes Berg.

    6) ucc_geth must reset the BQL queue state when stopping the device,
    from Mathias Thore.

    7) Several XDP bug fixes in virtio_net from Toshiaki Makita.

    8) TSO packets must be sent always on queue 0 in stmmac, from Jose
    Abreu.

    9) Fix socket refcounting bug in RDS, from Eric Dumazet.

    10) Handle sparse cpu allocations in bpf selftests, from Martynas
    Pumputis.

    11) Make sure mgmt frames have enough tailroom in mac80211, from Felix
    Feitkau.

    12) Use safe list walking in sctp_sendmsg() asoc list traversal, from
    Greg Kroah-Hartman.

    13) Make DCCP's ccid_hc_[rt]x_parse_options always check for NULL
    ccid, from Eric Dumazet.

    14) Need to reload WoL password into bcmsysport device after deep
    sleeps, from Florian Fainelli.

    15) Remove filter from mask before freeing in cls_flower, from Petr
    Machata.

    16) Missing release and use after free in error paths of s390 qeth
    code, from Julian Wiedmann.

    17) Fix lockdep false positive in dsa code, from Marc Zyngier.

    18) Fix counting of ATU violations in mv88e6xxx, from Andrew Lunn.

    19) Fix EQ firmware assert in qed driver, from Manish Chopra.

    20) Don't default Caivum PTP to Y in kconfig, from Bjorn Helgaas"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net: dsa: b53: Fix for failure when irq is not defined in dt
    sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    geneve: should not call rt6_lookup() when ipv6 was disabled
    net: Don't default Cavium PTP driver to 'y'
    net: broadcom: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: via-velocity: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: tehuti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: sun: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: fsl_ucc_hdlc: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: fec_mpc52xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: smsc: epic100: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: dscc4: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: tulip: de2104x: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
    net/mlx5e: Don't overwrite pedit action when multiple pedit used
    net/mlx5e: Update hw flows when encap source mac changed
    qed*: Advance drivers version to 8.37.0.20
    qed: Change verbosity for coalescing message.
    qede: Fix system crash on configuring channels.
    qed: Consider TX tcs while deriving the max num_queues for PF.
    ...

    Linus Torvalds
     
  • Pull driver core fixes from Greg KH:
    "Here are some driver core fixes for 5.0-rc6.

    Well, not so much "driver core" as "debugfs". There's a lot of
    outstanding debugfs cleanup patches coming in through different
    subsystem trees, and in that process the debugfs core was found that
    it really should return errors when something bad happens, to prevent
    random files from showing up in the root of debugfs afterward. So
    debugfs was fixed up to handle this properly, and then two fixes for
    the relay and blk-mq code was needed as it was making invalid
    assumptions about debugfs return values.

    There's also a cacheinfo fix in here that resolves a tiny issue.

    All of these have been in linux-next for over a week with no reported
    problems"

    * tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    blk-mq: protect debugfs_create_files() from failures
    relay: check return of create_buf_file() properly
    debugfs: debugfs_lookup() should return NULL if not found
    debugfs: return error values, not NULL
    debugfs: fix debugfs_rename parameter checking
    cacheinfo: Keep the old value if of_property_read_u32 fails

    Linus Torvalds
     

08 Feb, 2019

2 commits

  • commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
    rtmutex") changed the locking rules in the futex code so that the hash
    bucket lock is not longer held while the waiter is enqueued into the
    rtmutex wait list. This made the lock and the unlock path symmetric, but
    unfortunately the possible early exit from __rt_mutex_proxy_start() due to
    a detected deadlock was not updated accordingly. That allows a concurrent
    unlocker to observe inconsitent state which triggers the warning in the
    unlock path.

    futex_lock_pi() futex_unlock_pi()
    lock(hb->lock)
    queue(hb_waiter) lock(hb->lock)
    lock(rtmutex->wait_lock)
    unlock(hb->lock)
    // acquired hb->lock
    hb_waiter = futex_top_waiter()
    lock(rtmutex->wait_lock)
    __rt_mutex_proxy_start()
    ---> fail
    remove(rtmutex_waiter);
    ---> returns -EDEADLOCK
    unlock(rtmutex->wait_lock)
    // acquired wait_lock
    wake_futex_pi()
    rt_mutex_next_owner()
    --> returns NULL
    --> WARN

    lock(hb->lock)
    unqueue(hb_waiter)

    The problem is caused by the remove(rtmutex_waiter) in the failure case of
    __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
    hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

    The original commit handles this correctly for the other early return cases
    (timeout, signal) by delaying the removal of the rtmutex waiter until the
    returning task reacquired the hash bucket lock.

    Treat the failure case of __rt_mutex_proxy_start() in the same way and let
    the existing cleanup code handle the eventual handover of the rtmutex
    gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
    removal for the failure case, so that the other callsites are still
    operating correctly.

    Add proper comments to the code so all these details are fully documented.

    Thanks to Peter for helping with the analysis and writing the really
    valuable code comments.

    Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
    Reported-by: Heiko Carstens
    Co-developed-by: Peter Zijlstra
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    Tested-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: linux-s390@vger.kernel.org
    Cc: Stefan Liebler
    Cc: Sebastian Sewior
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

    Thomas Gleixner
     
  • The current comment for the barrier that guarantees that waiter increment
    is always before taking the hb spinlock (barrier (A)) needs to be fixed as
    it is misplaced.

    This is obviously referring to hb_waiters_inc, which is a full barrier.

    Reported-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190206185602.949-1-dave@stgolabs.net

    Davidlohr Bueso
     

07 Feb, 2019

3 commits

  • Recently syzkaller was able to create unkillablle processes by
    creating a timer that is delivered as a thread local signal on SIGHUP,
    and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing
    to deliver SIGHUP but always trying.

    When the stack overflows delivery of SIGHUP fails and force_sigsegv is
    called. Unfortunately because SIGSEGV is numerically higher than
    SIGHUP next_signal tries again to deliver a SIGHUP.

    From a quality of implementation standpoint attempting to deliver the
    timer SIGHUP signal is wrong. We should attempt to deliver the
    synchronous SIGSEGV signal we just forced.

    We can make that happening in a fairly straight forward manner by
    instead of just looking at the signal number we also look at the
    si_code. In particular for exceptions (aka synchronous signals) the
    si_code is always greater than 0.

    That still has the potential to pick up a number of asynchronous
    signals as in a few cases the same si_codes that are used
    for synchronous signals are also used for asynchronous signals,
    and SI_KERNEL is also included in the list of possible si_codes.

    Still the heuristic is much better and timer signals are definitely
    excluded. Which is enough to prevent all known ways for someone
    sending a process signals fast enough to cause unexpected and
    arguably incorrect behavior.

    Cc: stable@vger.kernel.org
    Fixes: a27341cd5fcb ("Prioritize synchronous signals over 'normal' signals")
    Tested-by: Dmitry Vyukov
    Reported-by: Dmitry Vyukov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Recently syzkaller was able to create unkillablle processes by
    creating a timer that is delivered as a thread local signal on SIGHUP,
    and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop
    failing to deliver SIGHUP but always trying.

    Upon examination it turns out part of the problem is actually most of
    the solution. Since 2.5 signal delivery has found all fatal signals,
    marked the signal group for death, and queued SIGKILL in every threads
    thread queue relying on signal->group_exit_code to preserve the
    information of which was the actual fatal signal.

    The conversion of all fatal signals to SIGKILL results in the
    synchronous signal heuristic in next_signal kicking in and preferring
    SIGHUP to SIGKILL. Which is especially problematic as all
    fatal signals have already been transformed into SIGKILL.

    Instead of dequeueing signals and depending upon SIGKILL to
    be the first signal dequeued, first test if the signal group
    has already been marked for death. This guarantees that
    nothing in the signal queue can prevent a process that needs
    to exit from exiting.

    Cc: stable@vger.kernel.org
    Tested-by: Dmitry Vyukov
    Reported-by: Dmitry Vyukov
    Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
    History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Pull tracing fixes from Steven Rostedt:
    "This has two fixes for uprobe code.

    - Cut and paste fix to have uprobe printks say "uprobe" and not
    "kprobe"

    - Add terminating '\0' byte when copying function arguments"

    * tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/uprobes: Fix output for multiple string arguments
    tracing: uprobes: Fix typo in pr_fmt string

    Linus Torvalds
     

05 Feb, 2019

1 commit

  • Since 2.5.34 the code has had the potential to not allocate siginfo
    for SIGSTOP signals. Except for ptrace this is perfectly fine as only
    ptrace can use PTRACE_PEEK_SIGINFO and see what the contents of
    the delivered siginfo are.

    Users of PTRACE_PEEK_SIGINFO that care about the contents siginfo
    for SIGSTOP are rare, but they do exist. A seccomp self test
    has cared and lldb cares.

    Jack Andersen writes:

    > The patch titled
    > `signal: Never allocate siginfo for SIGKILL or SIGSTOP`
    > created a regression for users of PTRACE_GETSIGINFO needing to
    > discern signals that were raised via the tgkill syscall.
    >
    > A notable user of this tgkill+ptrace combination is lldb while
    > debugging a multithreaded program. Without the ability to detect a
    > SIGSTOP originating from tgkill, lldb does not have a way to
    > synchronize on a per-thread basis and falls back to SIGSTOP-ing the
    > entire process.

    Everyone affected by this please note. The kernel can still fail to
    allocate a siginfo structure. The allocation is with GFP_KERNEL and
    is best effort only. If memory is tight when the signal allocation
    comes in this will fail to allocate a siginfo.

    So I strongly recommend looking at more robust solutions for
    synchronizing with a single thread such as PTRACE_INTERRUPT. Or if
    that does not work persuading your friendly local kernel developer to
    build the interface you need.

    Reported-by: Tycho Andersen
    Reported-by: Kees Cook
    Reported-by: Jack Andersen
    Acked-by: Linus Torvalds
    Reviewed-by: Christian Brauner
    Cc: stable@vger.kernel.org
    Fixes: f149b3155744 ("signal: Never allocate siginfo for SIGKILL or SIGSTOP")
    Fixes: 6dfc88977e42 ("[PATCH] shared thread signals")
    History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Feb, 2019

3 commits

  • The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how
    large its ringbuffer mmap should be. This can be configured to arbitrary
    values, which can be larger than the maximum possible allocation from
    kmalloc.

    When this is configured to a suitably large value (e.g. thanks to the
    perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in
    __alloc_pages_nodemask():

    WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8

    Let's avoid this by checking that the requested allocation is possible
    before calling kzalloc.

    Reported-by: Julien Thierry
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Julien Thierry
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     
  • Pull cpu hotplug fixes from Thomas Gleixner:
    "Two fixes for the cpu hotplug machinery:

    - Replace the overly clever 'SMT disabled by BIOS' detection logic as
    it breaks KVM scenarios and prevents speculation control updates
    when the Hyperthreads are brought online late after boot.

    - Remove a redundant invocation of the speculation control update
    function"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
    x86/speculation: Remove redundant arch_smt_update() invocation

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A pile of perf updates:

    - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
    write handler

    - Cure a perf script crash which caused by an unitinialized data
    structure

    - Highlight the hottest instruction in perf top and not a random one

    - Cure yet another clang issue when building perf python

    - Handle topology entries with no CPU correctly in the tools

    - Handle perf data which contains both tracepoints and performance
    counter entries correctly.

    - Add a missing NULL pointer check in perf ordered_events_free()"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf script: Fix crash when processing recorded stat data
    perf top: Fix wrong hottest instruction highlighted
    perf tools: Handle TOPOLOGY headers with no CPU
    perf python: Remove -fstack-clash-protection when building with some clang versions
    perf core: Fix perf_proc_update_handler() bug
    perf script: Fix crash with printing mixed trace point and other events
    perf ordered_events: Fix crash in ordered_events__free

    Linus Torvalds
     

02 Feb, 2019

3 commits

  • psi has provisions to shut off the periodic aggregation worker when
    there is a period of no task activity - and thus no data that needs
    aggregating. However, while developing psi monitoring, Suren noticed
    that the aggregation clock currently won't stay shut off for good.

    Debugging this revealed a flaw in the idle design: an aggregation run
    will see no task activity and decide to go to sleep; shortly thereafter,
    the kworker thread that executed the aggregation will go idle and cause
    a scheduling change, during which the psi callback will kick the
    !pending worker again. This will ping-pong forever, and is equivalent
    to having no shut-off logic at all (but with more code!)

    Fix this by exempting aggregation workers from psi's clock waking logic
    when the state change is them going to sleep. To do this, tag workers
    with the last work function they executed, and if in psi we see a worker
    going to sleep after aggregating psi data, we will not reschedule the
    aggregation work item.

    What if the worker is also executing other items before or after?

    Any psi state times that were incurred by work items preceding the
    aggregation work will have been collected from the per-cpu buckets
    during the aggregation itself. If there are work items following the
    aggregation work, the worker's last_func tag will be overwritten and the
    aggregator will be kept alive to process this genuine new activity.

    If the aggregation work is the last thing the worker does, and we decide
    to go idle, the brief period of non-idle time incurred between the
    aggregation run and the kworker's dequeue will be stranded in the
    per-cpu buckets until the clock is woken by later activity. But that
    should not be a problem. The buckets can hold 4s worth of time, and
    future activity will wake the clock with a 2s delay, giving us 2s worth
    of data we can leave behind when disabling aggregation. If it takes a
    worker more than two seconds to go idle after it finishes its last work
    item, we likely have bigger problems in the system, and won't notice one
    sample that was averaged with a bogus per-CPU weight.

    Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
    Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
    Signed-off-by: Johannes Weiner
    Reported-by: Suren Baghdasaryan
    Acked-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
    zap_pid_ns_processes() waits on all tasks in a current pidns, and only
    then are tasks from the dead list released.

    zap_pid_ns_processes() can get stuck on waiting tasks from the dead
    list. In this case, we will have one unkillable process with one or
    more dead children.

    Thanks to Oleg for the advice to release tasks in find_child_reaper().

    Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
    Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
    Signed-off-by: Andrei Vagin
    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrei Vagin
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2019-01-31

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) disable preemption in sender side of socket filters, from Alexei.

    2) fix two potential deadlocks in syscall bpf lookup and prog_register,
    from Martin and Alexei.

    3) fix BTF to allow typedef on func_proto, from Yonghong.

    4) two bpftool fixes, from Jiri and Paolo.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Feb, 2019

4 commits

  • The map_lookup_elem used to not acquiring spinlock
    in order to optimize the reader.

    It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
    bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
    If that is the case, pcpu_freelist_push() saves this elem for reuse later.
    This push requires a spinlock.

    If a tracing bpf_prog got run in the middle of the syscall's
    map_lookup_elem(stackmap) and this tracing bpf_prog is calling
    bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
    spinlock, it may end up with a dead lock situation as reported by
    Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

    The situation is the same as the syscall's map_update_elem() which
    needs to acquire the pcpu_freelist's spinlock and could race
    with tracing bpf_prog. Hence, this patch fixes it by protecting
    bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
    to prevent tracing bpf_prog from running.

    A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
    also acquires a spinlock and races with tracing bpf_prog similarly.
    Hence, this patch is forward looking and protects the majority
    of the map lookups. bpf_map_offload_lookup_elem() is the exception
    since it is for network bpf_prog only (i.e. never called by tracing
    bpf_prog).

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     
  • Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
    [ 13.007000] WARNING: possible circular locking dependency detected
    [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
    [ 13.008124] ------------------------------------------------------
    [ 13.008624] test_progs/246 is trying to acquire lock:
    [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
    [ 13.009770]
    [ 13.009770] but task is already holding lock:
    [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.010877]
    [ 13.010877] which lock already depends on the new lock.
    [ 13.010877]
    [ 13.011532]
    [ 13.011532] the existing dependency chain (in reverse order) is:
    [ 13.012129]
    [ 13.012129] -> #4 (bpf_event_mutex){+.+.}:
    [ 13.012582] perf_event_query_prog_array+0x9b/0x130
    [ 13.013016] _perf_ioctl+0x3aa/0x830
    [ 13.013354] perf_ioctl+0x2e/0x50
    [ 13.013668] do_vfs_ioctl+0x8f/0x6a0
    [ 13.014003] ksys_ioctl+0x70/0x80
    [ 13.014320] __x64_sys_ioctl+0x16/0x20
    [ 13.014668] do_syscall_64+0x4a/0x180
    [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.015469]
    [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}:
    [ 13.015910] perf_event_init_cpu+0x5a/0x90
    [ 13.016291] perf_event_init+0x1b2/0x1de
    [ 13.016654] start_kernel+0x2b8/0x42a
    [ 13.016995] secondary_startup_64+0xa4/0xb0
    [ 13.017382]
    [ 13.017382] -> #2 (pmus_lock){+.+.}:
    [ 13.017794] perf_event_init_cpu+0x21/0x90
    [ 13.018172] cpuhp_invoke_callback+0xb3/0x960
    [ 13.018573] _cpu_up+0xa7/0x140
    [ 13.018871] do_cpu_up+0xa4/0xc0
    [ 13.019178] smp_init+0xcd/0xd2
    [ 13.019483] kernel_init_freeable+0x123/0x24f
    [ 13.019878] kernel_init+0xa/0x110
    [ 13.020201] ret_from_fork+0x24/0x30
    [ 13.020541]
    [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
    [ 13.021051] static_key_slow_inc+0xe/0x20
    [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300
    [ 13.021891] perf_trace_event_init+0x11f/0x250
    [ 13.022297] perf_trace_init+0x6b/0xa0
    [ 13.022644] perf_tp_event_init+0x25/0x40
    [ 13.023011] perf_try_init_event+0x6b/0x90
    [ 13.023386] perf_event_alloc+0x9a8/0xc40
    [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30
    [ 13.024173] do_syscall_64+0x4a/0x180
    [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.024968]
    [ 13.024968] -> #0 (tracepoints_mutex){+.+.}:
    [ 13.025434] __mutex_lock+0x86/0x970
    [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.026215] bpf_probe_register+0x40/0x60
    [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.027042] __do_sys_bpf+0x94f/0x1a90
    [ 13.027389] do_syscall_64+0x4a/0x180
    [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.028171]
    [ 13.028171] other info that might help us debug this:
    [ 13.028171]
    [ 13.028807] Chain exists of:
    [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
    [ 13.028807]
    [ 13.029666] Possible unsafe locking scenario:
    [ 13.029666]
    [ 13.030140] CPU0 CPU1
    [ 13.030510] ---- ----
    [ 13.030875] lock(bpf_event_mutex);
    [ 13.031166] lock(&cpuctx_mutex);
    [ 13.031645] lock(bpf_event_mutex);
    [ 13.032135] lock(tracepoints_mutex);
    [ 13.032441]
    [ 13.032441] *** DEADLOCK ***
    [ 13.032441]
    [ 13.032911] 1 lock held by test_progs/246:
    [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.033909]
    [ 13.033909] stack backtrace:
    [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
    [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 13.035657] Call Trace:
    [ 13.035859] dump_stack+0x5f/0x8b
    [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db
    [ 13.036526] __lock_acquire+0x1158/0x1350
    [ 13.036852] ? lock_acquire+0x98/0x190
    [ 13.037154] lock_acquire+0x98/0x190
    [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.037876] __mutex_lock+0x86/0x970
    [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.039028] ? __mutex_lock+0x86/0x970
    [ 13.039337] ? __mutex_lock+0x24a/0x970
    [ 13.039649] ? bpf_probe_register+0x1d/0x60
    [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
    [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.041325] bpf_probe_register+0x40/0x60
    [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.042068] ? __might_fault+0x3e/0x90
    [ 13.042374] __do_sys_bpf+0x94f/0x1a90
    [ 13.042678] do_syscall_64+0x4a/0x180
    [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.043382] RIP: 0033:0x7f23b10a07f9
    [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
    [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
    [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
    [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
    [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
    [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

    Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
    there is no need to take bpf_event_mutex too.
    bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
    bpf_raw_tracepoints don't need to take this mutex.

    Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Lockdep warns about false positive:
    [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
    [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 12.493275] (&rq->lock){-.-.}
    [ 12.493276]
    [ 12.493276]
    [ 12.493276] and interrupts could create inverse lock ordering between them.
    [ 12.493276]
    [ 12.494435]
    [ 12.494435] other info that might help us debug this:
    [ 12.494979] Possible interrupt unsafe locking scenario:
    [ 12.494979]
    [ 12.495518] CPU0 CPU1
    [ 12.495879] ---- ----
    [ 12.496243] lock(&head->lock);
    [ 12.496502] local_irq_disable();
    [ 12.496969] lock(&rq->lock);
    [ 12.497431] lock(&head->lock);
    [ 12.497890]
    [ 12.498104] lock(&rq->lock);
    [ 12.498368]
    [ 12.498368] *** DEADLOCK ***
    [ 12.498368]
    [ 12.498837] 1 lock held by dd/276:
    [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
    [ 12.499747]
    [ 12.499747] the shortest dependencies between 2nd lock and 1st lock:
    [ 12.500389] -> (&rq->lock){-.-.} {
    [ 12.500669] IN-HARDIRQ-W at:
    [ 12.500934] _raw_spin_lock+0x2f/0x40
    [ 12.501373] scheduler_tick+0x4c/0xf0
    [ 12.501812] update_process_times+0x40/0x50
    [ 12.502294] tick_periodic+0x27/0xb0
    [ 12.502723] tick_handle_periodic+0x1f/0x60
    [ 12.503203] timer_interrupt+0x11/0x20
    [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0
    [ 12.504167] handle_irq_event_percpu+0x20/0x50
    [ 12.504674] handle_irq_event+0x37/0x60
    [ 12.505139] handle_level_irq+0xa7/0x120
    [ 12.505601] handle_irq+0xa1/0x150
    [ 12.506018] do_IRQ+0x77/0x140
    [ 12.506411] ret_from_intr+0x0/0x1d
    [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60
    [ 12.507362] __setup_irq+0x481/0x730
    [ 12.507789] setup_irq+0x49/0x80
    [ 12.508195] hpet_time_init+0x21/0x32
    [ 12.508644] x86_late_time_init+0xb/0x16
    [ 12.509106] start_kernel+0x390/0x42a
    [ 12.509554] secondary_startup_64+0xa4/0xb0
    [ 12.510034] IN-SOFTIRQ-W at:
    [ 12.510305] _raw_spin_lock+0x2f/0x40
    [ 12.510772] try_to_wake_up+0x1c7/0x4e0
    [ 12.511220] swake_up_locked+0x20/0x40
    [ 12.511657] swake_up_one+0x1a/0x30
    [ 12.512070] rcu_process_callbacks+0xc5/0x650
    [ 12.512553] __do_softirq+0xe6/0x47b
    [ 12.512978] irq_exit+0xc3/0xd0
    [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250
    [ 12.513876] apic_timer_interrupt+0xf/0x20
    [ 12.514343] default_idle+0x1c/0x170
    [ 12.514765] do_idle+0x199/0x240
    [ 12.515159] cpu_startup_entry+0x19/0x20
    [ 12.515614] start_kernel+0x422/0x42a
    [ 12.516045] secondary_startup_64+0xa4/0xb0
    [ 12.516521] INITIAL USE at:
    [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50
    [ 12.517258] rq_attach_root+0x16/0xd0
    [ 12.517685] sched_init+0x2f2/0x3eb
    [ 12.518096] start_kernel+0x1fb/0x42a
    [ 12.518525] secondary_startup_64+0xa4/0xb0
    [ 12.518986] }
    [ 12.519132] ... key at: [] __key.71384+0x0/0x8
    [ 12.519649] ... acquired at:
    [ 12.519892] pcpu_freelist_pop+0x7b/0xd0
    [ 12.520221] bpf_get_stackid+0x1d2/0x4d0
    [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.520887]
    [ 12.521008] -> (&head->lock){+...} {
    [ 12.521292] HARDIRQ-ON-W at:
    [ 12.521539] _raw_spin_lock+0x2f/0x40
    [ 12.521950] pcpu_freelist_push+0x2a/0x40
    [ 12.522396] bpf_get_stackid+0x494/0x4d0
    [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.523296] INITIAL USE at:
    [ 12.523537] _raw_spin_lock+0x2f/0x40
    [ 12.523944] pcpu_freelist_populate+0xc0/0x120
    [ 12.524417] htab_map_alloc+0x405/0x500
    [ 12.524835] __do_sys_bpf+0x1a3/0x1a90
    [ 12.525253] do_syscall_64+0x4a/0x180
    [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 12.526167] }
    [ 12.526311] ... key at: [] __key.13130+0x0/0x8
    [ 12.526812] ... acquired at:
    [ 12.527047] __lock_acquire+0x521/0x1350
    [ 12.527371] lock_acquire+0x98/0x190
    [ 12.527680] _raw_spin_lock+0x2f/0x40
    [ 12.527994] pcpu_freelist_push+0x2a/0x40
    [ 12.528325] bpf_get_stackid+0x494/0x4d0
    [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.528970]
    [ 12.529092]
    [ 12.529092] stack backtrace:
    [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
    [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 12.530750] Call Trace:
    [ 12.530948] dump_stack+0x5f/0x8b
    [ 12.531248] check_usage_backwards+0x10c/0x120
    [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0
    [ 12.531935] ? mark_lock+0x382/0x560
    [ 12.532229] mark_lock+0x382/0x560
    [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180
    [ 12.532928] __lock_acquire+0x521/0x1350
    [ 12.533271] ? find_get_entry+0x17f/0x2e0
    [ 12.533586] ? find_get_entry+0x19c/0x2e0
    [ 12.533902] ? lock_acquire+0x98/0x190
    [ 12.534196] lock_acquire+0x98/0x190
    [ 12.534482] ? pcpu_freelist_push+0x2a/0x40
    [ 12.534810] _raw_spin_lock+0x2f/0x40
    [ 12.535099] ? pcpu_freelist_push+0x2a/0x40
    [ 12.535432] pcpu_freelist_push+0x2a/0x40
    [ 12.535750] bpf_get_stackid+0x494/0x4d0
    [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0

    It has been explained that is a false positive here:
    https://lkml.org/lkml/2018/7/25/756
    Recap:
    - stackmap uses pcpu_freelist
    - The lock in pcpu_freelist is a percpu lock
    - stackmap is only used by tracing bpf_prog
    - A tracing bpf_prog cannot be run if another bpf_prog
    has already been running (ensured by the percpu bpf_prog_active counter).

    Eric pointed out that this lockdep splats stops other
    legit lockdep splats in selftests/bpf/test_progs.c.

    Fix this by calling local_irq_save/restore for stackmap.

    Another false positive had also been worked around by calling
    local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat").
    That commit added unnecessary irq_save/restore to fast path of
    bpf hash map. irqs are already disabled at that point, since htab
    is holding per bucket spin_lock with irqsave.

    Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
    function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
    to be used elsewhere (right now only in stackmap).
    It stops lockdep false positive in stackmap with a bit of acceptable overhead.

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Naresh Kamboju
    Reported-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Disabled preemption is necessary for proper access to per-cpu maps
    from BPF programs.

    But the sender side of socket filters didn't have preemption disabled:
    unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN

    and a combination of af_packet with tun device didn't disable either:
    tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue->
    tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN

    Disable preemption before executing BPF programs (both classic and extended).

    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

31 Jan, 2019

2 commits

  • If create_buf_file() returns an error, don't try to reference it later
    as a valid dentry pointer.

    This problem was exposed when debugfs started to return errors instead
    of just NULL for some calls when they do not succeed properly.

    Also, the check for WARN_ON(dentry) was just wrong :)

    Reported-by: Kees Cook
    Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
    Reported-by: Tetsuo Handa
    Cc: Andrew Morton
    Cc: David Rientjes
    Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • With the following commit:

    73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")

    ... the hotplug code attempted to detect when SMT was disabled by BIOS,
    in which case it reported SMT as permanently disabled. However, that
    code broke a virt hotplug scenario, where the guest is booted with only
    primary CPU threads, and a sibling is brought online later.

    The problem is that there doesn't seem to be a way to reliably
    distinguish between the HW "SMT disabled by BIOS" case and the virt
    "sibling not yet brought online" case. So the above-mentioned commit
    was a bit misguided, as it permanently disabled SMT for both cases,
    preventing future virt sibling hotplugs.

    Going back and reviewing the original problems which were attempted to
    be solved by that commit, when SMT was disabled in BIOS:

    1) /sys/devices/system/cpu/smt/control showed "on" instead of
    "notsupported"; and

    2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.

    I'd propose that we instead consider #1 above to not actually be a
    problem. Because, at least in the virt case, it's possible that SMT
    wasn't disabled by BIOS and a sibling thread could be brought online
    later. So it makes sense to just always default the smt control to "on"
    to allow for that possibility (assuming cpuid indicates that the CPU
    supports SMT).

    The real problem is #2, which has a simple fix: change vmx_vm_init() to
    query the actual current SMT state -- i.e., whether any siblings are
    currently online -- instead of looking at the SMT "control" sysfs value.

    So fix it by:

    a) reverting the original "fix" and its followup fix:

    73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
    bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation")

    and

    b) changing vmx_vm_init() to query the actual current SMT state --
    instead of the sysfs control value -- to determine whether the L1TF
    warning is needed. This also requires the 'sched_smt_present'
    variable to exported, instead of 'cpu_smt_control'.

    Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
    Reported-by: Igor Mammedov
    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Thomas Gleixner
    Cc: Joe Mario
    Cc: Jiri Kosina
    Cc: Peter Zijlstra
    Cc: kvm@vger.kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com

    Josh Poimboeuf
     

30 Jan, 2019

2 commits

  • Current implementation does not allow typedef func_proto.
    But it is actually allowed.
    -bash-4.4$ cat t.c
    typedef int (f) (int);
    f *g;
    -bash-4.4$ clang -O2 -g -c -target bpf t.c -Xclang -target-feature -Xclang +dwarfris
    -bash-4.4$ pahole -JV t.o
    File t.o:
    [1] PTR (anon) type_id=2
    [2] TYPEDEF f type_id=3
    [3] FUNC_PROTO (anon) return=4 args=(4 (anon))
    [4] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
    -bash-4.4$

    This patch related btf verifier to allow such (typedef func_proto)
    patterns.

    Fixes: 2667a2626f4d ("bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO")
    Acked-by: Martin KaFai Lau
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov

    Yonghong Song
     
  • With commit a74cfffb03b7 ("x86/speculation: Rework SMT state change"),
    arch_smt_update() is invoked from each individual CPU hotplug function.

    Therefore the extra arch_smt_update() call in the sysfs SMT control is
    redundant.

    Fixes: a74cfffb03b7 ("x86/speculation: Rework SMT state change")
    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Thomas Gleixner
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default

    Zhenzhong Duan
     

28 Jan, 2019

3 commits

  • Pull timer fix from Thomas Glexiner:
    "A single regression fix to address the unintended breakage of posix
    cpu timers.

    This is caused by a new sanity check in the common code, which fails
    for posix cpu timers under certain conditions because the posix cpu
    timer code never updates the variable which is checked"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    posix-cpu-timers: Unbreak timer rearming

    Linus Torvalds
     
  • Pull locking fixes from Thomas Gleixner:
    "A small series of fixes which all address possible missed wakeups:

    - Document and fix the wakeup ordering of wake_q

    - Add the missing barrier in rcuwait_wake_up(), which was documented
    in the comment but missing in the code

    - Fix the possible missed wakeups in the rwsem and futex code"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rwsem: Fix (possible) missed wakeup
    futex: Fix (possible) missed wakeup
    sched/wake_q: Fix wakeup ordering for wake_q
    sched/wake_q: Document wake_q_add()
    sched/wait: Fix rcuwait_wake_up() ordering

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "A small set of fixes for the interrupt subsystem:

    - Fix a double increment in the irq descriptor allocator which
    resulted in a sanity check only being done for every second
    affinity mask

    - Add a missing device tree translation in the stm32-exti driver.
    Without that the interrupt association is completely wrong.

    - Initialize the mutex in the GIC-V3 MBI driver

    - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
    multi MSI allocations work correctly

    - Ensure that the initial affinity of a interrupt is not empty at
    startup time.

    - Drop bogus include in the madera irq chip driver

    - Fix KernelDoc regression"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
    genirq/irqdesc: Fix double increment in alloc_descs()
    genirq: Fix the kerneldoc comment for struct irq_affinity_desc
    irqchip/madera: Drop GPIO includes
    irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
    irqchip/stm32-exti: Add domain translate function
    genirq: Make sure the initial affinity is not empty

    Linus Torvalds
     

22 Jan, 2019

1 commit

  • …inux/kernel/git/acme/linux into perf/urgent

    Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

    Kernel:

    Stephane Eranian:

    - Fix perf_proc_update_handler() bug.

    perf script:

    Andi Kleen:

    - Fix crash with printing mixed trace point and other events.

    Tony Jones:

    - Fix crash when processing recorded stat data.

    perf top:

    He Kuang:

    - Fix wrong hottest instruction highlighted.

    perf python:

    Arnaldo Carvalho de Melo:

    - Remove -fstack-clash-protection when building with some clang versions.

    perf ordered_events:

    Jiri Olsa:

    - Fix out of buffers crash in ordered_events__free().

    perf cpu_map:

    Stephane Eranian:

    - Handle TOPOLOGY headers with no CPU.

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

21 Jan, 2019

6 commits

  • Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
    case), we must not rely on the wakeup being delayed. However, commit:

    e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")

    relies on exactly that behaviour in that the wakeup must not happen
    until after we clear waiter->task.

    [ peterz: Added changelog. ]

    Signed-off-by: Xie Yongji
    Signed-off-by: Zhang Yu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")
    Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
    Signed-off-by: Ingo Molnar

    Xie Yongji
     
  • We must not rely on wake_q_add() to delay the wakeup; in particular
    commit:

    1d0dcb3ad9d3 ("futex: Implement lockless wakeups")

    moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
    could result in futex_wait() waking before observing ->lock_ptr ==
    NULL and going back to sleep again.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Notable cmpxchg() does not provide ordering when it fails, however
    wake_q_add() requires ordering in this specific case too. Without this
    it would be possible for the concurrent wakeup to not observe our
    prior state.

    Andrea Parri provided:

    C wake_up_q-wake_q_add

    {
    int next = 0;
    int y = 0;
    }

    P0(int *next, int *y)
    {
    int r0;

    /* in wake_up_q() */

    WRITE_ONCE(*next, 1); /* node->next = NULL */
    smp_mb(); /* implied by wake_up_process() */
    r0 = READ_ONCE(*y);
    }

    P1(int *next, int *y)
    {
    int r1;

    /* in wake_q_add() */

    WRITE_ONCE(*y, 1); /* wake_cond = true */
    smp_mb__before_atomic();
    r1 = cmpxchg_relaxed(next, 1, 2);
    }

    exists (0:r0=0 /\ 1:r1=0)

    This "exists" clause cannot be satisfied according to the LKMM:

    Test wake_up_q-wake_q_add Allowed
    States 3
    0:r0=0; 1:r1=1;
    0:r0=1; 1:r1=0;
    0:r0=1; 1:r1=1;
    No
    Witnesses
    Positive: 0 Negative: 3
    Condition exists (0:r0=0 /\ 1:r1=0)
    Observation wake_up_q-wake_q_add Never 0 3

    Reported-by: Yongji Xie
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The only guarantee provided by wake_q_add() is that a wakeup will
    happen after it, it does _NOT_ guarantee the wakeup will be delayed
    until the matching wake_up_q().

    If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
    that can happen at any time after the cmpxchg(). This means we should
    not rely on the wakeup happening at wake_q_up(), but should be ready
    for wake_q_add() to issue the wakeup.

    The delay; if provided (most likely); should only result in more efficient
    behaviour.

    Reported-by: Yongji Xie
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • For some peculiar reason rcuwait_wake_up() has the right barrier in
    the comment, but not in the code.

    This mistake has been observed to cause a deadlock in the following
    situation:

    P1 P2

    percpu_up_read() percpu_down_write()
    rcu_sync_is_idle() // false
    rcu_sync_enter()
    ...
    __percpu_up_read()

    [S] ,- __this_cpu_dec(*sem->read_count)
    | smp_rmb();
    [L] | task = rcu_dereference(w->task) // NULL
    |
    | [S] w->task = current
    | smp_mb();
    | [L] readers_active_check() // fail
    `->

    Where the smp_rmb() (obviously) fails to constrain the store.

    [ peterz: Added changelog. ]

    Signed-off-by: Prateek Sood
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Andrea Parri
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
    Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
    Signed-off-by: Ingo Molnar

    Prateek Sood
     
  • Pull networking fixes from David Miller:

    1) Fix endless loop in nf_tables, from Phil Sutter.

    2) Fix cross namespace ip6_gre tunnel hash list corruption, from
    Olivier Matz.

    3) Don't be too strict in phy_start_aneg() otherwise we might not allow
    restarting auto negotiation. From Heiner Kallweit.

    4) Fix various KMSAN uninitialized value cases in tipc, from Ying Xue.

    5) Memory leak in act_tunnel_key, from Davide Caratti.

    6) Handle chip errata of mv88e6390 PHY, from Andrew Lunn.

    7) Remove linear SKB assumption in fou/fou6, from Eric Dumazet.

    8) Missing udplite rehash callbacks, from Alexey Kodanev.

    9) Log dirty pages properly in vhost, from Jason Wang.

    10) Use consume_skb() in neigh_probe() as this is a normal free not a
    drop, from Yang Wei. Likewise in macvlan_process_broadcast().

    11) Missing device_del() in mdiobus_register() error paths, from Thomas
    Petazzoni.

    12) Fix checksum handling of short packets in mlx5, from Cong Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (96 commits)
    bpf: in __bpf_redirect_no_mac pull mac only if present
    virtio_net: bulk free tx skbs
    net: phy: phy driver features are mandatory
    isdn: avm: Fix string plus integer warning from Clang
    net/mlx5e: Fix cb_ident duplicate in indirect block register
    net/mlx5e: Fix wrong (zero) TX drop counter indication for representor
    net/mlx5e: Fix wrong error code return on FEC query failure
    net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
    tools: bpftool: Cleanup license mess
    bpf: fix inner map masking to prevent oob under speculation
    bpf: pull in pkt_sched.h header for tooling to fix bpftool build
    selftests: forwarding: Add a test case for externally learned FDB entries
    selftests: mlxsw: Test FDB offload indication
    mlxsw: spectrum_switchdev: Do not treat static FDB entries as sticky
    net: bridge: Mark FDB entries that were added by user as such
    mlxsw: spectrum_fid: Update dummy FID index
    mlxsw: pci: Return error on PCI reset timeout
    mlxsw: pci: Increase PCI SW reset timeout
    mlxsw: pci: Ring CQ's doorbell before RDQ's
    MAINTAINERS: update email addresses of liquidio driver maintainers
    ...

    Linus Torvalds
     

19 Jan, 2019

1 commit

  • During review I noticed that inner meta map setup for map in
    map is buggy in that it does not propagate all needed data
    from the reference map which the verifier is later accessing.

    In particular one such case is index masking to prevent out of
    bounds access under speculative execution due to missing the
    map's unpriv_array/index_mask field propagation. Fix this such
    that the verifier is generating the correct code for inlined
    lookups in case of unpriviledged use.

    Before patch (test_verifier's 'map in map access' dump):

    # bpftool prog dump xla id 3
    0: (62) *(u32 *)(r10 -4) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -4
    3: (18) r1 = map[id:4]
    5: (07) r1 += 272 |
    6: (61) r0 = *(u32 *)(r2 +0) |
    7: (35) if r0 >= 0x1 goto pc+6 | Inlined map in map lookup
    8: (54) (u32) r0 &= (u32) 0 | with index masking for
    9: (67) r0 <unpriv_array.
    10: (0f) r0 += r1 |
    11: (79) r0 = *(u64 *)(r0 +0) |
    12: (15) if r0 == 0x0 goto pc+1 |
    13: (05) goto pc+1 |
    14: (b7) r0 = 0 |
    15: (15) if r0 == 0x0 goto pc+11
    16: (62) *(u32 *)(r10 -4) = 0
    17: (bf) r2 = r10
    18: (07) r2 += -4
    19: (bf) r1 = r0
    20: (07) r1 += 272 |
    21: (61) r0 = *(u32 *)(r2 +0) | Index masking missing (!)
    22: (35) if r0 >= 0x1 goto pc+3 | for inner map despite
    23: (67) r0 <unpriv_array set.
    24: (0f) r0 += r1 |
    25: (05) goto pc+1 |
    26: (b7) r0 = 0 |
    27: (b7) r0 = 0
    28: (95) exit

    After patch:

    # bpftool prog dump xla id 1
    0: (62) *(u32 *)(r10 -4) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -4
    3: (18) r1 = map[id:2]
    5: (07) r1 += 272 |
    6: (61) r0 = *(u32 *)(r2 +0) |
    7: (35) if r0 >= 0x1 goto pc+6 | Same inlined map in map lookup
    8: (54) (u32) r0 &= (u32) 0 | with index masking due to
    9: (67) r0 <unpriv_array.
    10: (0f) r0 += r1 |
    11: (79) r0 = *(u64 *)(r0 +0) |
    12: (15) if r0 == 0x0 goto pc+1 |
    13: (05) goto pc+1 |
    14: (b7) r0 = 0 |
    15: (15) if r0 == 0x0 goto pc+12
    16: (62) *(u32 *)(r10 -4) = 0
    17: (bf) r2 = r10
    18: (07) r2 += -4
    19: (bf) r1 = r0
    20: (07) r1 += 272 |
    21: (61) r0 = *(u32 *)(r2 +0) |
    22: (35) if r0 >= 0x1 goto pc+4 | Now fixed inlined inner map
    23: (54) (u32) r0 &= (u32) 0 | lookup with proper index masking
    24: (67) r0 <unpriv_array.
    25: (0f) r0 += r1 |
    26: (05) goto pc+1 |
    27: (b7) r0 = 0 |
    28: (b7) r0 = 0
    29: (95) exit

    Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

18 Jan, 2019

3 commits

  • The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
    syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e,
    when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
    then no modification to sysctl_perf_event_sample_rate is allowed to prevent
    possible hang from wrong values.

    The problem is that the test to prevent modification is made after the
    sysctl variable is modified in perf_proc_update_handler().

    You get an error:

    $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
    echo: write error: invalid argument

    But the value is still modified causing all sorts of inconsistencies:

    $ cat /proc/sys/kernel/perf_event_max_sample_rate
    10001

    This patch fixes the problem by moving the parsing of the value after
    the test.

    Committer testing:

    # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
    # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
    -bash: echo: write error: Invalid argument
    # cat /proc/sys/kernel/perf_event_max_sample_rate
    10001
    #

    Signed-off-by: Stephane Eranian
    Reviewed-by: Andi Kleen
    Reviewed-by: Jiri Olsa
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Kan Liang
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Stephane Eranian
     
  • The recent rework of alloc_descs() introduced a double increment of the
    loop counter. As a consequence only every second affinity mask is
    validated.

    Remove it.

    [ tglx: Massaged changelog ]

    Fixes: c410abbbacb9 ("genirq/affinity: Add is_managed to struct irq_affinity_desc")
    Signed-off-by: Huacai Chen
    Signed-off-by: Thomas Gleixner
    Cc: Fuxin Zhang
    Cc: Zhangjin Wu
    Cc: Huacai Chen
    Cc: Dou Liyang
    Link: https://lkml.kernel.org/r/1547694009-16261-1-git-send-email-chenhc@lemote.com

    Huacai Chen
     
  • Pull swiotlb fix from Konrad Rzeszutek Wilk:
    "A tiny fix for v5.0-rc2:

    This fixes an issue with GPU cards not working anymore with the DMA
    mapping work Christopher did - as the SWIOTLB is initialized first and
    then free'd (as IOMMU is available) but we forgot to clear our start
    and end entries which are used and BOOM"

    * 'stable/for-linus-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
    swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit

    Linus Torvalds
     

17 Jan, 2019

1 commit

  • When printing multiple uprobe arguments as strings the output for the
    earlier arguments would also include all later string arguments.

    This is best explained in an example:

    Consider adding a uprobe to a function receiving two strings as
    parameters which is at offset 0xa0 in strlib.so and we want to print
    both parameters when the uprobe is hit (on x86_64):

    $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
    /sys/kernel/debug/tracing/uprobe_events

    When the function is called as func("foo", "bar") and we hit the probe,
    the trace file shows a line like the following:

    [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"

    Note the extra "bar" printed as part of arg1. This behaviour stacks up
    for additional string arguments.

    The strings are stored in a dynamically growing part of the uprobe
    buffer by fetch_store_string() after copying them from userspace via
    strncpy_from_user(). The return value of strncpy_from_user() is then
    directly used as the required size for the string. However, this does
    not take the terminating null byte into account as the documentation
    for strncpy_from_user() cleary states that it "[...] returns the
    length of the string (not including the trailing NUL)" even though the
    null byte will be copied to the destination.

    Therefore, subsequent calls to fetch_store_string() will overwrite
    the terminating null byte of the most recently fetched string with
    the first character of the current string, leading to the
    "accumulation" of strings in earlier arguments in the output.

    Fix this by incrementing the return value of strncpy_from_user() by
    one if we did not hit the maximum buffer size.

    Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de

    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Fixes: 5baaa59ef09e ("tracing/probes: Implement 'memory' fetch method for uprobes")
    Acked-by: Masami Hiramatsu
    Signed-off-by: Andreas Ziegler
    Signed-off-by: Steven Rostedt (VMware)

    Andreas Ziegler