18 Apr, 2019

2 commits

  • Commit 0a0e0829f990 ("nohz: Fix missing tick reprogram when interrupting an
    inline softirq") got backported to stable trees and now causes the NOHZ
    softirq pending warning to trigger. It's not an upstream issue as the NOHZ
    update logic has been changed there.

    The problem is when a softirq disabled section gets interrupted and on
    return from interrupt the tick/nohz state is evaluated, which then can
    observe pending soft interrupts. These soft interrupts are legitimately
    pending because they cannot be processed as long as soft interrupts are
    disabled and the interrupted code will correctly process them when soft
    interrupts are reenabled.

    Add a check for softirqs disabled to the pending check to prevent the
    warning.

    Reported-by: Grygorii Strashko
    Reported-by: John Crispin
    Signed-off-by: Thomas Gleixner
    Tested-by: Grygorii Strashko
    Tested-by: John Crispin
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Anna-Maria Gleixner
    Cc: Greg Kroah-Hartman
    Cc: stable@vger.kernel.org
    Acked-by: Frederic Weisbecker
    Tested-by: Geert Uytterhoeven

    Signed-off-by: Leonard Crestez
    Acked-by: Jason Liu
    Signed-off-by: Arulpandiyan Vadivel

    Thomas Gleixner
     
  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar
    (Vipul: Fixed merge conflicts)
    Signed-off-by: Vipul Kumar

    Viresh Kumar
     

17 Apr, 2019

4 commits

  • commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.

    A NULL pointer dereference bug was reported on a distribution kernel but
    the same issue should be present on mainline kernel. It occured on s390
    but should not be arch-specific. A partial oops looks like:

    Unable to handle kernel pointer dereference in virtual kernel address space
    ...
    Call Trace:
    ...
    try_to_wake_up+0xfc/0x450
    vhost_poll_wakeup+0x3a/0x50 [vhost]
    __wake_up_common+0xbc/0x178
    __wake_up_common_lock+0x9e/0x160
    __wake_up_sync_key+0x4e/0x60
    sock_def_readable+0x5e/0x98

    The bug hits any time between 1 hour to 3 days. The dereference occurs
    in update_cfs_rq_h_load when accumulating h_load. The problem is that
    cfq_rq->h_load_next is not protected by any locking and can be updated
    by parallel calls to task_h_load. Depending on the compiler, code may be
    generated that re-reads cfq_rq->h_load_next after the check for NULL and
    then oops when reading se->avg.load_avg. The dissassembly showed that it
    was possible to reread h_load_next after the check for NULL.

    While this does not appear to be an issue for later compilers, it's still
    an accident if the correct code is generated. Full locking in this path
    would have high overhead so this patch uses READ_ONCE to read h_load_next
    only once and check for NULL before dereferencing. It was confirmed that
    there were no further oops after 10 days of testing.

    As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
    potential problems with store tearing.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
    Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit e8458e7afa855317b14915d7b86ab3caceea7eb6 upstream.

    When CONFIG_SPARSE_IRQ is disable, the request_mutex in struct irq_desc
    is not initialized which causes malfunction.

    Fixes: 9114014cf4e6 ("genirq: Add mutex to irq desc to serialize request/free_irq()")
    Signed-off-by: Kefeng Wang
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mukesh Ojha
    Cc: Marc Zyngier
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190404074512.145533-1-wangkefeng.wang@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Kefeng Wang
     
  • commit 325aa19598e410672175ed50982f902d4e3f31c5 upstream.

    If a child irqchip calls irq_chip_set_wake_parent() but its parent irqchip
    has the IRQCHIP_SKIP_SET_WAKE flag set an error is returned.

    This is inconsistent behaviour vs. set_irq_wake_real() which returns 0 when
    the irqchip has the IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to
    walk the chain of parents and set irq wake on any chips that don't have the
    flag set either. If the intent is to call the .irq_set_wake() callback of
    the parent irqchip, then we expect irqchip implementations to omit the
    IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function that
    calls irq_chip_set_wake_parent().

    The problem has been observed on a Qualcomm sdm845 device where set wake
    fails on any GPIO interrupts after applying work in progress wakeup irq
    patches to the GPIO driver. The chain of chips looks like this:

    QCOM GPIO -> QCOM PDC (SKIP) -> ARM GIC (SKIP)

    The GPIO controllers parent is the QCOM PDC irqchip which in turn has ARM
    GIC as parent. The QCOM PDC irqchip has the IRQCHIP_SKIP_SET_WAKE flag
    set, and so does the grandparent ARM GIC.

    The GPIO driver doesn't know if the parent needs to set wake or not, so it
    unconditionally calls irq_chip_set_wake_parent() causing this function to
    return a failure because the parent irqchip (PDC) doesn't have the
    .irq_set_wake() callback set. Returning 0 instead makes everything work and
    irqs from the GPIO controller can be configured for wakeup.

    Make it consistent by returning 0 (success) from irq_chip_set_wake_parent()
    when a parent chip has IRQCHIP_SKIP_SET_WAKE set.

    [ tglx: Massaged changelog ]

    Fixes: 08b55e2a9208e ("genirq: Add irqchip_set_wake_parent")
    Signed-off-by: Stephen Boyd
    Signed-off-by: Thomas Gleixner
    Acked-by: Marc Zyngier
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-gpio@vger.kernel.org
    Cc: Lina Iyer
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190325181026.247796-1-swboyd@chromium.org
    Signed-off-by: Greg Kroah-Hartman

    Stephen Boyd
     
  • commit 07d7e12091f4ab869cc6a4bb276399057e73b0b3 upstream.

    To calculate a remaining time, it's required to subtract the current time
    from the expiration time. In alarm_timer_remaining() the arguments of
    ktime_sub are swapped.

    Fixes: d653d8457c76 ("alarmtimer: Implement remaining callback")
    Signed-off-by: Andrei Vagin
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mukesh Ojha
    Cc: Stephen Boyd
    Cc: John Stultz
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Andrei Vagin
     

06 Apr, 2019

11 commits

  • [ Upstream commit ce48c457b95316b9a01b5aa9d4456ce820df94b4 ]

    Since we've had:

    commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

    we've been getting some lockdep warnings during init, such as on HiKey960:

    [ 0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48
    [ 0.820498] Modules linked in:
    [ 0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S 4.20.0-rc5-00051-g4cae42a #34
    [ 0.820511] Hardware name: HiKey960 (DT)
    [ 0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO)
    [ 0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48
    [ 0.820523] lr : lockdep_assert_cpus_held+0x38/0x48
    [ 0.820526] sp : ffff00000a9cbe50
    [ 0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000
    [ 0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0
    [ 0.820537] x25: ffff000008ba69e0 x24: 0000000000000001
    [ 0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8
    [ 0.820545] x21: 0000000000000001 x20: 0000000000000003
    [ 0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff
    [ 0.820552] x17: 0000000000000000 x16: 0000000000000000
    [ 0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34
    [ 0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8
    [ 0.820564] x11: 0000000000000000 x10: ffff00000958f848
    [ 0.820567] x9 : ffff000009592000 x8 : ffff00000958f848
    [ 0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000
    [ 0.820574] x5 : 0000000000000000 x4 : 0000000000000001
    [ 0.820578] x3 : 0000000000000000 x2 : 0000000000000001
    [ 0.820582] x1 : 00000000ffffffff x0 : 0000000000000000
    [ 0.820587] Call trace:
    [ 0.820591] lockdep_assert_cpus_held+0x3c/0x48
    [ 0.820598] static_key_enable_cpuslocked+0x28/0xd0
    [ 0.820606] arch_timer_check_ool_workaround+0xe8/0x228
    [ 0.820610] arch_timer_starting_cpu+0xe4/0x2d8
    [ 0.820615] cpuhp_invoke_callback+0xe8/0xd08
    [ 0.820619] notify_cpu_starting+0x80/0xb8
    [ 0.820625] secondary_start_kernel+0x118/0x1d0

    We've also had a similar warning in sched_init_smp() for every
    asymmetric system that would enable the sched_asym_cpucapacity static
    key, although that was singled out in:

    commit 40fa3780bac2 ("sched/core: Take the hotplug lock in sched_init_smp()")

    Those warnings are actually harmless, since we cannot have hotplug
    operations at the time they appear. Instead of starting to sprinkle
    useless hotplug lock operations in the init codepaths, mute the
    warnings until they start warning about real problems.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: cai@gmx.us
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: longman@redhat.com
    Cc: marc.zyngier@arm.com
    Cc: mark.rutland@arm.com
    Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     
  • [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]

    The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
    needs pids_free() to uncharge the pid.

    However, ->free() is called from __put_task_struct()->cgroup_free() and this
    is too late. Even the trivial program which does

    for (;;) {
    int pid = fork();
    assert(pid >= 0);
    if (pid)
    wait(NULL);
    else
    exit(0);
    }

    can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
    implies an RCU gp after the task/pid goes away and before the final put().

    Test-case:

    mkdir -p /tmp/CG
    mount -t cgroup2 none /tmp/CG
    echo '+pids' > /tmp/CG/cgroup.subtree_control

    mkdir /tmp/CG/PID
    echo 2 > /tmp/CG/PID/pids.max

    perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
    echo $! > /tmp/CG/PID/cgroup.procs

    Without this patch the forking process fails soon after migration.

    Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
    into the new helper, cgroup_release(), called by release_task() which actually
    frees the pid(s).

    Reported-by: Herton R. Krzesinski
    Reported-by: Jan Stancek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Oleg Nesterov
     
  • [ Upstream commit c546951d9c9300065bad253ecdf1ac59ce9d06c8 ]

    move_queued_task() synchronizes with task_rq_lock() as follows:

    move_queued_task() task_rq_lock()

    [S] ->on_rq = MIGRATING [L] rq = task_rq()
    WMB (__set_task_cpu()) ACQUIRE (rq->lock);
    [S] ->cpu = new_cpu [L] ->on_rq

    where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
    address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
    "[L] ->on_rq" by the ACQUIRE itself.

    Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
    this address dependency. Also, mark the accesses to ->cpu and ->on_rq
    with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.

    Signed-off-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Andrea Parri
     
  • [ Upstream commit 1ca4fa3ab604734e38e2a3000c9abf788512ffa7 ]

    register_sched_domain_sysctl() copies the cpu_possible_mask into
    sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
    allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when
    CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
    uninitialized (all zeroes) and the kernel may fail to initialize
    sched_domain sysctl entries for all possible CPUs.

    This is visible to the user if the kernel is booted with maxcpus=n, or
    if ACPI tables have been modified to leave CPUs offline, and then
    checking for missing /proc/sys/kernel/sched_domain/cpu* entries.

    Fix this by separating the allocation and initialization, and adding a
    flag to initialize the possible CPU entries while system booting only.

    Tested-by: Syuuichirou Ishii
    Tested-by: Tarumizu, Kohei
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Masayoshi Mizuma
    Acked-by: Joe Lawrence
    Cc: Linus Torvalds
    Cc: Masayoshi Mizuma
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Hidetoshi Seto
     
  • [ Upstream commit 840018668ce2d96783356204ff282d6c9b0e5f66 ]

    When pmu::setup_aux() is called the coresight PMU needs to know which
    sink to use for the session by looking up the information in the
    event's attr::config2 field.

    As such simply replace the cpu information by the complete perf_event
    structure and change all affected customers.

    Signed-off-by: Mathieu Poirier
    Reviewed-by: Suzuki Poulouse
    Acked-by: Peter Zijlstra
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Jiri Olsa
    Cc: Mark Rutland
    Cc: Martin Schwidefsky
    Cc: Namhyung Kim
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-s390@vger.kernel.org
    Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Sasha Levin

    Mathieu Poirier
     
  • [ Upstream commit 1136b0728969901a091f0471968b2b76ed14d9ad ]

    Waiman reported that on large systems with a large amount of interrupts the
    readout of /proc/stat takes a long time to sum up the interrupt
    statistics. In principle this is not a problem. but for unknown reasons
    some enterprise quality software reads /proc/stat with a high frequency.

    The reason for this is that interrupt statistics are accounted per cpu. So
    the /proc/stat logic has to sum up the interrupt stats for each interrupt.

    This can be largely avoided for interrupts which are not marked as
    'PER_CPU' interrupts by simply adding a per interrupt summation counter
    which is incremented along with the per interrupt per cpu counter.

    The PER_CPU interrupts need to avoid that and use only per cpu accounting
    because they share the interrupt number and the interrupt descriptor and
    concurrent updates would conflict or require unwanted synchronization.

    Reported-by: Waiman Long
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Waiman Long
    Reviewed-by: Marc Zyngier
    Reviewed-by: Davidlohr Bueso
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Davidlohr Bueso
    Cc: Miklos Szeredi
    Cc: Daniel Colascione
    Cc: Dave Chinner
    Cc: Randy Dunlap
    Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de

    8

    Thomas Gleixner
     
  • [ Upstream commit 99687cdbb3f6c8e32bcc7f37496e811f30460e48 ]

    The percpu members of struct sd_data and s_data are declared as:

    struct ... ** __percpu member;

    So their type is:

    __percpu pointer to pointer to struct ...

    But looking at how they're used, their type should be:

    pointer to __percpu pointer to struct ...

    and they should thus be declared as:

    struct ... * __percpu *member;

    So fix the placement of '__percpu' in the definition of these
    structures.

    This addresses a bunch of Sparse's warnings like:

    warning: incorrect type in initializer (different address spaces)
    expected void const [noderef] *__vpp_verify
    got struct sched_domain **

    Signed-off-by: Luc Van Oostenryck
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Luc Van Oostenryck
     
  • [ Upstream commit a39f15b9644fac3f950f522c39e667c3af25c588 ]

    Since kprobe itself depends on RCU, probing on RCU debug
    routine can cause recursive breakpoint bugs.

    Prohibit probing on RCU debug routines.

    int3
    ->do_int3()
    ->ist_enter()
    ->RCU_LOCKDEP_WARN()
    ->debug_lockdep_rcu_enabled() -> int3

    Signed-off-by: Masami Hiramatsu
    Cc: Alexander Shishkin
    Cc: Andrea Righi
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 ]

    cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
    on flush. While it was only visiting updated ones in the subtree, it
    was visiting @root unconditionally. We can easily check whether @root
    is updated or not by looking at its ->updated_next just as with the
    cgroups in the subtree.

    * Remove the unnecessary cgroup_parent() test. The system root cgroup
    is never updated and thus its ->updated_next is always NULL. No
    need to test whether cgroup_parent() exists in addition to
    ->updated_next.

    * Terminate traverse if ->updated_next is NULL. This can only happen
    for subtree @root and there's no reason to visit it if it's not
    marked updated.

    This reduces cpu consumption when reading a lot of rstat backed files.
    In a micro benchmark reading stat from ~1600 cgroups, the sys time was
    lowered by >40%.

    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit 32a5ad9c22852e6bd9e74bdec5934ef9d1480bc5 ]

    Currently, when writing

    echo 18446744073709551616 > /proc/sys/fs/file-max

    /proc/sys/fs/file-max will overflow and be set to 0. That quickly
    crashes the system.

    This commit sets the max and min value for file-max. The max value is
    set to long int. Any higher value cannot currently be used as the
    percpu counters are long ints and not unsigned integers.

    Note that the file-max value is ultimately parsed via
    __do_proc_doulongvec_minmax(). This function does not report error when
    min or max are exceeded. Which means if a value largen that long int is
    written userspace will not receive an error instead the old value will be
    kept. There is an argument to be made that this should be changed and
    __do_proc_doulongvec_minmax() should return an error when a dedicated min
    or max value are exceeded. However this has the potential to break
    userspace so let's defer this to an RFC patch.

    Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Joe Lawrence
    Cc: Luis Chamberlain
    Cc: Waiman Long
    [christian@brauner.io: v4]
    Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Christian Brauner
     
  • [ Upstream commit 31b265b3baaf55f209229888b7ffea523ddab366 ]

    As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
    BUG for "sleeping function called from invalid context".

    kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
    atomic context. A very simple solution for this is to add allocation
    flags to ring_buffer_read_prepare() so kdb can call it without
    triggering the allocation error. This patch does that.

    Note that in the original email thread about this, it was suggested
    that perhaps the solution for kdb was to either preallocate the buffer
    ahead of time or create our own iterator. I'm hoping that this
    alternative of adding allocation flags to ring_buffer_read_prepare()
    can be considered since it means I don't need to duplicate more of the
    core trace code into "trace_kdb.c" (for either creating my own
    iterator or re-preparing a ring allocator whose memory was already
    allocated).

    NOTE: another option for kdb is to actually figure out how to make it
    reuse the existing ftrace_dump() function and totally eliminate the
    duplication. This sounds very appealing and actually works (the "sr
    z" command can be seen to properly dump the ftrace buffer). The
    downside here is that ftrace_dump() fully consumes the trace buffer.
    Unless that is changed I'd rather not use it because it means "ftdump
    | grep xyz" won't be very useful to search the ftrace buffer since it
    will throw away the whole trace on the first grep. A future patch to
    dump only the last few lines of the buffer will also be hard to
    implement.

    [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com

    Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org

    Reported-by: Brian Norris
    Signed-off-by: Douglas Anderson
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Douglas Anderson
     

03 Apr, 2019

3 commits

  • commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.

    Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.

    Call trace:

    dump_stack+0xbf/0x12e
    print_address_description+0x6a/0x280
    kasan_report+0x237/0x360
    sanitize_ptr_alu+0x85a/0x8d0
    adjust_ptr_min_max_vals+0x8f2/0x1ca0
    adjust_reg_min_max_vals+0x8ed/0x22e0
    do_check+0x1ca6/0x5d00
    bpf_check+0x9ca/0x2570
    bpf_prog_load+0xc91/0x1030
    __se_sys_bpf+0x61e/0x1f00
    do_syscall_64+0xc8/0x550
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fault injection trace:

     kfree+0xea/0x290
     free_func_state+0x4a/0x60
     free_verifier_state+0x61/0xe0
     push_stack+0x216/0x2f0
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Xu Yu
     
  • commit 206b92353c839c0b27a0b9bec24195f93fd6cf7a upstream.

    Tianyu reported a crash in a CPU hotplug teardown callback when booting a
    kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
    parameter.

    It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
    forever in case that a bringup callback fails. Unfortunately this issue was
    not recognized when the CPU hotplug code was reworked, so the shortcoming
    just stayed in place.

    When a bringup callback fails, the CPU hotplug code rolls back the
    operation and takes the CPU offline.

    The 'nosmt' command line argument uses a bringup failure to abort the
    bringup of SMT sibling CPUs. This partial bringup is required due to the
    MCE misdesign on Intel CPUs.

    With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
    CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
    teardown of a CPU including the synchronizations in various facilities like
    RCU, NOHZ and others.

    As a consequence the teardown callbacks which must be executed on the
    outgoing CPU within stop machine with interrupts disabled are executed on
    the control CPU in interrupt enabled and preemptible context causing the
    kernel to crash and burn. The pre state machine code has a different
    failure mode which is more subtle and resulting in a less obvious use after
    free crash because the control side frees resources which are still in use
    by the undead CPU.

    But this is not a x86 only problem. Any architecture which supports the
    SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
    likely to be triggered because in 99.99999% of the cases all bringup
    callbacks succeed.

    The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
    all architectures as the following architectures have either no hotplug
    support at all or not all subarchitectures support it:

    alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).

    Crashing the kernel in such a situation is not an acceptable state
    either.

    Implement a minimal rollback variant by limiting the teardown to the point
    where all regular teardown callbacks have been invoked and leave the CPU in
    the 'dead' idle state. This has the following consequences:

    - the CPU is brought down to the point where the stop_machine takedown
    would happen.

    - the CPU stays there forever and is idle

    - The CPU is cleared in the CPU active mask, but not in the CPU online
    mask which is a legit state.

    - Interrupts are not forced away from the CPU

    - All facilities which only look at online mask would still see it, but
    that is the case during normal hotplug/unplug operations as well. It's
    just a (way) longer time frame.

    This will expose issues, which haven't been exposed before or only seldom,
    because now the normally transient state of being non active but online is
    a permanent state. In testing this exposed already an issue vs. work queues
    where the vmstat code schedules work on the almost dead CPU which ends up
    in an unbound workqueue and triggers 'preemtible context' warnings. This is
    not a problem of this change, it merily exposes an already existing issue.
    Still this is better than crashing fully without a chance to debug it.

    This is mainly thought as workaround for those architectures which do not
    support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.

    Fixes: 2e1a3483ce74 ("cpu/hotplug: Split out the state walk into functions")
    Reported-by: Tianyu Lan
    Signed-off-by: Thomas Gleixner
    Tested-by: Tianyu Lan
    Acked-by: Greg Kroah-Hartman
    Cc: Konrad Wilk
    Cc: Josh Poimboeuf
    Cc: Mukesh Ojha
    Cc: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Rik van Riel
    Cc: Andy Lutomirski
    Cc: Micheal Kelley
    Cc: "K. Y. Srinivasan"
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: K. Y. Srinivasan
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 7dd47617114921fdd8c095509e5e7b4373cc44a1 upstream.

    The rework of the watchdog core to use cpu_stop_work broke the watchdog
    cpumask on CPU hotplug.

    The watchdog_enable/disable() functions are now called unconditionally from
    the hotplug callback, i.e. even on CPUs which are not in the watchdog
    cpumask. As a consequence the watchdog can become unstoppable.

    Only invoke them when the plugged CPU is in the watchdog cpumask.

    Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
    Reported-by: Maxime Coquelin
    Signed-off-by: Thomas Gleixner
    Tested-by: Maxime Coquelin
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Don Zickus
    Cc: Ricardo Neri
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

27 Mar, 2019

2 commits

  • commit 71492580571467fb7177aade19c18ce7486267f5 upstream.

    Tetsuo Handa had reported he saw an incorrect "downgrading a read lock"
    warning right after a previous lockdep warning. It is likely that the
    previous warning turned off lock debugging causing the lockdep to have
    inconsistency states leading to the lock downgrade warning.

    Fix that by add a check for debug_locks at the beginning of
    __lock_downgrade().

    Debugged-by: Tetsuo Handa
    Reported-by: Tetsuo Handa
    Reported-by: syzbot+53383ae265fb161ef488@syzkaller.appspotmail.com
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/1547093005-26085-1-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     
  • commit 5a07168d8d89b00fe1760120714378175b3ef992 upstream.

    The futex code requires that the user space addresses of futexes are 32bit
    aligned. sys_futex() checks this in futex_get_keys() but the robust list
    code has no alignment check in place.

    As a consequence the kernel crashes on architectures with strict alignment
    requirements in handle_futex_death() when trying to cmpxchg() on an
    unaligned futex address which was retrieved from the robust list.

    [ tglx: Rewrote changelog, proper sizeof() based alignement check and add
    comment ]

    Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
    Signed-off-by: Chen Jie
    Signed-off-by: Thomas Gleixner
    Cc:
    Cc:
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Chen Jie
     

24 Mar, 2019

8 commits

  • commit 1d1f898df6586c5ea9aeaf349f13089c6fa37903 upstream.

    The rcu_gp_kthread_wake() function is invoked when it might be necessary
    to wake the RCU grace-period kthread. Because self-wakeups are normally
    a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
    this kthread, it naturally refuses to do the wakeup.

    Unfortunately, natural though it might be, this heuristic fails when
    rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
    that interrupted the grace-period kthread just after the final check of
    the wait-event condition but just before the schedule() call. In this
    case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
    is within the RCU grace-period kthread's context. Failing to provide
    this wakeup can result in grace periods failing to start, which in turn
    results in out-of-memory conditions.

    This race window is quite narrow, but it actually did happen during real
    testing. It would of course need to be fixed even if it was strictly
    theoretical in nature.

    This patch does not Cc stable because it does not apply cleanly to
    earlier kernel versions.

    Fixes: 48a7639ce80c ("rcu: Make callers awaken grace-period kthread")
    Reported-by: "He, Bo"
    Co-developed-by: "Zhang, Jun"
    Co-developed-by: "He, Bo"
    Co-developed-by: "xiao, jin"
    Co-developed-by: Bai, Jie A
    Signed-off: "Zhang, Jun"
    Signed-off: "He, Bo"
    Signed-off: "xiao, jin"
    Signed-off: Bai, Jie A
    Signed-off-by: "Zhang, Jun"
    [ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
    !in_serving_softirq() to avoid redundant wakeups and to also handle the
    interrupt-handler scenario as well as the softirq-handler scenario that
    actually occurred in testing. ]
    Signed-off-by: Paul E. McKenney
    Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Zhang, Jun
     
  • commit 8cf7630b29701d364f8df4a50e4f1f5e752b2778 upstream.

    This bug has apparently existed since the introduction of this function
    in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
    "[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
    neighbour sysctls.").

    As a minimal fix we can simply duplicate the corresponding check in
    do_proc_dointvec_conv().

    Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
    Signed-off-by: Zev Weiss
    Cc: Brendan Higgins
    Cc: Iurii Zaikin
    Cc: Kees Cook
    Cc: Luis Chamberlain
    Cc: [2.6.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Zev Weiss
     
  • commit 83540fbc8812a580b6ad8f93f4c29e62e417687e upstream.

    The first version of this method was missing the check for
    `ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
    on error, so there was still a small memory leak in the error case.
    Fix it by using strndup_user() instead of open-coding it.

    Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com

    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Fixes: 0eadcc7a7bc0 ("perf/core: Fix perf_uprobe_init()")
    Reviewed-by: Masami Hiramatsu
    Acked-by: Song Liu
    Signed-off-by: Jann Horn
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit e7f0c424d0806b05d6f47be9f202b037eb701707 upstream.

    Commit d716ff71dd12 ("tracing: Remove taking of trace_types_lock in
    pipe files") use the current tracer instead of the copy in
    tracing_open_pipe(), but it forget to remove the freeing sentence in
    the error path.

    There's an error path that can call kfree(iter->trace) after the iter->trace
    was assigned to tr->current_trace, which would be bad to free.

    Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com

    Cc: stable@vger.kernel.org
    Fixes: d716ff71dd12 ("tracing: Remove taking of trace_types_lock in pipe files")
    Signed-off-by: zhangyi (F)
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    zhangyi (F)
     
  • commit 9f0bbf3115ca9f91f43b7c74e9ac7d79f47fc6c2 upstream.

    Because there may be random garbage beyond a string's null terminator,
    it's not correct to copy the the complete character array for use as a
    hist trigger key. This results in multiple histogram entries for the
    'same' string key.

    So, in the case of a string key, use strncpy instead of memcpy to
    avoid copying in the extra bytes.

    Before, using the gdbus entries in the following hist trigger as an
    example:

    # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist

    ...

    { comm: ImgDecoder #4 } hitcount: 203
    { comm: gmain } hitcount: 213
    { comm: gmain } hitcount: 216
    { comm: StreamTrans #73 } hitcount: 221
    { comm: mozStorage #3 } hitcount: 230
    { comm: gdbus } hitcount: 233
    { comm: StyleThread#5 } hitcount: 253
    { comm: gdbus } hitcount: 256
    { comm: gdbus } hitcount: 260
    { comm: StyleThread#4 } hitcount: 271

    ...

    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
    51

    After:

    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
    1

    Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com

    Cc: Namhyung Kim
    Cc: stable@vger.kernel.org
    Fixes: 79e577cbce4c4 ("tracing: Support string type key properly")
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Tom Zanussi
     
  • commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.

    same story as with last May fixes in sysfs (7b745a4e4051
    "unfuck sysfs_mount()"); new_sb is left uninitialized
    in case of early errors in kernfs_mount_ns() and papering
    over it by treating any error from kernfs_mount_ns() as
    equivalent to !new_ns ends up conflating the cases when
    objects had never been transferred to a superblock with
    ones when that has happened and resulting new superblock
    had been dropped. Easily fixed (same way as in sysfs
    case). Additionally, there's a superblock leak on
    kernfs_node_dentry() failure *and* a dentry leak inside
    kernfs_node_dentry() itself - the latter on probably
    impossible errors, but the former not impossible to trigger
    (as the matter of fact, injecting allocation failures
    at that point *does* trigger it).

    Cc: stable@kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • [ Upstream commit 7c0cdf0b3940f63d9777c3fcf250a2f83859ca54 ]

    trie_delete_elem() was deleting an entry even though it was not matching
    if the prefixlen was correct. This patch adds a check on matchlen.

    Reproducer:

    $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
    $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    key: 10 00 00 00 aa bb cc dd value: 01
    Found 1 element
    $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
    $ echo $?
    0
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    Found 0 elements

    A similar reproducer is added in the selftests.

    Without the patch:

    $ sudo ./tools/testing/selftests/bpf/test_lpm_map
    test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
    Aborted

    With the patch: test_lpm_map runs without errors.

    Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
    Cc: Craig Gallek
    Signed-off-by: Alban Crequy
    Acked-by: Craig Gallek
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alban Crequy
     
  • [ Upstream commit 3defaf2f15b2bfd86c6664181ac009e91985f8ac ]

    Lockdep warns about false positive:
    [ 11.211460] ------------[ cut here ]------------
    [ 11.211936] DEBUG_LOCKS_WARN_ON(depth
    [ 11.223874] ? __local_bh_enable+0x7a/0x80
    [ 11.224199] up_read+0x1c/0xa0
    [ 11.224446] do_up_read+0x12/0x20
    [ 11.224713] irq_work_run_list+0x43/0x70
    [ 11.225030] irq_work_run+0x26/0x50
    [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0
    [ 11.225662] irq_work_interrupt+0xf/0x20

    since rw_semaphore is released in a different task vs task that locked the sema.
    It is expected behavior.
    Fix the warning with up_read_non_owner() and rwsem_release() annotation.

    Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

14 Mar, 2019

5 commits

  • [ Upstream commit 7c4cd051add3d00bbff008a133c936c515eaa8fe ]

    The map_lookup_elem used to not acquiring spinlock
    in order to optimize the reader.

    It was true until commit 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
    bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
    If that is the case, pcpu_freelist_push() saves this elem for reuse later.
    This push requires a spinlock.

    If a tracing bpf_prog got run in the middle of the syscall's
    map_lookup_elem(stackmap) and this tracing bpf_prog is calling
    bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
    spinlock, it may end up with a dead lock situation as reported by
    Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/

    The situation is the same as the syscall's map_update_elem() which
    needs to acquire the pcpu_freelist's spinlock and could race
    with tracing bpf_prog. Hence, this patch fixes it by protecting
    bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
    to prevent tracing bpf_prog from running.

    A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
    also acquires a spinlock and races with tracing bpf_prog similarly.
    Hence, this patch is forward looking and protects the majority
    of the map lookups. bpf_map_offload_lookup_elem() is the exception
    since it is for network bpf_prog only (i.e. never called by tracing
    bpf_prog).

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Martin KaFai Lau
     
  • [ Upstream commit e16ec34039c701594d55d08a5aa49ee3e1abc821 ]

    Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
    [ 13.007000] WARNING: possible circular locking dependency detected
    [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
    [ 13.008124] ------------------------------------------------------
    [ 13.008624] test_progs/246 is trying to acquire lock:
    [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
    [ 13.009770]
    [ 13.009770] but task is already holding lock:
    [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.010877]
    [ 13.010877] which lock already depends on the new lock.
    [ 13.010877]
    [ 13.011532]
    [ 13.011532] the existing dependency chain (in reverse order) is:
    [ 13.012129]
    [ 13.012129] -> #4 (bpf_event_mutex){+.+.}:
    [ 13.012582] perf_event_query_prog_array+0x9b/0x130
    [ 13.013016] _perf_ioctl+0x3aa/0x830
    [ 13.013354] perf_ioctl+0x2e/0x50
    [ 13.013668] do_vfs_ioctl+0x8f/0x6a0
    [ 13.014003] ksys_ioctl+0x70/0x80
    [ 13.014320] __x64_sys_ioctl+0x16/0x20
    [ 13.014668] do_syscall_64+0x4a/0x180
    [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.015469]
    [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}:
    [ 13.015910] perf_event_init_cpu+0x5a/0x90
    [ 13.016291] perf_event_init+0x1b2/0x1de
    [ 13.016654] start_kernel+0x2b8/0x42a
    [ 13.016995] secondary_startup_64+0xa4/0xb0
    [ 13.017382]
    [ 13.017382] -> #2 (pmus_lock){+.+.}:
    [ 13.017794] perf_event_init_cpu+0x21/0x90
    [ 13.018172] cpuhp_invoke_callback+0xb3/0x960
    [ 13.018573] _cpu_up+0xa7/0x140
    [ 13.018871] do_cpu_up+0xa4/0xc0
    [ 13.019178] smp_init+0xcd/0xd2
    [ 13.019483] kernel_init_freeable+0x123/0x24f
    [ 13.019878] kernel_init+0xa/0x110
    [ 13.020201] ret_from_fork+0x24/0x30
    [ 13.020541]
    [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
    [ 13.021051] static_key_slow_inc+0xe/0x20
    [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300
    [ 13.021891] perf_trace_event_init+0x11f/0x250
    [ 13.022297] perf_trace_init+0x6b/0xa0
    [ 13.022644] perf_tp_event_init+0x25/0x40
    [ 13.023011] perf_try_init_event+0x6b/0x90
    [ 13.023386] perf_event_alloc+0x9a8/0xc40
    [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30
    [ 13.024173] do_syscall_64+0x4a/0x180
    [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.024968]
    [ 13.024968] -> #0 (tracepoints_mutex){+.+.}:
    [ 13.025434] __mutex_lock+0x86/0x970
    [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.026215] bpf_probe_register+0x40/0x60
    [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.027042] __do_sys_bpf+0x94f/0x1a90
    [ 13.027389] do_syscall_64+0x4a/0x180
    [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.028171]
    [ 13.028171] other info that might help us debug this:
    [ 13.028171]
    [ 13.028807] Chain exists of:
    [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
    [ 13.028807]
    [ 13.029666] Possible unsafe locking scenario:
    [ 13.029666]
    [ 13.030140] CPU0 CPU1
    [ 13.030510] ---- ----
    [ 13.030875] lock(bpf_event_mutex);
    [ 13.031166] lock(&cpuctx_mutex);
    [ 13.031645] lock(bpf_event_mutex);
    [ 13.032135] lock(tracepoints_mutex);
    [ 13.032441]
    [ 13.032441] *** DEADLOCK ***
    [ 13.032441]
    [ 13.032911] 1 lock held by test_progs/246:
    [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.033909]
    [ 13.033909] stack backtrace:
    [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
    [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 13.035657] Call Trace:
    [ 13.035859] dump_stack+0x5f/0x8b
    [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db
    [ 13.036526] __lock_acquire+0x1158/0x1350
    [ 13.036852] ? lock_acquire+0x98/0x190
    [ 13.037154] lock_acquire+0x98/0x190
    [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.037876] __mutex_lock+0x86/0x970
    [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.039028] ? __mutex_lock+0x86/0x970
    [ 13.039337] ? __mutex_lock+0x24a/0x970
    [ 13.039649] ? bpf_probe_register+0x1d/0x60
    [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
    [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.041325] bpf_probe_register+0x40/0x60
    [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.042068] ? __might_fault+0x3e/0x90
    [ 13.042374] __do_sys_bpf+0x94f/0x1a90
    [ 13.042678] do_syscall_64+0x4a/0x180
    [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.043382] RIP: 0033:0x7f23b10a07f9
    [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
    [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
    [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
    [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
    [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
    [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

    Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
    there is no need to take bpf_event_mutex too.
    bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
    bpf_raw_tracepoints don't need to take this mutex.

    Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     
  • [ Upstream commit a89fac57b5d080771efd4d71feaae19877cf68f0 ]

    Lockdep warns about false positive:
    [ 12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
    [ 12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [ 12.493275] (&rq->lock){-.-.}
    [ 12.493276]
    [ 12.493276]
    [ 12.493276] and interrupts could create inverse lock ordering between them.
    [ 12.493276]
    [ 12.494435]
    [ 12.494435] other info that might help us debug this:
    [ 12.494979] Possible interrupt unsafe locking scenario:
    [ 12.494979]
    [ 12.495518] CPU0 CPU1
    [ 12.495879] ---- ----
    [ 12.496243] lock(&head->lock);
    [ 12.496502] local_irq_disable();
    [ 12.496969] lock(&rq->lock);
    [ 12.497431] lock(&head->lock);
    [ 12.497890]
    [ 12.498104] lock(&rq->lock);
    [ 12.498368]
    [ 12.498368] *** DEADLOCK ***
    [ 12.498368]
    [ 12.498837] 1 lock held by dd/276:
    [ 12.499110] #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
    [ 12.499747]
    [ 12.499747] the shortest dependencies between 2nd lock and 1st lock:
    [ 12.500389] -> (&rq->lock){-.-.} {
    [ 12.500669] IN-HARDIRQ-W at:
    [ 12.500934] _raw_spin_lock+0x2f/0x40
    [ 12.501373] scheduler_tick+0x4c/0xf0
    [ 12.501812] update_process_times+0x40/0x50
    [ 12.502294] tick_periodic+0x27/0xb0
    [ 12.502723] tick_handle_periodic+0x1f/0x60
    [ 12.503203] timer_interrupt+0x11/0x20
    [ 12.503651] __handle_irq_event_percpu+0x43/0x2c0
    [ 12.504167] handle_irq_event_percpu+0x20/0x50
    [ 12.504674] handle_irq_event+0x37/0x60
    [ 12.505139] handle_level_irq+0xa7/0x120
    [ 12.505601] handle_irq+0xa1/0x150
    [ 12.506018] do_IRQ+0x77/0x140
    [ 12.506411] ret_from_intr+0x0/0x1d
    [ 12.506834] _raw_spin_unlock_irqrestore+0x53/0x60
    [ 12.507362] __setup_irq+0x481/0x730
    [ 12.507789] setup_irq+0x49/0x80
    [ 12.508195] hpet_time_init+0x21/0x32
    [ 12.508644] x86_late_time_init+0xb/0x16
    [ 12.509106] start_kernel+0x390/0x42a
    [ 12.509554] secondary_startup_64+0xa4/0xb0
    [ 12.510034] IN-SOFTIRQ-W at:
    [ 12.510305] _raw_spin_lock+0x2f/0x40
    [ 12.510772] try_to_wake_up+0x1c7/0x4e0
    [ 12.511220] swake_up_locked+0x20/0x40
    [ 12.511657] swake_up_one+0x1a/0x30
    [ 12.512070] rcu_process_callbacks+0xc5/0x650
    [ 12.512553] __do_softirq+0xe6/0x47b
    [ 12.512978] irq_exit+0xc3/0xd0
    [ 12.513372] smp_apic_timer_interrupt+0xa9/0x250
    [ 12.513876] apic_timer_interrupt+0xf/0x20
    [ 12.514343] default_idle+0x1c/0x170
    [ 12.514765] do_idle+0x199/0x240
    [ 12.515159] cpu_startup_entry+0x19/0x20
    [ 12.515614] start_kernel+0x422/0x42a
    [ 12.516045] secondary_startup_64+0xa4/0xb0
    [ 12.516521] INITIAL USE at:
    [ 12.516774] _raw_spin_lock_irqsave+0x38/0x50
    [ 12.517258] rq_attach_root+0x16/0xd0
    [ 12.517685] sched_init+0x2f2/0x3eb
    [ 12.518096] start_kernel+0x1fb/0x42a
    [ 12.518525] secondary_startup_64+0xa4/0xb0
    [ 12.518986] }
    [ 12.519132] ... key at: [] __key.71384+0x0/0x8
    [ 12.519649] ... acquired at:
    [ 12.519892] pcpu_freelist_pop+0x7b/0xd0
    [ 12.520221] bpf_get_stackid+0x1d2/0x4d0
    [ 12.520563] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.520887]
    [ 12.521008] -> (&head->lock){+...} {
    [ 12.521292] HARDIRQ-ON-W at:
    [ 12.521539] _raw_spin_lock+0x2f/0x40
    [ 12.521950] pcpu_freelist_push+0x2a/0x40
    [ 12.522396] bpf_get_stackid+0x494/0x4d0
    [ 12.522828] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.523296] INITIAL USE at:
    [ 12.523537] _raw_spin_lock+0x2f/0x40
    [ 12.523944] pcpu_freelist_populate+0xc0/0x120
    [ 12.524417] htab_map_alloc+0x405/0x500
    [ 12.524835] __do_sys_bpf+0x1a3/0x1a90
    [ 12.525253] do_syscall_64+0x4a/0x180
    [ 12.525659] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 12.526167] }
    [ 12.526311] ... key at: [] __key.13130+0x0/0x8
    [ 12.526812] ... acquired at:
    [ 12.527047] __lock_acquire+0x521/0x1350
    [ 12.527371] lock_acquire+0x98/0x190
    [ 12.527680] _raw_spin_lock+0x2f/0x40
    [ 12.527994] pcpu_freelist_push+0x2a/0x40
    [ 12.528325] bpf_get_stackid+0x494/0x4d0
    [ 12.528645] ___bpf_prog_run+0x8b4/0x11a0
    [ 12.528970]
    [ 12.529092]
    [ 12.529092] stack backtrace:
    [ 12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
    [ 12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 12.530750] Call Trace:
    [ 12.530948] dump_stack+0x5f/0x8b
    [ 12.531248] check_usage_backwards+0x10c/0x120
    [ 12.531598] ? ___bpf_prog_run+0x8b4/0x11a0
    [ 12.531935] ? mark_lock+0x382/0x560
    [ 12.532229] mark_lock+0x382/0x560
    [ 12.532496] ? print_shortest_lock_dependencies+0x180/0x180
    [ 12.532928] __lock_acquire+0x521/0x1350
    [ 12.533271] ? find_get_entry+0x17f/0x2e0
    [ 12.533586] ? find_get_entry+0x19c/0x2e0
    [ 12.533902] ? lock_acquire+0x98/0x190
    [ 12.534196] lock_acquire+0x98/0x190
    [ 12.534482] ? pcpu_freelist_push+0x2a/0x40
    [ 12.534810] _raw_spin_lock+0x2f/0x40
    [ 12.535099] ? pcpu_freelist_push+0x2a/0x40
    [ 12.535432] pcpu_freelist_push+0x2a/0x40
    [ 12.535750] bpf_get_stackid+0x494/0x4d0
    [ 12.536062] ___bpf_prog_run+0x8b4/0x11a0

    It has been explained that is a false positive here:
    https://lkml.org/lkml/2018/7/25/756
    Recap:
    - stackmap uses pcpu_freelist
    - The lock in pcpu_freelist is a percpu lock
    - stackmap is only used by tracing bpf_prog
    - A tracing bpf_prog cannot be run if another bpf_prog
    has already been running (ensured by the percpu bpf_prog_active counter).

    Eric pointed out that this lockdep splats stops other
    legit lockdep splats in selftests/bpf/test_progs.c.

    Fix this by calling local_irq_save/restore for stackmap.

    Another false positive had also been worked around by calling
    local_irq_save in commit 89ad2fa3f043 ("bpf: fix lockdep splat").
    That commit added unnecessary irq_save/restore to fast path of
    bpf hash map. irqs are already disabled at that point, since htab
    is holding per bucket spin_lock with irqsave.

    Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
    function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
    to be used elsewhere (right now only in stackmap).
    It stops lockdep false positive in stackmap with a bit of acceptable overhead.

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Naresh Kamboju
    Reported-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     
  • [ Upstream commit 2c1cf00eeacb784781cf1c9896b8af001246d339 ]

    If create_buf_file() returns an error, don't try to reference it later
    as a valid dentry pointer.

    This problem was exposed when debugfs started to return errors instead
    of just NULL for some calls when they do not succeed properly.

    Also, the check for WARN_ON(dentry) was just wrong :)

    Reported-by: Kees Cook
    Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
    Reported-by: Tetsuo Handa
    Cc: Andrew Morton
    Cc: David Rientjes
    Fixes: ff9fb72bc077 ("debugfs: return error values, not NULL")
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Sasha Levin

    Greg Kroah-Hartman
     
  • [ Upstream commit 1a51c5da5acc6c188c917ba572eebac5f8793432 ]

    The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
    syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e,
    when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
    then no modification to sysctl_perf_event_sample_rate is allowed to prevent
    possible hang from wrong values.

    The problem is that the test to prevent modification is made after the
    sysctl variable is modified in perf_proc_update_handler().

    You get an error:

    $ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
    echo: write error: invalid argument

    But the value is still modified causing all sorts of inconsistencies:

    $ cat /proc/sys/kernel/perf_event_max_sample_rate
    10001

    This patch fixes the problem by moving the parsing of the value after
    the test.

    Committer testing:

    # echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
    # echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
    -bash: echo: write error: Invalid argument
    # cat /proc/sys/kernel/perf_event_max_sample_rate
    10001
    #

    Signed-off-by: Stephane Eranian
    Reviewed-by: Andi Kleen
    Reviewed-by: Jiri Olsa
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Kan Liang
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Sasha Levin

    Stephane Eranian
     

10 Mar, 2019

2 commits

  • commit 3612af783cf52c74a031a2f11b82247b2599d3cd upstream.

    Marek reported that he saw an issue with the below snippet in that
    timing measurements where off when loaded as unpriv while results
    were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

    Turns out there is a bug where a corner case is missing in the fix
    d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
    type from different paths"), namely fixup_bpf_calls() only checks
    whether aux has a non-zero alu_state, but it also needs to test for
    the case of BPF_ALU_NON_POINTER since in both occasions we need to
    skip the masking rewrite (as there is nothing to mask).

    Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
    Reported-by: Marek Majkowski
    Reported-by: Arthur Fabre
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • commit 6a072128d262d2b98d31626906a96700d1fc11eb upstream.

    Then tracing syscall exit event it is extremely useful to filter exit
    codes equal to some negative value, to react only to required errors.
    But negative numbers does not work:

    [root@snorch sys_exit_read]# echo "ret == -1" > filter
    bash: echo: write error: Invalid argument
    [root@snorch sys_exit_read]# cat filter
    ret == -1
    ^
    parse_error: Invalid value (did you forget quotes)?

    Similar thing happens when setting triggers.

    These is a regression in v4.17 introduced by the commit mentioned below,
    testing without these commit shows no problem with negative numbers.

    Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com

    Cc: stable@vger.kernel.org
    Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and faster")
    Signed-off-by: Pavel Tikhomirov
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tikhomirov
     

06 Mar, 2019

3 commits

  • [ Upstream commit e158488be27b157802753a59b336142dc0eb0380 ]

    Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
    case), we must not rely on the wakeup being delayed. However, commit:

    e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")

    relies on exactly that behaviour in that the wakeup must not happen
    until after we clear waiter->task.

    [ peterz: Added changelog. ]

    Signed-off-by: Xie Yongji
    Signed-off-by: Zhang Yu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: e38513905eea ("locking/rwsem: Rework zeroing reader waiter->task")
    Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Xie Yongji
     
  • [ Upstream commit b061c38bef43406df8e73c5be06cbfacad5ee6ad ]

    We must not rely on wake_q_add() to delay the wakeup; in particular
    commit:

    1d0dcb3ad9d3 ("futex: Implement lockless wakeups")

    moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
    could result in futex_wait() waking before observing ->lock_ptr ==
    NULL and going back to sleep again.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]

    Notable cmpxchg() does not provide ordering when it fails, however
    wake_q_add() requires ordering in this specific case too. Without this
    it would be possible for the concurrent wakeup to not observe our
    prior state.

    Andrea Parri provided:

    C wake_up_q-wake_q_add

    {
    int next = 0;
    int y = 0;
    }

    P0(int *next, int *y)
    {
    int r0;

    /* in wake_up_q() */

    WRITE_ONCE(*next, 1); /* node->next = NULL */
    smp_mb(); /* implied by wake_up_process() */
    r0 = READ_ONCE(*y);
    }

    P1(int *next, int *y)
    {
    int r1;

    /* in wake_q_add() */

    WRITE_ONCE(*y, 1); /* wake_cond = true */
    smp_mb__before_atomic();
    r1 = cmpxchg_relaxed(next, 1, 2);
    }

    exists (0:r0=0 /\ 1:r1=0)

    This "exists" clause cannot be satisfied according to the LKMM:

    Test wake_up_q-wake_q_add Allowed
    States 3
    0:r0=0; 1:r1=1;
    0:r0=1; 1:r1=0;
    0:r0=1; 1:r1=1;
    No
    Witnesses
    Positive: 0 Negative: 3
    Condition exists (0:r0=0 /\ 1:r1=0)
    Observation wake_up_q-wake_q_add Never 0 3

    Reported-by: Yongji Xie
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Peter Zijlstra