10 Aug, 2021

3 commits

  • Starting the process wide cputime counter needs to be done in the same
    sighand locking sequence than actually arming the related timer otherwise
    this races against concurrent timers setting/expiring in the same
    threadgroup.

    Detecting that the cputime counter is started without holding the sighand
    lock is a first step toward debugging such situations.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org

    Frederic Weisbecker
     
  • The variable ret is being initialized with a value that is never read, it
    is being updated later on. The assignment is redundant and can be removed.

    Addresses-Coverity: ("Unused value")

    Signed-off-by: Colin Ian King
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20210721120147.109570-1-colin.king@canonical.com

    Colin Ian King
     
  • The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de

    Sebastian Andrzej Siewior
     

09 Aug, 2021

3 commits

  • Pull timer fix from Thomas Gleixner:
    "A single timer fix:

    - Prevent a memory ordering issue in the timer expiry code which
    makes it possible to observe falsely that the callback has been
    executed already while that's not the case, which violates the
    guarantee of del_timer_sync()"

    * tag 'timers-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers: Move clearing of base::timer_running under base:: Lock

    Linus Torvalds
     
  • Pull scheduler fix from Thomas Gleixner:
    "A single scheduler fix:

    - Prevent a double enqueue caused by rt_effective_prio() being
    invoked twice in __sched_setscheduler()"

    * tag 'sched-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/rt: Fix double enqueue caused by rt_effective_prio

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A set of perf fixes:

    - Correct the permission checks for perf event which send SIGTRAP to
    a different process and clean up that code to be more readable.

    - Prevent an out of bound MSR access in the x86 perf code which
    happened due to an incomplete limiting to the actually available
    hardware counters.

    - Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running
    inside a guest.

    - Handle small core counter re-enabling correctly by issuing an ACK
    right before reenabling it to prevent a stale PEBS record being
    kept around"

    * tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Apply mid ACK for small core
    perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
    perf/x86: Fix out of bound MSR access
    perf: Refactor permissions check into perf_check_permission()
    perf: Fix required permissions if sigtrap is requested

    Linus Torvalds
     

07 Aug, 2021

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Fix tracepoint race between static_call and callback data

    As callbacks to a tracepoint are paired with the data that is passed
    in when the callback is registered to the tracepoint, it must have
    that data passed to the callback when the tracepoint is triggered,
    else bad things will happen. To keep the two together, they are both
    assigned to a tracepoint structure and added to an array. The
    tracepoint call site will dereference the structure (via RCU) and call
    the callback in that structure along with the data in that structure.
    This keeps the callback and data tightly coupled.

    Because of the overhead that retpolines have on tracepoint callbacks,
    if there's only one callback attached to a tracepoint (a common case),
    then it is called via a static call (code modified to do a direct call
    instead of an indirect call). But to implement this, the data had to
    be decoupled from the callback, as now the callback is implemented via
    a direct call from the static call and not an indirect call from the
    dereferenced structure.

    Note, the static call only calls a callback used when there's a single
    callback attached to the tracepoint. If more than one callback is
    attached to the same tracepoint, then the static call will call an
    iterator function that goes back to dereferencing the structure
    keeping the callback and its data tightly coupled again.

    Issues can arise when going from 0 callbacks to one, as the static
    call is assigned to the callback, and it must take care that the data
    passed to it is loaded before the static call calls the callback.
    Going from 1 to 2 callbacks is not an issue, as long as the static
    call is updated to the iterator before the tracepoint structure array
    is updated via RCU. Going from 2 to more or back down to 2 is not an
    issue as the iterator can handle all theses cases. But going from 2 to
    1, care must be taken as the static call is now calling a callback and
    the data that is loaded must be the data for that callback.

    Care was taken to ensure the callback and data would be in-sync, but
    after a bug was reported, it became clear that not enough was done to
    make sure that was the case. These changes address this.

    The first change is to compare the old and new data instead of the old
    and new callback, as it's the data that can corrupt the callback, even
    if the callback is the same (something getting freed).

    The next change is to convert these transitions into states, to make
    it easier to know when a synchronization is needed, and to perform
    those synchronizations. The problem with this patch is that it slows
    down disabling all events from under a second, to making it take over
    10 seconds to do the same work. But that is addressed in the final
    patch.

    The final patch uses the RCU state functions to keep track of the RCU
    state between the transitions, and only needs to perform the
    synchronization if an RCU synchronization hasn't been done already.
    This brings the performance of disabling all events back to its
    original value. That's because no synchronization is required between
    disabling tracepoints but is required when enabling a tracepoint after
    its been disabled. If an RCU synchronization happens after the
    tracepoint is disabled, and before it is re-enabled, there's no need
    to do the synchronization again.

    Both the second and third patch have subtle complexities that they are
    separated into two patches. But because the second patch causes such a
    regression in performance, the third patch adds a "Fixes" tag to the
    second patch, such that the two must be backported together and not
    just the second patch"

    * tag 'trace-v5.14-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracepoint: Use rcu get state and cond sync for static call updates
    tracepoint: Fix static call function vs data state mismatch
    tracepoint: static call: Compare data on transition from 2->1 callees

    Linus Torvalds
     

06 Aug, 2021

5 commits

  • State transitions from 1->0->1 and N->2->1 callbacks require RCU
    synchronization. Rather than performing the RCU synchronization every
    time the state change occurs, which is quite slow when many tracepoints
    are registered in batch, instead keep a snapshot of the RCU state on the
    most recent transitions which belong to a chain, and conditionally wait
    for a grace period on the last transition of the chain if one g.p. has
    not elapsed since the last snapshot.

    This applies to both RCU and SRCU.

    This brings the performance regression caused by commit 231264d6927f
    ("Fix: tracepoint: static call function vs data state mismatch") back to
    what it was originally.

    Before this commit:

    # trace-cmd start -e all
    # time trace-cmd start -p nop

    real 0m10.593s
    user 0m0.017s
    sys 0m0.259s

    After this commit:

    # trace-cmd start -e all
    # time trace-cmd start -p nop

    real 0m0.878s
    user 0m0.000s
    sys 0m0.103s

    Link: https://lkml.kernel.org/r/20210805192954.30688-1-mathieu.desnoyers@efficios.com
    Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: "Paul E. McKenney"
    Cc: Stefan Metzmacher
    Fixes: 231264d6927f ("Fix: tracepoint: static call function vs data state mismatch")
    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Steven Rostedt (VMware)

    Mathieu Desnoyers
     
  • On a 1->0->1 callbacks transition, there is an issue with the new
    callback using the old callback's data.

    Considering __DO_TRACE_CALL:

    do { \
    struct tracepoint_func *it_func_ptr; \
    void *__data; \
    it_func_ptr = \
    rcu_dereference_raw((&__tracepoint_##name)->funcs); \
    if (it_func_ptr) { \
    __data = (it_func_ptr)->data; \

    ----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]

    static_call(tp_func_##name)(__data, args); \
    } \
    } while (0)

    It has loaded the tp->funcs of the old callback, so it will try to use the old
    data. This can be fixed by adding a RCU sync anywhere in the 1->0->1
    transition chain.

    On a N->2->1 transition, we need an rcu-sync because you may have a
    sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
    between 2->1, but was changed from 3->2 (or from 1->2), which may be
    observed by the static call. This can be fixed by adding an
    unconditional RCU sync in transition 2->1.

    Note, this fixes a correctness issue at the cost of adding a tremendous
    performance regression to the disabling of tracepoints.

    Before this commit:

    # trace-cmd start -e all
    # time trace-cmd start -p nop

    real 0m0.778s
    user 0m0.000s
    sys 0m0.061s

    After this commit:

    # trace-cmd start -e all
    # time trace-cmd start -p nop

    real 0m10.593s
    user 0m0.017s
    sys 0m0.259s

    A follow up fix will introduce a more lightweight scheme based on RCU
    get_state and cond_sync, that will return the performance back to what it
    was. As both this change and the lightweight versions are complex on their
    own, for bisecting any issues that this may cause, they are kept as two
    separate changes.

    Link: https://lkml.kernel.org/r/20210805132717.23813-3-mathieu.desnoyers@efficios.com
    Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: "Paul E. McKenney"
    Cc: Stefan Metzmacher
    Fixes: d25e37d89dd2 ("tracepoint: Optimize using static_call()")
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Steven Rostedt (VMware)

    Mathieu Desnoyers
     
  • On transition from 2->1 callees, we should be comparing .data rather
    than .func, because the same callback can be registered twice with
    different data, and what we care about here is that the data of array
    element 0 is unchanged to skip rcu sync.

    Link: https://lkml.kernel.org/r/20210805132717.23813-2-mathieu.desnoyers@efficios.com
    Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: "Paul E. McKenney"
    Cc: Stefan Metzmacher
    Fixes: 547305a64632 ("tracepoint: Fix out of sync data passing by static caller")
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Steven Rostedt (VMware)

    Mathieu Desnoyers
     
  • Pull ucounts fix from Eric Biederman:
    "Fix a subtle locking versus reference counting bug in the ucount
    changes, found by syzbot"

    * 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    ucounts: Fix race condition between alloc_ucounts and put_ucounts

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:
    "Various tracing fixes:

    - Fix NULL pointer dereference caused by an error path

    - Give histogram calculation fields a size, otherwise it breaks
    synthetic creation based on them.

    - Reject strings being used for number calculations.

    - Fix recordmcount.pl warning on llvm building RISC-V allmodconfig

    - Fix the draw_functrace.py script to handle the new trace output

    - Fix warning of smp_processor_id() in preemptible code"

    * tag 'trace-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Quiet smp_processor_id() use in preemptable warning in hwlat
    scripts/tracing: fix the bug that can't parse raw_trace_func
    scripts/recordmcount.pl: Remove check_objcopy() and $can_use_local
    tracing: Reject string operand in the histogram expression
    tracing / histogram: Give calculation hist_fields a size
    tracing: Fix NULL pointer dereference in start_creating

    Linus Torvalds
     

05 Aug, 2021

3 commits

  • The hardware latency detector (hwlat) has a mode that it runs one thread
    across CPUs. The logic to move from the currently running CPU to the next
    one in the list does a smp_processor_id() to find where it currently is.
    Unfortunately, it's done with preemption enabled, and this triggers a
    warning for using smp_processor_id() in a preempt enabled section.

    As it is only using smp_processor_id() to get information on where it
    currently is in order to simply move it to the next CPU, it doesn't really
    care if it got moved in the mean time. It will simply balance out later if
    such a case arises.

    Switch smp_processor_id() to raw_smp_processor_id() to quiet that warning.

    Link: https://lkml.kernel.org/r/20210804141848.79edadc0@oasis.local.home

    Acked-by: Daniel Bristot de Oliveira
    Fixes: 8fa826b7344d ("trace/hwlat: Implement the mode config option")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Since the string type can not be the target of the addition / subtraction
    operation, it must be rejected. Without this fix, the string type silently
    converted to digits.

    Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2

    Cc: stable@vger.kernel.org
    Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers")
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • When working on my user space applications, I found a bug in the synthetic
    event code where the automated synthetic event field was not matching the
    event field calculation it was attached to. Looking deeper into it, it was
    because the calculation hist_field was not given a size.

    The synthetic event fields are matched to their hist_fields either by
    having the field have an identical string type, or if that does not match,
    then the size and signed values are used to match the fields.

    The problem arose when I tried to match a calculation where the fields
    were "unsigned int". My tool created a synthetic event of type "u32". But
    it failed to match. The string was:

    diff=field1-field2:onmatch(event).trace(synth,$diff)

    Adding debugging into the kernel, I found that the size of "diff" was 0.
    And since it was given "unsigned int" as a type, the histogram fallback
    code used size and signed. The signed matched, but the size of u32 (4) did
    not match zero, and the event failed to be created.

    This can be worse if the field you want to match is not one of the
    acceptable fields for a synthetic event. As event fields can have any type
    that is supported in Linux, this can cause an issue. For example, if a
    type is an enum. Then there's no way to use that with any calculations.

    Have the calculation field simply take on the size of what it is
    calculating.

    Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home

    Cc: Tom Zanussi
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

04 Aug, 2021

1 commit

  • Double enqueues in rt runqueues (list) have been reported while running
    a simple test that spawns a number of threads doing a short sleep/run
    pattern while being concurrently setscheduled between rt and fair class.

    WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360
    CPU: 3 PID: 2825 Comm: setsched__13
    RIP: 0010:enqueue_task_rt+0x355/0x360
    Call Trace:
    __sched_setscheduler+0x581/0x9d0
    _sched_setscheduler+0x63/0xa0
    do_sched_setscheduler+0xa0/0x150
    __x64_sys_sched_setscheduler+0x1a/0x30
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40,
    next=ffff98679fc67ca0.
    kernel BUG at lib/list_debug.c:31!
    invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
    CPU: 3 PID: 2825 Comm: setsched__13
    RIP: 0010:__list_add_valid+0x41/0x50
    Call Trace:
    enqueue_task_rt+0x291/0x360
    __sched_setscheduler+0x581/0x9d0
    _sched_setscheduler+0x63/0xa0
    do_sched_setscheduler+0xa0/0x150
    __x64_sys_sched_setscheduler+0x1a/0x30
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    __sched_setscheduler() uses rt_effective_prio() to handle proper queuing
    of priority boosted tasks that are setscheduled while being boosted.
    rt_effective_prio() is however called twice per each
    __sched_setscheduler() call: first directly by __sched_setscheduler()
    before dequeuing the task and then by __setscheduler() to actually do
    the priority change. If the priority of the pi_top_task is concurrently
    being changed however, it might happen that the two calls return
    different results. If, for example, the first call returned the same rt
    priority the task was running at and the second one a fair priority, the
    task won't be removed by the rt list (on_list still set) and then
    enqueued in the fair runqueue. When eventually setscheduled back to rt
    it will be seen as enqueued already and the WARNING/BUG be issued.

    Fix this by calling rt_effective_prio() only once and then reusing the
    return value. While at it refactor code as well for clarity. Concurrent
    priority inheritance handling is still safe and will eventually converge
    to a new state by following the inheritance chain(s).

    Fixes: 0782e63bc6fe ("sched: Handle priority boosted tasks proper in setscheduler()")
    [squashed Peterz changes; added changelog]
    Reported-by: Mark Simmons
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com

    Peter Zijlstra
     

31 Jul, 2021

2 commits

  • Pull networking fixes from Jakub Kicinski:
    "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
    (mac80211) and netfilter trees.

    Current release - regressions:

    - mac80211: fix starting aggregation sessions on mesh interfaces

    Current release - new code bugs:

    - sctp: send pmtu probe only if packet loss in Search Complete state

    - bnxt_en: add missing periodic PHC overflow check

    - devlink: fix phys_port_name of virtual port and merge error

    - hns3: change the method of obtaining default ptp cycle

    - can: mcba_usb_start(): add missing urb->transfer_dma initialization

    Previous releases - regressions:

    - set true network header for ECN decapsulation

    - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
    LRO

    - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
    PHY

    - sctp: fix return value check in __sctp_rcv_asconf_lookup

    Previous releases - always broken:

    - bpf:
    - more spectre corner case fixes, introduce a BPF nospec
    instruction for mitigating Spectre v4
    - fix OOB read when printing XDP link fdinfo
    - sockmap: fix cleanup related races

    - mac80211: fix enabling 4-address mode on a sta vif after assoc

    - can:
    - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
    - j1939: j1939_session_deactivate(): clarify lifetime of session
    object, avoid UAF
    - fix number of identical memory leaks in USB drivers

    - tipc:
    - do not blindly write skb_shinfo frags when doing decryption
    - fix sleeping in tipc accept routine"

    * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
    gve: Update MAINTAINERS list
    can: esd_usb2: fix memory leak
    can: ems_usb: fix memory leak
    can: usb_8dev: fix memory leak
    can: mcba_usb_start(): add missing urb->transfer_dma initialization
    can: hi311x: fix a signedness bug in hi3110_cmd()
    MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
    bpf: Fix leakage due to insufficient speculative store bypass mitigation
    bpf: Introduce BPF nospec instruction for mitigating Spectre v4
    sis900: Fix missing pci_disable_device() in probe and remove
    net: let flow have same hash in two directions
    nfc: nfcsim: fix use after free during module unload
    tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
    sctp: fix return value check in __sctp_rcv_asconf_lookup
    nfc: s3fwrn5: fix undefined parameter values in dev_err()
    net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
    net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
    net/mlx5: Unload device upon firmware fatal error
    net/mlx5e: Fix page allocation failure for ptp-RQ over SF
    net/mlx5e: Fix page allocation failure for trap-RQ over SF
    ...

    Linus Torvalds
     
  • The event_trace_add_tracer() can fail. In this case, it leads to a crash
    in start_creating with below call stack. Handle the error scenario
    properly in trace_array_create_dir.

    Call trace:
    down_write+0x7c/0x204
    start_creating.25017+0x6c/0x194
    tracefs_create_file+0xc4/0x2b4
    init_tracer_tracefs+0x5c/0x940
    trace_array_create_dir+0x58/0xb4
    trace_array_create+0x1bc/0x2b8
    trace_array_get_by_name+0xdc/0x18c

    Link: https://lkml.kernel.org/r/1627651386-21315-1-git-send-email-kamaagra@codeaurora.org

    Cc: stable@vger.kernel.org
    Fixes: 4114fbfd02f1 ("tracing: Enable creating new instance early boot")
    Signed-off-by: Kamal Agrawal
    Signed-off-by: Steven Rostedt (VMware)

    Kamal Agrawal
     

29 Jul, 2021

4 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2021-07-29

    The following pull-request contains BPF updates for your *net* tree.

    We've added 9 non-merge commits during the last 14 day(s) which contain
    a total of 20 files changed, 446 insertions(+), 138 deletions(-).

    The main changes are:

    1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer.

    2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann,
    Piotr Krysiuk and Benedict Schlueter.

    3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Spectre v4 gadgets make use of memory disambiguation, which is a set of
    techniques that execute memory access instructions, that is, loads and
    stores, out of program order; Intel's optimization manual, section 2.4.4.5:

    A load instruction micro-op may depend on a preceding store. Many
    microarchitectures block loads until all preceding store addresses are
    known. The memory disambiguator predicts which loads will not depend on
    any previous stores. When the disambiguator predicts that a load does
    not have such a dependency, the load takes its data from the L1 data
    cache. Eventually, the prediction is verified. If an actual conflict is
    detected, the load and all succeeding instructions are re-executed.

    af86ca4e3088 ("bpf: Prevent memory disambiguation attack") tried to mitigate
    this attack by sanitizing the memory locations through preemptive "fast"
    (low latency) stores of zero prior to the actual "slow" (high latency) store
    of a pointer value such that upon dependency misprediction the CPU then
    speculatively executes the load of the pointer value and retrieves the zero
    value instead of the attacker controlled scalar value previously stored at
    that location, meaning, subsequent access in the speculative domain is then
    redirected to the "zero page".

    The sanitized preemptive store of zero prior to the actual "slow" store is
    done through a simple ST instruction based on r10 (frame pointer) with
    relative offset to the stack location that the verifier has been tracking
    on the original used register for STX, which does not have to be r10. Thus,
    there are no memory dependencies for this store, since it's only using r10
    and immediate constant of zero; hence af86ca4e3088 /assumed/ a low latency
    operation.

    However, a recent attack demonstrated that this mitigation is not sufficient
    since the preemptive store of zero could also be turned into a "slow" store
    and is thus bypassed as well:

    [...]
    // r2 = oob address (e.g. scalar)
    // r7 = pointer to map value
    31: (7b) *(u64 *)(r10 -16) = r2
    // r9 will remain "fast" register, r10 will become "slow" register below
    32: (bf) r9 = r10
    // JIT maps BPF reg to x86 reg:
    // r9 -> r15 (callee saved)
    // r10 -> rbp
    // train store forward prediction to break dependency link between both r9
    // and r10 by evicting them from the predictor's LRU table.
    33: (61) r0 = *(u32 *)(r7 +24576)
    34: (63) *(u32 *)(r7 +29696) = r0
    35: (61) r0 = *(u32 *)(r7 +24580)
    36: (63) *(u32 *)(r7 +29700) = r0
    37: (61) r0 = *(u32 *)(r7 +24584)
    38: (63) *(u32 *)(r7 +29704) = r0
    39: (61) r0 = *(u32 *)(r7 +24588)
    40: (63) *(u32 *)(r7 +29708) = r0
    [...]
    543: (61) r0 = *(u32 *)(r7 +25596)
    544: (63) *(u32 *)(r7 +30716) = r0
    // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
    // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
    // in hardware registers. rbp becomes slow due to push/pop latency. below is
    // disasm of bpf_ringbuf_output() helper for better visual context:
    //
    // ffffffff8117ee20: 41 54 push r12
    // ffffffff8117ee22: 55 push rbp
    // ffffffff8117ee23: 53 push rbx
    // ffffffff8117ee24: 48 f7 c1 fc ff ff ff test rcx,0xfffffffffffffffc
    // ffffffff8117ee2b: 0f 85 af 00 00 00 jne ffffffff8117eee0 spilled_ptr)
    condition, for example, the previous content of that memory location could
    also be a pointer to map or map value. Without the fix, a speculative
    store bypass is not mitigated in such precondition and can then lead to
    a type confusion in the speculative domain leaking kernel memory near
    these pointer types.

    While brainstorming on various alternative mitigation possibilities, we also
    stumbled upon a retrospective from Chrome developers [0]:

    [...] For variant 4, we implemented a mitigation to zero the unused memory
    of the heap prior to allocation, which cost about 1% when done concurrently
    and 4% for scavenging. Variant 4 defeats everything we could think of. We
    explored more mitigations for variant 4 but the threat proved to be more
    pervasive and dangerous than we anticipated. For example, stack slots used
    by the register allocator in the optimizing compiler could be subject to
    type confusion, leading to pointer crafting. Mitigating type confusion for
    stack slots alone would have required a complete redesign of the backend of
    the optimizing compiler, perhaps man years of work, without a guarantee of
    completeness. [...]

    From BPF side, the problem space is reduced, however, options are rather
    limited. One idea that has been explored was to xor-obfuscate pointer spills
    to the BPF stack:

    [...]
    // preoccupy the CPU store port by running sequence of dummy stores.
    [...]
    2106: (63) *(u32 *)(r7 +29796) = r0
    2107: (63) *(u32 *)(r7 +29800) = r0
    2108: (63) *(u32 *)(r7 +29804) = r0
    2109: (63) *(u32 *)(r7 +29808) = r0
    2110: (63) *(u32 *)(r7 +29812) = r0
    // overwrite scalar with dummy pointer; xored with random 'secret' value
    // of 943576462 before store ...
    2111: (b4) w11 = 943576462
    2112: (af) r11 ^= r7
    2113: (7b) *(u64 *)(r10 -16) = r11
    2114: (79) r11 = *(u64 *)(r10 -16)
    2115: (b4) w2 = 943576462
    2116: (af) r2 ^= r11
    // ... and restored with the same 'secret' value with the help of AX reg.
    2117: (71) r3 = *(u8 *)(r2 +0)
    [...]

    While the above would not prevent speculation, it would make data leakage
    infeasible by directing it to random locations. In order to be effective
    and prevent type confusion under speculation, such random secret would have
    to be regenerated for each store. The additional complexity involved for a
    tracking mechanism that prevents jumps such that restoring spilled pointers
    would not get corrupted is not worth the gain for unprivileged. Hence, the
    fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
    instruction which the x86 JIT translates into a lfence opcode. Inserting the
    latter in between the store and load instruction is one of the mitigations
    options [1]. The x86 instruction manual notes:

    [...] An LFENCE that follows an instruction that stores to memory might
    complete before the data being stored have become globally visible. [...]

    The latter meaning that the preceding store instruction finished execution
    and the store is at minimum guaranteed to be in the CPU's store queue, but
    it's not guaranteed to be in that CPU's L1 cache at that point (globally
    visible). The latter would only be guaranteed via sfence. So the load which
    is guaranteed to execute after the lfence for that local CPU would have to
    rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:

    [...] For every store operation that is added to the ROB, an entry is
    allocated in the store buffer. This entry requires both the virtual and
    physical address of the target. Only if there is no free entry in the store
    buffer, the frontend stalls until there is an empty slot available in the
    store buffer again. Otherwise, the CPU can immediately continue adding
    subsequent instructions to the ROB and execute them out of order. On Intel
    CPUs, the store buffer has up to 56 entries. [...]

    One small upside on the fix is that it lifts constraints from af86ca4e3088
    where the sanitize_stack_off relative to r10 must be the same when coming
    from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
    or BPF_ST instruction. This happens either when we store a pointer or data
    value to the BPF stack for the first time, or upon later pointer spills.
    The former needs to be enforced since otherwise stale stack data could be
    leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
    BPF_NOSPEC mapping is currently optimized away, but others could emit a
    speculation barrier as well if necessary. For real-world unprivileged
    programs e.g. generated by LLVM, pointer spill/fill is only generated upon
    register pressure and LLVM only tries to do that for pointers which are not
    used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
    sanitation for the STACK_INVALID case when the first write to a stack slot
    occurs e.g. upon map lookup. In future we might refine ways to mitigate
    the latter cost.

    [0] https://arxiv.org/pdf/1902.05178.pdf
    [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
    [2] https://arxiv.org/pdf/1905.05725.pdf

    Fixes: af86ca4e3088 ("bpf: Prevent memory disambiguation attack")
    Fixes: f7cf25b2026d ("bpf: track spill/fill of constants")
    Co-developed-by: Piotr Krysiuk
    Co-developed-by: Benedict Schlueter
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Piotr Krysiuk
    Signed-off-by: Benedict Schlueter
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • In case of JITs, each of the JIT backends compiles the BPF nospec instruction
    /either/ to a machine instruction which emits a speculation barrier /or/ to
    /no/ machine instruction in case the underlying architecture is not affected
    by Speculative Store Bypass or has different mitigations in place already.

    This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
    instruction for mitigation. In case of arm64, we rely on the firmware mitigation
    as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
    it works for all of the kernel code with no need to provide any additional
    instructions here (hence only comment in arm64 JIT). Other archs can follow
    as needed. The BPF nospec instruction is specifically targeting Spectre v4
    since i) we don't use a serialization barrier for the Spectre v1 case, and
    ii) mitigation instructions for v1 and v4 might be different on some archs.

    The BPF nospec is required for a future commit, where the BPF verifier does
    annotate intermediate BPF programs with speculation barriers.

    Co-developed-by: Piotr Krysiuk
    Co-developed-by: Benedict Schlueter
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Piotr Krysiuk
    Signed-off-by: Benedict Schlueter
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • The race happens because put_ucounts() doesn't use spinlock and
    get_ucounts is not under spinlock:

    CPU0 CPU1
    ---- ----
    alloc_ucounts() put_ucounts()

    spin_lock_irq(&ucounts_lock);
    ucounts = find_ucounts(ns, uid, hashent);

    atomic_dec_and_test(&ucounts->count))

    spin_unlock_irq(&ucounts_lock);

    spin_lock_irqsave(&ucounts_lock, flags);
    hlist_del_init(&ucounts->node);
    spin_unlock_irqrestore(&ucounts_lock, flags);
    kfree(ucounts);

    ucounts = get_ucounts(ucounts);

    ==================================================================
    BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
    BUG: KASAN: use-after-free in atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
    BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:152 [inline]
    BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:150 [inline]
    BUG: KASAN: use-after-free in alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
    Write of size 4 at addr ffff88802821e41c by task syz-executor.4/16785

    CPU: 1 PID: 16785 Comm: syz-executor.4 Not tainted 5.14.0-rc1-next-20210712-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:105
    print_address_description.constprop.0.cold+0x6c/0x309 mm/kasan/report.c:233
    __kasan_report mm/kasan/report.c:419 [inline]
    kasan_report.cold+0x83/0xdf mm/kasan/report.c:436
    check_region_inline mm/kasan/generic.c:183 [inline]
    kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
    instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
    atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
    get_ucounts kernel/ucount.c:152 [inline]
    get_ucounts kernel/ucount.c:150 [inline]
    alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
    set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
    __sys_setuid+0x285/0x400 kernel/sys.c:623
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x4665d9
    Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007fde54097188 EFLAGS: 00000246 ORIG_RAX: 0000000000000069
    RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
    RBP: 00000000004bfcb9 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf80
    R13: 00007ffc8655740f R14: 00007fde54097300 R15: 0000000000022000

    Allocated by task 16784:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_set_track mm/kasan/common.c:46 [inline]
    set_alloc_info mm/kasan/common.c:434 [inline]
    ____kasan_kmalloc mm/kasan/common.c:513 [inline]
    ____kasan_kmalloc mm/kasan/common.c:472 [inline]
    __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:522
    kmalloc include/linux/slab.h:591 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    alloc_ucounts+0x23d/0x5b0 kernel/ucount.c:169
    set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
    __sys_setuid+0x285/0x400 kernel/sys.c:623
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Freed by task 16785:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
    kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
    ____kasan_slab_free mm/kasan/common.c:366 [inline]
    ____kasan_slab_free mm/kasan/common.c:328 [inline]
    __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:374
    kasan_slab_free include/linux/kasan.h:229 [inline]
    slab_free_hook mm/slub.c:1650 [inline]
    slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1675
    slab_free mm/slub.c:3235 [inline]
    kfree+0xeb/0x650 mm/slub.c:4295
    put_ucounts kernel/ucount.c:200 [inline]
    put_ucounts+0x117/0x150 kernel/ucount.c:192
    put_cred_rcu+0x27a/0x520 kernel/cred.c:124
    rcu_do_batch kernel/rcu/tree.c:2550 [inline]
    rcu_core+0x7ab/0x1380 kernel/rcu/tree.c:2785
    __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
    insert_work+0x48/0x370 kernel/workqueue.c:1332
    __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
    queue_work_on+0xee/0x110 kernel/workqueue.c:1525
    queue_work include/linux/workqueue.h:507 [inline]
    call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
    kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
    netdev_queue_add_kobject net/core/net-sysfs.c:1621 [inline]
    netdev_queue_update_kobjects+0x374/0x450 net/core/net-sysfs.c:1655
    register_queue_kobjects net/core/net-sysfs.c:1716 [inline]
    netdev_register_kobject+0x35a/0x430 net/core/net-sysfs.c:1959
    register_netdevice+0xd33/0x1500 net/core/dev.c:10331
    nsim_init_netdevsim drivers/net/netdevsim/netdev.c:317 [inline]
    nsim_create+0x381/0x4d0 drivers/net/netdevsim/netdev.c:364
    __nsim_dev_port_add+0x32e/0x830 drivers/net/netdevsim/dev.c:1295
    nsim_dev_port_add_all+0x53/0x150 drivers/net/netdevsim/dev.c:1355
    nsim_dev_probe+0xcb5/0x1190 drivers/net/netdevsim/dev.c:1496
    call_driver_probe drivers/base/dd.c:517 [inline]
    really_probe+0x23c/0xcd0 drivers/base/dd.c:595
    __driver_probe_device+0x338/0x4d0 drivers/base/dd.c:747
    driver_probe_device+0x4c/0x1a0 drivers/base/dd.c:777
    __device_attach_driver+0x20b/0x2f0 drivers/base/dd.c:894
    bus_for_each_drv+0x15f/0x1e0 drivers/base/bus.c:427
    __device_attach+0x228/0x4a0 drivers/base/dd.c:965
    bus_probe_device+0x1e4/0x290 drivers/base/bus.c:487
    device_add+0xc2f/0x2180 drivers/base/core.c:3356
    nsim_bus_dev_new drivers/net/netdevsim/bus.c:431 [inline]
    new_device_store+0x436/0x710 drivers/net/netdevsim/bus.c:298
    bus_attr_store+0x72/0xa0 drivers/base/bus.c:122
    sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
    kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
    call_write_iter include/linux/fs.h:2152 [inline]
    new_sync_write+0x426/0x650 fs/read_write.c:518
    vfs_write+0x75a/0xa40 fs/read_write.c:605
    ksys_write+0x12d/0x250 fs/read_write.c:658
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Second to last potentially related work creation:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
    insert_work+0x48/0x370 kernel/workqueue.c:1332
    __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
    queue_work_on+0xee/0x110 kernel/workqueue.c:1525
    queue_work include/linux/workqueue.h:507 [inline]
    call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
    kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
    kobject_synth_uevent+0x701/0x850 lib/kobject_uevent.c:208
    uevent_store+0x20/0x50 drivers/base/core.c:2371
    dev_attr_store+0x50/0x80 drivers/base/core.c:2072
    sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
    kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
    call_write_iter include/linux/fs.h:2152 [inline]
    new_sync_write+0x426/0x650 fs/read_write.c:518
    vfs_write+0x75a/0xa40 fs/read_write.c:605
    ksys_write+0x12d/0x250 fs/read_write.c:658
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88802821e400
    which belongs to the cache kmalloc-192 of size 192
    The buggy address is located 28 bytes inside of
    192-byte region [ffff88802821e400, ffff88802821e4c0)
    The buggy address belongs to the page:
    page:ffffea0000a08780 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2821e
    flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000000200 dead000000000100 dead000000000122 ffff888010841a00
    raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 12874702440, free_ts 12637793385
    prep_new_page mm/page_alloc.c:2433 [inline]
    get_page_from_freelist+0xa72/0x2f80 mm/page_alloc.c:4166
    __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5374
    alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2119
    alloc_pages+0x238/0x2a0 mm/mempolicy.c:2242
    alloc_slab_page mm/slub.c:1713 [inline]
    allocate_slab+0x32b/0x4c0 mm/slub.c:1853
    new_slab mm/slub.c:1916 [inline]
    new_slab_objects mm/slub.c:2662 [inline]
    ___slab_alloc+0x4ba/0x820 mm/slub.c:2825
    __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2865
    slab_alloc_node mm/slub.c:2947 [inline]
    slab_alloc mm/slub.c:2989 [inline]
    __kmalloc+0x312/0x330 mm/slub.c:4133
    kmalloc include/linux/slab.h:596 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318
    rds_tcp_init_net+0x1db/0x4f0 net/rds/tcp.c:551
    ops_init+0xaf/0x470 net/core/net_namespace.c:140
    __register_pernet_operations net/core/net_namespace.c:1137 [inline]
    register_pernet_operations+0x35a/0x850 net/core/net_namespace.c:1214
    register_pernet_device+0x26/0x70 net/core/net_namespace.c:1301
    rds_tcp_init+0x77/0xe0 net/rds/tcp.c:717
    do_one_initcall+0x103/0x650 init/main.c:1285
    do_initcall_level init/main.c:1360 [inline]
    do_initcalls init/main.c:1376 [inline]
    do_basic_setup init/main.c:1396 [inline]
    kernel_init_freeable+0x6b8/0x741 init/main.c:1598
    page last free stack trace:
    reset_page_owner include/linux/page_owner.h:24 [inline]
    free_pages_prepare mm/page_alloc.c:1343 [inline]
    free_pcp_prepare+0x312/0x7d0 mm/page_alloc.c:1394
    free_unref_page_prepare mm/page_alloc.c:3329 [inline]
    free_unref_page+0x19/0x690 mm/page_alloc.c:3408
    __vunmap+0x783/0xb70 mm/vmalloc.c:2587
    free_work+0x58/0x70 mm/vmalloc.c:82
    process_one_work+0x98d/0x1630 kernel/workqueue.c:2276
    worker_thread+0x658/0x11f0 kernel/workqueue.c:2422
    kthread+0x3e5/0x4d0 kernel/kthread.c:319
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

    Memory state around the buggy address:
    ffff88802821e300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff88802821e380: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
    >ffff88802821e400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff88802821e480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff88802821e500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    - The race fix has two parts.
    * Changing the code to guarantee that ucounts->count is only decremented
    when ucounts_lock is held. This guarantees that find_ucounts
    will never find a structure with a zero reference count.
    * Changing alloc_ucounts to increment ucounts->count while
    ucounts_lock is held. This guarantees the reference count on the
    found data structure will not be decremented to zero (and the data
    structure freed) before the reference count is incremented.
    -- Eric Biederman

    Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
    Reported-by: syzbot+59dd63761094a80ad06d@syzkaller.appspotmail.com
    Reported-by: syzbot+6cd79f45bb8fa1c9eeae@syzkaller.appspotmail.com
    Reported-by: syzbot+b6e65bd125a05f803d6b@syzkaller.appspotmail.com
    Fixes: b6c336528926 ("Use atomic_t for ucounts reference counting")
    Cc: Hillf Danton
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/7b2ace1759b281cdd2d66101d6b305deef722efb.1627397820.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

28 Jul, 2021

3 commits

  • Pull cgroup fix from Tejun Heo:
    "Fix leak of filesystem context root which is triggered by LTP.

    Not too likely to be a problem in non-testing environments"

    * 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup1: fix leaked context root causing sporadic NULL deref in LTP

    Linus Torvalds
     
  • Pull workqueue fix from Tejun Heo:
    "Fix a use-after-free in allocation failure handling path"

    * 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix UAF in pwq_unbound_release_workfn()

    Linus Torvalds
     
  • syzbot reported KCSAN data races vs. timer_base::timer_running being set to
    NULL without holding base::lock in expire_timers().

    This looks innocent and most reads are clearly not problematic, but
    Frederic identified an issue which is:

    int data = 0;

    void timer_func(struct timer_list *t)
    {
    data = 1;
    }

    CPU 0 CPU 1
    ------------------------------ --------------------------
    base = lock_timer_base(timer, &flags); raw_spin_unlock(&base->lock);
    if (base->running_timer != timer) call_timer_fn(timer, fn, baseclk);
    ret = detach_if_pending(timer, base, true); base->running_timer = NULL;
    raw_spin_unlock_irqrestore(&base->lock, flags); raw_spin_lock(&base->lock);

    x = data;

    If the timer has previously executed on CPU 1 and then CPU 0 can observe
    base->running_timer == NULL and returns, assuming the timer has completed,
    but it's not guaranteed on all architectures. The comment for
    del_timer_sync() makes that guarantee. Moving the assignment under
    base->lock prevents this.

    For non-RT kernel it's performance wise completely irrelevant whether the
    store happens before or after taking the lock. For an RT kernel moving the
    store under the lock requires an extra unlock/lock pair in the case that
    there is a waiter for the timer, but that's not the end of the world.

    Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com
    Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com
    Fixes: 030dcdd197d7 ("timers: Prepare support for PREEMPT_RT")
    Signed-off-by: Thomas Gleixner
    Tested-by: Sebastian Andrzej Siewior
    Link: https://lore.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

26 Jul, 2021

4 commits

  • gcc doesn't care, but clang quite reasonably pointed out that the recent
    commit e9ba16e68cce ("smpboot: Mark idle_init() as __always_inlined to
    work around aggressive compiler un-inlining") did some really odd
    things:

    kernel/smpboot.c:50:20: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline void __always_inline idle_init(unsigned int cpu)
    ^

    which not only has that duplicate inlining specifier, but the new
    __always_inline was put in the wrong place of the function definition.

    We put the storage class specifiers (ie things like "static" and
    "extern") first, and the type information after that. And while the
    compiler may not care, we put the inline specifier before the types.

    So it should be just

    static __always_inline void idle_init(unsigned int cpu)

    instead.

    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "A small set of timer related fixes:

    - Plug a race between rearm and process tick in the posix CPU timers
    code

    - Make the optimization to avoid recalculation of the next timer
    interrupt work correctly when there are no timers pending"

    * tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers: Fix get_next_timer_interrupt() with no timers pending
    posix-cpu-timers: Fix rearm racing against process tick

    Linus Torvalds
     
  • Pull core fix from Thomas Gleixner:
    "A single update for the boot code to prevent aggressive un-inlining
    which causes a section mismatch"

    * tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining

    Linus Torvalds
     
  • Pull dma-mapping fix from Christoph Hellwig:

    - handle vmalloc addresses in dma_common_{mmap,get_sgtable} (Roman
    Skakun)

    * tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping:
    dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}

    Linus Torvalds
     

24 Jul, 2021

1 commit

  • Pull tracing fixes from Steven Rostedt:

    - Fix deadloop in ring buffer because of using stale "read" variable

    - Fix synthetic event use of field_pos as boolean and not an index

    - Fixed histogram special var "cpu" overriding event fields called
    "cpu"

    - Cleaned up error prone logic in alloc_synth_event()

    - Removed call to synchronize_rcu_tasks_rude() when not needed

    - Removed redundant initialization of a local variable "ret"

    - Fixed kernel crash when updating tracepoint callbacks of different
    priorities.

    * tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracepoints: Update static_call before tp_funcs when adding a tracepoint
    ftrace: Remove redundant initialization of variable ret
    ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
    tracing: Clean up alloc_synth_event()
    tracing/histogram: Rename "cpu" to "common_cpu"
    tracing: Synthetic event field_pos is an index not a boolean
    tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.

    Linus Torvalds
     

23 Jul, 2021

7 commits

  • Because of the significant overhead that retpolines pose on indirect
    calls, the tracepoint code was updated to use the new "static_calls" that
    can modify the running code to directly call a function instead of using
    an indirect caller, and this function can be changed at runtime.

    In the tracepoint code that calls all the registered callbacks that are
    attached to a tracepoint, the following is done:

    it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
    if (it_func_ptr) {
    __data = (it_func_ptr)->data;
    static_call(tp_func_##name)(__data, args);
    }

    If there's just a single callback, the static_call is updated to just call
    that callback directly. Once another handler is added, then the static
    caller is updated to call the iterator, that simply loops over all the
    funcs in the array and calls each of the callbacks like the old method
    using indirect calling.

    The issue was discovered with a race between updating the funcs array and
    updating the static_call. The funcs array was updated first and then the
    static_call was updated. This is not an issue as long as the first element
    in the old array is the same as the first element in the new array. But
    that assumption is incorrect, because callbacks also have a priority
    field, and if there's a callback added that has a higher priority than the
    callback on the old array, then it will become the first callback in the
    new array. This means that it is possible to call the old callback with
    the new callback data element, which can cause a kernel panic.

    static_call = callback1()
    funcs[] = {callback1,data1};
    callback2 has higher priority than callback1

    CPU 1 CPU 2
    ----- -----

    new_funcs = {callback2,data2},
    {callback1,data1}

    rcu_assign_pointer(tp->funcs, new_funcs);

    /*
    * Now tp->funcs has the new array
    * but the static_call still calls callback1
    */

    it_func_ptr = tp->funcs [ new_funcs ]
    data = it_func_ptr->data [ data2 ]
    static_call(callback1, data);

    /* Now callback1 is called with
    * callback2's data */

    [ KERNEL PANIC ]

    update_static_call(iterator);

    To prevent this from happening, always switch the static_call to the
    iterator before assigning the tp->funcs to the new array. The iterator will
    always properly match the callback with its data.

    To trigger this bug:

    In one terminal:

    while :; do hackbench 50; done

    In another terminal

    echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
    while :; do
    echo 1 > /sys/kernel/tracing/set_event_pid;
    sleep 0.5
    echo 0 > /sys/kernel/tracing/set_event_pid;
    sleep 0.5
    done

    And it doesn't take long to crash. This is because the set_event_pid adds
    a callback to the sched_waking tracepoint with a high priority, which will
    be called before the sched_waking trace event callback is called.

    Note, the removal to a single callback updates the array first, before
    changing the static_call to single callback, which is the proper order as
    the first element in the array is the same as what the static_call is
    being changed to.

    Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

    Cc: stable@vger.kernel.org
    Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")
    Reported-by: Stefan Metzmacher
    tested-by: Stefan Metzmacher
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • The variable ret is being initialized with a value that is never
    read, it is being updated later on. The assignment is redundant and
    can be removed.

    Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com

    Addresses-Coverity: ("Unused value")
    Signed-off-by: Colin Ian King
    Signed-off-by: Steven Rostedt (VMware)

    Colin Ian King
     
  • synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
    all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
    very disrupting (hence the name). So avoid calling it when 'old_hash'
    doesn't need to be freed.

    Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.com

    Signed-off-by: Nicolas Saenz Julienne
    Signed-off-by: Steven Rostedt (VMware)

    Nicolas Saenz Julienne
     
  • alloc_synth_event() currently has the following code to initialize the
    event fields and dynamic_fields:

    for (i = 0, j = 0; i < n_fields; i++) {
    event->fields[i] = fields[i];

    if (fields[i]->is_dynamic) {
    event->dynamic_fields[j] = fields[i];
    event->dynamic_fields[j]->field_pos = i;
    event->dynamic_fields[j++] = fields[i];
    event->n_dynamic_fields++;
    }
    }

    1) It would make more sense to have all fields keep track of their
    field_pos.

    2) event->dynmaic_fields[j] is assigned twice for no reason.

    3) We can move updating event->n_dynamic_fields outside the loop, and just
    assign it to j.

    This combination makes the code much cleaner.

    Link: https://lkml.kernel.org/r/20210721195341.29bb0f77@oasis.local.home

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Currently the histogram logic allows the user to write "cpu" in as an
    event field, and it will record the CPU that the event happened on.

    The problem with this is that there's a lot of events that have "cpu"
    as a real field, and using "cpu" as the CPU it ran on, makes it
    impossible to run histograms on the "cpu" field of events.

    For example, if I want to have a histogram on the count of the
    workqueue_queue_work event on its cpu field, running:

    ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger

    Gives a misleading and wrong result.

    Change the command to "common_cpu" as no event should have "common_*"
    fields as that's a reserved name for fields used by all events. And
    this makes sense here as common_cpu would be a field used by all events.

    Now we can even do:

    ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
    ># cat events/workqueue/workqueue_queue_work/hist
    # event histogram
    #
    # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
    #

    { common_cpu: 0, cpu: 2 } hitcount: 1
    { common_cpu: 0, cpu: 4 } hitcount: 1
    { common_cpu: 7, cpu: 7 } hitcount: 1
    { common_cpu: 0, cpu: 7 } hitcount: 1
    { common_cpu: 0, cpu: 1 } hitcount: 1
    { common_cpu: 0, cpu: 6 } hitcount: 2
    { common_cpu: 0, cpu: 5 } hitcount: 2
    { common_cpu: 1, cpu: 1 } hitcount: 4
    { common_cpu: 6, cpu: 6 } hitcount: 4
    { common_cpu: 5, cpu: 5 } hitcount: 14
    { common_cpu: 4, cpu: 4 } hitcount: 26
    { common_cpu: 0, cpu: 0 } hitcount: 39
    { common_cpu: 2, cpu: 2 } hitcount: 184

    Now for backward compatibility, I added a trick. If "cpu" is used, and
    the field is not found, it will fall back to "common_cpu" and work as
    it did before. This way, it will still work for old programs that use
    "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
    will get that event's "cpu" field, which is probably what it wants
    anyway.

    I updated the tracefs/README to include documentation about both the
    common_timestamp and the common_cpu. This way, if that text is present in
    the README, then an application can know that common_cpu is supported over
    just plain "cpu".

    Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home

    Cc: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Fixes: 8b7622bf94a44 ("tracing: Add cpu field for hist triggers")
    Reviewed-by: Tom Zanussi
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Performing the following:

    ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
    ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
    ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
    > events/sched/sched_switch/trigger
    ># echo 1 > events/synthetic/enable

    Crashed the kernel:

    BUG: kernel NULL pointer dereference, address: 000000000000001b
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP
    CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
    Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
    RIP: 0010:strlen+0x0/0x20
    Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
    20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 3f 00 74 10
    48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
    RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
    RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
    RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
    RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
    R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
    R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
    FS: 0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
    Call Trace:
    trace_event_raw_event_synth+0x90/0x1d0
    action_trace+0x5b/0x70
    event_hist_trigger+0x4bd/0x4e0
    ? cpumask_next_and+0x20/0x30
    ? update_sd_lb_stats.constprop.0+0xf6/0x840
    ? __lock_acquire.constprop.0+0x125/0x550
    ? find_held_lock+0x32/0x90
    ? sched_clock_cpu+0xe/0xd0
    ? lock_release+0x155/0x440
    ? update_load_avg+0x8c/0x6f0
    ? enqueue_entity+0x18a/0x920
    ? __rb_reserve_next+0xe5/0x460
    ? ring_buffer_lock_reserve+0x12a/0x3f0
    event_triggers_call+0x52/0xe0
    trace_event_buffer_commit+0x1ae/0x240
    trace_event_raw_event_sched_switch+0x114/0x170
    __traceiter_sched_switch+0x39/0x50
    __schedule+0x431/0xb00
    schedule_idle+0x28/0x40
    do_idle+0x198/0x2e0
    cpu_startup_entry+0x19/0x20
    secondary_startup_64_no_verify+0xc2/0xcb

    The reason is that the dynamic events array keeps track of the field
    position of the fields array, via the field_pos variable in the
    synth_field structure. Unfortunately, that field is a boolean for some
    reason, which means any field_pos greater than 1 will be a bug (in this
    case it was 2).

    Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home

    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events")
    Reviewed-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Pull networking fixes from David Miller:

    1) Fix type of bind option flag in af_xdp, from Baruch Siach.

    2) Fix use after free in bpf_xdp_link_release(), from Xuan Zhao.

    3) PM refcnt imbakance in r8152, from Takashi Iwai.

    4) Sign extension ug in liquidio, from Colin Ian King.

    5) Mising range check in s390 bpf jit, from Colin Ian King.

    6) Uninit value in caif_seqpkt_sendmsg(), from Ziyong Xuan.

    7) Fix skb page recycling race, from Ilias Apalodimas.

    8) Fix memory leak in tcindex_partial_destroy_work, from Pave Skripkin.

    9) netrom timer sk refcnt issues, from Nguyen Dinh Phi.

    10) Fix data races aroun tcp's tfo_active_disable_stamp, from Eric
    Dumazet.

    11) act_skbmod should only operate on ethernet packets, from Peilin Ye.

    12) Fix slab out-of-bpunds in fib6_nh_flush_exceptions(),, from Psolo
    Abeni.

    13) Fix sparx5 dependencies, from Yajun Deng.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
    dpaa2-switch: seed the buffer pool after allocating the swp
    net: sched: cls_api: Fix the the wrong parameter
    net: sparx5: fix unmet dependencies warning
    net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum
    net: dsa: ensure linearized SKBs in case of tail taggers
    ravb: Remove extra TAB
    ravb: Fix a typo in comment
    net: dsa: sja1105: make VID 4095 a bridge VLAN too
    tcp: disable TFO blackhole logic by default
    sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set
    net: ixp46x: fix ptp build failure
    ibmvnic: Remove the proper scrq flush
    selftests: net: add ESP-in-UDP PMTU test
    udp: check encap socket in __udp_lib_err
    sctp: update active_key for asoc when old key is being replaced
    r8169: Avoid duplicate sysfs entry creation error
    ixgbe: Fix packet corruption due to missing DMA sync
    Revert "qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union()"
    ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
    fsl/fman: Add fibre support
    ...

    Linus Torvalds
     

22 Jul, 2021

3 commits

  • The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
    "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
    the same buffer page, whose "buffer_data_page" is empty and "read" field
    is non-zero.

    An error scenario could be constructed as followed (kernel perspective):

    1. All pages in the buffer has been accessed by reader(s) so that all of
    them will have non-zero "read" field.

    2. Read and clear all buffer pages so that "rb_num_of_entries()" will
    return 0 rendering there's no more data to read. It is also required
    that the "read_page", "commit_page" and "tail_page" points to the same
    page, while "head_page" is the next page of them.

    3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
    so that it shot pass the end of current tail buffer page. Now the
    "head_page", "commit_page" and "tail_page" points to the same page.

    4. Discard current event with "ring_buffer_discard_commit()", so that
    "head_page", "commit_page" and "tail_page" points to a page whose buffer
    data page is now empty.

    When the error scenario has been constructed, "tracing_read_pipe" will
    be trapped inside a deadloop: "trace_empty()" returns 0 since
    "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
    constructed ring buffer. Then "trace_find_next_entry_inc()" always
    return NULL since "rb_num_of_entries()" reports there's no more entry
    to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
    "tracing_read_pipe" back to the start of the "waitagain" loop.

    I've also written a proof-of-concept script to construct the scenario
    and trigger the bug automatically, you can use it to trace and validate
    my reasoning above:

    https://github.com/aegistudio/RingBufferDetonator.git

    Tests has been carried out on linux kernel 5.14-rc2
    (2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version
    of kernel (for testing whether my update fixes the bug) and
    some older kernels (for range of affected kernels). Test result is
    also attached to the proof-of-concept repository.

    Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
    Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio

    Cc: stable@vger.kernel.org
    Fixes: bf41a158cacba ("ring-buffer: make reentrant")
    Suggested-by: Linus Torvalds
    Signed-off-by: Haoran Luo
    Signed-off-by: Steven Rostedt (VMware)

    Haoran Luo
     
  • I got a UAF report when doing fuzz test:

    [ 152.880091][ T8030] ==================================================================
    [ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190
    [ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030
    [ 152.883578][ T8030]
    [ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249
    [ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    [ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn
    [ 152.887358][ T8030] Call Trace:
    [ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b
    [ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
    [ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70
    [ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
    [ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
    [ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb
    [ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
    [ 152.893594][ T8030] __asan_load4+0x69/0x90
    [ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190
    [ 152.895057][ T8030] process_one_work+0x47b/0x890
    [ 152.895778][ T8030] worker_thread+0x5c/0x790
    [ 152.896439][ T8030] ? process_one_work+0x890/0x890
    [ 152.897163][ T8030] kthread+0x223/0x250
    [ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0
    [ 152.898471][ T8030] ret_from_fork+0x1f/0x30
    [ 152.899114][ T8030]
    [ 152.899446][ T8030] Allocated by task 8884:
    [ 152.900084][ T8030] kasan_save_stack+0x21/0x50
    [ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0
    [ 152.901416][ T8030] __kmalloc+0x29c/0x460
    [ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0
    [ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0
    [ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0
    [ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490
    [ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0
    [ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50
    [ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190
    [ 152.906916][ T8030] do_syscall_64+0x34/0xb0
    [ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae
    [ 152.908365][ T8030]
    [ 152.908688][ T8030] Freed by task 8884:
    [ 152.909243][ T8030] kasan_save_stack+0x21/0x50
    [ 152.909893][ T8030] kasan_set_track+0x20/0x30
    [ 152.910541][ T8030] kasan_set_free_info+0x24/0x40
    [ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140
    [ 152.911964][ T8030] kfree+0x9e/0x3d0
    [ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0
    [ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0
    [ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0
    [ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490
    [ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0
    [ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50
    [ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190
    [ 152.917414][ T8030] do_syscall_64+0x34/0xb0
    [ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae
    [ 152.918872][ T8030]
    [ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00
    [ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512
    [ 152.921155][ T8030] The buggy address is located 256 bytes inside of
    [ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00)
    [ 152.922993][ T8030] The buggy address belongs to the page:
    [ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318
    [ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0
    [ 152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
    [ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80
    [ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
    [ 152.929890][ T8030] page dumped because: kasan: bad access detected
    [ 152.930759][ T8030]
    [ 152.931076][ T8030] Memory state around the buggy address:
    [ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 152.935189][ T8030] ^
    [ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 152.937940][ T8030] ==================================================================

    If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq()
    which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'.
    The 'wq' allocated in alloc_workqueue() will be freed in error path when
    apply_wqattrs_prepare() fails. So it will lead a UAF.

    CPU0 CPU1
    alloc_workqueue()
    alloc_and_link_pwqs()
    apply_wqattrs_prepare() fails
    apply_wqattrs_cleanup()
    schedule_work(&pwq->unbound_release_work)
    kfree(wq)
    worker_thread()
    pwq_unbound_release_workfn()
    Suggested-by: Lai Jiangshan
    Signed-off-by: Yang Yingliang
    Reviewed-by: Lai Jiangshan
    Tested-by: Pavel Skripkin
    Signed-off-by: Tejun Heo

    Yang Yingliang
     
  • Richard reported sporadic (roughly one in 10 or so) null dereferences and
    other strange behaviour for a set of automated LTP tests. Things like:

    BUG: kernel NULL pointer dereference, address: 0000000000000008
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kernfs_sop_show_path+0x1b/0x60

    ...or these others:

    RIP: 0010:do_mkdirat+0x6a/0xf0
    RIP: 0010:d_alloc_parallel+0x98/0x510
    RIP: 0010:do_readlinkat+0x86/0x120

    There were other less common instances of some kind of a general scribble
    but the common theme was mount and cgroup and a dubious dentry triggering
    the NULL dereference. I was only able to reproduce it under qemu by
    replicating Richard's setup as closely as possible - I never did get it
    to happen on bare metal, even while keeping everything else the same.

    In commit 71d883c37e8d ("cgroup_do_mount(): massage calling conventions")
    we see this as a part of the overall change:

    --------------
    struct cgroup_subsys *ss;
    - struct dentry *dentry;

    [...]

    - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
    - CGROUP_SUPER_MAGIC, ns);

    [...]

    - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
    - struct super_block *sb = dentry->d_sb;
    - dput(dentry);
    + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
    + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
    + struct super_block *sb = fc->root->d_sb;
    + dput(fc->root);
    deactivate_locked_super(sb);
    msleep(10);
    return restart_syscall();
    }
    --------------

    In changing from the local "*dentry" variable to using fc->root, we now
    export/leave that dentry pointer in the file context after doing the dput()
    in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
    back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
    becomes slightly likely and then bad things happen.

    A fix would be to not leave the stale reference in fc->root as follows:

    --------------
                    dput(fc->root);
    + fc->root = NULL;
                    deactivate_locked_super(sb);
    --------------

    ...but then we are just open-coding a duplicate of fc_drop_locked() so we
    simply use that instead.

    Cc: Al Viro
    Cc: Tejun Heo
    Cc: Zefan Li
    Cc: Johannes Weiner
    Cc: stable@vger.kernel.org # v5.1+
    Reported-by: Richard Purdie
    Fixes: 71d883c37e8d ("cgroup_do_mount(): massage calling conventions")
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Tejun Heo

    Paul Gortmaker