16 Feb, 2022

9 commits

  • commit 0764db9b49c932b89ee4d9e3236dff4bb07b4a66 upstream.

    Alexander reported a circular lock dependency revealed by the mmap1 ltp
    test:

    LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
    WARNING: possible circular locking dependency detected
    5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
    ------------------------------------------------------
    mmap1/202299 is trying to acquire lock:
    00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
    but task is already holding lock:
    00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #1 (&sighand->siglock){-.-.}-{2:2}:
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    __lock_task_sighand+0x90/0x190
    cgroup_freeze_task+0x2e/0x90
    cgroup_migrate_execute+0x11c/0x608
    cgroup_update_dfl_csses+0x246/0x270
    cgroup_subtree_control_write+0x238/0x518
    kernfs_fop_write_iter+0x13e/0x1e0
    new_sync_write+0x100/0x190
    vfs_write+0x22c/0x2d8
    ksys_write+0x6c/0xf8
    __do_syscall+0x1da/0x208
    system_call+0x82/0xb0
    -> #0 (css_set_lock){..-.}-{2:2}:
    check_prev_add+0xe0/0xed8
    validate_chain+0x736/0xb20
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    obj_cgroup_release+0x4a/0xe0
    percpu_ref_put_many.constprop.0+0x150/0x168
    drain_obj_stock+0x94/0xe8
    refill_obj_stock+0x94/0x278
    obj_cgroup_charge+0x164/0x1d8
    kmem_cache_alloc+0xac/0x528
    __sigqueue_alloc+0x150/0x308
    __send_signal+0x260/0x550
    send_signal+0x7e/0x348
    force_sig_info_to_task+0x104/0x180
    force_sig_fault+0x48/0x58
    __do_pgm_check+0x120/0x1f0
    pgm_check_handler+0x11e/0x180
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(&sighand->siglock);
    lock(css_set_lock);
    lock(&sighand->siglock);
    lock(css_set_lock);
    *** DEADLOCK ***
    2 locks held by mmap1/202299:
    #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
    #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
    stack backtrace:
    CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
    Hardware name: IBM 3906 M04 704 (LPAR)
    Call Trace:
    dump_stack_lvl+0x76/0x98
    check_noncircular+0x136/0x158
    check_prev_add+0xe0/0xed8
    validate_chain+0x736/0xb20
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    obj_cgroup_release+0x4a/0xe0
    percpu_ref_put_many.constprop.0+0x150/0x168
    drain_obj_stock+0x94/0xe8
    refill_obj_stock+0x94/0x278
    obj_cgroup_charge+0x164/0x1d8
    kmem_cache_alloc+0xac/0x528
    __sigqueue_alloc+0x150/0x308
    __send_signal+0x260/0x550
    send_signal+0x7e/0x348
    force_sig_info_to_task+0x104/0x180
    force_sig_fault+0x48/0x58
    __do_pgm_check+0x120/0x1f0
    pgm_check_handler+0x11e/0x180
    INFO: lockdep is turned off.

    In this example a slab allocation from __send_signal() caused a
    refilling and draining of a percpu objcg stock, resulted in a releasing
    of another non-related objcg. Objcg release path requires taking the
    css_set_lock, which is used to synchronize objcg lists.

    This can create a circular dependency with the sighandler lock, which is
    taken with the locked css_set_lock by the freezer code (to freeze a
    task).

    In general it seems that using css_set_lock to synchronize objcg lists
    makes any slab allocations and deallocation with the locked css_set_lock
    and any intervened locks risky.

    To fix the problem and make the code more robust let's stop using
    css_set_lock to synchronize objcg lists and use a new dedicated spinlock
    instead.

    Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Roman Gushchin
    Reported-by: Alexander Egorenkov
    Tested-by: Alexander Egorenkov
    Reviewed-by: Waiman Long
    Acked-by: Tejun Heo
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jeremy Linton
    Tested-by: Jeremy Linton
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • [ Upstream commit 9eeabdf17fa0ab75381045c867c370f4cc75a613 ]

    When uncloning an skb dst and its associated metadata, a new
    dst+metadata is allocated and later replaces the old one in the skb.
    This is helpful to have a non-shared dst+metadata attached to a specific
    skb.

    The issue is the uncloned dst+metadata is initialized with a refcount of
    1, which is increased to 2 before attaching it to the skb. When
    tun_dst_unclone returns, the dst+metadata is only referenced from a
    single place (the skb) while its refcount is 2. Its refcount will never
    drop to 0 (when the skb is consumed), leading to a memory leak.

    Fix this by removing the call to dst_hold in tun_dst_unclone, as the
    dst+metadata refcount is already 1.

    Fixes: fc4099f17240 ("openvswitch: Fix egress tunnel info.")
    Cc: Pravin B Shelar
    Reported-by: Vlad Buslov
    Tested-by: Vlad Buslov
    Signed-off-by: Antoine Tenart
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Antoine Tenart
     
  • [ Upstream commit cfc56f85e72f5b9c5c5be26dc2b16518d36a7868 ]

    When uncloning an skb dst and its associated metadata a new dst+metadata
    is allocated and the tunnel information from the old metadata is copied
    over there.

    The issue is the tunnel metadata has references to cached dst, which are
    copied along the way. When a dst+metadata refcount drops to 0 the
    metadata is freed including the cached dst entries. As they are also
    referenced in the initial dst+metadata, this ends up in UaFs.

    In practice the above did not happen because of another issue, the
    dst+metadata was never freed because its refcount never dropped to 0
    (this will be fixed in a subsequent patch).

    Fix this by initializing the dst cache after copying the tunnel
    information from the old metadata to also unshare the dst cache.

    Fixes: d71785ffc7e7 ("net: add dst_cache to ovs vxlan lwtunnel")
    Cc: Paolo Abeni
    Reported-by: Vlad Buslov
    Tested-by: Vlad Buslov
    Signed-off-by: Antoine Tenart
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Antoine Tenart
     
  • [ Upstream commit d1ca60efc53d665cf89ed847a14a510a81770b81 ]

    When userspace, e.g. conntrackd, inserts an entry with a specified helper,
    its possible that the helper is lost immediately after its added:

    ctnetlink_create_conntrack
    -> nf_ct_helper_ext_add + assign helper
    -> ctnetlink_setup_nat
    -> ctnetlink_parse_nat_setup
    -> parse_nat_setup -> nfnetlink_parse_nat_setup
    -> nf_nat_setup_info
    -> nf_conntrack_alter_reply
    -> __nf_ct_try_assign_helper

    ... and __nf_ct_try_assign_helper will zero the helper again.

    Set IPS_HELPER bit to bypass auto-assign logic, its unwanted, just like
    when helper is assigned via ruleset.

    Dropped old 'not strictly necessary' comment, it referred to use of
    rcu_assign_pointer() before it got replaced by RCU_INIT_POINTER().

    NB: Fixes tag intentionally incorrect, this extends the referenced commit,
    but this change won't build without IPS_HELPER introduced there.

    Fixes: 6714cf5465d280 ("netfilter: nf_conntrack: fix explicit helper attachment and NAT")
    Reported-by: Pham Thanh Tuyen
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Florian Westphal
     
  • commit cb1f65c1e1424a4b5e4a86da8aa3b8fd8459c8ec upstream.

    After commit e3728b50cd9b ("ACPI: PM: s2idle: Avoid possible race
    related to the EC GPE") wakeup interrupts occurring immediately after
    the one discarded by acpi_s2idle_wake() may be missed. Moreover, if
    the SCI triggers again immediately after the rearming in
    acpi_s2idle_wake(), that wakeup may be missed too.

    The problem is that pm_system_irq_wakeup() only calls pm_system_wakeup()
    when pm_wakeup_irq is 0, but that's not the case any more after the
    interrupt causing acpi_s2idle_wake() to run until pm_wakeup_irq is
    cleared by the pm_wakeup_clear() call in s2idle_loop(). However,
    there may be wakeup interrupts occurring in that time frame and if
    that happens, they will be missed.

    To address that issue first move the clearing of pm_wakeup_irq to
    the point at which it is known that the interrupt causing
    acpi_s2idle_wake() to tun will be discarded, before rearming the SCI
    for wakeup. Moreover, because that only reduces the size of the
    time window in which the issue may manifest itself, allow
    pm_system_irq_wakeup() to register two second wakeup interrupts in
    a row and, when discarding the first one, replace it with the second
    one. [Of course, this assumes that only one wakeup interrupt can be
    discarded in one go, but currently that is the case and I am not
    aware of any plans to change that.]

    Fixes: e3728b50cd9b ("ACPI: PM: s2idle: Avoid possible race related to the EC GPE")
    Cc: 5.4+ # 5.4+
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • [ Upstream commit 33569ef3c754a82010f266b7b938a66a3ccf90a4 ]

    It is an unused wrapper forcing kmalloc allocation for registering
    nosave regions. Also, rename __register_nosave_region() to
    register_nosave_region() now that there is no need for disambiguation.

    Signed-off-by: Amadeusz Sławiński
    Reviewed-by: Cezary Rojewski
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Sasha Levin

    Amadeusz Sławiński
     
  • [ Upstream commit 1976b2b31462151403c9fc110204fcc2a77bdfd1 ]

    Query the server for other possible trunkable locations for a given
    file system on a 4.1+ mount.

    v2:
    -- added missing static to nfs4_discover_trunking,
    reported by the kernel test robot

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Olga Kornievskaia
     
  • [ Upstream commit 8a59bb93b7e3cca389af44781a429ac12ac49be6 ]

    Define and store if server returns it supports fs_locations attribute
    as a capability.

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Olga Kornievskaia
     
  • [ Upstream commit b5e7b59c3480f355910f9d2c6ece5857922a5e54 ]

    Currently the nfs_access_get_cached family of functions report a
    'struct nfs_access_entry' as the result, with both .mask and .cred set.
    However the .cred is never used. This is probably good and there is no
    guarantee that it won't be freed before use.

    Change to only report the 'mask' - as this is all that is used or needed.

    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    NeilBrown
     

09 Feb, 2022

3 commits

  • commit ef9989afda73332df566852d6e9ca695c05f10ce upstream.

    When transitioning to/from guest mode, it is necessary to inform
    lockdep, tracing, and RCU in a specific order, similar to the
    requirements for transitions to/from user mode. Additionally, it is
    necessary to perform vtime accounting for a window around running the
    guest, with RCU enabled, such that timer interrupts taken from the guest
    can be accounted as guest time.

    Most architectures don't handle all the necessary pieces, and a have a
    number of common bugs, including unsafe usage of RCU during the window
    between guest_enter() and guest_exit().

    On x86, this was dealt with across commits:

    87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
    0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
    9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
    3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
    135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
    160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
    bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

    ... but those fixes are specific to x86, and as the resulting logic
    (while correct) is split across generic helper functions and
    x86-specific helper functions, it is difficult to see that the
    entry/exit accounting is balanced.

    This patch adds generic helpers which architectures can use to handle
    guest entry/exit consistently and correctly. The guest_{enter,exit}()
    helpers are split into guest_timing_{enter,exit}() to perform vtime
    accounting, and guest_context_{enter,exit}() to perform the necessary
    context tracking and RCU management. The existing guest_{enter,exit}()
    heleprs are left as wrappers of these.

    Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
    helpers are added to handle the ordering of lockdep, tracing, and RCU
    manageent. These are inteneded to mirror exit_to_user_mode() and
    enter_from_user_mode().

    Subsequent patches will migrate architectures over to the new helpers,
    following a sequence:

    guest_timing_enter_irqoff();

    guest_state_enter_irqoff();
    < run the vcpu >
    guest_state_exit_irqoff();

    < take any pending IRQs >

    guest_timing_exit_irqoff();

    This sequences handles all of the above correctly, and more clearly
    balances the entry and exit portions, making it easier to understand.

    The existing helpers are marked as deprecated, and will be removed once
    all architectures have been converted.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Reviewed-by: Marc Zyngier
    Reviewed-by: Paolo Bonzini
    Reviewed-by: Nicolas Saenz Julienne
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit 314c459a6fe0957b5885fbc65c53d51444092880 upstream.

    Since commit 974b9b2c68f3 ("mm: consolidate pte_index() and
    pte_offset_*() definitions") pte_index is a static inline and there is
    no define for it that can be recognized by the preprocessor. As a
    result, vm_insert_pages() uses slower loop over vm_insert_page() instead
    of insert_pages() that amortizes the cost of spinlock operations when
    inserting multiple pages.

    Link: https://lkml.kernel.org/r/20220111145457.20748-1-rppt@kernel.org
    Fixes: 974b9b2c68f3 ("mm: consolidate pte_index() and pte_offset_*() definitions")
    Signed-off-by: Mike Rapoport
    Reported-by: Christian Dietrich
    Reviewed-by: Khalid Aziz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     
  • commit 06feec6005c9d9500cd286ec440aabf8b2ddd94d upstream.

    Correct size of iec_status array by changing it to the size of status
    array of the struct snd_aes_iec958. This fixes out-of-bounds slab
    read accesses made by memcpy() of the hdmi-codec driver. This problem
    is reported by KASAN.

    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Osipenko
    Link: https://lore.kernel.org/r/20220112195039.1329-1-digetx@gmail.com
    Signed-off-by: Mark Brown
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Osipenko
     

02 Feb, 2022

15 commits

  • commit a37d9a17f099072fe4d3a9048b0321978707a918 upstream.

    Apparently, there are some applications that use IN_DELETE event as an
    invalidation mechanism and expect that if they try to open a file with
    the name reported with the delete event, that it should not contain the
    content of the deleted file.

    Commit 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of
    d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
    will have access to a positive dentry.

    This allowed a race where opening the deleted file via cached dentry
    is now possible after receiving the IN_DELETE event.

    To fix the regression, create a new hook fsnotify_delete() that takes
    the unlinked inode as an argument and use a helper d_delete_notify() to
    pin the inode, so we can pass it to fsnotify_delete() after d_delete().

    Backporting hint: this regression is from v5.3. Although patch will
    apply with only trivial conflicts to v5.4 and v5.10, it won't build,
    because fsnotify_delete() implementation is different in each of those
    versions (see fsnotify_link()).

    A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
    filesystem that do not need to call d_delete().

    Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
    Reported-by: Ivan Delalande
    Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
    Fixes: 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
    Cc: stable@vger.kernel.org # v5.3+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 51e50fbd3efc6064c30ed73a5e009018b36e290a upstream.

    When CONFIG_CGROUPS is disabled psi code generates the following
    warnings:

    kernel/sched/psi.c:1112:21: warning: no previous prototype for 'psi_trigger_create' [-Wmissing-prototypes]
    1112 | struct psi_trigger *psi_trigger_create(struct psi_group *group,
    | ^~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1182:6: warning: no previous prototype for 'psi_trigger_destroy' [-Wmissing-prototypes]
    1182 | void psi_trigger_destroy(struct psi_trigger *t)
    | ^~~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1249:10: warning: no previous prototype for 'psi_trigger_poll' [-Wmissing-prototypes]
    1249 | __poll_t psi_trigger_poll(void **trigger_ptr,
    | ^~~~~~~~~~~~~~~~

    Change the declarations of these functions in the header to provide the
    prototypes even when they are unused.

    Link: https://lkml.kernel.org/r/20220119223940.787748-2-surenb@google.com
    Fixes: 0e94682b73bf ("psi: introduce psi monitor")
    Signed-off-by: Suren Baghdasaryan
    Reported-by: kernel test robot
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     
  • [ Upstream commit 3c42b2019863b327caa233072c50739d4144dd16 ]

    ./include/net/route.h:373:48: warning: incorrect type in argument 2 (different base types)
    ./include/net/route.h:373:48: expected unsigned int [usertype] key
    ./include/net/route.h:373:48: got restricted __be32 [usertype] daddr

    Fixes: 5c9f7c1dfc2e ("ipv4: Add helpers for neigh lookup for nexthop")
    Signed-off-by: Eric Dumazet
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20220127013404.1279313-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit 36268983e90316b37000a005642af42234dabb36 ]

    This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d.

    This commit breaks Linux compatibility with USGv6 tests. The RFC this
    commit was based on is actually an expired draft: no published RFC
    currently allows the new behaviour it introduced.

    Without full IETF endorsement, the flash renumbering scenario this
    patch was supposed to enable is never going to work, as other IPv6
    equipements on the same LAN will keep the 2 hours limit.

    Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values")
    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Guillaume Nault
     
  • [ Upstream commit 09f5e7dc7ad705289e1b1ec065439aa3c42951c4 ]

    Time readers that cannot take locks (due to NMI etc..) currently make
    use of perf_event::shadow_ctx_time, which, for that event gives:

    time' = now + (time - timestamp)

    or, alternatively arranged:

    time' = time + (now - timestamp)

    IOW, the progression of time since the last time the shadow_ctx_time
    was updated.

    There's problems with this:

    A) the shadow_ctx_time is per-event, even though the ctx_time it
    reflects is obviously per context. The direct concequence of this
    is that the context needs to iterate all events all the time to
    keep the shadow_ctx_time in sync.

    B) even with the prior point, the context itself might not be active
    meaning its time should not advance to begin with.

    C) shadow_ctx_time isn't consistently updated when ctx_time is

    There are 3 users of this stuff, that suffer differently from this:

    - calc_timer_values()
    - perf_output_read()
    - perf_event_update_userpage() /* A */

    - perf_event_read_local() /* A,B */

    In particular, perf_output_read() doesn't suffer at all, because it's
    sample driven and hence only relevant when the event is actually
    running.

    This same was supposed to be true for perf_event_update_userpage(),
    after all self-monitoring implies the context is active *HOWEVER*, as
    per commit f79256532682 ("perf/core: fix userpage->time_enabled of
    inactive events") this goes wrong when combined with counter
    overcommit, in that case those events that do not get scheduled when
    the context becomes active (task events typically) miss out on the
    EVENT_TIME update and ENABLED time is inflated (for a little while)
    with the time the context was inactive. Once the event gets rotated
    in, this gets corrected, leading to a non-monotonic timeflow.

    perf_event_read_local() made things even worse, it can request time at
    any point, suffering all the problems perf_event_update_userpage()
    does and more. Because while perf_event_update_userpage() is limited
    by the context being active, perf_event_read_local() users have no
    such constraint.

    Therefore, completely overhaul things and do away with
    perf_event::shadow_ctx_time. Instead have regular context time updates
    keep track of this offset directly and provide perf_event_time_now()
    to complement perf_event_time().

    perf_event_time_now() will, in adition to being context wide, also
    take into account if the context is active. For inactive context, it
    will not advance time.

    This latter property means the cgroup perf_cgroup_info context needs
    to grow addition state to track this.

    Additionally, since all this is strictly per-cpu, we can use barrier()
    to order context activity vs context time.

    Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Song Liu
    Tested-by: Namhyung Kim
    Link: https://lkml.kernel.org/r/YcB06DasOBtU0b00@hirez.programming.kicks-ass.net
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit aed28b7a2d620cb5cd0c554cb889075c02e25e8e ]

    Fixes: e26d9972720e ("SUNRPC: Clean up scheduling of autoclose")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit 76497b1adb89175eee85afc437f08a68247314b3 ]

    Clean up: BIT() is preferred over open-coding the shift.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • commit aafc2e3285c2d7a79b7ee15221c19fbeca7b1509 upstream.

    struct fib6_node's fn_sernum field can be
    read while other threads change it.

    Add READ_ONCE()/WRITE_ONCE() annotations.

    Do not change existing smp barriers in fib6_get_cookie_safe()
    and __fib6_update_sernum_upto_root()

    syzbot reported:

    BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket

    write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1:
    fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178
    fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112
    fib6_walk net/ipv6/ip6_fib.c:2160 [inline]
    fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline]
    __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256
    fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281
    rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline]
    addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230
    addrconf_dad_work+0x908/0x1170
    process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
    worker_thread+0x616/0xa70 kernel/workqueue.c:2454
    kthread+0x1bf/0x1e0 kernel/kthread.c:359
    ret_from_fork+0x1f/0x30

    read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0:
    fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline]
    rt6_get_cookie include/net/ip6_fib.h:306 [inline]
    ip6_dst_store include/net/ip6_route.h:234 [inline]
    inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109
    inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121
    __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402
    tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline]
    tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680
    __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864
    tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725
    mptcp_push_release net/mptcp/protocol.c:1491 [inline]
    __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578
    mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764
    inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643
    sock_sendmsg_nosec net/socket.c:705 [inline]
    sock_sendmsg net/socket.c:725 [inline]
    kernel_sendmsg+0x97/0xd0 net/socket.c:745
    sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086
    inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834
    kernel_sendpage+0x187/0x200 net/socket.c:3492
    sock_sendpage+0x5a/0x70 net/socket.c:1007
    pipe_to_sendpage+0x128/0x160 fs/splice.c:364
    splice_from_pipe_feed fs/splice.c:418 [inline]
    __splice_from_pipe+0x207/0x500 fs/splice.c:562
    splice_from_pipe fs/splice.c:597 [inline]
    generic_splice_sendpage+0x94/0xd0 fs/splice.c:746
    do_splice_from fs/splice.c:767 [inline]
    direct_splice_actor+0x80/0xa0 fs/splice.c:936
    splice_direct_to_actor+0x345/0x650 fs/splice.c:891
    do_splice_direct+0x106/0x190 fs/splice.c:979
    do_sendfile+0x675/0xc40 fs/read_write.c:1245
    __do_sys_sendfile64 fs/read_write.c:1310 [inline]
    __se_sys_sendfile64 fs/read_write.c:1296 [inline]
    __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x0000026f -> 0x00000271

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    The Fixes tag I chose is probably arbitrary, I do not think
    we need to backport this patch to older kernels.

    Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 23f57406b82de51809d5812afd96f210f8b627f3 upstream.

    ip_select_ident_segs() has been very conservative about using
    the connected socket private generator only for packets with IP_DF
    set, claiming it was needed for some VJ compression implementations.

    As mentioned in this referenced document, this can be abused.
    (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)

    Before switching to pure random IPID generation and possibly hurt
    some workloads, lets use the private inet socket generator.

    Not only this will remove one vulnerability, this will also
    improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT

    Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count")
    Signed-off-by: Eric Dumazet
    Reviewed-by: David Ahern
    Reported-by: Ray Che
    Cc: Willy Tarreau
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 47934e06b65637c88a762d9c98329ae6e3238888 upstream.

    In one net namespace, after creating a packet socket without binding
    it to a device, users in other net namespaces can observe the new
    `packet_type` added by this packet socket by reading `/proc/net/ptype`
    file. This is minor information leakage as packet socket is
    namespace aware.

    Add a net pointer in `packet_type` to keep the net namespace of
    of corresponding packet socket. In `ptype_seq_show`, this net pointer
    must be checked when it is not NULL.

    Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.")
    Signed-off-by: Congyu Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Sasha Levin

    Congyu Liu
     
  • commit 945c37ed564770c78dfe6b9f08bed57a1b4e60ef upstream.

    when CONFIG_USB_ROLE_SWITCH is not defined,
    add usb_role_switch_find_by_fwnode() definition which return NULL.

    Fixes: c6919d5e0cd1 ("usb: roles: Add usb_role_switch_find_by_fwnode()")
    Signed-off-by: Linyu Yuan
    Link: https://lore.kernel.org/r/1641818608-25039-1-git-send-email-quic_linyyuan@quicinc.com
    Signed-off-by: Greg Kroah-Hartman

    Linyu Yuan
     
  • commit 27fe73394a1c6d0b07fa4d95f1bca116d1cc66e9 upstream.

    It has been reported that the tag setting operation on newly-allocated
    pages can cause the page flags to be corrupted when performed
    concurrently with other flag updates as a result of the use of
    non-atomic operations.

    Fix the problem by using a compare-exchange loop to update the tag.

    Link: https://lkml.kernel.org/r/20220120020148.1632253-1-pcc@google.com
    Link: https://linux-review.googlesource.com/id/I456b24a2b9067d93968d43b4bb3351c0cec63101
    Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
    Signed-off-by: Peter Collingbourne
    Reviewed-by: Andrey Konovalov
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Collingbourne
     
  • commit f23653fe64479d96910bfda2b700b1af17c991ac upstream.

    Fix a user API regression introduced with commit f76edd8f7ce0 ("tty:
    cyclades, remove this orphan"), which removed a part of the API and
    caused compilation errors for user programs using said part, such as
    GCC 9 in its libsanitizer component[1]:

    .../libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.cc:160:10: fatal error: linux/cyclades.h: No such file or directory
    160 | #include
    | ^~~~~~~~~~~~~~~~~~
    compilation terminated.
    make[4]: *** [Makefile:664: sanitizer_platform_limits_posix.lo] Error 1

    As the absolute minimum required bring `struct cyclades_monitor' and
    ioctl numbers back then so as to make the library build again. Add a
    preprocessor warning as to the obsolescence of the features provided.

    References:

    [1] GCC PR sanitizer/100379, "cyclades.h is removed from linux kernel
    header files",

    Fixes: f76edd8f7ce0 ("tty: cyclades, remove this orphan")
    Cc: stable@vger.kernel.org # v5.13+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Maciej W. Rozycki
    Link: https://lore.kernel.org/r/alpine.DEB.2.20.2201260733430.11348@tpp.orcam.me.uk
    Signed-off-by: Greg Kroah-Hartman

    Maciej W. Rozycki
     
  • commit e45c47d1f94e0cc7b6b079fdb4bcce2995e2adc4 upstream.

    bio_start_io_acct_time() interface is like bio_start_io_acct() that
    allows start_time to be passed in. This gives drivers the ability to
    defer starting accounting until after IO is issued (but possibily not
    entirely due to bio splitting).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream.

    With write operation on psi files replacing old trigger with a new one,
    the lifetime of its waitqueue is totally arbitrary. Overwriting an
    existing trigger causes its waitqueue to be freed and pending poll()
    will stumble on trigger->event_wait which was destroyed.
    Fix this by disallowing to redefine an existing psi trigger. If a write
    operation is used on a file descriptor with an already existing psi
    trigger, the operation will fail with EBUSY error.
    Also bypass a check for psi_disabled in the psi_trigger_destroy as the
    flag can be flipped after the trigger is created, leading to a memory
    leak.

    Fixes: 0e94682b73bf ("psi: introduce psi monitor")
    Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
    Suggested-by: Linus Torvalds
    Analyzed-by: Eric Biggers
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Eric Biggers
    Acked-by: Johannes Weiner
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

27 Jan, 2022

13 commits

  • commit fb80445c438c78b40b547d12b8d56596ce4ccfeb upstream.

    commit 56b765b79e9a ("htb: improved accuracy at high rates") broke
    "overhead X", "linklayer atm" and "mpu X" attributes.

    "overhead X" and "linklayer atm" have already been fixed. This restores
    the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping:

    tc class add ... htb rate X overhead 4 mpu 64

    The code being fixed is used by htb, tbf and act_police. Cake has its
    own mpu handling. qdisc_calculate_pkt_len still uses the size table
    containing values adjusted for mpu by user space.

    iproute2 tc has always passed mpu into the kernel via a tc_ratespec
    structure, but the kernel never directly acted on it, merely stored it
    so that it could be read back by `tc class show`.

    Rather, tc would generate length-to-time tables that included the mpu
    (and linklayer) in their construction, and the kernel used those tables.

    Since v3.7, the tables were no longer used. Along with "mpu", this also
    broke "overhead" and "linklayer" which were fixed in 01cb71d2d47b
    ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b171
    ("net_sched: restore "linklayer atm" handling", v3.11).

    "overhead" was fixed by simply restoring use of tc_ratespec::overhead -
    this had originally been used by the kernel but was initially omitted
    from the new non-table-based calculations.

    "linklayer" had been handled in the table like "mpu", but the mode was
    not originally passed in tc_ratespec. The new implementation was made to
    handle it by getting new versions of tc to pass the mode in an extended
    tc_ratespec, and for older versions of tc the table contents were analysed
    at load time to deduce linklayer.

    As "mpu" has always been given to the kernel in tc_ratespec,
    accompanying the mpu-based table, we can restore system functionality
    with no userspace change by making the kernel act on the tc_ratespec
    value.

    Fixes: 56b765b79e9a ("htb: improved accuracy at high rates")
    Signed-off-by: Kevin Bracey
    Cc: Eric Dumazet
    Cc: Jiri Pirko
    Cc: Vimalkumar
    Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Kevin Bracey
     
  • commit 91341fa0003befd097e190ec2a4bf63ad957c49a upstream.

    Both fields can be read/written without synchronization,
    add proper accessors and documentation.

    Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit b7ec62d7ee0f0b8af6ba190501dff7f9ee6545ca upstream.

    find_first_bit() and find_first_zero_bit() are not protected with
    ifdefs as other functions in find.h. It causes build errors on some
    platforms if CONFIG_GENERIC_FIND_FIRST_BIT is enabled.

    Signed-off-by: Yury Norov
    Fixes: 2cc7b6a44ac2 ("lib: add fast path for find_first_*_bit() and find_last_bit()")
    Reported-by: kernel test robot
    Tested-by: Wolfram Sang
    Signed-off-by: Greg Kroah-Hartman

    Yury Norov
     
  • commit ec3bb890817e4398f2d46e12e2e205495b116be9 upstream.

    When there is no policy configured on the system, the default policy is
    checked in xfrm_route_forward. However, it was done with the wrong
    direction (XFRM_POLICY_FWD instead of XFRM_POLICY_OUT).
    The default policy for XFRM_POLICY_FWD was checked just before, with a call
    to xfrm[46]_policy_check().

    CC: stable@vger.kernel.org
    Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy")
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Dichtel
     
  • [ Upstream commit 222a011efc839ca1f51bf89fe7a2b3705fa55ccd ]

    When finding the socket to report an error on, if the invoking packet
    is using Segment Routing, the IPv6 destination address is that of an
    intermediate router, not the end destination. Extract the ultimate
    destination address from the segment address.

    This change allows traceroute to function in the presence of Segment
    Routing.

    Signed-off-by: Andrew Lunn
    Reviewed-by: David Ahern
    Reviewed-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Andrew Lunn
     
  • [ Upstream commit e41294408c56c68ea0f269d757527bf33b39118a ]

    RFC8754 says:

    ICMP error packets generated within the SR domain are sent to source
    nodes within the SR domain. The invoking packet in the ICMP error
    message may contain an SRH. Since the destination address of a packet
    with an SRH changes as each segment is processed, it may not be the
    destination used by the socket or application that generated the
    invoking packet.

    For the source of an invoking packet to process the ICMP error
    message, the ultimate destination address of the IPv6 header may be
    required. The following logic is used to determine the destination
    address for use by protocol-error handlers.

    * Walk all extension headers of the invoking IPv6 packet to the
    routing extension header preceding the upper-layer header.

    - If routing header is type 4 Segment Routing Header (SRH)

    o The SID at Segment List[0] may be used as the destination
    address of the invoking packet.

    Mangle the skb so the network header points to the invoking packet
    inside the ICMP packet. The seg6 helpers can then be used on the skb
    to find any segment routing headers. If found, mark this fact in the
    IPv6 control block of the skb, and store the offset into the packet of
    the SRH. Then restore the skb back to its old state.

    Signed-off-by: Andrew Lunn
    Reviewed-by: David Ahern
    Reviewed-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Andrew Lunn
     
  • [ Upstream commit fa55a7d745de2d10489295b0674a403e2a5d490d ]

    An ICMP error message can contain in its message body part of an IPv6
    packet which invoked the error. Such a packet might contain a segment
    router header. Export get_srh() so the ICMP code can make use of it.

    Since his changes the scope of the function from local to global, add
    the seg6_ prefix to keep the namespace clean. And move it into seg6.c
    so it is always available, not just when IPV6_SEG6_LWTUNNEL is
    enabled.

    Signed-off-by: Andrew Lunn
    Reviewed-by: David Ahern
    Reviewed-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Andrew Lunn
     
  • [ Upstream commit f81bdeaf816142e0729eea0cc84c395ec9673151 ]

    ACPICA commit bc02c76d518135531483dfc276ed28b7ee632ce1

    The current ACPI_ACCESS_*_WIDTH defines do not provide a way to
    test that size is small enough to not cause an overflow when
    applied to a 32-bit integer.

    Rather than adding more magic numbers, add ACPI_ACCESS_*_SHIFT,
    ACPI_ACCESS_*_MAX, and ACPI_ACCESS_*_DEFAULT #defines and
    redefine ACPI_ACCESS_*_WIDTH in terms of the new #defines.

    This was inititally reported on Linux where a size of 102 in
    ACPI_ACCESS_BIT_WIDTH caused an overflow error in the SPCR
    initialization code.

    Link: https://github.com/acpica/acpica/commit/bc02c76d
    Signed-off-by: Mark Langsdorf
    Signed-off-by: Bob Moore
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Sasha Levin

    Mark Langsdorf
     
  • [ Upstream commit 4e484b3e969b52effd95c17f7a86f39208b2ccf4 ]

    Kernel generates mapping change message, XFRM_MSG_MAPPING,
    when a source port chage is detected on a input state with UDP
    encapsulation set. Kernel generates a message for each IPsec packet
    with new source port. For a high speed flow per packet mapping change
    message can be excessive, and can overload the user space listener.

    Introduce rate limiting for XFRM_MSG_MAPPING message to the user space.

    The rate limiting is configurable via netlink, when adding a new SA or
    updating it. Use the new attribute XFRMA_MTIMER_THRESH in seconds.

    v1->v2 change:
    update xfrm_sa_len()

    v2->v3 changes:
    use u32 insted unsigned long to reduce size of struct xfrm_state
    fix xfrm_ompat size Reported-by: kernel test robot
    accept XFRM_MSG_MAPPING only when XFRMA_ENCAP is present

    Co-developed-by: Thomas Egerer
    Signed-off-by: Thomas Egerer
    Signed-off-by: Antony Antony
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Antony Antony
     
  • [ Upstream commit d1579e61192e0e686faa4208500ef4c3b529b16c ]

    Because refcount_dec_not_one() returns true if the target refcount
    becomes saturated, it is generally unsafe to use its return value as
    a loop termination condition, but that is what happens when a device
    link's supplier device is released during runtime PM suspend
    operations and on device link removal.

    To address this, introduce pm_runtime_release_supplier() to be used
    in the above cases which will check the supplier device's runtime
    PM usage counter in addition to the refcount_dec_not_one() return
    value, so the loop can be terminated in case the rpm_active refcount
    value becomes invalid, and update the code in question to use it as
    appropriate.

    This change is not expected to have any visible functional impact.

    Reported-by: Peter Zijlstra
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Greg Kroah-Hartman
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Sasha Levin

    Rafael J. Wysocki
     
  • [ Upstream commit fd8d135b2c5e88662f2729e034913f183455a667 ]

    Add a HID_QUIRK_X_INVERT/HID_QUIRK_Y_INVERT quirk that can be used
    to invert the X/Y values.

    Signed-off-by: Alistair Francis
    [bentiss: silence checkpatch warning]
    Signed-off-by: Benjamin Tissoires
    Link: https://lore.kernel.org/r/20211208124045.61815-2-alistair@alistair23.me
    Signed-off-by: Sasha Levin

    Alistair Francis
     
  • [ Upstream commit 1a68b346a2c9969c05e80a3b99a9ab160b5655c0 ]

    Currently, acpi_bus_get_status() calls acpi_device_always_present() to
    allow platform quirks to override the _STA return to report that a
    device is present (status = ACPI_STA_DEFAULT) independent of the _STA
    return.

    In some cases it might also be useful to have the opposite functionality
    and have a platform quirk which marks a device as not present (status = 0)
    to work around ACPI table bugs.

    Change acpi_device_always_present() into a more generic
    acpi_device_override_status() function to allow this.

    Signed-off-by: Hans de Goede
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Sasha Levin

    Hans de Goede
     
  • [ Upstream commit cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf ]

    We've noticed cases where tasks in a cgroup are stalled on memory but
    there is little memory FULL pressure since tasks stay on the runqueue
    in reclaim.

    A simple example involves a single threaded program that keeps leaking
    and touching large amounts of memory. It runs in a cgroup with swap
    enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
    there is significant CPU pressure and memory SOME, there is barely any
    memory FULL since the task enters reclaim and stays on the runqueue.
    However, this memory-bound task is effectively stalled on memory and
    we expect memory FULL to match memory SOME in this scenario.

    The code is confused about memstall && running, thinking there is a
    stalled task and a productive task when there's only one task: a
    reclaimer that's counted as both. To fix this, we redefine the
    condition for PSI_MEM_FULL to check that all running tasks are in an
    active memstall instead of checking that there are no running tasks.

    case PSI_MEM_FULL:
    - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
    + return unlikely(tasks[NR_MEMSTALL] &&
    + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

    This will capture reclaimers. It will also capture tasks that called
    psi_memstall_enter() and are about to sleep, but this should be
    negligible noise.

    Signed-off-by: Brian Chen
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Johannes Weiner
    Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
    Signed-off-by: Sasha Levin

    Brian Chen