28 Sep, 2022

1 commit

  • [ Upstream commit c0feea594e058223973db94c1c32a830c9807c86 ]

    Like Hillf Danton mentioned

    syzbot should have been able to catch cancel_work_sync() in work context
    by checking lockdep_map in __flush_work() for both flush and cancel.

    in [1], being unable to report an obvious deadlock scenario shown below is
    broken. From locking dependency perspective, sync version of cancel request
    should behave as if flush request, for it waits for completion of work if
    that work has already started execution.

    ----------
    #include
    #include
    static DEFINE_MUTEX(mutex);
    static void work_fn(struct work_struct *work)
    {
    schedule_timeout_uninterruptible(HZ / 5);
    mutex_lock(&mutex);
    mutex_unlock(&mutex);
    }
    static DECLARE_WORK(work, work_fn);
    static int __init test_init(void)
    {
    schedule_work(&work);
    schedule_timeout_uninterruptible(HZ / 10);
    mutex_lock(&mutex);
    cancel_work_sync(&work);
    mutex_unlock(&mutex);
    return -EINVAL;
    }
    module_init(test_init);
    MODULE_LICENSE("GPL");
    ----------

    The check this patch restores was added by commit 0976dfc1d0cd80a4
    ("workqueue: Catch more locking problems with flush_work()").

    Then, lockdep's crossrelease feature was added by commit b09be676e0ff25bd
    ("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
    this check was once removed by commit fd1a5b04dfb899f8 ("workqueue: Remove
    now redundant lock acquisitions wrt. workqueue flushes").

    But lockdep's crossrelease feature was removed by commit e966eaeeb623f099
    ("locking/lockdep: Remove the cross-release locking checks"). At this
    point, this check should have been restored.

    Then, commit d6e89786bed977f3 ("workqueue: skip lockdep wq dependency in
    cancel_work_sync()") introduced a boolean flag in order to distinguish
    flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
    dependency when called from cancel_work_sync() was causing false positives.

    Then, commit 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for
    flushing") tried to restore "struct work_struct" dependency check, but by
    error checked this boolean flag. Like an example shown above indicates,
    "struct work_struct" dependency needs to be checked for both flush_work()
    and cancel_work_sync().

    Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
    Reported-by: Hillf Danton
    Suggested-by: Lai Jiangshan
    Fixes: 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for flushing")
    Cc: Johannes Berg
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     

23 Sep, 2022

1 commit

  • commit 43626dade36fa74d3329046f4ae2d7fdefe401c6 upstream.

    syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
    cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix
    threadgroup_rwsem cpus_read_lock() deadlock") missed that
    cpuset_attach() is also called from cgroup_attach_task_all().
    Add cpus_read_lock() like what cgroup_procs_write_start() does.

    Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
    Reported-by: syzbot
    Signed-off-by: Tetsuo Handa
    Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem cpus_read_lock() deadlock")
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

20 Sep, 2022

2 commits

  • [ Upstream commit 54c3931957f6a6194d5972eccc36d052964b2abe ]

    Currently, The arguments passing to lockdep_hardirqs_{on,off} was fixed
    in CALLER_ADDR0.
    The function trace_hardirqs_on_caller should have been intended to use
    caller_addr to represent the address that caller wants to be traced.

    For example, lockdep log in riscv showing the last {enabled,disabled} at
    __trace_hardirqs_{on,off} all the time(if called by):
    [ 57.853175] hardirqs last enabled at (2519): __trace_hardirqs_on+0xc/0x14
    [ 57.853848] hardirqs last disabled at (2520): __trace_hardirqs_off+0xc/0x14

    After use trace_hardirqs_xx_caller, we can get more effective information:
    [ 53.781428] hardirqs last enabled at (2595): restore_all+0xe/0x66
    [ 53.782185] hardirqs last disabled at (2596): ret_from_exception+0xa/0x10

    Link: https://lkml.kernel.org/r/20220901104515.135162-2-zouyipeng@huawei.com

    Cc: stable@vger.kernel.org
    Fixes: c3bc8fd637a96 ("tracing: Centralize preemptirq tracepoints and unify their usage")
    Signed-off-by: Yipeng Zou
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Sasha Levin

    Yipeng Zou
     
  • [ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ]

    While looking into a bug related to the compiler's handling of addresses
    of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
    Drive by cleanup.

    -Wunused-parameter:
    kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'

    Signed-off-by: Nick Desaulniers
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
    Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
    Signed-off-by: Sasha Levin

    Nick Desaulniers
     

15 Sep, 2022

7 commits

  • [ Upstream commit 3f0461613ebcdc8c4073e235053d06d5aa58750f ]

    The second operand passed to slot_addr() is declared as int or unsigned int
    in all call sites. The left-shift to get the offset of a slot can overflow
    if swiotlb size is larger than 4G.

    Convert the macro to an inline function and declare the second argument as
    phys_addr_t to avoid the potential overflow.

    Fixes: 26a7e094783d ("swiotlb: refactor swiotlb_tbl_map_single")
    Signed-off-by: Chao Gao
    Reviewed-by: Dongli Zhang
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Chao Gao
     
  • [ Upstream commit 85eaeb5058f0f04dffb124c97c86b4f18db0b833 ]

    Fix a nested dead lock as part of ODP flow by using mmput_async().

    From the below call trace [1] can see that calling mmput() once we have
    the umem_odp->umem_mutex locked as required by
    ib_umem_odp_map_dma_and_lock() might trigger in the same task the
    exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which
    may dead lock when trying to lock the same mutex.

    Moving to use mmput_async() will solve the problem as the above
    exit_mmap() flow will be called in other task and will be executed once
    the lock will be available.

    [1]
    [64843.077665] task:kworker/u133:2 state:D stack: 0 pid:80906 ppid:
    2 flags:0x00004000
    [64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
    [64843.077719] Call Trace:
    [64843.077722]
    [64843.077724] __schedule+0x23d/0x590
    [64843.077729] schedule+0x4e/0xb0
    [64843.077735] schedule_preempt_disabled+0xe/0x10
    [64843.077740] __mutex_lock.constprop.0+0x263/0x490
    [64843.077747] __mutex_lock_slowpath+0x13/0x20
    [64843.077752] mutex_lock+0x34/0x40
    [64843.077758] mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib]
    [64843.077808] __mmu_notifier_release+0x1a4/0x200
    [64843.077816] exit_mmap+0x1bc/0x200
    [64843.077822] ? walk_page_range+0x9c/0x120
    [64843.077828] ? __cond_resched+0x1a/0x50
    [64843.077833] ? mutex_lock+0x13/0x40
    [64843.077839] ? uprobe_clear_state+0xac/0x120
    [64843.077860] mmput+0x5f/0x140
    [64843.077867] ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core]
    [64843.077931] pagefault_real_mr+0x9a/0x140 [mlx5_ib]
    [64843.077962] pagefault_mr+0xb4/0x550 [mlx5_ib]
    [64843.077992] pagefault_single_data_segment.constprop.0+0x2ac/0x560
    [mlx5_ib]
    [64843.078022] mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib]
    [64843.078051] process_one_work+0x22b/0x3d0
    [64843.078059] worker_thread+0x53/0x410
    [64843.078065] ? process_one_work+0x3d0/0x3d0
    [64843.078073] kthread+0x12a/0x150
    [64843.078079] ? set_kthread_struct+0x50/0x50
    [64843.078085] ret_from_fork+0x22/0x30
    [64843.078093]

    Fixes: 36f30e486dce ("IB/core: Improve ODP to use hmm_range_fault()")
    Reviewed-by: Maor Gottlieb
    Signed-off-by: Yishai Hadas
    Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Sasha Levin

    Yishai Hadas
     
  • [ Upstream commit 4f7e7236435ca0abe005c674ebd6892c6e83aeb3 ]

    Bringing up a CPU may involve creating and destroying tasks which requires
    read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
    cpus_read_lock(). However, cpuset's ->attach(), which may be called with
    thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
    acquires cpus_read_lock(), leading to a deadlock.

    Fix it by guaranteeing that ->attach() is always called with CPU hotplug
    disabled and removing cpus_read_lock() call from cpuset_attach().

    Signed-off-by: Tejun Heo
    Reviewed-and-tested-by: Imran Khan
    Reported-and-tested-by: Xuewen Yan
    Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
    Cc: stable@vger.kernel.org # v5.17+
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit 671c11f0619e5ccb380bcf0f062f69ba95fc974a ]

    cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
    csses can trigger process migrations. However, if the subtree doesn't
    contain any tasks, there aren't gonna be any cgroup migrations. This
    condition can be trivially detected by testing whether
    mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
    the subtree is empty.

    After this optimization, the usage pattern of creating a cgroup, enabling
    the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
    then removing the cgroup after it becomes empty doesn't need to write-lock
    threadgroup_rwsem at all.

    Signed-off-by: Tejun Heo
    Cc: Christian Brauner
    Cc: Michal Koutný
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • commit c2e406596571659451f4b95e37ddfd5a8ef1d0dc upstream.

    Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup())
    leaks a dentry and with a hotplug stress test, the machine eventually
    runs out of memory.

    Fix this up by using the newly created debugfs_lookup_and_remove() call
    instead which properly handles the dentry reference counting logic.

    Cc: Major Chen
    Cc: stable
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Vincent Guittot
    Cc: Dietmar Eggemann
    Cc: Steven Rostedt
    Cc: Ben Segall
    Cc: Mel Gorman
    Cc: Daniel Bristot de Oliveira
    Cc: Valentin Schneider
    Cc: Matthias Brugger
    Reported-by: Kuyo Chang
    Tested-by: Kuyo Chang
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • commit 1efda38d6f9ba26ac88b359c6277f1172db03f1e upstream.

    The system call gate area counts as kernel text but trying
    to install a kprobe in this area fails with an Oops later on.
    To fix this explicitly disallow the gate area for kprobes.

    Found by syzkaller with the following reproducer:
    perf_event_open$cgroup(&(0x7f00000001c0)={0x6, 0x80, 0x0, 0x0, 0x0, 0x0, 0x80ffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_config_ext={0x0, 0xffffffffff600000}}, 0xffffffffffffffff, 0x0, 0xffffffffffffffff, 0x0)

    Sample report:
    BUG: unable to handle page fault for address: fffffbfff3ac6000
    PGD 6dfcb067 P4D 6dfcb067 PUD 6df8f067 PMD 6de4d067 PTE 0
    Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
    CPU: 0 PID: 21978 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b-dirty #6
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
    RIP: 0010:__insn_get_emulate_prefix arch/x86/lib/insn.c:91 [inline]
    RIP: 0010:insn_get_emulate_prefix arch/x86/lib/insn.c:106 [inline]
    RIP: 0010:insn_get_prefixes.part.0+0xa8/0x1110 arch/x86/lib/insn.c:134
    Code: 49 be 00 00 00 00 00 fc ff df 48 8b 40 60 48 89 44 24 08 e9 81 00 00 00 e8 e5 4b 39 ff 4c 89 fa 4c 89 f9 48 c1 ea 03 83 e1 07 0f b6 14 32 38 ca 7f 08 84 d2 0f 85 06 10 00 00 48 89 d8 48 89
    RSP: 0018:ffffc900088bf860 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffff9b9bebc0 RCX: 0000000000000000
    RDX: 1ffffffff3ac6000 RSI: ffffc90002d82000 RDI: ffffc900088bf9e8
    RBP: ffffffff9d630001 R08: 0000000000000000 R09: ffffc900088bf9e8
    R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
    R13: ffffffff9d630000 R14: dffffc0000000000 R15: ffffffff9d630000
    FS: 00007f63eef63640(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: fffffbfff3ac6000 CR3: 0000000029d90005 CR4: 0000000000770ef0
    PKRU: 55555554
    Call Trace:

    insn_get_prefixes arch/x86/lib/insn.c:131 [inline]
    insn_get_opcode arch/x86/lib/insn.c:272 [inline]
    insn_get_modrm+0x64a/0x7b0 arch/x86/lib/insn.c:343
    insn_get_sib+0x29a/0x330 arch/x86/lib/insn.c:421
    insn_get_displacement+0x350/0x6b0 arch/x86/lib/insn.c:464
    insn_get_immediate arch/x86/lib/insn.c:632 [inline]
    insn_get_length arch/x86/lib/insn.c:707 [inline]
    insn_decode+0x43a/0x490 arch/x86/lib/insn.c:747
    can_probe+0xfc/0x1d0 arch/x86/kernel/kprobes/core.c:282
    arch_prepare_kprobe+0x79/0x1c0 arch/x86/kernel/kprobes/core.c:739
    prepare_kprobe kernel/kprobes.c:1160 [inline]
    register_kprobe kernel/kprobes.c:1641 [inline]
    register_kprobe+0xb6e/0x1690 kernel/kprobes.c:1603
    __register_trace_kprobe kernel/trace/trace_kprobe.c:509 [inline]
    __register_trace_kprobe+0x26a/0x2d0 kernel/trace/trace_kprobe.c:477
    create_local_trace_kprobe+0x1f7/0x350 kernel/trace/trace_kprobe.c:1833
    perf_kprobe_init+0x18c/0x280 kernel/trace/trace_event_perf.c:271
    perf_kprobe_event_init+0xf8/0x1c0 kernel/events/core.c:9888
    perf_try_init_event+0x12d/0x570 kernel/events/core.c:11261
    perf_init_event kernel/events/core.c:11325 [inline]
    perf_event_alloc.part.0+0xf7f/0x36a0 kernel/events/core.c:11619
    perf_event_alloc kernel/events/core.c:12059 [inline]
    __do_sys_perf_event_open+0x4a8/0x2a00 kernel/events/core.c:12157
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f63ef7efaed
    Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f63eef63028 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
    RAX: ffffffffffffffda RBX: 00007f63ef90ff80 RCX: 00007f63ef7efaed
    RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 00000000200001c0
    RBP: 00007f63ef86019c R08: 0000000000000000 R09: 0000000000000000
    R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000002 R14: 00007f63ef90ff80 R15: 00007f63eef43000

    Modules linked in:
    CR2: fffffbfff3ac6000
    ---[ end trace 0000000000000000 ]---
    RIP: 0010:__insn_get_emulate_prefix arch/x86/lib/insn.c:91 [inline]
    RIP: 0010:insn_get_emulate_prefix arch/x86/lib/insn.c:106 [inline]
    RIP: 0010:insn_get_prefixes.part.0+0xa8/0x1110 arch/x86/lib/insn.c:134
    Code: 49 be 00 00 00 00 00 fc ff df 48 8b 40 60 48 89 44 24 08 e9 81 00 00 00 e8 e5 4b 39 ff 4c 89 fa 4c 89 f9 48 c1 ea 03 83 e1 07 0f b6 14 32 38 ca 7f 08 84 d2 0f 85 06 10 00 00 48 89 d8 48 89
    RSP: 0018:ffffc900088bf860 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffff9b9bebc0 RCX: 0000000000000000
    RDX: 1ffffffff3ac6000 RSI: ffffc90002d82000 RDI: ffffc900088bf9e8
    RBP: ffffffff9d630001 R08: 0000000000000000 R09: ffffc900088bf9e8
    R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
    R13: ffffffff9d630000 R14: dffffc0000000000 R15: ffffffff9d630000
    FS: 00007f63eef63640(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: fffffbfff3ac6000 CR3: 0000000029d90005 CR4: 0000000000770ef0
    PKRU: 55555554
    ==================================================================

    Link: https://lkml.kernel.org/r/20220907200917.654103-1-lk@c--e.de

    cc: "Naveen N. Rao"
    cc: Anil S Keshavamurthy
    cc: "David S. Miller"
    Cc: stable@vger.kernel.org
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Acked-by: Masami Hiramatsu (Google)
    Signed-off-by: Christian A. Ehrhardt
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Christian A. Ehrhardt
     
  • commit cecf8e128ec69149fe53c9a7bafa505a4bee25d9 upstream.

    Since the check_user_trigger() is called outside of RCU
    read lock, this list_for_each_entry_rcu() caused a suspicious
    RCU usage warning.

    # echo hist:keys=pid > events/sched/sched_stat_runtime/trigger
    # cat events/sched/sched_stat_runtime/trigger
    [ 43.167032]
    [ 43.167418] =============================
    [ 43.167992] WARNING: suspicious RCU usage
    [ 43.168567] 5.19.0-rc5-00029-g19ebe4651abf #59 Not tainted
    [ 43.169283] -----------------------------
    [ 43.169863] kernel/trace/trace_events_trigger.c:145 RCU-list traversed in non-reader section!!
    ...

    However, this file->triggers list is safe when it is accessed
    under event_mutex is held.
    To fix this warning, adds a lockdep_is_held check to the
    list_for_each_entry_rcu().

    Link: https://lkml.kernel.org/r/166226474977.223837.1992182913048377113.stgit@devnote2

    Cc: stable@vger.kernel.org
    Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
    Signed-off-by: Masami Hiramatsu (Google)
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu (Google)
     

08 Sep, 2022

2 commits

  • [ Upstream commit 7d6620f107bae6ed687ff07668e8e8f855487aa9 ]

    Syzkaller reported a triggered kernel BUG as follows:

    ------------[ cut here ]------------
    kernel BUG at kernel/bpf/cgroup.c:925!
    invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
    RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0
    Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84
    28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 0b 48
    8b 0c5
    RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578
    RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000
    R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00
    FS: 00007f68213d2b80(0000) GS:ffff88813bc80000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0
    Call Trace:

    cgroup_bpf_prog_detach+0xcc/0x100
    __sys_bpf+0x2273/0x2a00
    __x64_sys_bpf+0x17/0x20
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f68214dbcb9
    Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
    f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01
    f0 ff8
    RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
    RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9
    RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009
    RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003
    R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20
    R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0

    Modules linked in:
    ---[ end trace 0000000000000000 ]---

    Repetition steps:

    For the following cgroup tree,

    root
    |
    cg1
    |
    cg2

    1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs
    attach type is NONE or OVERRIDE.
    2. write 1 to /proc/thread-self/fail-nth for failslab.
    3. detach prog1 for cg1, and then kernel BUG occur.

    Failslab injection will cause kmalloc fail and fall back to
    purge_effective_progs. The problem is that cg2 have attached another prog,
    so when go through cg2 layer, iteration will add pos to 1, and subsequent
    operations will be skipped by the following condition, and cg will meet
    NULL in the end.

    `if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))`

    The NULL cg means no link or prog match, this is as expected, and it's not
    a bug. So here just skip the no match situation.

    Fixes: 4c46091ee985 ("bpf: Fix KASAN use-after-free Read in compute_effective_progs")
    Signed-off-by: Pu Lehui
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com
    Signed-off-by: Sasha Levin

    Pu Lehui
     
  • [ Upstream commit 14b20b784f59bdd95f6f1cfb112c9818bcec4d84 ]

    The verifier cannot perform sufficient validation of any pointers passed
    into bpf_attr and treats them as integers rather than pointers. The helper
    will then read from arbitrary pointers passed into it. Restrict the helper
    to CAP_PERFMON since the security model in BPF of arbitrary kernel read is
    CAP_BPF + CAP_PERFMON.

    Fixes: af2ac3e13e45 ("bpf: Prepare bpf syscall to be used from kernel and user space.")
    Signed-off-by: YiFei Zhu
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20220816205517.682470-1-zhuyifei@google.com
    Signed-off-by: Sasha Levin

    YiFei Zhu
     

05 Sep, 2022

2 commits

  • commit 9c80e79906b4ca440d09e7f116609262bb747909 upstream.

    The assumption in __disable_kprobe() is wrong, and it could try to disarm
    an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can
    easily reproduce this issue.

    1. Write 0 to /sys/kernel/debug/kprobes/enabled.

    # echo 0 > /sys/kernel/debug/kprobes/enabled

    2. Run execsnoop. At this time, one kprobe is disabled.

    # /usr/share/bcc/tools/execsnoop &
    [1] 2460
    PCOMM PID PPID RET ARGS

    # cat /sys/kernel/debug/kprobes/list
    ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE]
    ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]

    3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes
    kprobes_all_disarmed to false but does not arm the disabled kprobe.

    # echo 1 > /sys/kernel/debug/kprobes/enabled

    # cat /sys/kernel/debug/kprobes/list
    ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE]
    ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]

    4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the
    disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().

    # fg
    /usr/share/bcc/tools/execsnoop
    ^C

    Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses
    some cleanups and leaves the aggregated kprobe in the hash table. Then,
    __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an
    infinite loop like this.

    aggregated kprobe.list -> kprobe.list -.
    ^ |
    '.__.'

    In this situation, these commands fall into the infinite loop and result
    in RCU stall or soft lockup.

    cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the
    infinite loop with RCU.

    /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex,
    and __get_valid_kprobe() is stuck in
    the loop.

    To avoid the issue, make sure we don't call disarm_kprobe() for disabled
    kprobes.

    [0]
    Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2)
    WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
    Modules linked in: ena
    CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28
    Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017
    RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
    Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94
    RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001
    RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff
    RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff
    R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40
    R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000
    FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:

    __disable_kprobe (kernel/kprobes.c:1716)
    disable_kprobe (kernel/kprobes.c:2392)
    __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340)
    disable_trace_kprobe (kernel/trace/trace_kprobe.c:429)
    perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168)
    perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295)
    _free_event (kernel/events/core.c:4971)
    perf_event_release_kernel (kernel/events/core.c:5176)
    perf_release (kernel/events/core.c:5186)
    __fput (fs/file_table.c:321)
    task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1))
    exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201)
    syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296)
    do_syscall_64 (arch/x86/entry/common.c:87)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
    RIP: 0033:0x7fe7ff210654
    Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc
    RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
    RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654
    RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008
    RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30
    R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600
    R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560

    Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com
    Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.")
    Signed-off-by: Kuniyuki Iwashima
    Reported-by: Ayushman Dutta
    Cc: "Naveen N. Rao"
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Masami Hiramatsu
    Cc: Wang Nan
    Cc: Kuniyuki Iwashima
    Cc: Kuniyuki Iwashima
    Cc: Ayushman Dutta
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kuniyuki Iwashima
     
  • commit c3b0f72e805f0801f05fa2aa52011c4bfc694c44 upstream.

    ftrace_startup does not remove ops from ftrace_ops_list when
    ftrace_startup_enable fails:

    register_ftrace_function
    ftrace_startup
    __register_ftrace_function
    ...
    add_ftrace_ops(&ftrace_ops_list, ops)
    ...
    ...
    ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1
    ...
    return 0 // ops is in the ftrace_ops_list.

    When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything:
    unregister_ftrace_function
    ftrace_shutdown
    if (unlikely(ftrace_disabled))
    return -ENODEV; // return here, __unregister_ftrace_function is not executed,
    // as a result, ops is still in the ftrace_ops_list
    __unregister_ftrace_function
    ...

    If ops is dynamically allocated, it will be free later, in this case,
    is_ftrace_trampoline accesses NULL pointer:

    is_ftrace_trampoline
    ftrace_ops_trampoline
    do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!

    Syzkaller reports as follows:
    [ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b
    [ 1203.508039] #PF: supervisor read access in kernel mode
    [ 1203.508798] #PF: error_code(0x0000) - not-present page
    [ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0
    [ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI
    [ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8
    [ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    [ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0
    [ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00
    [ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246
    [ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866
    [ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b
    [ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07
    [ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399
    [ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008
    [ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000
    [ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0
    [ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Therefore, when ftrace_startup_enable fails, we need to rollback registration
    process and remove ops from ftrace_ops_list.

    Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com

    Suggested-by: Steven Rostedt
    Signed-off-by: Yang Jihong
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Yang Jihong
     

31 Aug, 2022

4 commits

  • commit a657182a5c5150cdfacb6640aad1d2712571a409 upstream.

    Hsin-Wei reported a KASAN splat triggered by their BPF runtime fuzzer which
    is based on a customized syzkaller:

    BUG: KASAN: slab-out-of-bounds in bpf_int_jit_compile+0x1257/0x13f0
    Read of size 8 at addr ffff888004e90b58 by task syz-executor.0/1489
    CPU: 1 PID: 1489 Comm: syz-executor.0 Not tainted 5.19.0 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    1.13.0-1ubuntu1.1 04/01/2014
    Call Trace:

    dump_stack_lvl+0x9c/0xc9
    print_address_description.constprop.0+0x1f/0x1f0
    ? bpf_int_jit_compile+0x1257/0x13f0
    kasan_report.cold+0xeb/0x197
    ? kvmalloc_node+0x170/0x200
    ? bpf_int_jit_compile+0x1257/0x13f0
    bpf_int_jit_compile+0x1257/0x13f0
    ? arch_prepare_bpf_dispatcher+0xd0/0xd0
    ? rcu_read_lock_sched_held+0x43/0x70
    bpf_prog_select_runtime+0x3e8/0x640
    ? bpf_obj_name_cpy+0x149/0x1b0
    bpf_prog_load+0x102f/0x2220
    ? __bpf_prog_put.constprop.0+0x220/0x220
    ? find_held_lock+0x2c/0x110
    ? __might_fault+0xd6/0x180
    ? lock_downgrade+0x6e0/0x6e0
    ? lock_is_held_type+0xa6/0x120
    ? __might_fault+0x147/0x180
    __sys_bpf+0x137b/0x6070
    ? bpf_perf_link_attach+0x530/0x530
    ? new_sync_read+0x600/0x600
    ? __fget_files+0x255/0x450
    ? lock_downgrade+0x6e0/0x6e0
    ? fput+0x30/0x1a0
    ? ksys_write+0x1a8/0x260
    __x64_sys_bpf+0x7a/0xc0
    ? syscall_enter_from_user_mode+0x21/0x70
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f917c4e2c2d

    The problem here is that a range of tnum_range(0, map->max_entries - 1) has
    limited ability to represent the concrete tight range with the tnum as the
    set of resulting states from value + mask can result in a superset of the
    actual intended range, and as such a tnum_in(range, reg->var_off) check may
    yield true when it shouldn't, for example tnum_range(0, 2) would result in
    00XX -> v = 0000, m = 0011 such that the intended set of {0, 1, 2} is here
    represented by a less precise superset of {0, 1, 2, 3}. As the register is
    known const scalar, really just use the concrete reg->var_off.value for the
    upper index check.

    Fixes: d2e4c1e6c294 ("bpf: Constant map key tracking for prog array pokes")
    Reported-by: Hsin-Wei Hung
    Signed-off-by: Daniel Borkmann
    Cc: Shung-Hsi Yu
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/r/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • commit a8faed3a02eeb75857a3b5d660fa80fe79db77a3 upstream.

    When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is
    set/enabled, the riscv compat_syscall_table references
    'compat_sys_fadvise64_64', which is not defined:

    riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8):
    undefined reference to `compat_sys_fadvise64_64'

    Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so
    that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function
    available.

    Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org
    Fixes: d3ac21cacc24 ("mm: Support compiling out madvise and fadvise")
    Signed-off-by: Randy Dunlap
    Suggested-by: Arnd Bergmann
    Reviewed-by: Arnd Bergmann
    Cc: Josh Triplett
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     
  • commit 763f4fb76e24959c370cdaa889b2492ba6175580 upstream.

    Root cause:
    The rebind_subsystems() is no lock held when move css object from A
    list to B list,then let B's head be treated as css node at
    list_for_each_entry_rcu().

    Solution:
    Add grace period before invalidating the removed rstat_css_node.

    Reported-by: Jing-Ting Wu
    Suggested-by: Michal Koutný
    Signed-off-by: Jing-Ting Wu
    Tested-by: Jing-Ting Wu
    Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
    Acked-by: Mukesh Ojha
    Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1")
    Cc: stable@vger.kernel.org # v5.13+
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Jing-Ting Wu
     
  • commit ad982c3be4e60c7d39c03f782733503cbd88fd2a upstream.

    Audit_alloc_mark() assign pathname to audit_mark->path, on error path
    from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory
    of audit_mark->path, but the caller of audit_alloc_mark will free
    the pathname again, so there will be double free problem.

    Fix this by resetting audit_mark->path to NULL pointer on error path
    from fsnotify_add_inode_mark().

    Cc: stable@vger.kernel.org
    Fixes: 7b1293234084d ("fsnotify: Add group pointer in fsnotify_init_mark()")
    Signed-off-by: Gaosheng Cui
    Reviewed-by: Jan Kara
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Gaosheng Cui
     

25 Aug, 2022

11 commits

  • [ Upstream commit 7c56a8733d0a2a4be2438a7512566e5ce552fccf ]

    In some circumstances it may be interesting to reconfigure the watchdog
    from inside the kernel.

    On PowerPC, this may helpful before and after a LPAR migration (LPM) is
    initiated, because it implies some latencies, watchdog, and especially NMI
    watchdog is expected to be triggered during this operation. Reconfiguring
    the watchdog with a factor, would prevent it to happen too frequently
    during LPM.

    Rename lockup_detector_reconfigure() as __lockup_detector_reconfigure() and
    create a new function lockup_detector_reconfigure() calling
    __lockup_detector_reconfigure() under the protection of watchdog_mutex.

    Signed-off-by: Laurent Dufour
    [mpe: Squash in build fix from Laurent, reported by Sachin]
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20220713154729.80789-3-ldufour@linux.ibm.com
    Signed-off-by: Sasha Levin

    Laurent Dufour
     
  • commit f04dec93466a0481763f3b56cdadf8076e28bfbf upstream.

    Currently when an event probe (eprobe) hooks to a string field, it does
    not display it as a string, but instead as a number. This makes the field
    rather useless. Handle the different kinds of strings, dynamic, static,
    relational/dynamic etc.

    Now when a string field is used, the ":string" type can be used to display
    it:

    echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events

    Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
    Acked-by: Masami Hiramatsu (Google)
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit ef1e93d2eeb58a1f08c37b22a2314b94bc045f15 upstream.

    bpf_iter_attach_map() acquires a map uref, and the uref may be released
    before or in the middle of iterating map elements. For example, the uref
    could be released in bpf_iter_detach_map() as part of
    bpf_link_release(), or could be released in bpf_map_put_with_uref() as
    part of bpf_map_release().

    So acquiring an extra map uref in bpf_iter_init_hash_map() and
    releasing it in bpf_iter_fini_hash_map().

    Fixes: d6c4503cc296 ("bpf: Implement bpf iterator for hash maps")
    Signed-off-by: Hou Tao
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/r/20220810080538.1845898-3-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit f76fa6b338055054f80c72b29c97fb95c1becadc upstream.

    bpf_iter_attach_map() acquires a map uref, and the uref may be released
    before or in the middle of iterating map elements. For example, the uref
    could be released in bpf_iter_detach_map() as part of
    bpf_link_release(), or could be released in bpf_map_put_with_uref() as
    part of bpf_map_release().

    Alternative fix is acquiring an extra bpf_link reference just like
    a pinned map iterator does, but it introduces unnecessary dependency
    on bpf_link instead of bpf_map.

    So choose another fix: acquiring an extra map uref in .init_seq_private
    for array map iterator.

    Fixes: d3cc2ab546ad ("bpf: Implement bpf iterator for array maps")
    Signed-off-by: Hou Tao
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/r/20220810080538.1845898-2-houtao@huaweicloud.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit 275c30bcee66a27d1aa97a215d607ad6d49804cb upstream.

    The LRU map that is preallocated may have its elements reused while
    another program holds a pointer to it from bpf_map_lookup_elem. Hence,
    only check_and_free_fields is appropriate when the element is being
    deleted, as it ensures proper synchronization against concurrent access
    of the map value. After that, we cannot call check_and_init_map_value
    again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
    they can be concurrently accessed from a BPF program.

    This is safe to do as when the map entry is deleted, concurrent access
    is protected against by check_and_free_fields, i.e. an existing timer
    would be freed, and any existing kptr will be released by it. The
    program can create further timers and kptrs after check_and_free_fields,
    but they will eventually be released once the preallocated items are
    freed on map destruction, even if the item is never reused again. Hence,
    the deleted item sitting in the free list can still have resources
    attached to it, and they would never leak.

    With spin_lock, we never touch the field at all on delete or update, as
    we may end up modifying the state of the lock. Since the verifier
    ensures that a bpf_spin_lock call is always paired with bpf_spin_unlock
    call, the program will eventually release the lock so that on reuse the
    new user of the value can take the lock.

    Essentially, for the preallocated case, we must assume that the map
    value may always be in use by the program, even when it is sitting in
    the freelist, and handle things accordingly, i.e. use proper
    synchronization inside check_and_free_fields, and never reinitialize the
    special fields when it is reused on update.

    Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
    Acked-by: Yonghong Song
    Signed-off-by: Kumar Kartikeya Dwivedi
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/r/20220809213033.24147-3-memxor@gmail.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Kumar Kartikeya Dwivedi
     
  • commit b2380577d4fe1c0ef3fa50417f1e441c016e4cbe upstream.

    Make filtering consistent with histograms. As "cpu" can be a field of an
    event, allow for "common_cpu" to keep it from being confused with the
    "cpu" field of the event.

    Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
    Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
    Suggested-by: Masami Hiramatsu (Google)
    Acked-by: Masami Hiramatsu (Google)
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit ab8384442ee512fc0fc72deeb036110843d0e7ff upstream.

    Both $comm and $COMM can be used to get current->comm in eprobes and the
    filtering and histogram logic. Make kprobes and uprobes consistent in this
    regard and allow both $comm and $COMM as well. Currently kprobes and
    uprobes only handle $comm, which is inconsistent with the other utilities,
    and can be confusing to users.

    Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
    Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
    Suggested-by: Masami Hiramatsu (Google)
    Acked-by: Masami Hiramatsu (Google)
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit 6a832ec3d680b3a4f4fad5752672827d71bae501 upstream.

    Currently, if a symbol "@" is attempted to be used with an event probe
    (eprobes), it will cause a NULL pointer dereference crash.

    Both kprobes and uprobes can reference data other than the main registers.
    Such as immediate address, symbols and the current task name. Have eprobes
    do the same thing.

    For "comm", if "comm" is used and the event being attached to does not
    have the "comm" field, then make it the "$comm" that kprobes has. This is
    consistent to the way histograms and filters work.

    Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Masami Hiramatsu
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit 02333de90e5945e2fe7fc75b15b4eb9aee187f0a upstream.

    The variable $comm is hard coded as a string, which is true for both
    kprobes and uprobes, but for event probes (eprobes) it is a field name. In
    most cases the "comm" field would be a string, but there's no guarantee of
    that fact.

    Do not assume that comm is a string. Not to mention, it currently forces
    comm fields to fault, as string processing for event probes is currently
    broken.

    Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Masami Hiramatsu
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit 2673c60ee67e71f2ebe34386e62d348f71edee47 upstream.

    While playing with event probes (eprobes), I tried to see what would
    happen if I attempted to retrieve the instruction pointer (%rip) knowing
    that event probes do not use pt_regs. The result was:

    BUG: kernel NULL pointer dereference, address: 0000000000000024
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309
    Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
    v03.03 07/14/2016
    RIP: 0010:get_event_field.isra.0+0x0/0x50
    Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8
    50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 63 47 24
    8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74
    RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086
    RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000
    RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000
    RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8
    R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff916c9ea40000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0
    Call Trace:

    get_eprobe_size+0xb4/0x640
    ? __mod_node_page_state+0x72/0xc0
    __eprobe_trace_func+0x59/0x1a0
    ? __mod_lruvec_page_state+0xaa/0x1b0
    ? page_remove_file_rmap+0x14/0x230
    ? page_remove_rmap+0xda/0x170
    event_triggers_call+0x52/0xe0
    trace_event_buffer_commit+0x18f/0x240
    trace_event_raw_event_sched_wakeup_template+0x7a/0xb0
    try_to_wake_up+0x260/0x4c0
    __wake_up_common+0x80/0x180
    __wake_up_common_lock+0x7c/0xc0
    do_notify_parent+0x1c9/0x2a0
    exit_notify+0x1a9/0x220
    do_exit+0x2ba/0x450
    do_group_exit+0x2d/0x90
    __x64_sys_exit_group+0x14/0x20
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    Obviously this is not the desired result.

    Move the testing for TPARG_FL_TPOINT which is only used for event probes
    to the top of the "$" variable check, as all the other variables are not
    used for event probes. Also add a check in the register parsing "%" to
    fail if an event probe is used.

    Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org

    Cc: stable@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Tzvetomir Stoyanov
    Cc: Tom Zanussi
    Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
    Acked-by: Masami Hiramatsu (Google)
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     
  • commit 7249921d94ff64f67b733eca0b68853a62032b3d upstream.

    If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
    will call perf_trace_event_unreg() which will not only unregister the perf
    trace event, but will also call the put() function of the tp_event.

    The problem here is that the trace_event_try_get_ref() is called by the
    caller of perf_trace_event_init() and if perf_trace_event_init() returns a
    failure, it will then call trace_event_put(). But since the
    perf_trace_event_unreg() already called the trace_event_put() function, it
    triggers a WARN_ON().

    WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20

    If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
    the perf_trace_event_unreg() should not be calling trace_event_put(). This
    breaks symmetry and causes bugs like these.

    Pull out the trace_event_put() from perf_trace_event_unreg() and call it
    in the locations that perf_trace_event_unreg() is called. This not only
    fixes this bug, but also brings back the proper symmetry of the reg/unreg
    vs get/put logic.

    Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
    Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home

    Cc: stable@vger.kernel.org
    Fixes: 1d18538e6a092 ("tracing: Have dynamic events have a ref counter")
    Reported-by: Krister Johansen
    Reviewed-by: Krister Johansen
    Tested-by: Krister Johansen
    Acked-by: Jiri Olsa
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Google)
     

17 Aug, 2022

10 commits

  • [ Upstream commit 55de2c0b5610cba5a5a93c0788031133c457e689 ]

    Add '__rel_loc' using trace event macros. These macros are usually
    not used in the kernel, except for testing purpose.
    This also add "rel_" variant of macros for dynamic_array string,
    and bitmask.

    Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2

    Cc: Beau Belgrave
    Cc: Namhyung Kim
    Cc: Tom Zanussi
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit 9c9b26b0df270d4f9246e483a44686fca951a29c ]

    The csdlock_debug kernel-boot parameter is parsed by the
    early_param() function csdlock_debug(). If set, csdlock_debug()
    invokes static_branch_enable() to enable csd_lock_wait feature, which
    triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and
    CONFIG_SPARSEMEM_VMEMMAP=n.

    With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in
    static_key_enable() and returns NULL, resulting in a NULL dereference
    because mem_section is initialized only later in sparse_init().

    This is also a problem for powerpc because early_param() functions
    are invoked earlier than jump_label_init(), also resulting in
    static_key_enable() failures. These failures cause the warning "static
    key 'xxx' used before call to jump_label_init()".

    Thus, early_param is too early for csd_lock_wait to run
    static_branch_enable(), so changes it to __setup to fix these.

    Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging")
    Cc: stable@vger.kernel.org
    Reported-by: Chen jingwen
    Signed-off-by: Chen Zhongjin
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Chen Zhongjin
     
  • [ Upstream commit b8ac29b40183a6038919768b5d189c9bd91ce9b4 ]

    The rng's random_init() function contributes the real time to the rng at
    boot time, so that events can at least start in relation to something
    particular in the real world. But this clock might not yet be set that
    point in boot, so nothing is contributed. In addition, the relation
    between minor clock changes from, say, NTP, and the cycle counter is
    potentially useful entropic data.

    This commit addresses this by mixing in a time stamp on calls to
    settimeofday and adjtimex. No entropy is credited in doing so, so it
    doesn't make initialization faster, but it is still useful input to
    have.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Cc: stable@vger.kernel.org
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Eric Biggers
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Sasha Levin

    Jason A. Donenfeld
     
  • [ Upstream commit 751d4cbc43879229dbc124afefe240b70fd29a85 ]

    The following warning was triggered on a large machine early in boot on
    a distribution kernel but the same problem should also affect mainline.

    WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440
    Call Trace:

    rescuer_thread+0x1f6/0x360
    kthread+0x156/0x180
    ret_from_fork+0x22/0x30

    Commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    optimises ttwu by queueing a task that is descheduling on the wakelist,
    but does not check if the task descheduling is still allowed to run on that CPU.

    In this warning, the problematic task is a workqueue rescue thread which
    checks if the rescue is for a per-cpu workqueue and running on the wrong CPU.
    While this is early in boot and it should be possible to create workers,
    the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached
    or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue
    thread is being used frequently.

    Tracing confirmed that the task should have migrated properly using the
    stopper thread to handle the migration. However, a parallel wakeup from udev
    running on another CPU that does not share CPU cache observes p->on_cpu and
    uses task_cpu(p), queues the task on the old CPU and triggers the warning.

    Check that the wakee task that is descheduling is still allowed to run
    on its current CPU and if not, wait for the descheduling to complete
    and select an allowed CPU.

    Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    Signed-off-by: Mel Gorman
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net
    Signed-off-by: Sasha Levin

    Mel Gorman
     
  • [ Upstream commit f3dd3f674555bd9455c5ae7fafce0696bd9931b3 ]

    Wakelist can help avoid cache bouncing and offload the overhead of waker
    cpu. So far, using wakelist within the same llc only happens on
    WF_ON_CPU, and this limitation could be removed to further improve
    wakeup performance.

    The commit 518cd6234178 ("sched: Only queue remote wakeups when
    crossing cache boundaries") disabled queuing tasks on wakelist when
    the cpus share llc. This is because, at that time, the scheduler must
    send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
    supports TIF_POLLING, so this is not a problem now when the wakee cpu is
    in idle polling.

    Benefits:
    Queuing the task on idle cpu can help improving performance on waker cpu
    and utilization on wakee cpu, and further improve locality because
    the wakee cpu can handle its own rq. This patch helps improving rt on
    our real java workloads where wakeup happens frequently.

    Consider the normal condition (CPU0 and CPU1 share same llc)
    Before this patch:

    CPU0 CPU1

    select_task_rq() idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

    resched()

    After this patch:

    CPU0 CPU1

    select_task_rq() idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    resched()

    We see CPU0 can finish its work earlier. It only needs to put task to
    wakelist and return.
    While CPU1 is idle, so let itself handle its own runqueue data.

    This patch brings no difference about IPI.
    This patch only takes effect when the wakee cpu is:
    1) idle polling
    2) idle not polling

    For 1), there will be no IPI with or without this patch.

    For 2), there will always be an IPI before or after this patch.
    Before this patch: waker cpu will enqueue task and check preempt. Since
    "idle" will be sure to be preempted, waker cpu must send a resched IPI.
    After this patch: waker cpu will put the task to the wakelist of wakee
    cpu, and send an IPI.

    Benchmark:
    We've tested schbench, unixbench, and hachbench on both x86 and arm64.

    On x86 (Intel Xeon Platinum 8269CY):
    schbench -m 2 -t 8

    Latency percentiles (usec) before after
    50.0000th: 8 6
    75.0000th: 10 7
    90.0000th: 11 8
    95.0000th: 12 8
    *99.0000th: 13 10
    99.5000th: 15 11
    99.9000th: 18 14

    Unixbench with full threads (104)
    before after
    Dhrystone 2 using register variables 3011862938 3009935994 -0.06%
    Double-Precision Whetstone 617119.3 617298.5 0.03%
    Execl Throughput 27667.3 27627.3 -0.14%
    File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12%
    File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20%
    File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36%
    Pipe Throughput 145535622.8 145323033.2 -0.15%
    Pipe-based Context Switching 3221686.4 3583975.4 11.25%
    Process Creation 101347.1 103345.4 1.97%
    Shell Scripts (1 concurrent) 120193.5 123977.8 3.15%
    Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55%
    System Call Overhead 5300604.8 5312213.6 0.22%

    hackbench -g 1 -l 100000
    before after
    Time 3.246 2.251

    On arm64 (Ampere Altra):
    schbench -m 2 -t 8

    Latency percentiles (usec) before after
    50.0000th: 14 10
    75.0000th: 19 14
    90.0000th: 22 16
    95.0000th: 23 16
    *99.0000th: 24 17
    99.5000th: 24 17
    99.9000th: 28 25

    Unixbench with full threads (80)
    before after
    Dhrystone 2 using register variables 3536194249 3537019613 0.02%
    Double-Precision Whetstone 629383.6 629431.6 0.01%
    Execl Throughput 65920.5 65846.2 -0.11%
    File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03%
    File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23%
    File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83%
    Pipe Throughput 133542875.3 131619389.8 -1.44%
    Pipe-based Context Switching 3215356.1 3576945.1 11.25%
    Process Creation 108520.5 120184.6 10.75%
    Shell Scripts (1 concurrent) 122636.3 121888 -0.61%
    Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46%
    System Call Overhead 4429998.9 4435006.7 0.11%

    hackbench -g 1 -l 100000
    before after
    Time 4.217 2.916

    Our patch has improvement on schbench, hackbench
    and Pipe-based Context Switching of unixbench
    when there exists idle cpus,
    and no obvious regression on other tests of unixbench.
    This can help improve rt in scenes where wakeup happens frequently.

    Signed-off-by: Tianchen Ding
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Tianchen Ding
     
  • [ Upstream commit 28156108fecb1f808b21d216e8ea8f0d205a530c ]

    The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it
    the wakee is descheduling") checked rq->nr_running on_rq and p->on_cpu, observing p->on_cpu
    (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
    the deactivate_task() in __schedule(), thus p has been accounted out of
    rq->nr_running. As such, the task being the only runnable task on the rq
    implies reading rq->nr_running == 0 at that point.

    The benchmark result is in [1].

    [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

    Suggested-by: Valentin Schneider
    Signed-off-by: Tianchen Ding
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Tianchen Ding
     
  • [ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ]

    With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
    that the cpuset will just use the effective CPUs of its parent. So
    cpuset_can_attach() can call task_can_attach() with an empty mask.
    This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
    to dl_bw_of() to crash due to percpu value access of an out of bound
    CPU value. For example:

    [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
    :
    [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
    :
    [80468.207946] Call Trace:
    [80468.208947] cpuset_can_attach+0xa0/0x140
    [80468.209953] cgroup_migrate_execute+0x8c/0x490
    [80468.210931] cgroup_update_dfl_csses+0x254/0x270
    [80468.211898] cgroup_subtree_control_write+0x322/0x400
    [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
    [80468.213777] new_sync_write+0x11f/0x1b0
    [80468.214689] vfs_write+0x1eb/0x280
    [80468.215592] ksys_write+0x5f/0xe0
    [80468.216463] do_syscall_64+0x5c/0x80
    [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
    is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
    to be used by tasks within the cpuset anyway.

    Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
    reflect the change. In addition, a check is added to task_can_attach()
    to guard against the possibility that cpumask_any_and() may return a
    value >= nr_cpu_ids.

    Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
    Signed-off-by: Waiman Long
    Signed-off-by: Ingo Molnar
    Acked-by: Juri Lelli
    Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 772b6539fdda31462cc08368e78df60b31a58bab ]

    Both functions are doing almost the same, that is checking if admission
    control is still respected.

    With exclusive cpusets, dl_task_can_attach() checks if the destination
    cpuset (i.e. its root domain) has enough CPU capacity to accommodate the
    task.
    dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in
    case the CPU is hot-plugged out.

    dl_task_can_attach() is used to check if a task can be admitted while
    dl_cpu_busy() is used to check if a CPU can be hotplugged out.

    Make dl_cpu_busy() able to deal with a task and use it instead of
    dl_task_can_attach() in task_can_attach().

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com
    Signed-off-by: Sasha Levin

    Dietmar Eggemann
     
  • [ Upstream commit 28f6c37a2910f565b4f5960df52b2eccae28c891 ]

    kernel_text_address() treats ftrace_trampoline, kprobe_insn_slot
    and bpf_text_address as valid kprobe addresses - which is not ideal.

    These text areas are removable and changeable without any notification
    to kprobes, and probing on them can trigger unexpected behavior:

    https://lkml.org/lkml/2022/7/26/1148

    Considering that jump_label and static_call text are already
    forbiden to probe, kernel_text_address() should be replaced with
    core_kernel_text() and is_module_text_address() to check other text
    areas which are unsafe to kprobe.

    [ mingo: Rewrote the changelog. ]

    Fixes: 5b485629ba0d ("kprobes, extable: Identify kprobes trampolines as kernel text area")
    Fixes: 74451e66d516 ("bpf: make jited programs visible in traces")
    Signed-off-by: Chen Zhongjin
    Signed-off-by: Ingo Molnar
    Acked-by: Masami Hiramatsu (Google)
    Link: https://lore.kernel.org/r/20220801033719.228248-1-chenzhongjin@huawei.com
    Signed-off-by: Sasha Levin

    Chen Zhongjin
     
  • [ Upstream commit c51ba246cb172c9e947dc6fb8868a1eaf0b2a913 ]

    In the failure case of trying to use a buffer which we'd previously
    failed to allocate, the "!mem" condition is no longer sufficient since
    io_tlb_default_mem became static and assigned by default. Update the
    condition to work as intended per the rest of that conversion.

    Fixes: 463e862ac63e ("swiotlb: Convert io_default_tlb_mem to static allocation")
    Signed-off-by: Robin Murphy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Robin Murphy