23 Sep, 2022

1 commit

  • commit 43626dade36fa74d3329046f4ae2d7fdefe401c6 upstream.

    syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
    cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix
    threadgroup_rwsem cpus_read_lock() deadlock") missed that
    cpuset_attach() is also called from cgroup_attach_task_all().
    Add cpus_read_lock() like what cgroup_procs_write_start() does.

    Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
    Reported-by: syzbot
    Signed-off-by: Tetsuo Handa
    Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem cpus_read_lock() deadlock")
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

15 Sep, 2022

2 commits

  • [ Upstream commit 4f7e7236435ca0abe005c674ebd6892c6e83aeb3 ]

    Bringing up a CPU may involve creating and destroying tasks which requires
    read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
    cpus_read_lock(). However, cpuset's ->attach(), which may be called with
    thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
    acquires cpus_read_lock(), leading to a deadlock.

    Fix it by guaranteeing that ->attach() is always called with CPU hotplug
    disabled and removing cpus_read_lock() call from cpuset_attach().

    Signed-off-by: Tejun Heo
    Reviewed-and-tested-by: Imran Khan
    Reported-and-tested-by: Xuewen Yan
    Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
    Cc: stable@vger.kernel.org # v5.17+
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit 671c11f0619e5ccb380bcf0f062f69ba95fc974a ]

    cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
    csses can trigger process migrations. However, if the subtree doesn't
    contain any tasks, there aren't gonna be any cgroup migrations. This
    condition can be trivially detected by testing whether
    mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
    the subtree is empty.

    After this optimization, the usage pattern of creating a cgroup, enabling
    the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
    then removing the cgroup after it becomes empty doesn't need to write-lock
    threadgroup_rwsem at all.

    Signed-off-by: Tejun Heo
    Cc: Christian Brauner
    Cc: Michal Koutný
    Signed-off-by: Sasha Levin

    Tejun Heo
     

31 Aug, 2022

1 commit

  • commit 763f4fb76e24959c370cdaa889b2492ba6175580 upstream.

    Root cause:
    The rebind_subsystems() is no lock held when move css object from A
    list to B list,then let B's head be treated as css node at
    list_for_each_entry_rcu().

    Solution:
    Add grace period before invalidating the removed rstat_css_node.

    Reported-by: Jing-Ting Wu
    Suggested-by: Michal Koutný
    Signed-off-by: Jing-Ting Wu
    Tested-by: Jing-Ting Wu
    Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
    Acked-by: Mukesh Ojha
    Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1")
    Cc: stable@vger.kernel.org # v5.13+
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Jing-Ting Wu
     

17 Aug, 2022

1 commit

  • [ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ]

    With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
    that the cpuset will just use the effective CPUs of its parent. So
    cpuset_can_attach() can call task_can_attach() with an empty mask.
    This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
    to dl_bw_of() to crash due to percpu value access of an out of bound
    CPU value. For example:

    [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
    :
    [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
    :
    [80468.207946] Call Trace:
    [80468.208947] cpuset_can_attach+0xa0/0x140
    [80468.209953] cgroup_migrate_execute+0x8c/0x490
    [80468.210931] cgroup_update_dfl_csses+0x254/0x270
    [80468.211898] cgroup_subtree_control_write+0x322/0x400
    [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
    [80468.213777] new_sync_write+0x11f/0x1b0
    [80468.214689] vfs_write+0x1eb/0x280
    [80468.215592] ksys_write+0x5f/0xe0
    [80468.216463] do_syscall_64+0x5c/0x80
    [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
    is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
    to be used by tasks within the cpuset anyway.

    Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
    reflect the change. In addition, a check is added to task_can_attach()
    to guard against the possibility that cpumask_any_and() may return a
    value >= nr_cpu_ids.

    Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
    Signed-off-by: Waiman Long
    Signed-off-by: Ingo Molnar
    Acked-by: Juri Lelli
    Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
    Signed-off-by: Sasha Levin

    Waiman Long
     

22 Jul, 2022

1 commit

  • commit 07fd5b6cdf3cc30bfde8fe0f644771688be04447 upstream.

    Each cset (css_set) is pinned by its tasks. When we're moving tasks around
    across csets for a migration, we need to hold the source and destination
    csets to ensure that they don't go away while we're moving tasks about. This
    is done by linking cset->mg_preload_node on either the
    mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
    same cset->mg_preload_node for both the src and dst lists was deemed okay as
    a cset can't be both the source and destination at the same time.

    Unfortunately, this overloading becomes problematic when multiple tasks are
    involved in a migration and some of them are identity noop migrations while
    others are actually moving across cgroups. For example, this can happen with
    the following sequence on cgroup1:

    #1> mkdir -p /sys/fs/cgroup/misc/a/b
    #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
    #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
    #4> PID=$!
    #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
    #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

    the process including the group leader back into a. In this final migration,
    non-leader threads would be doing identity migration while the group leader
    is doing an actual one.

    After #3, let's say the whole process was in cset A, and that after #4, the
    leader moves to cset B. Then, during #6, the following happens:

    1. cgroup_migrate_add_src() is called on B for the leader.

    2. cgroup_migrate_add_src() is called on A for the other threads.

    3. cgroup_migrate_prepare_dst() is called. It scans the src list.

    4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

    5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

    6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

    This means that A isn't held while migration is in progress. If all tasks
    leave A before the migration finishes and the incoming task pins it, the
    cset will be destroyed leading to use-after-free.

    This is caused by overloading cset->mg_preload_node for both src and dst
    preload lists. We wanted to exclude the cset from the src list but ended up
    inadvertently excluding it from the dst list too.

    This patch fixes the issue by separating out cset->mg_preload_node into
    ->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
    preloadings don't interfere with each other.

    Signed-off-by: Tejun Heo
    Reported-by: Mukesh Ojha
    Reported-by: shisiyuan
    Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
    Link: https://www.spinics.net/lists/cgroups/msg33313.html
    Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
    Cc: stable@vger.kernel.org # v3.16+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

18 May, 2022

1 commit

  • commit 2685027fca387b602ae565bff17895188b803988 upstream.

    There are 3 places where the cpu and node masks of the top cpuset can
    be initialized in the order they are executed:
    1) start_kernel -> cpuset_init()
    2) start_kernel -> cgroup_init() -> cpuset_bind()
    3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()

    The first cpuset_init() call just sets all the bits in the masks.
    The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
    default v2 values. The third cpuset_init_smp() call sets them back to
    v1 values.

    For systems with cgroup v2 setup, cpuset_bind() is called once. As a
    result, cpu and memory node hot add may fail to update the cpu and node
    masks of the top cpuset to include the newly added cpu or node in a
    cgroup v2 environment.

    For systems with cgroup v1 setup, cpuset_bind() is called again by
    rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
    in the dmesg log below with an instrumented kernel.

    [ 2.609781] cpuset_bind() called - v2 = 1
    [ 3.079473] cpuset_init_smp() called
    [ 7.103710] cpuset_bind() called - v2 = 0

    smp_init() is called after the first two init functions. So we don't
    have a complete list of active cpus and memory nodes until later in
    cpuset_init_smp() which is the right time to set up effective_cpus
    and effective_mems.

    To fix this cgroup v2 mask setup problem, the potentially incorrect
    cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
    For cgroup v2 systems, the initial cpuset_bind() call will set the masks
    correctly. For cgroup v1 systems, the second call to cpuset_bind()
    will do the right setup.

    cc: stable@vger.kernel.org
    Signed-off-by: Waiman Long
    Tested-by: Feng Tang
    Reviewed-by: Michal Koutný
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

02 Mar, 2022

2 commits

  • commit 467a726b754f474936980da793b4ff2ec3e382a7 upstream.

    The idea is to check: a) the owning user_ns of cgroup_ns, b)
    capabilities in init_user_ns.

    The commit 24f600856418 ("cgroup-v1: Require capabilities to set
    release_agent") got this wrong in the write handler of release_agent
    since it checked user_ns of the opener (may be different from the owning
    user_ns of cgroup_ns).
    Secondly, to avoid possibly confused deputy, the capability of the
    opener must be checked.

    Fixes: 24f600856418 ("cgroup-v1: Require capabilities to set release_agent")
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/
    Signed-off-by: Michal Koutný
    Reviewed-by: Masami Ichikawa(CIP)
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Michal Koutný
     
  • commit 05c7b7a92cc87ff8d7fde189d0fade250697573c upstream.

    As previously discussed(https://lkml.org/lkml/2022/1/20/51),
    cpuset_attach() is affected with similar cpu hotplug race,
    as follow scenario:

    cpuset_attach() cpu hotplug
    --------------------------- ----------------------
    down_write(cpuset_rwsem)
    guarantee_online_cpus() // (load cpus_attach)
    sched_cpu_deactivate
    set_cpu_active()
    // will change cpu_active_mask
    set_cpus_allowed_ptr(cpus_attach)
    __set_cpus_allowed_ptr_locked()
    // (if the intersection of cpus_attach and
    cpu_active_mask is empty, will return -EINVAL)
    up_write(cpuset_rwsem)

    To avoid races such as described above, protect cpuset_attach() call
    with cpu_hotplug_lock.

    Fixes: be367d099270 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
    Cc: stable@vger.kernel.org # v2.6.32+
    Reported-by: Zhao Gongyi
    Signed-off-by: Zhang Qiao
    Acked-by: Waiman Long
    Reviewed-by: Michal Koutný
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Zhang Qiao
     

09 Feb, 2022

1 commit

  • commit 2bdfd2825c9662463371e6691b1a794e97fa36b4 upstream.

    It was found that a "suspicious RCU usage" lockdep warning was issued
    with the rcu_read_lock() call in update_sibling_cpumasks(). It is
    because the update_cpumasks_hier() function may sleep. So we have
    to release the RCU lock, call update_cpumasks_hier() and reacquire
    it afterward.

    Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
    instead of stating that in the comment.

    Fixes: 4716909cc5c5 ("cpuset: Track cpusets that use parent's effective_cpus")
    Signed-off-by: Waiman Long
    Tested-by: Phil Auld
    Reviewed-by: Phil Auld
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

05 Feb, 2022

2 commits

  • commit c80d401c52a2d1baf2a5afeb06f0ffe678e56d23 upstream.

    subparts_cpus should be limited as a subset of cpus_allowed, but it is
    updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
    fix it.

    Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
    Signed-off-by: Tianchen Ding
    Reviewed-by: Waiman Long
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tianchen Ding
     
  • commit 24f6008564183aa120d07c03d9289519c2fe02af upstream.

    The cgroup release_agent is called with call_usermodehelper. The function
    call_usermodehelper starts the release_agent with a full set fo capabilities.
    Therefore require capabilities when setting the release_agaent.

    Reported-by: Tabitha Sable
    Tested-by: Tabitha Sable
    Fixes: 81a6a5cdd2c5 ("Task Control Groups: automatic userspace notification of idle cgroups")
    Cc: stable@vger.kernel.org # v2.6.24+
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

02 Feb, 2022

1 commit

  • commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream.

    With write operation on psi files replacing old trigger with a new one,
    the lifetime of its waitqueue is totally arbitrary. Overwriting an
    existing trigger causes its waitqueue to be freed and pending poll()
    will stumble on trigger->event_wait which was destroyed.
    Fix this by disallowing to redefine an existing psi trigger. If a write
    operation is used on a file descriptor with an already existing psi
    trigger, the operation will fail with EBUSY error.
    Also bypass a check for psi_disabled in the psi_trigger_destroy as the
    flag can be flipped after the trigger is created, leading to a memory
    leak.

    Fixes: 0e94682b73bf ("psi: introduce psi monitor")
    Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
    Suggested-by: Linus Torvalds
    Analyzed-by: Eric Biggers
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Eric Biggers
    Acked-by: Johannes Weiner
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

11 Jan, 2022

3 commits

  • commit e57457641613fef0d147ede8bd6a3047df588b95 upstream.

    cgroup process migration permission checks are performed at write time as
    whether a given operation is allowed or not is dependent on the content of
    the write - the PID. This currently uses current's cgroup namespace which is
    a potential security weakness as it may allow scenarios where a less
    privileged process tricks a more privileged one into writing into a fd that
    it created.

    This patch makes cgroup remember the cgroup namespace at the time of open
    and uses it for migration permission checks instad of current's. Note that
    this only applies to cgroup2 as cgroup1 doesn't have namespace support.

    This also fixes a use-after-free bug on cgroupns reported in

    https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com

    Note that backporting this fix also requires the preceding patch.

    Reported-by: "Eric W. Biederman"
    Suggested-by: Linus Torvalds
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Reviewed-by: Michal Koutný
    Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
    Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
    Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 0d2b5955b36250a9428c832664f2079cbf723bec upstream.

    of->priv is currently used by each interface file implementation to store
    private information. This patch collects the current two private data usages
    into struct cgroup_file_ctx which is allocated and freed by the common path.
    This allows generic private data which applies to multiple files, which will
    be used to in the following patch.

    Note that cgroup_procs iterator is now embedded as procs.iter in the new
    cgroup_file_ctx so that it doesn't need to be allocated and freed
    separately.

    v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
    cgroup_file_ctx as suggested by Linus.

    v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
    Converted. Didn't change to embedded allocation as cgroup1 pidlists get
    stored for caching.

    Signed-off-by: Tejun Heo
    Cc: Linus Torvalds
    Reviewed-by: Michal Koutný
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 1756d7994ad85c2479af6ae5a9750b92324685af upstream.

    cgroup process migration permission checks are performed at write time as
    whether a given operation is allowed or not is dependent on the content of
    the write - the PID. This currently uses current's credentials which is a
    potential security weakness as it may allow scenarios where a less
    privileged process tricks a more privileged one into writing into a fd that
    it created.

    This patch makes both cgroup2 and cgroup1 process migration interfaces to
    use the credentials saved at the time of open (file->f_cred) instead of
    current's.

    Reported-by: "Eric W. Biederman"
    Suggested-by: Linus Torvalds
    Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy")
    Reviewed-by: Michal Koutný
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

19 Nov, 2021

2 commits

  • [ Upstream commit 81c49d39aea8a10e6d05d3aa1cb65ceb721e19b0 ]

    In account_guest_time in kernel/sched/cputime.c guest time is
    attributed to both CPUTIME_NICE and CPUTIME_USER in addition to
    CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding
    both to calculate usage results in double counting any guest time at
    the rootcg.

    Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
    Signed-off-by: Dan Schatzberg
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Dan Schatzberg
     
  • [ Upstream commit 7ee285395b211cad474b2b989db52666e0430daf ]

    It was found that the following warning was displayed when remounting
    controllers from cgroup v2 to v1:

    [ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
    :
    [ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
    [ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
    [ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
    [ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
    [ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
    [ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
    [ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
    [ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
    [ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
    [ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
    [ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8043.191804] PKRU: 55555554
    [ 8043.194517] Call Trace:
    [ 8043.196970] rebind_subsystems+0x18c/0x470
    [ 8043.201070] cgroup_setup_root+0x16c/0x2f0
    [ 8043.205177] cgroup1_root_to_use+0x204/0x2a0
    [ 8043.209456] cgroup1_get_tree+0x3e/0x120
    [ 8043.213384] vfs_get_tree+0x22/0xb0
    [ 8043.216883] do_new_mount+0x176/0x2d0
    [ 8043.220550] __x64_sys_mount+0x103/0x140
    [ 8043.224474] do_syscall_64+0x38/0x90
    [ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae

    It was caused by the fact that rebind_subsystem() disables
    controllers to be rebound one by one. If more than one disabled
    controllers are originally from the default hierarchy, it means that
    cgroup_apply_control_disable() will be called multiple times for the
    same default hierarchy. A controller may be killed by css_kill() in
    the first round. In the second round, the killed controller may not be
    completely dead yet leading to the warning.

    To avoid this problem, we collect all the ssid's of controllers that
    needed to be disabled from the default hierarchy and then disable them
    in one go instead of one by one.

    Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo
    Signed-off-by: Sasha Levin

    Waiman Long
     

23 Oct, 2021

1 commit

  • When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
    the command as below:

    $mount -t cgroup -o none,name=foo cgroup cgroup/
    $umount cgroup/

    unreferenced object 0xc3585c40 (size 64):
    comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
    hex dump (first 32 bytes):
    01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(.........
    00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC......
    backtrace:
    [] cgroup_bpf_inherit+0x44/0x24c
    [] cgroup_setup_root+0x174/0x37c
    [] cgroup1_get_tree+0x2c0/0x4a0
    [] vfs_get_tree+0x24/0x108
    [] path_mount+0x384/0x988
    [] do_mount+0x64/0x9c
    [] sys_mount+0xfc/0x1f4
    [] ret_fast_syscall+0x0/0x48
    [] 0xbeb4daa8

    This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
    memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
    is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
    is called by cgroup_setup_root when mounting, but not freed along with
    root_cgrp when umounting. Adding cgroup_bpf_offline which calls
    percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
    umount path.

    This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
    of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
    cleanup that frees the resources which are allocated by cgroup_bpf_inherit
    in cgroup_setup_root.

    And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
    cgroup_put is at the end of cgroup_bpf_release which is called by
    cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
    cgroup's refcount.

    Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
    Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    Signed-off-by: Quanyang Wang
    Signed-off-by: Alexei Starovoitov
    Acked-by: Roman Gushchin
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com

    Quanyang Wang
     

12 Oct, 2021

1 commit


28 Sep, 2021

1 commit

  • If cgroup_sk_alloc() is called from interrupt context, then just assign the
    root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
    Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
    on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
    iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
    than re-adding the NULL-test to the fast-path we can just assign it once from
    cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
    NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
    directly does /not/ change behavior for callers of sock_cgroup_ptr().

    syzkaller was able to trigger a splat in the legacy netrom code base, where
    the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
    and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
    skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
    a non-NULL object. There are a few other candidates aside from netrom which
    have similar pattern where in their accept-like implementation, they just call
    to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
    corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
    socket. None of them are related to core protocols where BPF cgroup programs
    are running from. However, in future, they should follow to implement a similar
    inheritance mechanism.

    Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
    configuration, the same issue was exposed also prior to 8520e224f547 due to
    commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
    cgroup") which added the early in_interrupt() return back then.

    Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
    Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
    Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
    Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
    Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net

    Daniel Borkmann
     

14 Sep, 2021

3 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2021-09-14

    The following pull-request contains BPF updates for your *net* tree.

    We've added 7 non-merge commits during the last 13 day(s) which contain
    a total of 18 files changed, 334 insertions(+), 193 deletions(-).

    The main changes are:

    1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.

    2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.

    3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.

    4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.

    5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
    is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
    Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    embedded per-socket cgroup information into sock->sk_cgrp_data and in order
    to save 8 bytes in struct sock made both mutually exclusive, that is, when
    cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
    falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

    The assumption made was "there is no reason to mix the two and this is in line
    with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
    However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
    this assumption no longer holds, and the possibility of the v1/v2 mixed mode
    with the v2 root fallback being hit becomes a real security issue.

    Many of the cgroup v2 BPF programs are also used for policy enforcement, just
    to pick _one_ example, that is, to programmatically deny socket related system
    calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
    a policy bypass for the affected Pods.

    In production environments, we have recently seen this case due to various
    circumstances: i) a different 3rd party agent and/or ii) a container runtime
    such as [0] in the user's environment configuring legacy cgroup v1 net_cls
    tags, which triggered implicitly mentioned root fallback. Another case is
    Kubernetes projects like kind [1] which create Kubernetes nodes in a container
    and also add cgroup namespaces to the mix, meaning programs which are attached
    to the cgroup v2 root of the cgroup namespace get attached to a non-root
    cgroup v2 path from init namespace point of view. And the latter's root is
    out of reach for agents on a kind Kubernetes node to configure. Meaning, any
    entity on the node setting cgroup v1 net_cls tag will trigger the bypass
    despite cgroup v2 BPF programs attached to the namespace root.

    Generally, this mutual exclusiveness does not hold anymore in today's user
    environments and makes cgroup v2 usage from BPF side fragile and unreliable.
    This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
    sock_cgroup_data in order to address these issues; this implicitly also fixes
    the tradeoffs being made back then with regards to races and refcount leaks
    as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
    programs always operate as expected.

    [0] https://github.com/nestybox/sysbox/
    [1] https://kind.sigs.k8s.io/

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Stanislav Fomichev
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

    Daniel Borkmann
     
  • Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to
    percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
    a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
    reference cpuset_mutex which are now incorrect.

    Change all the references of cpuset_mutex to cpuset_rwsem.

    Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     

04 Sep, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • Container admin can create new namespaces and force kernel to allocate up
    to several pages of memory for the namespaces and its associated
    structures.

    Net and uts namespaces have enabled accounting for such allocations. It
    makes sense to account for rest ones to restrict the host's memory
    consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
    Signed-off-by: Vasily Averin
    Acked-by: Serge Hallyn
    Acked-by: Christian Brauner
    Acked-by: Kirill Tkhai
    Reviewed-by: Shakeel Butt
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Yutian Yang
    Cc: Zefan Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

01 Sep, 2021

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Two cpuset behavior changes:

    - cpuset on cgroup2 is changed to enable memory migration based on
    nodemask by default.

    - A notification is generated when cpuset partition state changes.

    All other patches are minor fixes and cleanups"

    * 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Avoid compiler warnings with no subsystems
    cgroup/cpuset: Avoid memory migration when nodemasks match
    cgroup/cpuset: Enable memory migration for cpuset v2
    cgroup/cpuset: Enable event notification when partition state changes
    cgroup: cgroup-v1: clean up kernel-doc notation
    cgroup: Replace deprecated CPU-hotplug functions.
    cgroup/cpuset: Fix violation of cpuset locking rule
    cgroup/cpuset: Fix a partition bug with hotplug
    cgroup/cpuset: Miscellaneous code cleanup
    cgroup: remove cgroup_mount from comments

    Linus Torvalds
     

31 Aug, 2021

2 commits

  • Pull scheduler updates from Ingo Molnar:

    - The biggest change in this cycle is scheduler support for asymmetric
    scheduling affinity, to support the execution of legacy 32-bit tasks
    on AArch32 systems that also have 64-bit-only CPUs.

    Architectures can fill in this functionality by defining their own
    task_cpu_possible_mask(p). When this is done, the scheduler will make
    sure the task will only be scheduled on CPUs that support it.

    (The actual arm64 specific changes are not part of this tree.)

    For other architectures there will be no change in functionality.

    - Add cgroup SCHED_IDLE support

    - Increase node-distance flexibility & delay determining it until a CPU
    is brought online. (This enables platforms where node distance isn't
    final until the CPU is only.)

    - Deadline scheduler enhancements & fixes

    - Misc fixes & cleanups.

    * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    eventfd: Make signal recursion protection a task bit
    sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
    sched: Introduce dl_task_check_affinity() to check proposed affinity
    sched: Allow task CPU affinity to be restricted on asymmetric systems
    sched: Split the guts of sched_setaffinity() into a helper function
    sched: Introduce task_struct::user_cpus_ptr to track requested affinity
    sched: Reject CPU affinity changes based on task_cpu_possible_mask()
    cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
    cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
    cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
    sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
    sched: Cgroup SCHED_IDLE support
    sched/topology: Skip updating masks for non-online nodes
    sched: Replace deprecated CPU-hotplug functions.
    sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
    sched: Fix UCLAMP_FLAG_IDLE setting
    sched/deadline: Fix missing clock update in migrate_task_rq_dl()
    sched/fair: Avoid a second scan of target in select_idle_cpu
    sched/fair: Use prev instead of new target as recent_used_cpu
    sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
    ...

    Linus Torvalds
     
  • As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
    for_each_subsys"), avoid compiler warnings for the pathological case of
    having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
    hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:

    In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
    from include/linux/compiler.h:264,
    from include/uapi/linux/swab.h:6,
    from include/linux/swab.h:5,
    from arch/arm/include/asm/opcodes.h:86,
    from arch/arm/include/asm/bug.h:7,
    from include/linux/bug.h:5,
    from include/linux/thread_info.h:13,
    from include/asm-generic/current.h:5,
    from ./arch/arm/include/generated/asm/current.h:1,
    from include/linux/sched.h:12,
    from include/linux/cgroup.h:12,
    from kernel/cgroup/cgroup-internal.h:5,
    from kernel/cgroup/cgroup.c:31:
    kernel/cgroup/cgroup.c: In function 'of_css':
    kernel/cgroup/cgroup.c:651:42: warning: array subscript '' is outside the bounds of an
    interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
    651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);

    Reported-by: Stephen Rothwell
    Cc: Tejun Heo
    Cc: Zefan Li
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Tejun Heo

    Kees Cook
     

26 Aug, 2021

1 commit

  • With the introduction of ee9707e8593d ("cgroup/cpuset: Enable memory
    migration for cpuset v2") attaching a process to a different cgroup will
    trigger a memory migration regardless of whether it's really needed.
    Memory migration is an expensive operation, so bypass it if the
    nodemasks passed to cpuset_migrate_mm() are equal.

    Note that we're not only avoiding the migration work itself, but also a
    call to lru_cache_disable(), which triggers and flushes an LRU drain
    work on every online CPU.

    Signed-off-by: Nicolas Saenz Julienne
    Signed-off-by: Tejun Heo

    Nicolas Saenz Julienne
     

20 Aug, 2021

3 commits

  • select_fallback_rq() only needs to recheck for an allowed CPU if the
    affinity mask of the task has changed since the last check.

    Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether
    the affinity mask was updated, and use this to elide the allowed check
    when the mask has been left alone.

    No functional change.

    Suggested-by: Valentin Schneider
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org

    Will Deacon
     
  • Asymmetric systems may not offer the same level of userspace ISA support
    across all CPUs, meaning that some applications cannot be executed by
    some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
    not feature support for 32-bit applications on both clusters.

    Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
    account when trying to find a suitable set of online CPUs for a given
    task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
    during ->attach() and will subsequently allow the cpuset hierarchy to be
    taken into account when forcefully overriding the affinity mask for a
    task which requires migration to a compatible CPU.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210730112443.23245-4-will@kernel.org

    Will Deacon
     
  • If the scheduler cannot find an allowed CPU for a task,
    cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
    if cgroup v1 is in use.

    In preparation for allowing architectures to provide their own fallback
    mask, just return early if we're either using cgroup v1 or we're using
    cgroup v2 with a mask that contains invalid CPUs. This will allow
    select_fallback_rq() to figure out the mask by itself.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Reviewed-by: Quentin Perret
    Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org

    Will Deacon
     

13 Aug, 2021

1 commit

  • When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
    to one of the new cpus if it is not there already. For memory, however,
    they won't be migrated to the new nodes when cpuset.mems changes. This is
    an inconsistency in behavior.

    In cpuset v1, there is a memory_migrate control file to enable such
    behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
    for cpuset v2 so that we have a consistent set of behavior for both
    cpus and memory.

    There is certainly a cost to make memory migration the default, but it
    is a one time cost that shouldn't really matter as long as cpuset.mems
    isn't changed frequenty. Update the cgroup-v2.rst file to document the
    new behavior and recommend against changing cpuset.mems frequently.

    Since there won't be any concurrent access to the newly allocated cpuset
    structure in cpuset_css_alloc(), we can use the cheaper non-atomic
    __set_bit() instead of the more expensive atomic set_bit().

    Signed-off-by: Waiman Long
    Acked-by: Johannes Weiner
    Signed-off-by: Tejun Heo

    Waiman Long
     

12 Aug, 2021

2 commits

  • A valid cpuset partition can become invalid if all its CPUs are offlined
    or somehow removed. This can happen through external events without
    "cpuset.cpus.partition" being touched at all.

    Users that rely on the property of a partition being present do not
    currently have a simple way to get such an event notified other than
    constant periodic polling which is both inefficient and cumbersome.

    To make life easier for those users, event notification is now enabled
    for "cpuset.cpus.partition" whenever its state changes.

    Suggested-by: Tejun Heo
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • Fix kernel-doc warnings found in cgroup-v1.c:

    kernel/cgroup/cgroup-v1.c:55: warning: No description found for return value of 'cgroup_attach_task_all'
    kernel/cgroup/cgroup-v1.c:94: warning: expecting prototype for cgroup_trasnsfer_tasks(). Prototype was for cgroup_transfer_tasks() instead
    cgroup-v1.c:96: warning: No description found for return value of 'cgroup_transfer_tasks'
    kernel/cgroup/cgroup-v1.c:687: warning: No description found for return value of 'cgroupstats_build'

    Signed-off-by: Randy Dunlap
    Cc: Tejun Heo
    Cc: Zefan Li
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Tejun Heo

    Randy Dunlap
     

10 Aug, 2021

2 commits

  • The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Tejun Heo

    Sebastian Andrzej Siewior
     
  • The cpuset fields that manage partition root state do not strictly
    follow the cpuset locking rule that update to cpuset has to be done
    with both the callback_lock and cpuset_mutex held. This is now fixed
    by making sure that the locking rule is upheld.

    Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")
    Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     

28 Jul, 2021

1 commit

  • 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
    cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
    context. However, rstat paths use u64_stats_sync to synchronize access to
    64bit stat counters on 32bit machines. u64_stats_sync is implemented using
    seq_lock and trying to read from an irq context can lead to A-A deadlock if
    the irq happens to interrupt the stat update.

    Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
    u64_stats_update_end_irqrestore() - in the update paths. Note that none of
    this matters on 64bit machines. All these are just for 32bit SMP setups.

    Note that the interface was introduced way back, its first and currently
    only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to
    rstat"). Stable tagging targets this commit.

    Signed-off-by: Tejun Heo
    Reported-by: Rik van Riel
    Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
    Cc: stable@vger.kernel.org # v5.13+

    Tejun Heo
     

27 Jul, 2021

1 commit

  • In cpuset_hotplug_workfn(), the detection of whether the cpu list
    has been changed is done by comparing the effective cpus of the top
    cpuset with the cpu_active_mask. However, in the rare case that just
    all the CPUs in the subparts_cpus are offlined, the detection fails
    and the partition states are not updated correctly. Fix it by forcing
    the cpus_updated flag to true in this particular case.

    Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long