Eric Lee / smarc-fsl-linux-kernel

23 Sep, 2022

1 commit

5db17805b cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all() ... Browse Code »

commit 43626dade36fa74d3329046f4ae2d7fdefe401c6 upstream.

syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix
threadgroup_rwsem cpus_read_lock() deadlock") missed that
cpuset_attach() is also called from cgroup_attach_task_all().
Add cpus_read_lock() like what cgroup_procs_write_start() does.

Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
Reported-by: syzbot
Signed-off-by: Tetsuo Handa
Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem cpus_read_lock() deadlock")
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tetsuo Handa
2022-09-23 20:15:52 +0800

15 Sep, 2022

2 commits

3bf4bf540 cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock ... Browse Code »

[ Upstream commit 4f7e7236435ca0abe005c674ebd6892c6e83aeb3 ]

Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.

Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().

Signed-off-by: Tejun Heo
Reviewed-and-tested-by: Imran Khan
Reported-and-tested-by: Xuewen Yan
Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
Signed-off-by: Sasha Levin

Tejun Heo
2022-09-15 17:30:03 +0800
509e3456d cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree ... Browse Code »

[ Upstream commit 671c11f0619e5ccb380bcf0f062f69ba95fc974a ]

cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
csses can trigger process migrations. However, if the subtree doesn't
contain any tasks, there aren't gonna be any cgroup migrations. This
condition can be trivially detected by testing whether
mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
the subtree is empty.

After this optimization, the usage pattern of creating a cgroup, enabling
the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
then removing the cgroup after it becomes empty doesn't need to write-lock
threadgroup_rwsem at all.

Signed-off-by: Tejun Heo
Cc: Christian Brauner
Cc: Michal Koutný
Signed-off-by: Sasha Levin

Tejun Heo
2022-09-15 17:30:03 +0800

31 Aug, 2022

1 commit

f49fd5fe2 cgroup: Fix race condition at rebind_subsystems() ... Browse Code »

commit 763f4fb76e24959c370cdaa889b2492ba6175580 upstream.

Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().

Solution:
Add grace period before invalidating the removed rstat_css_node.

Reported-by: Jing-Ting Wu
Suggested-by: Michal Koutný
Signed-off-by: Jing-Ting Wu
Tested-by: Jing-Ting Wu
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha
Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1")
Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Jing-Ting Wu
2022-08-31 23:16:33 +0800

17 Aug, 2022

1 commit

147f66d22 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed ... Browse Code »

[ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ]

With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
:
[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
:
[80468.207946] Call Trace:
[80468.208947] cpuset_can_attach+0xa0/0x140
[80468.209953] cgroup_migrate_execute+0x8c/0x490
[80468.210931] cgroup_update_dfl_csses+0x254/0x270
[80468.211898] cgroup_subtree_control_write+0x322/0x400
[80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
[80468.213777] new_sync_write+0x11f/0x1b0
[80468.214689] vfs_write+0x1eb/0x280
[80468.215592] ksys_write+0x5f/0xe0
[80468.216463] do_syscall_64+0x5c/0x80
[80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long
Signed-off-by: Ingo Molnar
Acked-by: Juri Lelli
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
Signed-off-by: Sasha Levin

Waiman Long
2022-08-17 20:24:14 +0800

22 Jul, 2022

1 commit

54aee4e5c cgroup: Use separate src/dst nodes when preloading css_sets for migration ... Browse Code »

commit 07fd5b6cdf3cc30bfde8fe0f644771688be04447 upstream.

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

1. cgroup_migrate_add_src() is called on B for the leader.

2. cgroup_migrate_add_src() is called on A for the other threads.

3. cgroup_migrate_prepare_dst() is called. It scans the src list.

4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.

5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.

6. The rest of migration takes place with B on the src list but nothing on
the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo
Reported-by: Mukesh Ojha
Reported-by: shisiyuan
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2022-07-22 03:24:13 +0800

18 May, 2022

1 commit

8e1716993 cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp() ... Browse Code »

commit 2685027fca387b602ae565bff17895188b803988 upstream.

There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
1) start_kernel -> cpuset_init()
2) start_kernel -> cgroup_init() -> cpuset_bind()
3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()

The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.

For systems with cgroup v2 setup, cpuset_bind() is called once. As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.

For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.

[ 2.609781] cpuset_bind() called - v2 = 1
[ 3.079473] cpuset_init_smp() called
[ 7.103710] cpuset_bind() called - v2 = 0

smp_init() is called after the first two init functions. So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.

To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly. For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.

cc: stable@vger.kernel.org
Signed-off-by: Waiman Long
Tested-by: Feng Tang
Reviewed-by: Michal Koutný
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2022-05-18 16:26:56 +0800

02 Mar, 2022

2 commits

ebeb7b735 cgroup-v1: Correct privileges check in release_agent writes ... Browse Code »

commit 467a726b754f474936980da793b4ff2ec3e382a7 upstream.

The idea is to check: a) the owning user_ns of cgroup_ns, b)
capabilities in init_user_ns.

The commit 24f600856418 ("cgroup-v1: Require capabilities to set
release_agent") got this wrong in the write handler of release_agent
since it checked user_ns of the opener (may be different from the owning
user_ns of cgroup_ns).
Secondly, to avoid possibly confused deputy, the capability of the
opener must be checked.

Fixes: 24f600856418 ("cgroup-v1: Require capabilities to set release_agent")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/
Signed-off-by: Michal Koutný
Reviewed-by: Masami Ichikawa(CIP)
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Michal Koutný
2022-03-02 18:47:47 +0800
ffed0bf6a cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug ... Browse Code »

commit 05c7b7a92cc87ff8d7fde189d0fade250697573c upstream.

As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:

cpuset_attach() cpu hotplug
--------------------------- ----------------------
down_write(cpuset_rwsem)
guarantee_online_cpus() // (load cpus_attach)
sched_cpu_deactivate
set_cpu_active()
// will change cpu_active_mask
set_cpus_allowed_ptr(cpus_attach)
__set_cpus_allowed_ptr_locked()
// (if the intersection of cpus_attach and
cpu_active_mask is empty, will return -EINVAL)
up_write(cpuset_rwsem)

To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.

Fixes: be367d099270 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi
Signed-off-by: Zhang Qiao
Acked-by: Waiman Long
Reviewed-by: Michal Koutný
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Zhang Qiao
2022-03-02 18:47:47 +0800

09 Feb, 2022

1 commit

f5afdefe1 cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning ... Browse Code »

commit 2bdfd2825c9662463371e6691b1a794e97fa36b4 upstream.

It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks(). It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.

Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.

Fixes: 4716909cc5c5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long
Tested-by: Phil Auld
Reviewed-by: Phil Auld
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2022-02-09 01:34:14 +0800

05 Feb, 2022

2 commits

6be234917 cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask() ... Browse Code »

commit c80d401c52a2d1baf2a5afeb06f0ffe678e56d23 upstream.

subparts_cpus should be limited as a subset of cpus_allowed, but it is
updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
fix it.

Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Tianchen Ding
Reviewed-by: Waiman Long
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tianchen Ding
2022-02-05 19:38:59 +0800
4b1c32bfa cgroup-v1: Require capabilities to set release_agent ... Browse Code »

commit 24f6008564183aa120d07c03d9289519c2fe02af upstream.

The cgroup release_agent is called with call_usermodehelper. The function
call_usermodehelper starts the release_agent with a full set fo capabilities.
Therefore require capabilities when setting the release_agaent.

Reported-by: Tabitha Sable
Tested-by: Tabitha Sable
Fixes: 81a6a5cdd2c5 ("Task Control Groups: automatic userspace notification of idle cgroups")
Cc: stable@vger.kernel.org # v2.6.24+
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2022-02-05 19:38:57 +0800

02 Feb, 2022

1 commit

d3e4c61e1 psi: Fix uaf issue when psi trigger is destroyed while being polled ... Browse Code »

commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream.

With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.

Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds
Analyzed-by: Eric Biggers
Signed-off-by: Suren Baghdasaryan
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Eric Biggers
Acked-by: Johannes Weiner
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
Signed-off-by: Greg Kroah-Hartman

Suren Baghdasaryan
2022-02-02 00:27:01 +0800

11 Jan, 2022

3 commits

43fa0b363 cgroup: Use open-time cgroup namespace for process migration perm checks ... Browse Code »

commit e57457641613fef0d147ede8bd6a3047df588b95 upstream.

cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.

This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.

This also fixes a use-after-free bug on cgroupns reported in

https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com

Note that backporting this fix also requires the preceding patch.

Reported-by: "Eric W. Biederman"
Suggested-by: Linus Torvalds
Cc: Michal Koutný
Cc: Oleg Nesterov
Reviewed-by: Michal Koutný
Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2022-01-11 22:35:15 +0800
50273128d cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv ... Browse Code »

commit 0d2b5955b36250a9428c832664f2079cbf723bec upstream.

of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.

Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.

v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
cgroup_file_ctx as suggested by Linus.

v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
Converted. Didn't change to embedded allocation as cgroup1 pidlists get
stored for caching.

Signed-off-by: Tejun Heo
Cc: Linus Torvalds
Reviewed-by: Michal Koutný
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2022-01-11 22:35:15 +0800
c6ebc3529 cgroup: Use open-time credentials for process migraton perm checks ... Browse Code »

commit 1756d7994ad85c2479af6ae5a9750b92324685af upstream.

cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials which is a
potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.

This patch makes both cgroup2 and cgroup1 process migration interfaces to
use the credentials saved at the time of open (file->f_cred) instead of
current's.

Reported-by: "Eric W. Biederman"
Suggested-by: Linus Torvalds
Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy")
Reviewed-by: Michal Koutný
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2022-01-11 22:35:15 +0800

19 Nov, 2021

2 commits

e3acd3a08 cgroup: Fix rootcg cpu.stat guest double counting ... Browse Code »

[ Upstream commit 81c49d39aea8a10e6d05d3aa1cb65ceb721e19b0 ]

In account_guest_time in kernel/sched/cputime.c guest time is
attributed to both CPUTIME_NICE and CPUTIME_USER in addition to
CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding
both to calculate usage results in double counting any guest time at
the rootcg.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Dan Schatzberg
Signed-off-by: Tejun Heo
Signed-off-by: Sasha Levin

Dan Schatzberg
2021-11-19 02:16:45 +0800
26d3bf38a cgroup: Make rebind_subsystems() disable v2 controllers all at once ... Browse Code »

[ Upstream commit 7ee285395b211cad474b2b989db52666e0430daf ]

It was found that the following warning was displayed when remounting
controllers from cgroup v2 to v1:

[ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
:
[ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
[ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
[ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
[ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
[ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
[ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
[ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
[ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
[ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
[ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8043.191804] PKRU: 55555554
[ 8043.194517] Call Trace:
[ 8043.196970] rebind_subsystems+0x18c/0x470
[ 8043.201070] cgroup_setup_root+0x16c/0x2f0
[ 8043.205177] cgroup1_root_to_use+0x204/0x2a0
[ 8043.209456] cgroup1_get_tree+0x3e/0x120
[ 8043.213384] vfs_get_tree+0x22/0xb0
[ 8043.216883] do_new_mount+0x176/0x2d0
[ 8043.220550] __x64_sys_mount+0x103/0x140
[ 8043.224474] do_syscall_64+0x38/0x90
[ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae

It was caused by the fact that rebind_subsystem() disables
controllers to be rebound one by one. If more than one disabled
controllers are originally from the default hierarchy, it means that
cgroup_apply_control_disable() will be called multiple times for the
same default hierarchy. A controller may be killed by css_kill() in
the first round. In the second round, the killed controller may not be
completely dead yet leading to the warning.

To avoid this problem, we collect all the ssid's of controllers that
needed to be disabled from the default hierarchy and then disable them
in one go instead of one by one.

Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Signed-off-by: Waiman Long
Signed-off-by: Tejun Heo
Signed-off-by: Sasha Levin

Waiman Long
2021-11-19 02:16:24 +0800

23 Oct, 2021

1 commit

04f8ef564 cgroup: Fix memory leak caused by missing cgroup_bpf_offline ... Browse Code »

When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
the command as below:

$mount -t cgroup -o none,name=foo cgroup cgroup/
$umount cgroup/

unreferenced object 0xc3585c40 (size 64):
comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
hex dump (first 32 bytes):
01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(.........
00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC......
backtrace:
[] cgroup_bpf_inherit+0x44/0x24c
[] cgroup_setup_root+0x174/0x37c
[] cgroup1_get_tree+0x2c0/0x4a0
[] vfs_get_tree+0x24/0x108
[] path_mount+0x384/0x988
[] do_mount+0x64/0x9c
[] sys_mount+0xfc/0x1f4
[] ret_fast_syscall+0x0/0x48
[] 0xbeb4daa8

This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
is called by cgroup_setup_root when mounting, but not freed along with
root_cgrp when umounting. Adding cgroup_bpf_offline which calls
percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
umount path.

This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
cleanup that frees the resources which are allocated by cgroup_bpf_inherit
in cgroup_setup_root.

And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
cgroup_put is at the end of cgroup_bpf_release which is called by
cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
cgroup's refcount.

Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Signed-off-by: Quanyang Wang
Signed-off-by: Alexei Starovoitov
Acked-by: Roman Gushchin
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com

Quanyang Wang
2021-10-23 08:23:54 +0800

12 Oct, 2021

1 commit

459ea72c6 Merge branch 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"All documentation / comment updates"

* 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroupv2, docs: fix misinformation in "device controller" section
cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
docs/cgroup: remove some duplicate words

Linus Torvalds
2021-10-12 08:16:41 +0800

28 Sep, 2021

1 commit

78cc316e9 bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt ... Browse Code »

If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().

syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.

Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.

Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Acked-by: Tejun Heo
Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net

Daniel Borkmann
2021-09-28 15:29:19 +0800

14 Sep, 2021

3 commits

2865ba824 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf 2021-09-14

The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 13 day(s) which contain
a total of 18 files changed, 334 insertions(+), 193 deletions(-).

The main changes are:

1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.

2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.

3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.

4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.

5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
====================

Signed-off-by: David S. Miller

David S. Miller
2021-09-14 20:09:54 +0800
8520e224f bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode ... Browse Code »

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Acked-by: Stanislav Fomichev
Acked-by: Tejun Heo
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

Daniel Borkmann
2021-09-14 07:35:58 +0800
b94f9ac79 cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem ... Browse Code »

Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to
percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
reference cpuset_mutex which are now incorrect.

Change all the references of cpuset_mutex to cpuset_rwsem.

Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
Signed-off-by: Waiman Long
Signed-off-by: Tejun Heo

Waiman Long
2021-09-14 02:06:17 +0800

04 Sep, 2021

2 commits

14726903c Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"173 patches.

Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton : (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...

Linus Torvalds
2021-09-04 01:08:28 +0800
30acd0bdf memcg: enable accounting for new namesapces and struct nsproxy ... Browse Code »

Container admin can create new namespaces and force kernel to allocate up
to several pages of memory for the namespaces and its associated
structures.

Net and uts namespaces have enabled accounting for such allocations. It
makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
Signed-off-by: Vasily Averin
Acked-by: Serge Hallyn
Acked-by: Christian Brauner
Acked-by: Kirill Tkhai
Reviewed-by: Shakeel Butt
Cc: Alexander Viro
Cc: Alexey Dobriyan
Cc: Andrei Vagin
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Dmitry Safonov
Cc: "Eric W. Biederman"
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: "J. Bruce Fields"
Cc: Jeff Layton
Cc: Jens Axboe
Cc: Jiri Slaby
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Oleg Nesterov
Cc: Roman Gushchin
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Cc: Yutian Yang
Cc: Zefan Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Averin
2021-09-04 00:58:12 +0800

01 Sep, 2021

1 commit

69dc8010b Merge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Two cpuset behavior changes:

- cpuset on cgroup2 is changed to enable memory migration based on
nodemask by default.

- A notification is generated when cpuset partition state changes.

All other patches are minor fixes and cleanups"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Avoid compiler warnings with no subsystems
cgroup/cpuset: Avoid memory migration when nodemasks match
cgroup/cpuset: Enable memory migration for cpuset v2
cgroup/cpuset: Enable event notification when partition state changes
cgroup: cgroup-v1: clean up kernel-doc notation
cgroup: Replace deprecated CPU-hotplug functions.
cgroup/cpuset: Fix violation of cpuset locking rule
cgroup/cpuset: Fix a partition bug with hotplug
cgroup/cpuset: Miscellaneous code cleanup
cgroup: remove cgroup_mount from comments

Linus Torvalds
2021-09-01 06:49:04 +0800

31 Aug, 2021

2 commits

5d3c0db45 Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks
on AArch32 systems that also have 64-bit-only CPUs.

Architectures can fill in this functionality by defining their own
task_cpu_possible_mask(p). When this is done, the scheduler will make
sure the task will only be scheduled on CPUs that support it.

(The actual arm64 specific changes are not part of this tree.)

For other architectures there will be no change in functionality.

- Add cgroup SCHED_IDLE support

- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)

- Deadline scheduler enhancements & fixes

- Misc fixes & cleanups.

* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
eventfd: Make signal recursion protection a task bit
sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
sched: Replace deprecated CPU-hotplug functions.
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
sched: Fix UCLAMP_FLAG_IDLE setting
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
sched/fair: Avoid a second scan of target in select_idle_cpu
sched/fair: Use prev instead of new target as recent_used_cpu
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
...

Linus Torvalds
2021-08-31 04:42:10 +0800
d20d30ebb cgroup: Avoid compiler warnings with no subsystems ... Browse Code »

As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:

In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);

Reported-by: Stephen Rothwell
Cc: Tejun Heo
Cc: Zefan Li
Cc: Johannes Weiner
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: Tejun Heo

Kees Cook
2021-08-31 01:36:43 +0800

26 Aug, 2021

1 commit

9f72daf7e cgroup/cpuset: Avoid memory migration when nodemasks match ... Browse Code »

With the introduction of ee9707e8593d ("cgroup/cpuset: Enable memory
migration for cpuset v2") attaching a process to a different cgroup will
trigger a memory migration regardless of whether it's really needed.
Memory migration is an expensive operation, so bypass it if the
nodemasks passed to cpuset_migrate_mm() are equal.

Note that we're not only avoiding the migration work itself, but also a
call to lru_cache_disable(), which triggers and flushes an LRU drain
work on every online CPU.

Signed-off-by: Nicolas Saenz Julienne
Signed-off-by: Tejun Heo

Nicolas Saenz Julienne
2021-08-26 00:51:51 +0800

20 Aug, 2021

3 commits

97c0054db cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() ... Browse Code »

select_fallback_rq() only needs to recheck for an allowed CPU if the
affinity mask of the task has changed since the last check.

Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether
the affinity mask was updated, and use this to elide the allowed check
when the mask has been left alone.

No functional change.

Suggested-by: Valentin Schneider
Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org

Will Deacon
2021-08-20 18:32:59 +0800
431c69fac cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() ... Browse Code »

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.

Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210730112443.23245-4-will@kernel.org

Will Deacon
2021-08-20 18:32:59 +0800
d4b96fb92 cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 ... Browse Code »

If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.

In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.

Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Reviewed-by: Quentin Perret
Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org

Will Deacon
2021-08-20 18:32:58 +0800

13 Aug, 2021

1 commit

ee9707e85 cgroup/cpuset: Enable memory migration for cpuset v2 ... Browse Code »

When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
to one of the new cpus if it is not there already. For memory, however,
they won't be migrated to the new nodes when cpuset.mems changes. This is
an inconsistency in behavior.

In cpuset v1, there is a memory_migrate control file to enable such
behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
for cpuset v2 so that we have a consistent set of behavior for both
cpus and memory.

There is certainly a cost to make memory migration the default, but it
is a one time cost that shouldn't really matter as long as cpuset.mems
isn't changed frequenty. Update the cgroup-v2.rst file to document the
new behavior and recommend against changing cpuset.mems frequently.

Since there won't be any concurrent access to the newly allocated cpuset
structure in cpuset_css_alloc(), we can use the cheaper non-atomic
__set_bit() instead of the more expensive atomic set_bit().

Signed-off-by: Waiman Long
Acked-by: Johannes Weiner
Signed-off-by: Tejun Heo

Waiman Long
2021-08-13 05:40:20 +0800

12 Aug, 2021

2 commits

e7cc9888d cgroup/cpuset: Enable event notification when partition state changes ... Browse Code »

A valid cpuset partition can become invalid if all its CPUs are offlined
or somehow removed. This can happen through external events without
"cpuset.cpus.partition" being touched at all.

Users that rely on the property of a partition being present do not
currently have a simple way to get such an event notified other than
constant periodic polling which is both inefficient and cumbersome.

To make life easier for those users, event notification is now enabled
for "cpuset.cpus.partition" whenever its state changes.

Suggested-by: Tejun Heo
Signed-off-by: Waiman Long
Signed-off-by: Tejun Heo

Waiman Long
2021-08-12 02:04:43 +0800
b4cc61960 cgroup: cgroup-v1: clean up kernel-doc notation ... Browse Code »

Fix kernel-doc warnings found in cgroup-v1.c:

kernel/cgroup/cgroup-v1.c:55: warning: No description found for return value of 'cgroup_attach_task_all'
kernel/cgroup/cgroup-v1.c:94: warning: expecting prototype for cgroup_trasnsfer_tasks(). Prototype was for cgroup_transfer_tasks() instead
cgroup-v1.c:96: warning: No description found for return value of 'cgroup_transfer_tasks'
kernel/cgroup/cgroup-v1.c:687: warning: No description found for return value of 'cgroupstats_build'

Signed-off-by: Randy Dunlap
Cc: Tejun Heo
Cc: Zefan Li
Cc: Johannes Weiner
Cc: cgroups@vger.kernel.org
Signed-off-by: Tejun Heo

Randy Dunlap
2021-08-12 01:57:43 +0800

10 Aug, 2021

2 commits

c5c63b9a6 cgroup: Replace deprecated CPU-hotplug functions. ... Browse Code »

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: Zefan Li
Cc: Tejun Heo
Cc: Johannes Weiner
Cc: cgroups@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Tejun Heo

Sebastian Andrzej Siewior
2021-08-10 06:53:35 +0800
6ba34d3c7 cgroup/cpuset: Fix violation of cpuset locking rule ... Browse Code »

The cpuset fields that manage partition root state do not strictly
follow the cpuset locking rule that update to cpuset has to be done
with both the callback_lock and cpuset_mutex held. This is now fixed
by making sure that the locking rule is upheld.

Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")
Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long
Signed-off-by: Tejun Heo

Waiman Long
2021-08-10 06:53:19 +0800

28 Jul, 2021

1 commit

c3df5fb57 cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync ... Browse Code »

0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
context. However, rstat paths use u64_stats_sync to synchronize access to
64bit stat counters on 32bit machines. u64_stats_sync is implemented using
seq_lock and trying to read from an irq context can lead to A-A deadlock if
the irq happens to interrupt the stat update.

Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
u64_stats_update_end_irqrestore() - in the update paths. Note that none of
this matters on 64bit machines. All these are just for 32bit SMP setups.

Note that the interface was introduced way back, its first and currently
only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to
rstat"). Stable tagging targets this commit.

Signed-off-by: Tejun Heo
Reported-by: Rik van Riel
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+

Tejun Heo
2021-07-28 07:12:20 +0800

27 Jul, 2021

1 commit

15d428e6f cgroup/cpuset: Fix a partition bug with hotplug ... Browse Code »

In cpuset_hotplug_workfn(), the detection of whether the cpu list
has been changed is done by comparing the effective cpus of the top
cpuset with the cpu_active_mask. However, in the rare case that just
all the CPUs in the subparts_cpus are offlined, the detection fails
and the partition states are not updated correctly. Fix it by forcing
the cpus_updated flag to true in this particular case.

Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long
Signed-off-by: Tejun Heo

Waiman Long
2021-07-27 06:59:05 +0800