Eric Lee / smarc-fsl-linux-kernel

06 Apr, 2019

2 commits

d0bc74c56 cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting ... Browse Code »

[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]

The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does

for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}

can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control

mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max

perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).

Reported-by: Herton R. Krzesinski
Reported-by: Jan Stancek
Signed-off-by: Oleg Nesterov
Signed-off-by: Tejun Heo
Signed-off-by: Sasha Levin

Oleg Nesterov
2019-04-06 04:33:13 +0800
a74ebf047 cgroup, rstat: Don't flush subtree root unless necessary ... Browse Code »

[ Upstream commit b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 ]

cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
on flush. While it was only visiting updated ones in the subtree, it
was visiting @root unconditionally. We can easily check whether @root
is updated or not by looking at its ->updated_next just as with the
cgroups in the subtree.

* Remove the unnecessary cgroup_parent() test. The system root cgroup
is never updated and thus its ->updated_next is always NULL. No
need to test whether cgroup_parent() exists in addition to
->updated_next.

* Terminate traverse if ->updated_next is NULL. This can only happen
for subtree @root and there's no reason to visit it if it's not
marked updated.

This reduces cpu consumption when reading a lot of rstat backed files.
In a micro benchmark reading stat from ~1600 cgroups, the sys time was
lowered by >40%.

Signed-off-by: Tejun Heo
Signed-off-by: Sasha Levin

Tejun Heo
2019-04-06 04:33:06 +0800

24 Mar, 2019

1 commit

7a8b04843 fix cgroup_do_mount() handling of failure exits ... Browse Code »

commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.

same story as with last May fixes in sysfs (7b745a4e4051
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped. Easily fixed (same way as in sysfs
case). Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).

Cc: stable@kernel.org
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2019-03-24 03:09:53 +0800

13 Feb, 2019

1 commit

4b5abffd6 cgroup: fix parsing empty mount option string ... Browse Code »

[ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ]

This fixes the case where all mount options specified are consumed by an
LSM and all that's left is an empty string. In this case cgroupfs should
accept the string and not fail.

How to reproduce (with SELinux enabled):

# umount /sys/fs/cgroup/unified
# mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
# dmesg | tail -n 1
[ 31.575952] cgroup: cgroup2: unknown option ""

Fixes: 67e9c74b8a87 ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
[NOTE: should apply on top of commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
Suggested-by: Stephen Smalley
Signed-off-by: Ondrej Mosnacek
Signed-off-by: Tejun Heo
Signed-off-by: Sasha Levin

Ondrej Mosnacek
2019-02-13 02:47:17 +0800

10 Jan, 2019

1 commit

8a2fbdd5b cgroup: fix CSS_TASK_ITER_PROCS ... Browse Code »

commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.

CSS_TASK_ITER_PROCS implements process-only iteration by making
css_task_iter_advance() skip tasks which aren't threadgroup leaders;
however, when an iteration is started css_task_iter_start() calls the
inner helper function css_task_iter_advance_css_set() instead of
css_task_iter_advance(). As the helper doesn't have the skip logic,
when the first task to visit is a non-leader thread, it doesn't get
skipped correctly as shown in the following example.

# ps -L 2030
PID LWP TTY STAT TIME COMMAND
2030 2030 pts/0 Sl+ 0:00 ./test-thread
2030 2031 pts/0 Sl+ 0:00 ./test-thread
# mkdir -p /sys/fs/cgroup/x/a/b
# echo threaded > /sys/fs/cgroup/x/a/cgroup.type
# echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
# echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
# cat /sys/fs/cgroup/x/a/cgroup.threads
2030
2031
# cat /sys/fs/cgroup/x/cgroup.procs
2030
# echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
# cat /sys/fs/cgroup/x/cgroup.procs
2031
2030

The last read of cgroup.procs is incorrectly showing non-leader 2031
in cgroup.procs output.

This can be fixed by updating css_task_iter_advance() to handle the
first advance and css_task_iters_tart() to call
css_task_iter_advance() instead of the inner helper. After the fix,
the same commands result in the following (correct) result:

# ps -L 2062
PID LWP TTY STAT TIME COMMAND
2062 2062 pts/0 Sl+ 0:00 ./test-thread
2062 2063 pts/0 Sl+ 0:00 ./test-thread
# mkdir -p /sys/fs/cgroup/x/a/b
# echo threaded > /sys/fs/cgroup/x/a/cgroup.type
# echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
# echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
# cat /sys/fs/cgroup/x/a/cgroup.threads
2062
2063
# cat /sys/fs/cgroup/x/cgroup.procs
2062
# echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
# cat /sys/fs/cgroup/x/cgroup.procs
2062

Signed-off-by: Tejun Heo
Reported-by: "Michael Kerrisk (man-pages)"
Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2019-01-10 00:38:45 +0800

05 Oct, 2018

1 commit

479adb89a cgroup: Fix dom_cgrp propagation when enabling threaded mode ... Browse Code »

A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met. When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup. Unfortunately, this
propagation was missing leading to the following failure.

# cd /sys/fs/cgroup/unified
# cat cgroup.subtree_control # show that no controllers are enabled

# mkdir -p mycgrp/a/b/c
# echo threaded > mycgrp/a/b/cgroup.type

At this point, the hierarchy looks as follows:

mycgrp [d]
a [dt]
b [t]
c [inv]

Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

# echo threaded > mycgrp/a/cgroup.type

By this point, we now have a hierarchy that looks as follows:

mycgrp [dt]
a [t]
b [t]
c [inv]

But, when we try to convert the node "c" from "domain invalid" to
"threaded", we get ENOTSUP on the write():

# echo threaded > mycgrp/a/b/c/cgroup.type
sh: echo: write error: Operation not supported

This patch fixes the problem by

* Moving the opencoded ->dom_cgrp save and restoration in
cgroup_enable_threaded() into cgroup_{save|restore}_control() so
that mulitple cgroups can be handled.

* Updating all threaded descendants' ->dom_cgrp to point to the new
dom_cgrp when enabling threaded mode.

Signed-off-by: Tejun Heo
Reported-and-tested-by: "Michael Kerrisk (man-pages)"
Reported-by: Amin Jamali
Reported-by: Joao De Almeida Pereira
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+

Tejun Heo
2018-10-05 04:28:08 +0800

25 Aug, 2018

1 commit

596766102 Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Just one commit from Steven to take out spin lock from trace event
handlers"

* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/tracing: Move taking of spin lock out of trace event handlers

Linus Torvalds
2018-08-25 04:19:27 +0800

21 Jul, 2018

1 commit

488dee96b kernfs: allow creating kernfs objects with arbitrary uid/gid ... Browse Code »

This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks
Signed-off-by: Dmitry Torokhov
Signed-off-by: Tyler Hicks
Signed-off-by: David S. Miller

Dmitry Torokhov
2018-07-21 14:44:35 +0800

12 Jul, 2018

1 commit

e4f8d81c7 cgroup/tracing: Move taking of spin lock out of trace event handlers ... Browse Code »

It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.

As a general rule, I tell people never to take any locks in a trace
event handler.

Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.

By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.

Reported-by: Sebastian Andrzej Siewior
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Tejun Heo

Steven Rostedt (VMware)
2018-07-12 01:48:47 +0800

16 Jun, 2018

1 commit

5fb94e9ca docs: Fix some broken references ... Browse Code »

As we move stuff around, some doc references are broken. Fix some of
them via this script:
./scripts/documentation-file-ref-check --fix

Manually checked if the produced result is valid, removing a few
false-positives.

Acked-by: Takashi Iwai
Acked-by: Masami Hiramatsu
Acked-by: Stephen Boyd
Acked-by: Charles Keepax
Acked-by: Mathieu Poirier
Reviewed-by: Coly Li
Signed-off-by: Mauro Carvalho Chehab
Acked-by: Jonathan Corbet

Mauro Carvalho Chehab
2018-06-16 05:10:01 +0800

13 Jun, 2018

2 commits

42bc47b35 treewide: Use array_size() in vmalloc() ... Browse Code »

The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

vmalloc(a * b)

with:
vmalloc(array_size(a, b))

as well as handling cases of:

vmalloc(a * b * c)

with:

vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
vmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
vmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
vmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
vmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
vmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
vmalloc(
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

vmalloc(
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
vmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
vmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
vmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
vmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
vmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
vmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
vmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
vmalloc(C1 * C2 * C3, ...)
|
vmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
vmalloc(C1 * C2, ...)
|
vmalloc(
- E1 * E2
+ array_size(E1, E2)
, ...)
)

Signed-off-by: Kees Cook

Kees Cook
2018-06-13 07:19:22 +0800
6da2ec560 treewide: kmalloc() -> kmalloc_array() ... Browse Code »

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook

Kees Cook
2018-06-13 07:19:22 +0800

07 Jun, 2018

2 commits

285767604 Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull overflow updates from Kees Cook:
"This adds the new overflow checking helpers and adds them to the
2-factor argument allocators. And this adds the saturating size
helpers and does a treewide replacement for the struct_size() usage.
Additionally this adds the overflow testing modules to make sure
everything works.

I'm still working on the treewide replacements for allocators with
"simple" multiplied arguments:

*alloc(a * b, ...) -> *alloc_array(a, b, ...)

and

*zalloc(a * b, ...) -> *calloc(a, b, ...)

as well as the more complex cases, but that's separable from this
portion of the series. I expect to have the rest sent before -rc1
closes; there are a lot of messy cases to clean up.

Summary:

- Introduce arithmetic overflow test helper functions (Rasmus)

- Use overflow helpers in 2-factor allocators (Kees, Rasmus)

- Introduce overflow test module (Rasmus, Kees)

- Introduce saturating size helper functions (Matthew, Kees)

- Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use struct_size() for devm_kmalloc() and friends
treewide: Use struct_size() for vmalloc()-family
treewide: Use struct_size() for kmalloc()-family
device: Use overflow helpers for devm_kmalloc()
mm: Use overflow helpers in kvmalloc()
mm: Use overflow helpers in kmalloc_array*()
test_overflow: Add memory allocation overflow tests
overflow.h: Add allocation size calculation helpers
test_overflow: Report test failures
test_overflow: macrofy some more, do more tests for free
lib: add runtime test of check_*_overflow functions
compiler.h: enable builtin overflow checkers and add fallback code

Linus Torvalds
2018-06-07 08:27:14 +0800
acafe7e30 treewide: Use struct_size() for kmalloc()-family ... Browse Code »

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook

Kees Cook
2018-06-07 02:15:43 +0800

06 Jun, 2018

1 commit

9f25a8da4 Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- For cpustat, cgroup has a percpu hierarchical stat mechanism which
propagates up the hierarchy lazily.

This contains commits to factor out and generalize the mechanism so
that it can be used for other cgroup stats too.

The original intention was to update memcg stats to use it but memcg
went for a different approach, so still the only user is cpustat. The
factoring out and generalization still make sense and it's likely
that this can be used for other purposes in the future.

- cgroup uses kernfs_notify() (which uses fsnotify()) to inform user
space of certain events. A rate limiting mechanism is added.

- Other misc changes.

* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: css_set_lock should nest inside tasklist_lock
rdmacg: Convert to use match_string() helper
cgroup: Make cgroup_rstat_updated() ready for root cgroup usage
cgroup: Add memory barriers to plug cgroup_rstat_updated() race window
cgroup: Add cgroup_subsys->css_rstat_flush()
cgroup: Replace cgroup_rstat_mutex with a spinlock
cgroup: Factor out and expose cgroup_rstat_*() interface functions
cgroup: Reorganize kernel/cgroup/rstat.c
cgroup: Distinguish base resource stat implementation from rstat
cgroup: Rename stat to rstat
cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c
cgroup: Limit event generation frequency
cgroup: Explicitly remove core interface files

Linus Torvalds
2018-06-06 08:08:45 +0800

24 May, 2018

1 commit

d8742e229 cgroup: css_set_lock should nest inside tasklist_lock ... Browse Code »

cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
tasklist_lock inside irq-safe css_set_lock triggering the following
lockdep warning.

WARNING: possible irq lock inversion dependency detected
4.17.0-rc1-00027-gb37d049 #6 Not tainted
--------------------------------------------------------
systemd/1 just changed the state of lock:
00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
but this lock took another, SOFTIRQ-unsafe lock in the past:
(tasklist_lock){.+.+}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(tasklist_lock);
local_irq_disable();
lock(css_set_lock);
lock(tasklist_lock);

lock(css_set_lock);

*** DEADLOCK ***

The condition is highly unlikely to actually happen especially given
that the path is executed only once per boot.

Signed-off-by: Tejun Heo
Reported-by: Boqun Feng

Tejun Heo
2018-05-24 02:04:54 +0800

16 May, 2018

1 commit

3f3942aca proc: introduce proc_create_single{,_data} ... Browse Code »

Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-16 13:23:35 +0800

08 May, 2018

1 commit

cc659e76f rdmacg: Convert to use match_string() helper ... Browse Code »

The new helper returns index of the matching string in an array.
We are going to use it here.

Signed-off-by: Andy Shevchenko
Signed-off-by: Tejun Heo

Andy Shevchenko
2018-05-08 00:27:26 +0800

27 Apr, 2018

11 commits

c43c5ea75 cgroup: Make cgroup_rstat_updated() ready for root cgroup usage ... Browse Code »

cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
the parent. If there's no parent, it never gets linked and the
function ends up grabbing and releasing the cgroup_rstat_lock each
time for no reason which can be expensive.

This hasn't been a problem till now because nobody was calling the
function for the root cgroup but rstat is gonna be exposed to
controllers and use cases, so let's get ready. Make
cgroup_rstat_updated() an no-op for the root cgroup.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:06 +0800
9a9e97b2f cgroup: Add memory barriers to plug cgroup_rstat_updated() race window ... Browse Code »

cgroup_rstat_updated() has a small race window where an updated
signaling can race with flush and could be lost till the next update.
This wasn't a problem for the existing usages, but we plan to use
rstat to track counters which need to be accurate.

This patch plugs the race window by synchronizing
cgroup_rstat_updated() and flush path with memory barriers around
cgroup_rstat_cpu->updated_next pointer.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:05 +0800
8f53470ba cgroup: Add cgroup_subsys->css_rstat_flush() ... Browse Code »

This patch adds cgroup_subsys->css_rstat_flush(). If a subsystem has
this callback, its csses are linked on cgrp->css_rstat_list and rstat
will call the function whenever the associated cgroup is flushed.
Flush is also performed when such csses are released so that residual
counts aren't lost.

Combined with the rstat API previous patches factored out, this allows
controllers to plug into rstat to manage their statistics in a
scalable way.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:05 +0800
0fa294fb1 cgroup: Replace cgroup_rstat_mutex with a spinlock ... Browse Code »

Currently, rstat flush path is protected with a mutex which is fine as
all the existing users are from interface file show path. However,
rstat is being generalized for use by controllers and flushing from
atomic contexts will be necessary.

This patch replaces cgroup_rstat_mutex with a spinlock and adds a
irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit
yield handling is added to the flush path so that other flush
functions can yield to other threads and flushers.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:05 +0800
6162cef0f cgroup: Factor out and expose cgroup_rstat_*() interface functions ... Browse Code »

cgroup_rstat is being generalized so that controllers can use it too.
This patch factors out and exposes the following interface functions.

* cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
consistency.

* cgroup_rstat_flush_hold/release(): Factored out from base stat
implementation.

* cgroup_rstat_flush(): Verbatim expose.

While at it, drop assert on cgroup_rstat_mutex in
cgroup_base_stat_flush() as it crosses layers and make a minor comment
update.

v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:05 +0800
a17556f8d cgroup: Reorganize kernel/cgroup/rstat.c ... Browse Code »

Currently, rstat.c has rstat and base stat implementations intermixed.
Collect base stat implementation at the end of the file. Also,
reorder the prototypes.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:05 +0800
d4ff749b5 cgroup: Distinguish base resource stat implementation from rstat ... Browse Code »

Base resource stat accounts universial (not specific to any
controller) resource consumptions on top of rstat. Currently, its
implementation is intermixed with rstat implementation making the code
confusing to follow.

This patch clarifies the distintion by doing the followings.

* Encapsulate base resource stat counters, currently only cputime, in
struct cgroup_base_stat.

* Move prev_cputime into struct cgroup and initialize it with cgroup.

* Rename the related functions so that they start with cgroup_base_stat.

* Prefix the related variables and field names with b.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:04 +0800
c58632b36 cgroup: Rename stat to rstat ... Browse Code »

stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse. Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

This patch does the following renames. No other changes.

* cpu_stat -> rstat_cpu
* stat -> rstat
* ?cstat -> ?rstatc

Note that the renames are selective. The unrenamed are the ones which
implement basic resource statistics on top of rstat. This will be
further cleaned up in the following patches.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:04 +0800
a5c2b93f7 cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c ... Browse Code »

stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse. Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

First, rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c. No
content changes.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:04 +0800
b12e35832 cgroup: Limit event generation frequency ... Browse Code »

".events" files generate file modified event to notify userland of
possible new events. Some of the events can be quite bursty
(e.g. memory high event) and generating notification each time is
costly and pointless.

This patch implements a event rate limit mechanism. If a new
notification is requested before 10ms has passed since the previous
notification, the new notification is delayed till then.

As this only delays from the second notification on in a given close
cluster of notifications, userland reactions to notifications
shouldn't be delayed at all in most cases while avoiding notification
storms.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:04 +0800
5faaf05f2 cgroup: Explicitly remove core interface files ... Browse Code »

The "cgroup." core interface files bypass the usual interface removal
path and get removed recursively along with the cgroup itself. While
this works now, the subtle discrepancy gets in the way of implementing
common mechanisms.

This patch updates cgroup core interface file handling so that it's
consistent with controller interface files. When added, the css is
marked CSS_VISIBLE and they're explicitly removed before the cgroup is
destroyed.

This doesn't cause user-visible behavior changes.

Signed-off-by: Tejun Heo

Tejun Heo
2018-04-27 05:29:04 +0800

04 Apr, 2018

1 commit

d92cd810e Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue updates from Tejun Heo:
"rcu_work addition and a couple trivial changes"

* 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove the comment about the old manager_arb mutex
workqueue: fix the comments of nr_idle
fs/aio: Use rcu_work instead of explicit rcu and work item
cgroup: Use rcu_work instead of explicit rcu and work item
RCU, workqueue: Implement rcu_work

Linus Torvalds
2018-04-04 09:00:13 +0800

20 Mar, 2018

1 commit

8f36aaec9 cgroup: Use rcu_work instead of explicit rcu and work item ... Browse Code »

Workqueue now has rcu_work. Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo

Tejun Heo
2018-03-20 01:12:03 +0800

22 Feb, 2018

1 commit

d1897c953 cgroup: fix rule checking for threaded mode switching ... Browse Code »

A domain cgroup isn't allowed to be turned threaded if its subtree is
populated or domain controllers are enabled. cgroup_enable_threaded()
depended on cgroup_can_be_thread_root() test to enforce this rule. A
parent which has populated domain descendants or have domain
controllers enabled can't become a thread root, so the above rules are
enforced automatically.

However, for the root cgroup which can host mixed domain and threaded
children, cgroup_can_be_thread_root() doesn't check any of those
conditions and thus first level cgroups ends up escaping those rules.

This patch fixes the bug by adding explicit checks for those rules in
cgroup_enable_threaded().

Reported-by: Michael Kerrisk (man-pages)
Signed-off-by: Tejun Heo
Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+

Tejun Heo
2018-02-22 03:39:22 +0800

07 Feb, 2018

1 commit

77ef80c65 kernel/cpuset: current_cpuset_is_being_rebound can be boolean ... Browse Code »

Make current_cpuset_is_being_rebound return bool due to this particular
function only using either one or zero as its return value.

No functional change.

Link: http://lkml.kernel.org/r/1513266622-15860-4-git-send-email-baiyaowei@cmss.chinamobile.com
Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2018-02-07 10:32:47 +0800

20 Jan, 2018

1 commit

08a77676f string: drop __must_check from strscpy() and restore strscpy() usages in cgroup ... Browse Code »

e7fd37ba1217 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.

kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
^

To avoid the warnings, 50034ed49645 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().

strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function. It's just
that someone added __must_check to strscpy() and not to strlcpy().

These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value. The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.

Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.

Signed-off-by: Tejun Heo
Suggested-by: Linus Torvalds
Cc: Ma Shimiao
Cc: Arnd Bergmann
Cc: Chris Metcalf

Tejun Heo
2018-01-20 00:51:36 +0800

11 Jan, 2018

1 commit

4f58424da cgroup: make cgroup.threads delegatable ... Browse Code »

Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.

Signed-off-by: Roman Gushchin
Discovered-by: Michael Kerrisk
Cc: Tejun Heo
Signed-off-by: Tejun Heo

Roman Gushchin
2018-01-11 01:42:32 +0800

20 Dec, 2017

1 commit

74d0833c6 cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC ... Browse Code »

While teaching css_task_iter to handle skipping over tasks which
aren't group leaders, bc2fb7ed089f ("cgroup: add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
silly bug.

CSS_TASK_ITER_PROCS is implemented by repeating
css_task_iter_advance() while the advanced cursor is pointing to a
non-leader thread. However, the cursor variable, @l, wasn't updated
when the iteration has to advance to the next css_set and the
following repetition would operate on the terminal @l from the
previous iteration which isn't pointing to a valid task leading to
oopses like the following or infinite looping.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
IP: __task_pid_nr_ns+0xc7/0xf0
PGD 0 P4D 0
Oops: 0000 [#1] SMP
...
CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
task: ffff88c4baee8000 task.stack: ffff96d5c3158000
RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
FS: 00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
Call Trace:
cgroup_procs_show+0x19/0x30
cgroup_seqfile_show+0x4c/0xb0
kernfs_seq_show+0x21/0x30
seq_read+0x2ec/0x3f0
kernfs_fop_read+0x134/0x180
__vfs_read+0x37/0x160
? security_file_permission+0x9b/0xc0
vfs_read+0x8e/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa5
RIP: 0033:0x7f94455f942d
RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50

Fix it by moving the initialization of the cursor below the repeat
label. While at it, rename it to @next for readability.

Signed-off-by: Tejun Heo
Fixes: bc2fb7ed089f ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: Laura Abbott
Reported-by: Bronek Kozicki
Reported-by: George Amanakis
Signed-off-by: Tejun Heo

Tejun Heo
2017-12-20 23:09:19 +0800

19 Dec, 2017

1 commit

116d2f749 cgroup: Fix deadlock in cpu hotplug path ... Browse Code »

Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.

kworker/0:0
cpuset_hotplug_workfn()
cpuset_hotplug_update_tasks()
hotplug_update_tasks_legacy()
remove_tasks_in_empty_cpuset()
cgroup_transfer_tasks() // stuck in iterator loop
cgroup_migrate()
cgroup_migrate_add_task()

In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.

Task T
do_exit()
exit_signals() // sets PF_EXITING
exit_task_namespaces()
switch_task_namespaces()
free_nsproxy()
put_mnt_ns()
drop_collected_mounts()
namespace_unlock()
synchronize_rcu()
_synchronize_rcu_expedited()
schedule_work() // on cpu0 low priority worker pool
wait_event() // waiting for work item to execute

Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().

kworker/0:0 ==> Task T ==>kworker/0:0

In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.

Signed-off-by: Prateek Sood
Signed-off-by: Tejun Heo

Prateek Sood
2017-12-19 21:38:47 +0800

15 Dec, 2017

1 commit

50034ed49 cgroup: use strlcpy() instead of strscpy() to avoid spurious warning ... Browse Code »

As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.

Signed-off-by: Arnd Bergmann
Signed-off-by: Tejun Heo

Arnd Bergmann
2017-12-15 21:09:47 +0800

12 Dec, 2017

1 commit

e7fd37ba1 cgroup: avoid copying strings longer than the buffers ... Browse Code »

cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.

tj: minor update to $SUBJ.

Signed-off-by: Ma Shimiao
Signed-off-by: Tejun Heo

Ma Shimiao
2017-12-12 23:53:29 +0800

05 Dec, 2017

1 commit

bdfbbda90 Revert "cgroup/cpuset: remove circular dependency deadlock" ... Browse Code »

This reverts commit aa24163b2ee5c92120e32e99b5a93143a0f4258e.

This and the following commit led to another circular locking scenario
and the scenario which is fixed by this commit no longer exists after
e8b3f8db7aad ("workqueue/hotplug: simplify workqueue_offline_cpu()")
which removes work item flushing from hotplug path.

Revert it for now.

Signed-off-by: Tejun Heo

Tejun Heo
2017-12-05 06:55:59 +0800