Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

14 May, 2014

3 commits

0ab7a60de cgroup: css_release() shouldn't clear cgroup->subsys[] ... Browse Code »

c1a71504e971 ("cgroup: don't recycle cgroup id until all csses' have
been destroyed") made cgroup ID persist until a cgroup is released and
add cgroup->subsys[] clearing to css_release() so that css_from_id()
doesn't return a css which has already been released which happens
before cgroup release; however, the right change here was updating
offline_css() to clear cgroup->subsys[] which was done by e32978031016
("cgroup: cgroup->subsys[] should be cleared after the css is
offlined") instead of clearing it from css_release().

We're now clearing cgroup->subsys[] twice. This is okay for
traditional hierarchies as a css's lifetime is the same as its
cgroup's; however, this confuses unified hierarchy and turning on and
off a controller repeatedly using "cgroup.subtree_control" can lead to
an oops like the following which happens because cgroup->subsys[] is
incorrectly cleared asynchronously by css_release().

BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
IP: [] kill_css+0x21/0x1c0
PGD 1170d067 PUD f0ab067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
RIP: 0010:[] [] kill_css+0x21/0x1c0
RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
Stack:
ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
Call Trace:
[] cgroup_subtree_control_write+0x85f/0xa00
[] cgroup_file_write+0x38/0x1d0
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xb6/0x1c0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b
Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
RIP [] kill_css+0x21/0x1c0
RSP
CR2: 0000000000000008
---[ end trace e7aae1f877c4e1b4 ]---

Remove the unnecessary cgroup->subsys[] clearing from css_release().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-14 00:10:59 +0800
54504e977 cgroup: cgroup_idr_lock should be bh ... Browse Code »

cgroup_idr_remove() can be invoked from bh leading to lockdep
detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-14 00:10:59 +0800
0cee8b778 cgroup: fix offlining child waiting in cgroup_subtree_control_write() ... Browse Code »

cgroup_subtree_control_write() waits for offline to complete
child-by-child before enabling a controller; however, it has a couple
bugs.

* It doesn't initialize the wait_queue_t. This can lead to infinite
hang on the following schedule() among other things.

* It forgets to pin the child before releasing cgroup_tree_mutex and
performing schedule(). The child may already be gone by the time it
wakes up and invokes finish_wait(). Pin the child being waited on.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-14 00:10:59 +0800

13 May, 2014

5 commits

f21a4f759 Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/t… ... Browse Code »

…j/cgroup into for-3.16

Pull to receive e37a06f10994 ("cgroup: fix the retry path of
cgroup_mount()") to avoid unnecessary conflicts with planned
cgroup_tree_mutex removal and also to be able to remove the temp fix
added by 36c38fb7144a ("blkcg: use trylock on blkcg_pol_mutex in
blkcg_reset_stats()") afterwards.

Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo
2014-05-13 23:30:04 +0800
36e9d2ebc cgroup: fix rcu_read_lock() leak in update_if_frozen() ... Browse Code »

While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held. Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo
Reported-by: kbuild test robot

Tejun Heo
2014-05-13 23:28:30 +0800
d39ea871c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16 ... Browse Code »

Pull to receive percpu_ref_tryget[_live]() changes. Planned cgroup
changes will make use of them.

Signed-off-by: Tejun Heo

Tejun Heo
2014-05-13 23:27:24 +0800
e5ced8ebb cgroup_freezer: replace freezer->lock with freezer_mutex ... Browse Code »

After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem. I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
errors like the following on freezer state reads and transitions.

BUG: sleeping function called from invalid context at /work
/os/work/kernel/locking/rwsem.c:20
in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
5 locks held by bash/462:
#0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
#1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
#2: (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
#3: (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
#4: (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
Preemption disabled at:[] console_unlock+0x1e4/0x460

CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
Call Trace:
[] dump_stack+0x4e/0x7a
[] __might_sleep+0x162/0x260
[] down_read+0x24/0x60
[] css_task_iter_start+0x27/0x70
[] freezer_apply_state+0x5d/0x130
[] freezer_write+0xf6/0x1e0
[] cgroup_file_write+0xd8/0x230
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xb6/0x1c0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup. In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex. This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward. The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
freezer_mutex. update_if_frozen() race no longer exists and the
comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
the new task_css_is_root() without doing rcu_read_lock/unlock(). If
not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
the whole operation and pin the css while iterating so that each
descendant processing happens in sleepable context.

Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-13 23:26:31 +0800
5024ae29c cgroup: introduce task_css_is_root() ... Browse Code »

Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers. No need to export
init_css_set. Pointed out by Li.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-13 23:26:27 +0800

10 May, 2014

2 commits

4fb6e2504 percpu-refcount: implement percpu_ref_tryget() ... Browse Code »
13

Implement percpu_ref_tryget() which fails if the refcnt already
reached zero. Note that this is different from the recently renamed
percpu_ref_tryget_live() which fails if the refcnt has been killed and
is draining the remaining references. percpu_ref_tryget() succeeds on
a killed refcnt as long as its current refcnt is above zero.

Signed-off-by: Tejun Heo
Acked-by: Kent Overstreet

Tejun Heo
2014-05-10 03:42:35 +0800
2070d50e1 percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live() ... Browse Code »

percpu_ref_tryget() is different from the usual tryget semantics in
that it fails if the refcnt is in its dying stage even if the refcnt
hasn't reached zero yet. We're about to introduce the more
conventional tryget and the current one has only one user. Let's
rename it to percpu_ref_tryget_live() so that it explicitly signifies
the peculiarities of its semantics.

This is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Kent Overstreet

Tejun Heo
2014-05-10 03:42:15 +0800

07 May, 2014

1 commit

2b53f41fa cgroup: remove unused CGRP_SANE_BEHAVIOR ... Browse Code »

This cgroup flag has never been used. Only CGRP_ROOT_SANE_BEHAVIOR is
used. Remove it.

Signed-off-by: Tejun Heo

Tejun Heo
2014-05-07 21:21:56 +0800

06 May, 2014

4 commits

12d3089c1 kernel/cpuset.c: convert printk to pr_foo() ... Browse Code »

Cc: Andrew Morton
Signed-off-by: Fabian Frederick
Acked-by: Li Zefan
Signed-off-by: Tejun Heo

Fabian Frederick
2014-05-06 19:31:14 +0800
fc34ac1dc kernel/cpuset.c: kernel-doc fixes ... Browse Code »

This patch also converts seq_printf to seq_puts

Cc: Andrew Morton
Signed-off-by: Fabian Frederick
Acked-by: Li Zefan
Signed-off-by: Tejun Heo

Fabian Frederick
2014-05-06 19:31:14 +0800
60106946c kernel/cgroup.c: fix 2 kernel-doc warnings ... Browse Code »

Fix typo and variable name.

tj: Updated @cgrp argument description in cgroup_destroy_css_killed()

Cc: Andrew Morton
Signed-off-by: Fabian Frederick
Signed-off-by: Tejun Heo

Fabian Frederick
2014-05-06 02:33:04 +0800
36c38fb71 blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() ... Browse Code »
13

During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
which nests above both the kernfs s_active protection and cgroup_mutex
is added to synchronize cgroup file type operations as cgroup_mutex
needed to be grabbed from some file operations and thus can't be put
above s_active protection.

While this arrangement mostly worked for cgroup, this triggered the
following lockdep warning.

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W
-------------------------------------------------------
trinity-c173/9024 is trying to acquire lock:
(blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

but task is already holding lock:
(s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (s_active#89){++++.+}:
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
__kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
rebind_subsystems (kernel/cgroup.c:1144)
cgroup_setup_root (kernel/cgroup.c:1568)
cgroup_mount (kernel/cgroup.c:1716)
mount_fs (fs/super.c:1094)
vfs_kern_mount (fs/namespace.c:899)
do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
tracesys (arch/x86/kernel/entry_64.S:746)

-> #1 (cgroup_tree_mutex){+.+.+.}:
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
blkcg_policy_register (block/blk-cgroup.c:1106)
throtl_init (block/blk-throttle.c:1694)
do_one_initcall (init/main.c:789)
kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
kernel_init (init/main.c:935)
ret_from_fork (arch/x86/kernel/entry_64.S:552)

-> #0 (blkcg_pol_mutex){+.+.+.}:
__lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
cgroup_file_write (kernel/cgroup.c:2714)
kernfs_fop_write (fs/kernfs/file.c:295)
vfs_write (fs/read_write.c:532)
SyS_write (fs/read_write.c:584 fs/read_write.c:576)
tracesys (arch/x86/kernel/entry_64.S:746)

other info that might help us debug this:

Chain exists of:
blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(s_active#89);
lock(cgroup_tree_mutex);
lock(s_active#89);
lock(blkcg_pol_mutex);

*** DEADLOCK ***

4 locks held by trinity-c173/9024:
#0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
#1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
#2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
#3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

stack backtrace:
CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
Call Trace:
dump_stack (lib/dump_stack.c:52)
print_circular_bug (kernel/locking/lockdep.c:1216)
__lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
cgroup_file_write (kernel/cgroup.c:2714)
kernfs_fop_write (fs/kernfs/file.c:295)
vfs_write (fs/read_write.c:532)
SyS_write (fs/read_write.c:584 fs/read_write.c:576)

This is a highly unlikely but valid circular dependency between "echo
1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going
through further locking update which will remove this complication but
for now let's use trylock on blkcg_pol_mutex and retry the file
operation if the trylock fails.

Signed-off-by: Tejun Heo
Reported-by: Sasha Levin
References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com

Tejun Heo
2014-05-06 01:48:18 +0800

05 May, 2014

8 commits

d2c2b11cf device_cgroup: check if exception removal is allowed ... Browse Code »
5

[PATCH v3 1/2] device_cgroup: check if exception removal is allowed

When the device cgroup hierarchy was introduced in
bd2953ebbb53 - devcg: propagate local changes down the hierarchy

a specific case was overlooked. Consider the hierarchy bellow:

A default policy: ALLOW, exceptions will deny access
\
B default policy: ALLOW, exceptions will deny access

There's no need to verify when an new exception is added to B because
in this case exceptions will deny access to further devices, which is
always fine. Hierarchy in device cgroup only makes sure B won't have
more access than A.

But when an exception is removed (by writing devices.allow), it isn't
checked if the user is in fact removing an inherited exception from A,
thus giving more access to B.

Example:

# echo 'a' >A/devices.allow
# echo 'c 1:3 rw' >A/devices.deny
# echo $$ >A/B/tasks
# echo >/dev/null
-bash: /dev/null: Operation not permitted
# echo 'c 1:3 w' >A/B/devices.allow
# echo >/dev/null
#

This shouldn't be allowed and this patch fixes it by making sure to never allow
exceptions in this case to be removed if the exception is partially or fully
present on the parent.

v3: missing '*' in function description
v2: improved log message and formatting fixes

Cc: cgroups@vger.kernel.org
Cc: Li Zefan
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Signed-off-by: Tejun Heo

Aristeu Rozanski
2014-05-05 23:20:12 +0800
f5f3cf6f7 device_cgroup: fix the comment format for recently added functions ... Browse Code »

Moving more extensive explanations to the end of the comment.

Cc: Li Zefan
Signed-off-by: Aristeu Rozanski
Acked-by: Serge Hallyn
Signed-off-by: Tejun Heo

Aristeu Rozanski
2014-05-05 03:21:09 +0800
15a4c835e cgroup, memcg: implement css->id and convert css_from_id() to use it ... Browse Code »

Until now, cgroup->id has been used to identify all the associated
csses and css_from_id() takes cgroup ID and returns the matching css
by looking up the cgroup and then dereferencing the css associated
with it; however, now that the lifetimes of cgroup and css are
separate, this is incorrect and breaks on the unified hierarchy when a
controller is disabled and enabled back again before the previous
instance is released.

This patch adds css->id which is a subsystem-unique ID and converts
css_from_id() to look up by the new css->id instead. memcg is the
only user of css_from_id() and also converted to use css->id instead.

For traditional hierarchies, this shouldn't make any functional
difference.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Jianyu Zhan
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:14 +0800
ddfcadab3 cgroup: update init_css() into init_and_link_css() ... Browse Code »

init_css() takes the cgroup the new css belongs to as an argument and
initializes the new css's ->cgroup and ->parent pointers but doesn't
acquire the matching reference counts. After the previous patch,
create_css() puts init_css() and reference acquisition right next to
each other. Let's move reference acquistion into init_css() and
rename the function to init_and_link_css(). This makes sense and is
easier to follow. This makes the root csses to hold a reference on
cgrp_dfl_root.cgrp, which is harmless.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:14 +0800
a2bed8209 cgroup: use RCU free in create_css() failure path ... Browse Code »

Currently, when create_css() fails in the middle, the half-initialized
css is freed by invoking cgroup_subsys->css_free() directly. This
patch updates the function so that it invokes RCU free path instead.
As the RCU free path puts the parent css and owning cgroup, their
references are now acquired right after a new css is successfully
allocated.

This doesn't make any visible difference now but is to enable
implementing css->id and RCU protected lookup by such IDs.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:14 +0800
6fa4918d0 cgroup: protect cgroup_root->cgroup_idr with a spinlock ... Browse Code »

Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
ends up requiring cgroup_put() to be invoked under sleepable context.
This is okay for now but is an unusual requirement and we'll soon add
css->id which will have the same problem but won't be able to simply
grab cgroup_mutex as removal will have to happen from css_release()
which can't sleep.

Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
which protects the idr operations with the lock and use them for
cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab
cgroup_mutex and css_from_id() is updated to always require RCU read
lock instead of either RCU read lock or cgroup_mutex, which doesn't
affect the existing users.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:13 +0800
7d699ddb2 cgroup, memcg: allocate cgroup ID from 1 ... Browse Code »

Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.

It's reasonable to reserve 0 for special purposes. This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:13 +0800
69dfa00cc cgroup: make flags and subsys_masks unsigned int ... Browse Code »

There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks. This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.

This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-05 03:09:13 +0800

26 Apr, 2014

10 commits

ed3d261b5 cgroup: Use more current logging style ... Browse Code »

Use pr_fmt and remove embedded prefixes.
Realign modified multi-line statements to open parenthesis.
Convert embedded function name to "%s: ", __func__

Signed-off-by: Joe Perches
Signed-off-by: Tejun Heo

Joe Perches
2014-04-26 06:28:03 +0800
a2a1f9eaf cgroup: replace pr_warning with preferred pr_warn ... Browse Code »

As suggested by scripts/checkpatch.pl, substitude all pr_warning()
with pr_warn().

No functional change.

Signed-off-by: Jianyu Zhan
Signed-off-by: Tejun Heo

Jianyu Zhan
2014-04-26 06:28:03 +0800
f8719ccf7 cgroup: remove orphaned cgroup_pidlist_seq_operations ... Browse Code »

6612f05b88fa309c9 ("cgroup: unify pidlist and other file handling")
has removed the only user of cgroup_pidlist_seq_operations :
cgroup_pidlist_open().

This patch removes it.

Signed-off-by: Jianyu Zhan
Signed-off-by: Tejun Heo

Jianyu Zhan
2014-04-26 06:28:03 +0800
2f0edc04e cgroup: clean up obsolete comment for parse_cgroupfs_options() ... Browse Code »

1d5be6b287c8efc87 ("cgroup: move module ref handling into
rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
refcounts on subsystems.

And unified hierachy makes parse_cgroupfs_options not need to call
with cgroup_mutex held to protect the cgroup_subsys[].

So this patch removes BUG_ON() and the comment. As the comment
doesn't contain useful information afterwards, the whole comment is
removed.

Signed-off-by: Jianyu Zhan
Signed-off-by: Tejun Heo

Jianyu Zhan
2014-04-26 06:28:03 +0800
657315780 cgroup: add documentation about unified hierarchy ... Browse Code »
26

Unified hierarchy will be the new version of cgroup interface. This
patch adds Documentation/cgroups/unified-hierarchy.txt which describes
the design and rationales of unified hierarchy.

v2: Grammatical updates as per Randy Dunlap's review.

Signed-off-by: Tejun Heo
Cc: Randy Dunlap

Tejun Heo
2014-04-26 06:28:02 +0800
842b597ee cgroup: implement cgroup.populated for the default hierarchy ... Browse Code »

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.

* There is single monitoring point at the root. There's no way to
delegate management of a subtree.

* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.

* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.

This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

v2: Patch description updated as per Serge.

v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.

Signed-off-by: Tejun Heo
Acked-by: Serge Hallyn
Acked-by: Li Zefan
Cc: Lennart Poettering

Tejun Heo
2014-04-26 06:28:02 +0800
50bce01b0 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/gregkh/driver-core into for-3.16

Pull in driver-core-next to receive kernfs_notify() updates which will
be used by the planned "cgroup.populated" implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo
2014-04-26 06:25:55 +0800
86d56134f kobject: Make support for uevent_helper optional. ... Browse Code »

Support for uevent_helper, aka hotplug, is not required on many systems
these days but it can still be enabled via sysfs or sysctl.

Reported-by: Darren Shepherd
Signed-off-by: Michael Marineau
Signed-off-by: Greg Kroah-Hartman

Michael Marineau
2014-04-26 03:00:49 +0800
d911d9874 kernfs: make kernfs_notify() trigger inotify events too ... Browse Code »
15

kernfs_notify() is used to indicate either new data is available or
the content of a file has changed. It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor. Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.

fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node. As any active monitor
would pin dentry, just looking up existing dentry is enough. This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().

Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify(). It might be better to add a wrapper
in fsnotify.h instead.

Signed-off-by: Tejun Heo
Cc: John McCutchan
Cc: Robert Love
Cc: Eric Paris
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-26 02:43:31 +0800
7d568a838 kernfs: implement kernfs_root->supers list ... Browse Code »

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2014-04-26 02:43:31 +0800

23 Apr, 2014

7 commits

f8f22e53a cgroup: implement dynamic subtree controller enable/disable on the default hierarchy ... Browse Code »

cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.

This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.

root - A - B - C
\ D

root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.

The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.

When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.

* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.

* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.

* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.

* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.

* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.

* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".

This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.

v2: Non-root cgroups now also have "cgroup.controllers".

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:16 +0800
f817de985 cgroup: prepare migration path for unified hierarchy ... Browse Code »

Unified hierarchy implementation would require re-migrating tasks onto
the same cgroup on the default hierarchy to reflect updated effective
csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as
the destination cgrp. When NULL is specified, the destination is
considered to be the cgroup on the default hierarchy associated with
each css_set.

After this change, the identity check in cgroup_migrate_add_src()
isn't sufficient for noop detection as the associated csses may change
without any cgroup association changing. The only way to tell whether
a migration is noop or not is testing whether the source and
destination csets are identical. The noop check in
cgroup_migrate_add_src() is removed and cset identity test is added to
cgroup_migreate_prepare_dst(). If it's detected that source and
destination csets are identical, the cset is removed removed from
@preloaded_csets and all the migration nodes are cleared which makes
cgroup_migrate() ignore the cset.

Also, make the function append the destination css_sets to
@preloaded_list so that destination css_sets always come after source
css_sets.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:16 +0800
7fd8c565d cgroup: update subsystem rebind restrictions ... Browse Code »

Because the default root couldn't have any non-root csses attached to
it, rebinding away from it was always allowed; however, the default
hierarchy will soon host the unified hierarchy and have non-root csses
so the rebind restrictions need to be updated accordingly.

Instead of special casing rebinding from the default hierarchy and
then checking whether the source hierarchy has children cgroups, which
implies non-root csses for !dfl hierarchies, simply check whether the
source hierarchy has non-root csses for the subsystem using
css_next_child().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:16 +0800
6803c0062 cgroup: add css_set->dfl_cgrp ... Browse Code »

To implement the unified hierarchy behavior, we'll need to be able to
determine the associated cgroup on the default hierarchy from css_set.
Let's add css_set->dfl_cgrp so that it can be accessed conveniently
and efficiently.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:16 +0800
bd53d617b cgroup: allow cgroup creation and suppress automatic css creation in the unified hierarchy ... Browse Code »

Now that effective css handling has been added and iterators updated
accordingly, it's safe to allow cgroup creation in the default
hierarchy. Unblock cgroup creation in the default hierarchy.

As the default hierarchy will implement explicit enabling and
disabling of controllers on each cgroup, suppress automatic css
enabling on cgroup creation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:16 +0800
e32978031 cgroup: cgroup->subsys[] should be cleared after the css is offlined ... Browse Code »
13

After a css finishes offlining, offline_css() mistakenly performs
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the
cgroup->subsys[] pointer to the current value. The intention was to
clear it after offline is complete, not reassign the same value.

Update it to assign NULL instead of the current value. This makes
cgroup_css() to return NULL once offline is complete. All the
existing users of the function either can handle NULL return already
or guarantee that the css doesn't get offlined.

While this is a bugfix, as css lifetime is currently tied to the
cgroup it belongs to, this bug doesn't cause any actual problems.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:15 +0800
3ebb2b6ef cgroup: teach css_task_iter about effective csses ... Browse Code »

Currently, css_task_iter iterates tasks associated with a css by
visiting each css_set associated with the owning cgroup and walking
tasks of each of them. This works fine for !unified hierarchies as
each cgroup has its own css for each associated subsystem on the
hierarchy; however, on the planned unified hierarchy, a cgroup may not
have csses associated and its tasks would be considered associated
with the matching css of the nearest ancestor which has the subsystem
enabled.

This means that on the default unified hierarchy, just walking all
tasks associated with a cgroup isn't enough to walk all tasks which
are associated with the specified css. If any of its children doesn't
have the matching css enabled, task iteration should also include all
tasks from the subtree. We already added cgroup->e_csets[] to list
all css_sets effectively associated with a given css and walk css_sets
on that list instead to achieve such iteration.

This patch updates css_task_iter iteration such that it walks css_sets
on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
requested on an non-dummy css. Thanks to the previous iteration
update, this change can be achieved with the addition of
css_task_iter->ss and minimal updates to css_advance_task_iter() and
css_task_iter_start().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-04-23 23:13:15 +0800