Eric Lee / smarc-fsl-linux-kernel

12 Aug, 2013

1 commit

a09a1b870 cgroup: fix umount vs cgroup_cfts_commit() race ... Browse Code »

commit 084457f284abf6789d90509ee11dae383842b23b upstream.

cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
is dropped, but dget() won't prevent cgroupfs from being umounted. When
the race happens, vfs will see some dentries with non-zero refcnt while
umount is in process.

Keep running this:
mount -t cgroup -o blkio xxx /cgroup
umount /cgroup

And this:
modprobe cfq-iosched
rmmod cfs-iosched

After a while, the BUG() in shrink_dcache_for_umount_subtree() may
be triggered:

BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Li Zefan
2013-08-12 09:35:24 +0800

22 Jul, 2013

1 commit

83d0eb797 cgroup: fix umount vs cgroup_event_remove() race ... Browse Code »

commit 1c8158eeae0f37d0eee9f1fbe68080df6a408df2 upstream.

commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc
Author: Tejun Heo
Date: Sat Jul 7 16:08:18 2012 -0700

cgroup: fix cgroup hierarchy umount race

This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Li Zefan
2013-07-22 09:21:25 +0800

29 May, 2013

1 commit

2a0ff3fbe cgroup: warn about mismatching options of a new mount of an existing hierarchy ... Browse Code »

With the new __DEVEL__sane_behavior mount option was introduced,
if the root cgroup is alive with no xattr function, to mount a
new cgroup with xattr will be rejected in terms of design which
just fine. However, if the root cgroup does not mounted with
__DEVEL__sane_hehavior, to create a new cgroup with xattr option
will succeed although after that the EA function does not works
as expected but will get ENOTSUPP for setting up attributes under
either cgroup. e.g.

setfattr: /cgroup2/test: Operation not supported

Instead of keeping silence in this case, it's better to drop a log
entry in warning level. That would be helpful to understand the
reason behind the scene from the user's perspective, and this is
essentially an improvement does not break the backward compatibilities.

With this fix, above mount attemption will keep up works as usual but
the following line cound be found at the system log:

[ ...] cgroup: new mount options do not match the existing superblock

tj: minor formatting / message updates.

Signed-off-by: Jie Liu
Reported-by: Alexey Kodanev
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org

Jeff Liu
2013-05-29 06:59:39 +0800

24 May, 2013

1 commit

7805d000d cgroup: fix a subtle bug in descendant pre-order walk ... Browse Code »

When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty. This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path. Remove it along with
the later assumption that the subtree isn't empty. This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo
Reviewed-by: Michal Hocko
Cc: stable@vger.kernel.org

Tejun Heo
2013-05-24 09:50:24 +0800

14 May, 2013

1 commit

d6cbf35da cgroup: initialize xattr before calling d_instantiate() ... Browse Code »

cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.

But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.

This bug has been there since cgroup xattr was introduced.

Cc: # 3.8.x
Reported-by: Ivan Bulatovic
Reported-by: Casey Schaufler
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-05-14 23:24:15 +0800

02 May, 2013

2 commits

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800
8d8b97ba4 take cgroup_open() and cpuset_open() to fs/proc/base.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-05-02 05:29:46 +0800

30 Apr, 2013

4 commits

16fa94b53 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes in this development cycle were:

- full dynticks preparatory work by Frederic Weisbecker

- factor out the cpu time accounting code better, by Li Zefan

- multi-CPU load balancer cleanups and improvements by Joonsoo Kim

- various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
sched: Fix init NOHZ_IDLE flag
sched: Prevent to re-select dst-cpu in load_balance()
sched: Rename load_balance_tmpmask to load_balance_mask
sched: Move up affinity check to mitigate useless redoing overhead
sched: Don't consider other cpus in our group in case of NEWLY_IDLE
sched: Explicitly cpu_idle_type checking in rebalance_domains()
sched: Change position of resched_cpu() in load_balance()
sched: Fix wrong rq's runnable_avg update with rt tasks
sched: Document task_struct::personality field
sched/cpuacct/UML: Fix header file dependency bug on the UML build
cgroup: Kill subsys.active flag
sched/cpuacct: No need to check subsys active state
sched/cpuacct: Initialize cpuacct subsystem earlier
sched/cpuacct: Initialize root cpuacct earlier
sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
sched/cpuacct: Clean up cpuacct.h
sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
sched/cpuacct: Add cpuacct_acount_field()
sched/cpuacct: Add cpuacct_init()
...

Linus Torvalds
2013-04-30 22:43:28 +0800
191a71209 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().

- device cgroup now supports proper hierarchy thanks to Aristeu.

- perf_event cgroup now supports proper hierarchy.

- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.

The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.

This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.

Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...

Linus Torvalds
2013-04-30 10:14:20 +0800
46d9be3e5 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue updates from Tejun Heo:
"A lot of activities on workqueue side this time. The changes achieve
the followings.

- WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.

- The ability to interface with multiple backend worker pools are
used to implement unbound workqueues with custom attributes.
Currently the supported attributes are the nice level and CPU
affinity. It may be expanded to include cgroup association in
future. The attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.

The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools
alone.

This allows converting custom worker pool implementations which
want worker attribute tuning to use workqueues. The writeback pool
is already converted in block tree and there are a couple others
are likely to follow including btrfs io workers.

- WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.

After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.

Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.

While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.

As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.

While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online
CPUs.

There are also a couple things which are being routed outside the
workqueue tree.

- block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

- The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
workqueue: use kmem_cache_free() instead of kfree()
workqueue: avoid false negative WARN_ON() in destroy_workqueue()
workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
workqueue: implement NUMA affinity for unbound workqueues
workqueue: introduce put_pwq_unlocked()
workqueue: introduce numa_pwq_tbl_install()
workqueue: use NUMA-aware allocation for pool_workqueues
workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
workqueue: map an unbound workqueues to multiple per-node pool_workqueues
workqueue: move hot fields of workqueue_struct to the end
workqueue: make workqueue->name[] fixed len
workqueue: add workqueue->unbound_attrs
workqueue: determine NUMA node of workers accourding to the allowed cpumask
workqueue: drop 'H' from kworker names of unbound worker pools
workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
workqueue: fix memory leak in apply_workqueue_attrs()
workqueue: fix unbound workqueue attrs hashing / comparison
workqueue: fix race condition in unbound workqueue free path
workqueue: remove pwq_lock which is no longer used
...

Linus Torvalds
2013-04-30 10:07:40 +0800
6d2488f64 cgroup: remove css_get_next ... Browse Code »

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Ying Han
Cc: Tejun Heo
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-04-30 06:54:33 +0800

27 Apr, 2013

2 commits

7ef70e487 cgroup: restore the call to eventfd->poll() ... Browse Code »

I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...

Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-04-27 02:58:03 +0800
cc20e01cd cgroup: fix use-after-free when umounting cgroupfs ... Browse Code »

Try:
# mount -t cgroup xxx /cgroup
# mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

And you might see this:

ida_remove called for id=1 which is not allocated.

It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-04-27 02:58:02 +0800

19 Apr, 2013

1 commit

712317ad9 cgroup: fix broken file xattrs ... Browse Code »

We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

# mount -t cgroup -o xattr /sys/fs/cgroup
# mkdir /sys/fs/cgroup/test
# setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
# rmdir /sys/fs/cgroup/test
# umount /sys/fs/cgroup
oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: # 3.8.x
Reported-by: Alexey Kodanev
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-04-19 14:11:40 +0800

15 Apr, 2013

5 commits

05fb22ec5 cgroup: remove cgrp->top_cgroup ... Browse Code »

It's not used, and it can be retrieved via cgrp->root->top_cgroup.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-04-15 14:26:10 +0800
873fe09ea cgroup: introduce sane_behavior mount option ... Browse Code »

It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility. This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors. As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default. Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

v2: Dropped unnecessary explicit file permission setting sane_behavior
cftype entry as suggested by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Serge E. Hallyn
Acked-by: Li Zefan
Cc: Michal Hocko
Cc: Vivek Goyal

Tejun Heo
2013-04-15 11:15:26 +0800
25a7e6848 move cgroupfs_root to include/linux/cgroup.h ... Browse Code »

While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly. This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers. Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo
Acked-by: Serge E. Hallyn
Acked-by: Li Zefan

Tejun Heo
2013-04-15 11:15:25 +0800
934386294 cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix ... Browse Code »

There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags. Convert them to masks. Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.

Note that flags are assigned from (1 << 1). The first bit will be
used by a flag which will be added soon.

Signed-off-by: Tejun Heo
Acked-by: Serge E. Hallyn
Acked-by: Li Zefan

Tejun Heo
2013-04-15 11:15:25 +0800
da1f296fd cgroup: make cgroup_path() not print double slashes ... Browse Code »

While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
cgroup_path() vs rename() race") introduced a bug where the path of a
non-root cgroup would have two slahses at the beginning, which is
caused by treating the root cgroup which has the name '/' like
non-root cgroups.

$ grep systemd /proc/self/cgroup
1:name=systemd://user/root/1

Fix it by special casing root cgroup case and not looping over it in
the normal path.

Signed-off-by: Tejun Heo
Cc: Li Zefan

Tejun Heo
2013-04-15 01:47:02 +0800

13 Apr, 2013

1 commit

26d5bbe5b Revert "cgroup: remove bind() method from cgroup_subsys." ... Browse Code »

This reverts commit 84cfb6ab484b442d5115eb3baf9db7d74a3ea626. There
are scheduled changes which make use of the removed callback.

Signed-off-by: Tejun Heo
Cc: Rami Rosen
Cc: Li Zefan

Tejun Heo
2013-04-13 01:29:04 +0800

11 Apr, 2013

3 commits

78574cf98 cgroup: implement cgroup_is_descendant() ... Browse Code »

A couple controllers want to determine whether two cgroups are in
ancestor/descendant relationship. As it's more likely that the
descendant is the primary subject of interest and there are other
operations focusing on the descendants, let's ask is_descendent rather
than is_ancestor.

Implementation is trivial as the previous patch guarantees that all
ancestors of a cgroup stay accessible as long as the cgroup is
accessible.

tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
rewrote descriptions.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-04-11 02:07:08 +0800
415cf07a1 cgroup: make sure parent won't be destroyed before its children ... Browse Code »

Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed
immediately due to css ref draining to 0. Now it would be a disaster if
the still-alive child cgroup tries to access its parent.

Make sure this won't happen.

Signed-off-by: Li Zefan
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Tejun Heo

Li Zefan
2013-04-11 02:07:00 +0800
84cfb6ab4 cgroup: remove bind() method from cgroup_subsys. ... Browse Code »

The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)

tj: Removed the entry on ->bind() from
Documentation/cgroups/cgroups.txt. Also updated a couple
paragraphs which were suggesting that dynamic re-binding may be
implemented. It's not gonna.

Signed-off-by: Rami Rosen
Signed-off-by: Tejun Heo

Rami Rosen
2013-04-11 01:46:59 +0800

10 Apr, 2013

1 commit

479f61411 cgroup: Kill subsys.active flag ... Browse Code »

The only user was cpuacct.

Acked-by: Tejun Heo
Signed-off-by: Li Zefan
Acked-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com
Signed-off-by: Ingo Molnar

Li Zefan
2013-04-10 19:54:22 +0800

08 Apr, 2013

5 commits

2219449a6 cgroup: remove cgroup_lock_is_held() ... Browse Code »

We don't want controllers to assume that the information is officially
available and do funky things with it.

The only user is task_subsys_state_check() which uses it to verify RCU
access context. We can move cgroup_lock_is_held() inside
CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
to conditionally exposing cgroup_mutex.

Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
and use lockdep_is_held() directly on the mutex in
task_subsys_state_check().

While at it, add parentheses around macro arguments in
task_subsys_state_check().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-04-08 00:29:51 +0800
47cfcd092 cgroup: kill cgroup_[un]lock() ... Browse Code »

Now that locking interface is unexported, there's no reason to keep
around these thin wrappers. Kill them and use mutex operations
directly.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-04-08 00:29:51 +0800
b9777cf8d cgroup: unexport locking interface and cgroup_attach_task() ... Browse Code »

Now that all external cgroup_lock() users are gone, we can finally
unexport the locking interface and prevent future abuse of
cgroup_mutex.

Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also,
cgroup_attach_task() doesn't have any user left and can't be used
without locking interface anyway. Make it static too.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-04-08 00:29:51 +0800
7ae1bad99 cgroup: relocate cgroup_lock_live_group() and cgroup_attach_task_all() ... Browse Code »

cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be
made static. Relocate the former and cgroup_attach_task_all() so that
we don't need forward declarations.

This patch is pure relocation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-04-08 00:29:51 +0800
8cc993452 cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks() ... Browse Code »

When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources. This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.

Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported. Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-04-08 00:29:50 +0800

04 Apr, 2013

1 commit

1e2ccd1c0 cgroup: remove unused parameter in cgroup_task_migrate(). ... Browse Code »

This patch removes unused parameter from cgroup_task_migrate().

Signed-off-by: Kevin Wilson
Acked-by: Acked-by: Li Zefan
Signed-off-by: Tejun Heo

Kevin Wilson
2013-04-04 05:04:33 +0800

20 Mar, 2013

3 commits

081aa458c cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc() ... Browse Code »

These two functions share most of the code.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-20 22:50:25 +0800
3ac1707a1 cgroup: fix an off-by-one bug which may trigger BUG_ON() ... Browse Code »

The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.

The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org

Li Zefan
2013-03-20 22:50:04 +0800
14a40ffcc sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY ... Browse Code »

PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo
Acked-by: Ingo Molnar
Reviewed-by: Lai Jiangshan
Cc: Peter Zijlstra
Cc: Thomas Gleixner

Tejun Heo
2013-03-20 04:45:20 +0800

13 Mar, 2013

5 commits

80f36c2a1 cgroup: remove useless code in cgroup_write_event_control() ... Browse Code »

eventfd_poll() never returns POLLHUP.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-13 06:36:00 +0800
6ee211ad0 cgroup: don't bother to resize pid array ... Browse Code »

When we open cgroup.procs, we'll allocate an buffer and store all tasks'
tgid in it, and then duplicate entries will be stripped. If that results
in a much smaller pid list, we'll re-allocate a smaller buffer.

But we've already sucessfully allocated memory and reading the procs
file is a short period and the memory will be freed very soon, so why
bother to re-allocate memory.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-13 06:36:00 +0800
d7eeac191 cgroup: hold cgroup_mutex before calling css_offline() ... Browse Code »

cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so
we don't have to release cgroup_mutex before calling css_offline().

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-13 06:35:59 +0800
6dc01181e cgroup: remove unused variables in cgroup_destroy_locked() ... Browse Code »

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-13 06:35:58 +0800
e7b2dcc52 cgroup: remove cgroup_is_descendant() ... Browse Code »

It was used by ns cgroup, and ns cgroup was removed long ago.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-13 06:35:58 +0800

06 Mar, 2013

1 commit

7d8e0bf56 cgroup: avoid accessing modular cgroup subsys structure without locking ... Browse Code »

subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
and that's protected by cgroup_mutex, and then the memory *subsys[i]
resides will be freed.

So this is unsafe without any locking:

if (!ss || ss->module)
...

v2:
- add a comment for enum cgroup_subsys_id
- simplify the comment in cgroup_exit()

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-06 01:33:25 +0800

05 Mar, 2013

1 commit

f50daa704 cgroup: no need to check css refs for release notification ... Browse Code »

We no longer fail rmdir() when there're still css refs, so we don't
need to check css refs in check_for_release().

This also voids a bug. cgroup_has_css_refs() accesses subsys[i]
without cgroup_mutex, so it can race with cgroup_unload_subsys().

cgroup_has_css_refs()
...
if (ss == NULL || ss->root != cgrp->root)

if ss pointers to net_cls_subsys, and cls_cgroup module is unloaded
right after the former check but before the latter, the memory that
net_cls_subsys resides has become invalid.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-03-05 02:04:54 +0800