Eric Lee / smarc-fsl-linux-kernel

21 Feb, 2013

1 commit

9ae46e670 Merge branch 'for-3.9-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cpuset changes from Tejun Heo:

- Synchornization has seen a lot of changes with focus on decoupling
cpuset synchronization from cgroup internal locking.

After this change, there only remain a couple of mostly trivial
dependencies on cgroup_lock outside cgroup core proper. cgroup_lock
is scheduled to be unexported in this devel cycle.

This will finally remove the fragile locking order around cgroup
(cgroup locking wants to / should be one of the outermost but yet has
been acquired from deep inside individual controllers).

- At this point, Li is most knowlegeable with cpuset and taking over
the maintainership of cpuset.

* 'for-3.9-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: drop spurious retval assignment in proc_cpuset_show()
cpuset: fix RCU lockdep splat
cpuset: update MAINTAINERS
cpuset: remove cpuset->parent
cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()
cpuset: replace cgroup_mutex locking with cpuset internal locking
cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty
cpuset: pin down cpus and mems while a task is being attached
cpuset: make CPU / memory hotplug propagation asynchronous
cpuset: drop async_rebuild_sched_domains()
cpuset: don't nest cgroup_mutex inside get_online_cpus()
cpuset: reorganize CPU / memory hotplug handling
cpuset: cleanup cpuset[_can]_attach()
cpuset: introduce cpuset_for_each_child()
cpuset: introduce CS_ONLINE
cpuset: introduce ->css_on/offline()
cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()
cpuset: remove unused cpuset_unlock()

Linus Torvalds
2013-02-21 01:18:31 +0800

19 Feb, 2013

1 commit

63f43f55c cpuset: fix cpuset_print_task_mems_allowed() vs rename() race ... Browse Code »

rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.

It's safe in the protection of dentry->d_lock.

v2: check NULL dentry before acquiring dentry lock.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org

Li Zefan
2013-02-19 01:08:15 +0800

16 Jan, 2013

2 commits

d127027ba cpuset: drop spurious retval assignment in proc_cpuset_show() ... Browse Code »

proc_cpuset_show() has a spurious -EINVAL assignment which does
nothing. Remove it.

This patch doesn't make any functional difference.

tj: Rewrote patch description.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-01-16 00:38:55 +0800
27e89ae5d cpuset: fix RCU lockdep splat ... Browse Code »

5d21cc2db040d01f8c19b8602f6987813e1176b4 ("cpuset: replace
cgroup_mutex locking with cpuset internal locking") incorrectly
converted proc_cpuset_show() from cgroup_lock() to cpuset_mutex.
proc_cpuset_show() is accessing cgroup hierarchy proper to determine
cgroup path which can't be protected by cpuset_mutex. This triggered
the following RCU warning.

===============================
[ INFO: suspicious RCU usage. ]
3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262 Tainted: G W
-------------------------------
include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
2 locks held by trinity/7514:
#0: (&p->lock){+.+.+.}, at: [] seq_read+0x3a/0x3e0
#1: (cpuset_mutex){+.+...}, at: [] proc_cpuset_show+0x84/0x190

stack backtrace:
Pid: 7514, comm: trinity Tainted: G W
+3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262
Call Trace:
[] lockdep_rcu_suspicious+0x10b/0x120
[] proc_cpuset_show+0x111/0x190
[] seq_read+0x1b7/0x3e0
[] ? seq_lseek+0x110/0x110
[] do_loop_readv_writev+0x4b/0x90
[] do_readv_writev+0xf6/0x1d0
[] vfs_readv+0x3e/0x60
[] sys_readv+0x50/0xd0
[] tracesys+0xe1/0xe6

The operation can be performed under RCU read lock. Replace
cpuset_mutex locking with RCU read locking.

tj: Rewrote patch description.

Reported-by: Sasha Levin
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2013-01-16 00:38:55 +0800

08 Jan, 2013

15 commits

c431069fe cpuset: remove cpuset->parent ... Browse Code »

cgroup already tracks the hierarchy. Follow cgroup->parent to find
the parent and drop cpuset->parent.

Signed-off-by: Tejun Heo
Reviewed-by: Michal Hocko
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:08 +0800
fc560a26a cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre() ... Browse Code »

Implement cpuset_for_each_descendant_pre() and replace the
cpuset-specific tree walking using cpuset->stack_list with it.

Signed-off-by: Tejun Heo
Reviewed-by: Michal Hocko
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:08 +0800
5d21cc2db cpuset: replace cgroup_mutex locking with cpuset internal locking ... Browse Code »

Supposedly for historical reasons, cpuset depends on cgroup core for
locking. It depends on cgroup_mutex in cgroup callbacks and grabs
cgroup_mutex from other places where it wants to be synchronized.
This is majorly messy and highly prone to introducing circular locking
dependency especially because cgroup_mutex is supposed to be one of
the outermost locks.

As previous patches already plugged possible races which may happen by
decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
specific cpuset_mutex is mostly straight-forward. Introduce
cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
cpuset_mutex locking to places which inherited cgroup_mutex from
cgroup core.

The only complication is from cpuset wanting to initiate task
migration when a cpuset loses all cpus or memory nodes. Task
migration may go through full cgroup and all subsystem locking and
should be initiated without holding any cpuset specific lock; however,
a previous patch already made hotplug handled asynchronously and
moving the task migration part outside other locks is easy.
cpuset_propagate_hotplug_workfn() now invokes
remove_tasks_in_empty_cpuset() without holding any lock.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:08 +0800
02bb58637 cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty ... Browse Code »

cpuset is scheduled to be decoupled from cgroup_lock which will make
hotplug handling race with task migration. cpus or mems will be
allowed to go offline between ->can_attach() and ->attach(). If
hotplug takes down all cpus or mems of a cpuset while attach is in
progress, ->attach() may end up putting tasks into an empty cpuset.

This patchset makes ->attach() schedule hotplug propagation if the
cpuset is empty after attaching is complete. This will move the tasks
to the nearest ancestor which can execute and the end result would be
as if hotplug handling happened after the tasks finished attaching.

cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
wait for propagations scheduled directly by cpuset_attach().

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:08 +0800
452477fa6 cpuset: pin down cpus and mems while a task is being attached ... Browse Code »

cpuset is scheduled to be decoupled from cgroup_lock which will make
configuration updates race with task migration. Any config update
will be allowed to happen between ->can_attach() and ->attach(). If
such config update removes either all cpus or mems, by the time
->attach() is called, the condition verified by ->can_attach(), that
the cpuset is capable of hosting the tasks, is no longer true.

This patch adds cpuset->attach_in_progress which is incremented from
->can_attach() and decremented when the attach operation finishes
either successfully or not. validate_change() treats cpusets w/
non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
remove all cpus or mems from it.

This currently doesn't make any functional difference as everything is
protected by cgroup_mutex but enables decoupling the locking.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
8d0339487 cpuset: make CPU / memory hotplug propagation asynchronous ... Browse Code »

cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
directly to propagate hotplug updates to !root cpusets; however, this
has the following problems.

* cpuset locking is scheduled to be decoupled from cgroup_mutex,
cgroup_mutex will be unexported, and cgroup_attach_task() will do
cgroup locking internally, so propagation can't synchronously move
tasks to a parent cgroup while walking the hierarchy.

* We can't use cgroup generic tree iterator because propagation to
each cpuset may sleep. With propagation done asynchronously, we can
lose the rather ugly cpuset specific iteration.

Convert cpuset_propagate_hotplug() to
cpuset_propagate_hotplug_workfn() and execute it from newly added
cpuset->hotplug_work. The work items are run on an ordered workqueue,
so the propagation order is preserved. cpuset_hotplug_workfn()
schedules all propagations while holding cgroup_mutex and waits for
completion without cgroup_mutex. Each in-flight propagation holds a
reference to the cpuset->css.

This patch doesn't cause any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
699140ba8 cpuset: drop async_rebuild_sched_domains() ... Browse Code »

In general, we want to make cgroup_mutex one of the outermost locks
and be able to use get_online_cpus() and friends from cgroup methods.
With cpuset hotplug made async, get_online_cpus() can now be nested
inside cgroup_mutex.

Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
by bouncing sched_domain rebuilding to a work item. As such nesting
is allowed now, remove the workqueue bouncing code and always rebuild
sched_domains synchronously. This also nests sched_domains_mutex
inside cgroup_mutex, which is intended and should be okay.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
3a5a6d0c2 cpuset: don't nest cgroup_mutex inside get_online_cpus() ... Browse Code »

CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
event notifications. We want to separate cpuset locking from cgroup
core and make cgroup_mutex outer to hotplug synchronization so that,
among other things, mechanisms which depend on get_online_cpus() can
be used from cgroup callbacks. In general, we want to keep
cgroup_mutex the outermost lock to minimize locking interactions among
different controllers.

Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
schedule it from the hotplug notifications. As the function can
already handle multiple mixed events without any input, converting it
to a work function is mostly trivial; however, one complication is
that cpuset_update_active_cpus() needs to update sched domains
synchronously to reflect an offlined cpu to avoid confusing the
scheduler. This is worked around by falling back to the the default
single sched domain synchronously before scheduling the actual hotplug
work. This makes sched domain rebuilt twice per CPU hotplug event but
the operation isn't that heavy and a lot of the second operation would
be noop for systems w/ single sched domain, which is the common case.

This decouples cpuset hotplug handling from the notification callbacks
and there can be an arbitrary delay between the actual event and
updates to cpusets. Scheduler and mm can handle it fine but moving
tasks out of an empty cpuset may race against writes to the cpuset
restoring execution resources which can lead to confusing behavior.
Flush hotplug work item from cpuset_write_resmask() to avoid such
confusions.

v2: Synchronous sched domain rebuilding using the fallback sched
domain added. This fixes various issues caused by confused
scheduler putting tasks on a dead CPU, including the one reported
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
deb7aa308 cpuset: reorganize CPU / memory hotplug handling ... Browse Code »

Reorganize hotplug path to prepare for async hotplug handling.

* Both CPU and memory hotplug handlings are collected into a single
function - cpuset_handle_hotplug(). It doesn't take any argument
but compares the current setttings of top_cpuset against what's
actually available to determine what happened. This function
directly updates top_cpuset. If there are CPUs or memory nodes
which are taken down, cpuset_propagate_hotplug() in invoked on all
!root cpusets.

* cpuset_propagate_hotplug() is responsible for updating the specified
cpuset so that it doesn't include any resource which isn't available
to top_cpuset. If no CPU or memory is left after update, all tasks
are moved to the nearest ancestor with both resources.

* update_tasks_cpumask() and update_tasks_nodemask() are now always
called after cpus or mems masks are updated even if the cpuset
doesn't have any task. This is for brevity and not expected to have
any measureable effect.

* cpu_active_mask and N_HIGH_MEMORY are read exactly once per
cpuset_handle_hotplug() invocation, all cpusets share the same view
of what resources are available, and cpuset_handle_hotplug() can
handle multiple resources going up and down. These properties will
allow async operation.

The reorganization, while drastic, is equivalent and shouldn't cause
any behavior difference. This will enable making hotplug handling
async and remove get_online_cpus() -> cgroup_mutex nesting.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
4e4c9a140 cpuset: cleanup cpuset[_can]_attach() ... Browse Code »

cpuset_can_attach() prepare global variables cpus_attach and
cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
There is no reason to prepare in cpuset_can_attach(). The same
information can be accessed from cpuset_attach().

Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
and make the global variables static ones inside cpuset_attach().

With this change, there's no reason to keep
cpuset_attach_nodemask_{from|to} global. Move them inside
cpuset_attach(). Unfortunately, we need to keep cpus_attach global as
it can't be allocated from cpuset_attach().

v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
Russell.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Rusty Russell

Tejun Heo
2013-01-08 00:51:07 +0800
ae8086ce1 cpuset: introduce cpuset_for_each_child() ... Browse Code »

Instead of iterating cgroup->children directly, introduce and use
cpuset_for_each_child() which wraps cgroup_for_each_child() and
performs online check. As it uses the generic iterator, it requires
RCU read locking too.

As cpuset is currently protected by cgroup_mutex, non-online cpusets
aren't visible to all the iterations and this patch currently doesn't
make any functional difference. This will be used to de-couple cpuset
locking from cgroup core.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
efeb77b2f cpuset: introduce CS_ONLINE ... Browse Code »

Add CS_ONLINE which is set from css_online() and cleared from
css_offline(). This will enable using generic cgroup iterator while
allowing decoupling cpuset from cgroup internal locking.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
c8f699bb5 cpuset: introduce ->css_on/offline() ... Browse Code »

Add cpuset_css_on/offline() and rearrange css init/exit such that,

* Allocation and clearing to the default values happen in css_alloc().
Allocation now uses kzalloc().

* Config inheritance and registration happen in css_online().

* css_offline() undoes what css_online() did.

* css_free() frees.

This doesn't introduce any visible behavior changes. This will help
cleaning up locking.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
0772324ae cpuset: remove fast exit path from remove_tasks_in_empty_cpuset() ... Browse Code »

The function isn't that hot, the overhead of missing the fast exit is
low, the test itself depends heavily on cgroup internals, and it's
gonna be a hindrance when trying to decouple cpuset locking from
cgroup core. Remove the fast exit path.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800
01c889cf4 cpuset: remove unused cpuset_unlock() ... Browse Code »

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-01-08 00:51:07 +0800

13 Dec, 2012

1 commit

38d7bee9d cpuset: use N_MEMORY instead N_HIGH_MEMORY ... Browse Code »

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan
Acked-by: Hillf Danton
Signed-off-by: Wen Congyang
Cc: Christoph Lameter
Cc: Lin Feng
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2012-12-13 09:38:32 +0800

20 Nov, 2012

2 commits

033fa1c5f cgroup, cpuset: remove cgroup_subsys->post_clone() ... Browse Code »

Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now
that clone_children is cpuset specific, there's no reason to have this
rather odd option activation mechanism in cgroup core. cpuset can
check the flag from its ->css_allocate() and take the necessary
action.

Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
remove cgroup_subsys->post_clone().

Loosely based on Glauber's "generalize post_clone into post_create"
patch.

Signed-off-by: Tejun Heo
Original-patch-by: Glauber Costa
Original-patch:
Acked-by: Serge E. Hallyn
Acked-by: Li Zefan
Cc: Glauber Costa

Tejun Heo
2012-11-20 00:13:39 +0800
92fb97487 cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() ... Browse Code »

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-11-20 00:13:38 +0800

24 Jul, 2012

4 commits

a1cd2b13f cpusets: Remove/update outdated comments ... Browse Code »

cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.

Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
out how it is dealt with. So update that comment as well.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:28 +0800
7ddf96b02 cpusets, hotplug: Restructure functions that are invoked during hotplug ... Browse Code »

Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".

And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:22 +0800
80d1fa646 cpusets, hotplug: Implement cpuset tree traversal in a helper function ... Browse Code »

At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.

As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:18 +0800
d35be8bab CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume ... Browse Code »

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.

(Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:14 +0800

23 May, 2012

1 commit

88d6ae8dc Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.

There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...

Linus Torvalds
2012-05-23 08:40:19 +0800

02 Apr, 2012

2 commits

deb74f5ca Merge tag 'for-linus' of git://github.com/rustyrussell/linux ... Browse Code »

Pull cpumask cleanups from Rusty Russell:
"(Somehow forgot to send this out; it's been sitting in linux-next, and
if you don't want it, it can sit there another cycle)"

I'm a sucker for things that actually delete lines of code.

Fix up trivial conflict in arch/arm/kernel/kprobes.c, where Rusty fixed
a user of &cpu_online_map to be cpu_online_mask, but that code got
deleted by commit b21d55e98ac2 ("ARM: 7332/1: extract out code patch
function from kprobes").

* tag 'for-linus' of git://github.com/rustyrussell/linux:
cpumask: remove old cpu_*_map.
documentation: remove references to cpu_*_map.
drivers/cpufreq/db8500-cpufreq: remove references to cpu_*_map.
remove references to cpu_*_map in arch/

Linus Torvalds
2012-04-02 23:53:24 +0800
4baf6e332 cgroup: convert all non-memcg controllers to the new cftype interface ... Browse Code »
44

Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation. There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Paul Menage
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "David S. Miller"
Cc: Vivek Goyal

Tejun Heo
2012-04-02 03:09:55 +0800

30 Mar, 2012

1 commit

7fda0412c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpusets: Remove an unused variable
sched/rt: Improve pick_next_highest_task_rt()
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
sched/x86/smp: Do not enable IRQs over calibrate_delay()
sched: Fix compiler warning about declared inline after use
MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS

Linus Torvalds
2012-03-30 05:46:05 +0800

29 Mar, 2012

1 commit

5f054e31c documentation: remove references to cpu_*_map. ... Browse Code »

This has been obsolescent for a while, fix documentation and
misc comments.

Signed-off-by: Rusty Russell

Rusty Russell
2012-03-29 13:08:31 +0800

28 Mar, 2012

1 commit

160594e99 cpusets: Remove an unused variable ... Browse Code »

We don't use "cpu" any more after 2baab4e904 "sched: Fix
select_fallback_rq() vs cpu_active/cpu_online".

Signed-off-by: Dan Carpenter
Cc: Paul Menage
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120328104608.GD29022@elgon.mountain
Signed-off-by: Ingo Molnar

Dan Carpenter
2012-03-28 19:40:44 +0800

27 Mar, 2012

1 commit

2baab4e90 sched: Fix select_fallback_rq() vs cpu_active/cpu_online ... Browse Code »
132

Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.

Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.

If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.

select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.

This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.

The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.

[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]

The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.

Reported-by: Chuansheng Liu
Signed-off-by: Peter Zijlstra
Cc: Mike Frysinger
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-03-27 20:50:14 +0800

22 Mar, 2012

1 commit

cc9a6c877 cpuset: mm: reduce large amounts of memory barrier related damage v3 ... Browse Code »
44

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were

3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman
Cc: Miao Xie
Cc: David Rientjes
Cc: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-03-22 08:54:59 +0800

03 Feb, 2012

1 commit

761b3ef50 cgroup: remove cgroup_subsys argument from callbacks ... Browse Code »
44

The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

16 files changed, 113 insertions(+), 162 deletions(-)

text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2012-02-03 01:20:22 +0800

10 Jan, 2012

1 commit

db0c2bf69 Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...

Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.

Linus Torvalds
2012-01-10 04:59:24 +0800

21 Dec, 2011

1 commit

b246272ec cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask ... Browse Code »

Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie
Signed-off-by: David Rientjes
Cc: KOSAKI Motohiro
Cc: Paul Menage
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-12-21 02:25:04 +0800

13 Dec, 2011

3 commits

94196f51c cgroup, cpuset: don't use ss->pre_attach() ... Browse Code »

->pre_attach() is supposed to be called before migration, which is
observed during process migration but task migration does it the other
way around. The only ->pre_attach() user is cpuset which can do the
same operaitons in ->can_attach(). Collapse cpuset_pre_attach() into
cpuset_can_attach().

-v2: Patch contamination from later patch removed. Spotted by Paul
Menage.

Signed-off-by: Tejun Heo
Reviewed-by: Frederic Weisbecker
Acked-by: Paul Menage
Cc: Li Zefan

Tejun Heo
2011-12-13 10:12:22 +0800
bb9d97b6d cgroup: don't use subsys->can_attach_task() or ->attach_task() ... Browse Code »

Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations. Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead. Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
methods. It's useless and very prone to get out of sync, which
already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task. This
doesn't make any practical difference but is conceptually cleaner.

Signed-off-by: Tejun Heo
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Frederic Weisbecker
Acked-by: Li Zefan
Cc: Paul Menage
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: James Morris
Cc: Ingo Molnar
Cc: Peter Zijlstra

Tejun Heo
2011-12-13 10:12:21 +0800
2f7ee5691 cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() ... Browse Code »

Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods. This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.

Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset. This unifies passed parameters so
that all methods have access to all information. Conversions in this
patchset are identical and don't introduce any behavior change.

-v2: documentation updated as per Paul Menage's suggestion.

Signed-off-by: Tejun Heo
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Frederic Weisbecker
Acked-by: Paul Menage
Acked-by: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: James Morris

Tejun Heo
2011-12-13 10:12:21 +0800