Eric Lee / smarc-fsl-linux-kernel

14 Sep, 2015

1 commit

2f9de0cc2 cpuset: use trialcs->mems_allowed as a temp variable ... Browse Code »

commit 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 upstream.

The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy
Tested-by: Iago López Galeiras
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Alban Crequy
2015-09-14 00:07:46 +0800

15 Apr, 2015

1 commit

6e276d2a5 kernel, cpuset: remove exception for __GFP_THISNODE ... Browse Code »

Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.

Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Pravin Shelar
Cc: Jarno Rajahalme
Cc: Li Zefan
Cc: Greg Thelen
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2015-04-15 07:49:03 +0800

20 Mar, 2015

1 commit

47b8ea718 cpusets, isolcpus: exclude isolcpus from load balancing in cpusets ... Browse Code »

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra
Cc: Clark Williams
Cc: Li Zefan
Cc: Ingo Molnar
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: cgroups@vger.kernel.org
Signed-off-by: Rik van Riel
Tested-by: David Rientjes
Acked-by: Peter Zijlstra (Intel)
Acked-by: David Rientjes
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Rik van Riel
2015-03-20 02:28:19 +0800

03 Mar, 2015

3 commits

283cb41f4 cpuset: Fix cpuset sched_relax_domain_level ... Browse Code »

The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.

The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.

This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.

Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Cc: # 3.9+
Signed-off-by: Jason Low
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn

Jason Low
2015-03-03 00:55:04 +0800
79063bffc cpuset: fix a warning when clearing configured masks in old hierarchy ... Browse Code »

When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:

# mount -t cgroup -o cpuset xxx /mnt
# mkdir /mnt/tmp
# echo 0 > /mnt/tmp/cpuset.cpus
# echo > /mnt/tmp/cpuset.cpus
# cat cpuset.cpus

# cat cpuset.effective_cpus
0-15

And a kernel warning in update_cpumasks_hier() is triggered:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()

Cc: # 3.17+
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn

Zefan Li
2015-03-03 00:55:04 +0800
790317e1b cpuset: initialize effective masks when clone_children is enabled ... Browse Code »

If clone_children is enabled, effective masks won't be initialized
due to the bug:

# mount -t cgroup -o cpuset xxx /mnt
# echo 1 > cgroup.clone_children
# mkdir /mnt/tmp
# cat /mnt/tmp/
# cat cpuset.effective_cpus

# cat cpuset.cpus
0-15

And then this cpuset won't constrain the tasks in it.

Either the bug or the fix has no effect on unified hierarchy, as
there's no clone_chidren flag there any more.

Reported-by: Christian Brauner
Reported-by: Serge Hallyn
Cc: # 3.17+
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn

Zefan Li
2015-03-03 00:55:04 +0800

14 Feb, 2015

1 commit

e8e6d97c9 cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasks ... Browse Code »

printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

* kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static
buffer which is protected by a dedicated spinlock. Removed.

Signed-off-by: Tejun Heo
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2015-02-14 13:21:37 +0800

13 Feb, 2015

1 commit

8f4ab07f4 kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init ... Browse Code »

The only caller of cpuset_init_current_mems_allowed is the __init
annotated build_all_zonelists_init, so we can also make the former __init.

Signed-off-by: Rasmus Villemoes
Cc: Vlastimil Babka
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Vishnu Pratap Singh
Cc: Pintu Kumar
Cc: Michal Nazarewicz
Cc: Mel Gorman
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Tim Chen
Cc: Hugh Dickins
Cc: Li Zefan
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-02-13 10:54:11 +0800

12 Dec, 2014

1 commit

2756d373a Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock

Linus Torvalds
2014-12-12 10:57:19 +0800

28 Oct, 2014

2 commits

f82f80426 sched/deadline: Ensure that updates to exclusive cpusets don't break AC ... Browse Code »

How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.

This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Li Zefan
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:48:00 +0800
7f51412a4 sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets ... Browse Code »

Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:

- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).

- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.

This patch fixes both things at once, as they are opposite faces
of the same coin.

The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.

Reported-by: Daniel Wagner
Reported-by: Vincent Legout
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan
Cc: Linus Torvalds
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-10-28 17:47:58 +0800

27 Oct, 2014

3 commits

cea74465e cpuset: lock vs unlock typo ... Browse Code »

This will deadlock instead of unlocking.

Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
Signed-off-by: Dan Carpenter
Acked-by: Vladimir Davydov
Signed-off-by: Tejun Heo

Dan Carpenter
2014-10-27 23:53:29 +0800
344736f29 cpuset: simplify cpuset_node_allowed API ... Browse Code »

Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Vladimir Davydov
2014-10-27 23:15:27 +0800
8447a0fee cpuset: convert callback_mutex to a spinlock ... Browse Code »

The callback_mutex is only used to synchronize reads/updates of cpusets'
flags and cpu/node masks. These operations should always proceed fast so
there's no reason why we can't use a spinlock instead of the mutex.

Converting the callback_mutex into a spinlock will let us call
cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
it possible to simplify the code by merging the hardwall and asoftwall
checks into the same function, which is the business of the next patch.

Suggested-by: Zefan Li
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Vladimir Davydov
2014-10-27 23:15:26 +0800

10 Oct, 2014

1 commit

b211e9d7c Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Just a handful of cleanup patches"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: remove redundant variable in cgroup_mount()"
cgroup: remove redundant variable in cgroup_mount()
cgroup: fix missing unlock in cgroup_release_agent()
cgroup: remove CGRP_RELEASABLE flag
perf/cgroup: Remove perf_put_cgroup()
cgroup: remove redundant check in cgroup_ino()
cpuset: simplify proc_cpuset_show()
cgroup: simplify proc_cgroup_show()
cgroup: use a per-cgroup work for release agent
cgroup: remove bogus comments
cgroup: remove redundant code in cgroup_rmdir()
cgroup: remove some useless forward declarations
cgroup: fix a typo in comment.

Linus Torvalds
2014-10-10 19:24:40 +0800

25 Sep, 2014

1 commit

2ad654bc5 cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags ... Browse Code »

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Miao Xie
Cc: Kees Cook
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Cc: # 2.6.31+
Reported-by: Tetsuo Handa
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo

Zefan Li
2014-09-25 10:16:06 +0800

19 Sep, 2014

1 commit

52de4779f cpuset: simplify proc_cpuset_show() ... Browse Code »

Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().

Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo

Zefan Li
2014-09-19 01:27:23 +0800

05 Aug, 2014

1 commit

47dfe4037 Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.

- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.

- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.

- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.

All the changes are to the new interface and no behavior changed for
the multiple hierarchies"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...

Linus Torvalds
2014-08-05 01:11:28 +0800

30 Jul, 2014

1 commit

a13812683 cpuset: fix the WARN_ON() in update_nodemasks_hier() ... Browse Code »

The WARN_ON() is used to check if we break the legal hierarchy, on
which the effective mems should be equal to configured mems.

Reported-by: Mike Qiu
Tested-by: Mike Qiu
Signed-off-by: Li Zefan

Li Zefan
2014-07-30 23:26:58 +0800

15 Jul, 2014

1 commit

5577964e6 cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes ... Browse Code »

Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Aristeu Rozanski
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800

10 Jul, 2014

12 commits

afd1a8b3e cpuset: export effective masks to userspace ... Browse Code »

cpuset.cpus and cpuset.mems are the configured masks, and we need
to export effective masks to userspace, so users know the real
cpus_allowed and mems_allowed that apply to the tasks in a cpuset.

v2:
- export those masks unconditionally, suggested by Tejun.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:18 +0800
5d8ba82c3 cpuset: allow writing offlined masks to cpuset.cpus/mems ... Browse Code »

As the configured masks won't be limited by its parent, and the top
cpuset's masks won't change when hotplug happens, it's natural to
allow writing offlined masks to the configured masks.

If on default hierarchy:

# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
# cat /cpuset/sub/cpuset.cpus
1

If on legacy hierarchy:

# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
-bash: echo: write error: Invalid argument

Note the checks don't need to be gated by cgroup_on_dfl, because we've
initialized top_cpuset.{cpus,mems}_allowed accordingly in cpuset_bind().

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
be4c9dd7a cpuset: enable onlined cpu/node in effective masks ... Browse Code »

Firstly offline cpu1:

# echo 0-1 > cpuset.cpus
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cat cpuset.cpus
0-1
# cat cpuset.effective_cpus
0

Then online it:

# echo 1 > /sys/devices/system/cpu/cpu1/online
# cat cpuset.cpus
0-1
# cat cpuset.effective_cpus
0-1

And cpuset will bring it back to the effective mask.

The implementation is quite straightforward. Instead of calculating the
offlined cpus/mems and do updates, we just set the new effective_mask
to online_mask & congifured_mask.

This is a behavior change for default hierarchy, so legacy hierarchy
won't be affected.

v2:
- make refactoring of cpuset_hotplug_update_tasks() as seperate patch,
suggested by Tejun.
- make hotplug_update_tasks_insane() use @new_cpus and @new_mems as
hotplug_update_tasks_sane() does.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
390a36aad cpuset: refactor cpuset_hotplug_update_tasks() ... Browse Code »

We mix the handling for both default hierarchy and legacy hierarchy in
the same function, and it's quite messy, so split into two functions.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
7e88291be cpuset: make cs->{cpus, mems}_allowed as user-configured masks ... Browse Code »

Now we've used effective cpumasks to enforce hierarchical manner,
we can use cs->{cpus,mems}_allowed as configured masks.

Configured masks can be changed by writing cpuset.cpus and cpuset.mems
only. The new behaviors are:

- They won't be changed by hotplug anymore.
- They won't be limited by its parent's masks.

This ia a behavior change, but won't take effect unless mount with
sane_behavior.

v2:
- Add comments to explain the differences between configured masks and
effective masks.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
ae1c80238 cpuset: apply cs->effective_{cpus,mems} ... Browse Code »

Now we can use cs->effective_{cpus,mems} as effective masks. It's
used whenever:

- we update tasks' cpus_allowed/mems_allowed,
- we want to retrieve tasks_cs(tsk)'s cpus_allowed/mems_allowed.

They actually replace effective_{cpu,node}mask_cpuset().

effective_mask == configured_mask & parent effective_mask except when
the reault is empty, in which case it inherits parent effective_mask.
The result equals the mask computed from effective_{cpu,node}mask_cpuset().

This won't affect the original legacy hierarchy, because in this case we
make sure the effective masks are always the same with user-configured
masks.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
39bd0d15e cpuset: initialize top_cpuset's configured masks at mount ... Browse Code »

We now have to support different behaviors for default hierachy and
legacy hiearchy, top_cpuset's configured masks need to be initialized
accordingly.

Suppose we've offlined cpu1.

On default hierarchy:

# mount -t cgroup -o __DEVEL__sane_behavior xxx /cpuset
# cat /cpuset/cpuset.cpus
0-15

On legacy hierarchy:

# mount -t cgroup xxx /cpuset
# cat /cpuset/cpuset.cpus
0,2-15

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:17 +0800
8b5f1c52d cpuset: use effective cpumask to build sched domains ... Browse Code »

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.

We should partition sched domains according to effective_cpus, which
is the real cpulist that takes effects on tasks in the cpuset.

This won't introduce behavior change.

v2:
- Add a comment for the call of rebuild_sched_domains(), suggested
by Tejun.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:16 +0800
554b0d1c8 cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty ... Browse Code »

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty

The last item is done here.

This won't introduce behavior change.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:16 +0800
734d45130 cpuset: update cs->effective_{cpus, mems} when config changes ... Browse Code »

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty

The second item is done here. We don't need to treat root_cs specially
in update_cpumasks_hier().

This won't introduce behavior change.

v3:
- add a WARN_ON() to check if effective masks are the same with configured
masks on legacy hierarchy.
- pass trialcs->cpus_allowed to update_cpumasks_hier() and add a comment for
it. Similar change for update_nodemasks_hier(). Suggested by Tejun.

v2:
- revise the comment in update_{cpu,node}masks_hier(), suggested by Tejun.
- fix to use @cp instead of @cs in these two functions.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:16 +0800
1344ab9c2 cpuset: update cpuset->effective_{cpus,mems} at hotplug ... Browse Code »

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for default hierarchy only. For legacy
hierarchy, effective_mask and configured_mask are the same, so we won't
break old interfaces.

To make cs->effective_{cpus,mems} to be effective masks, we need to
- update the effective masks at hotplug
- update the effective masks at config change
- take on ancestor's mask when the effective mask is empty

The first item is done here.

This won't introduce behavior change.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:15 +0800
e2b9a3d7d cpuset: add cs->effective_cpus and cs->effective_mems ... Browse Code »

We're going to have separate user-configured masks and effective ones.

Eventually configured masks can only be changed by writing cpuset.cpus
and cpuset.mems, and they won't be restricted by parent cpuset. While
effective masks reflect cpu/memory hotplug and hierachical restriction,
and these are the real masks that apply to the tasks in the cpuset.

We calculate effective mask this way:
- top cpuset's effective_mask == online_mask, otherwise
- cpuset's effective_mask == configured_mask & parent effective_mask,
if the result is empty, it inherits parent effective mask.

Those behavior changes are for default hierarchy only. For legacy
hierachy, effective_mask and configured_mask are the same, so we won't
break old interfaces.

This patch adds the effective masks to struct cpuset and initializes
them. The effective masks of the top cpuset is the same with configured
masks, and a child cpuset inherits its parent's effective masks.

This won't introduce behavior change.

v2:
- s/real_{mems,cpus}_allowed/effective_{mems,cpus}, suggested by Tejun.
- don't init effective masks in cpuset_css_online() if !cgroup_on_dfl.

Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-07-10 03:56:15 +0800

09 Jul, 2014

1 commit

aa6ec29be cgroup: remove sane_behavior support on non-default hierarchies ... Browse Code »

sane_behavior has been used as a development vehicle for the default
unified hierarchy. Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies. There are gonna be either the default hierarchy or
legacy ones. Let's make that clear by removing sane_behavior support
on non-default hierarchies.

This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.

On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko

Tejun Heo
2014-07-09 22:08:08 +0800

02 Jul, 2014

1 commit

76bb5ab8f cpuset: break kernfs active protection in cpuset_write_resmask() ... Browse Code »

Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
migrating tasks off a cpuset after new resources are added to it.

As cpuset_hotplug_work calls into cgroup core via
cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
core locking from cpuset_write_resmak(). This used to be okay because
cgroup interface files were protected by a different mutex; however,
8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the
cgroup core locking and this dependency became a deadlock hazard -
cgroup file removal performed under cgroup core lock tries to drain
on-going file operation which is trying to flush cpuset_hotplug_work
blocked on the same cgroup core lock.

The locking simplification was done because kernfs added an a lot
easier way to deal with circular dependencies involving kernfs active
protection. Let's use the same strategy in cpuset and break active
protection in cpuset_write_resmask(). While it isn't the prettiest,
this is a very rare, likely unique, situation which also goes away on
the unified hierarchy.

The commands to trigger the deadlock warning without the patch and the
lockdep output follow.

localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
localhost:/ # mkdir /cpuset/tmp
localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
localhost:/ # echo $$ > /cpuset/tmp/tasks
localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online

======================================================
[ INFO: possible circular locking dependency detected ]
3.16.0-rc1-0.1-default+ #7 Not tainted
-------------------------------------------------------
kworker/1:0/32649 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [] cgroup_transfer_tasks+0x37/0x150

but task is already holding lock:
(cpuset_hotplug_work){+.+...}, at: [] process_one_work+0x192/0x520

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (cpuset_hotplug_work){+.+...}:
...
-> #1 (s_active#175){++++.+}:
...
-> #0 (cgroup_mutex){+.+.+.}:
...

other info that might help us debug this:

Chain exists of:
cgroup_mutex --> s_active#175 --> cpuset_hotplug_work

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpuset_hotplug_work);
lock(s_active#175);
lock(cpuset_hotplug_work);
lock(cgroup_mutex);

*** DEADLOCK ***

2 locks held by kworker/1:0/32649:
#0: ("events"){.+.+.+}, at: [] process_one_work+0x192/0x520
#1: (cpuset_hotplug_work){+.+...}, at: [] process_one_work+0x192/0x520

stack backtrace:
CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
...
Call Trace:
[] dump_stack+0x72/0x8a
[] print_circular_bug+0x10f/0x120
[] check_prev_add+0x43e/0x4b0
[] validate_chain+0x656/0x7c0
[] __lock_acquire+0x382/0x660
[] lock_acquire+0xf9/0x170
[] mutex_lock_nested+0x6f/0x380
[] cgroup_transfer_tasks+0x37/0x150
[] hotplug_update_tasks_insane+0x110/0x1d0
[] cpuset_hotplug_update_tasks+0x13d/0x180
[] cpuset_hotplug_workfn+0x18c/0x630
[] process_one_work+0x254/0x520
[] worker_thread+0x13d/0x3d0
[] kthread+0xf8/0x100
[] ret_from_fork+0x7c/0xb0

Signed-off-by: Tejun Heo
Reported-by: Li Zefan
Tested-by: Li Zefan

Tejun Heo
2014-07-02 04:42:28 +0800

25 Jun, 2014

1 commit

391acf970 cpuset,mempolicy: fix sleeping function called from invalid context ... Browse Code »

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [] dump_stack+0x4d/0x66
[ 9970.065074] [] __might_sleep+0xfa/0x130
[ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [] ? __mpol_dup+0x63/0x150
[ 9970.409282] [] __mpol_dup+0xe5/0x150
[ 9970.471897] [] ? __mpol_dup+0x63/0x150
[ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [] do_fork+0xd8/0x380
[ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [] SyS_clone+0x16/0x20
[ 9971.030011] [] stub_clone+0x69/0x90
[ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."

Signed-off-by: Gu Zheng
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Cc: stable

Gu Zheng
2014-06-25 21:42:11 +0800

10 Jun, 2014

1 commit

14208b0ec Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy

Documentation/cgroups/unified-hierarchy.txt

is added.

Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...

Linus Torvalds
2014-06-10 06:03:33 +0800

05 Jun, 2014

1 commit

664eeddee mm: page_alloc: use jump labels to avoid checking number_of_cpusets ... Browse Code »

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800

17 May, 2014

1 commit

5c9d535b8 cgroup: remove css_parent() ... Browse Code »

cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.

This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.

v2: New usage from device_cgroup.c converted.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Johannes Weiner

Tejun Heo
2014-05-17 01:22:48 +0800

14 May, 2014

2 commits

451af504d cgroup: replace cftype->write_string() with cftype->write() ... Browse Code »

Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion. Converted.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Vivek Goyal
Acked-by: Li Zefan
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Neil Horman
Cc: "David S. Miller"

Tejun Heo
2014-05-14 00:16:21 +0800
ec903c0c8 cgroup: rename css_tryget*() to css_tryget_online*() ... Browse Code »

Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.

Signed-off-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo

Tejun Heo
2014-05-14 00:11:01 +0800