29 Jul, 2016
1 commit
-
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
it is checking the flag on the current task rather than the given one.This doesn't make much sense and it is actually wrong. If the current
task which updates the nodemask of a cpuset got killed by the OOM killer
then a part of the cpuset cgroup processes would have incompatible
nodemask which is surprising to say the least.The comment suggests the intention was to skip oom victim or an exiting
task so we should be checking the given task. But even then it would be
layering violation because it is the memory allocator to interpret the
TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
should simply follow the same mask.Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: David Rientjes
Cc: Miao Xie
Cc: Miao Xie
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 May, 2016
2 commits
-
An important function for cpusets is cpuset_node_allowed(), which
optimizes on the fact if there's a single root CPU set, it must be
trivially allowed. But the check "nr_cpusets()
Signed-off-by: Mel Gorman
Acked-by: Zefan Li
Cc: Jesper Dangaard Brouer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Lots of code does
node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);so create next_node_in() to do this and use it in various places.
[mhocko@suse.com: use next_node_in() helper]
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Signed-off-by: Michal Hocko
Cc: Xishi Qiu
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Naoya Horiguchi
Cc: Laura Abbott
Cc: Hui Zhu
Cc: Wang Xiaoqiang
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Apr, 2016
1 commit
-
Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write(). This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function. This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex. This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished. cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem. We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: # 4.4+ prerequisite for the next patch
22 Mar, 2016
1 commit
-
Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path
23 Feb, 2016
1 commit
-
Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Ingo Molnar
Cc: Peter Zijlstra
17 Feb, 2016
1 commit
-
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.Signed-off-by: Aditya Kali
Signed-off-by: Serge Hallyn
Signed-off-by: Tejun Heo
22 Jan, 2016
1 commit
-
If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes. This was performed
synchronously in the ->attach() callback, which is synchronized
against process management. Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place. This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released. This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization. CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.Signed-off-by: Tejun Heo
Reported-and-tested-by: Christian Borntraeger
Cc: stable@vger.kernel.org # v4.4+
03 Dec, 2015
2 commits
-
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- BP0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[] dump_stack+0x4e/0x82
[] warn_slowpath_common+0x82/0xc0
[] warn_slowpath_null+0x1a/0x20
[] pids_cancel.constprop.6+0x31/0x40
[] pids_can_attach+0x6d/0xf0
[] cgroup_taskset_migrate+0x6c/0x330
[] cgroup_migrate+0xf5/0x190
[] cgroup_attach_task+0x176/0x200
[] __cgroup_procs_write+0x2ad/0x460
[] cgroup_procs_write+0x14/0x20
[] cgroup_file_write+0x35/0x1c0
[] kernfs_fop_write+0x141/0x190
[] __vfs_write+0x28/0xe0
[] vfs_write+0xac/0x1a0
[] SyS_write+0x49/0xb0
[] entry_SYSCALL_64_fastpath+0x12/0x76This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.Signed-off-by: Tejun Heo
Reported-and-tested-by: Daniel Wagner
Cc: Aleksa Sarai
26 Nov, 2015
1 commit
-
The following patch replaces all instances of time_t with time64_t i.e.
change the type used for representing time from 32-bit to 64-bit. All
32-bit kernels to date use a signed 32-bit time_t type, which can only
represent time until January 2038. Since embedded systems running 32-bit
Linux are going to survive beyond that date, we have to change all
current uses, in a backwards compatible way.The patch also changes the function get_seconds() that returns a 32-bit
integer to ktime_get_seconds() that returns seconds as 64-bit integer.The patch changes the type of ticks from time_t to u32. We keep ticks as
32-bits as the function uses 32-bit arithmetic which would prove less
expensive than 64-bit arithmetic and the function is expected to be
called atleast once every 32 seconds.Signed-off-by: Heena Sirwani
Reviewed-by: Arnd Bergmann
Signed-off-by: Arnd Bergmann
Signed-off-by: Tejun Heo
06 Nov, 2015
2 commits
-
Merge patch-bomb from Andrew Morton:
- inotify tweaks
- some ocfs2 updates (many more are awaiting review)
- various misc bits
- kernel/watchdog.c updates
- Some of mm. I have a huge number of MM patches this time and quite a
lot of it is quite difficult and much will be held over to next time.* emailed patches from Andrew Morton : (162 commits)
selftests: vm: add tests for lock on fault
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
mm: introduce VM_LOCKONFAULT
mm: mlock: add new mlock system call
mm: mlock: refactor mlock, munlock, and munlockall code
kasan: always taint kernel on report
mm, slub, kasan: enable user tracking by default with KASAN=y
kasan: use IS_ALIGNED in memory_is_poisoned_8()
kasan: Fix a type conversion error
lib: test_kasan: add some testcases
kasan: update reference to kasan prototype repo
kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
kasan: various fixes in documentation
kasan: update log messages
kasan: accurately determine the type of the bad access
kasan: update reported bug types for kernel memory accesses
kasan: update reported bug types for not user nor kernel memory accesses
mm/kasan: prevent deadlock in kasan reporting
mm/kasan: don't use kasan shadow pointer in generic functions
mm/kasan: MODULE_VADDR is not available on all archs
... -
The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.The comm will always be NULL-terminated, so the worst race scenario would
only be during update. We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.Signed-off-by: David Rientjes
Suggested-by: Oleg Nesterov
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Sergey Senozhatsky
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Oct, 2015
1 commit
-
Currently, cgroup_has_tasks() tests whether the target cgroup has any
css_set linked to it. This works because a css_set's refcnt converges
with the number of tasks linked to it and thus there's no css_set
linked to a cgroup if it doesn't have any live tasks.To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.This patch replaces cgroup_has_tasks() with cgroup_is_populated()
which tests cgroup->nr_populated instead which locally counts the
number of populated css_sets. Unlike cgroup_has_tasks(),
cgroup_is_populated() is recursive - if any of the descendants is
populated, the cgroup is populated too. While this changes the
meaning of the test, all the existing users are okay with the change.While at it, replace the open-coded ->populated_cnt test in
cgroup_events_show() with cgroup_is_populated().Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
23 Sep, 2015
2 commits
-
It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.This prepares both memcg and cpuset for multi-process migration. This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.v2: A previous patch which added threadgroup leader test was dropped.
Patch updated accordingly.Signed-off-by: Tejun Heo
Acked-by: Zefan Li
Acked-by: Michal Hocko
Cc: Johannes Weiner -
If memory_migrate flag is set, cpuset migrates memory according to the
destnation css's nodemask. The current implementation migrates memory
whenever any thread of a process is migrated making the behavior
somewhat arbitrary. Let's tie memory operations to the threadgroup
leader so that memory is migrated only when the leader is migrated.While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.Note that we're currently migrating memory in migration path proper
while holding all the locks. In the long term, this should be moved
out to an async work item.Signed-off-by: Tejun Heo
Acked-by: Zefan Li
19 Sep, 2015
1 commit
-
cftype->mode allows controllers to give arbitrary permissions to
interface knobs. Except for "cgroup.event_control", the existing uses
are spurious.* Some explicitly specify S_IRUGO | S_IWUSR even though that's the
default.* "cpuset.memory_pressure" specifies S_IRUGO while also setting a
write callback which returns -EACCES. All it needs to do is simply
not setting a write callback."cgroup.event_control" uses cftype->mode to make the file
world-writable. It's a misdesigned interface and we don't want
controllers to be tweaking interface file permissions in general.
This patch removes cftype->mode and all its spurious uses and
implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
marked as compatibility-only.Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
18 Sep, 2015
1 commit
-
cgroup_on_dfl() tests whether the cgroup's root is the default
hierarchy; however, an individual controller is only interested in
whether the controller is attached to the default hierarchy and never
tests a cgroup which doesn't belong to the hierarchy that the
controller is attached to.This patch replaces cgroup_on_dfl() tests in controllers with faster
static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
the only user of cgroup_on_dfl() and the function is moved from the
header file to cgroup.c.Signed-off-by: Tejun Heo
Acked-by: Zefan Li
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
10 Aug, 2015
1 commit
-
The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.Signed-off-by: Alban Crequy
Tested-by: Iago López Galeiras
Cc: # 3.17+
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
15 Apr, 2015
1 commit
-
Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Pravin Shelar
Cc: Jarno Rajahalme
Cc: Li Zefan
Cc: Greg Thelen
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Mar, 2015
1 commit
-
Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.Cc: Peter Zijlstra
Cc: Clark Williams
Cc: Li Zefan
Cc: Ingo Molnar
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: cgroups@vger.kernel.org
Signed-off-by: Rik van Riel
Tested-by: David Rientjes
Acked-by: Peter Zijlstra (Intel)
Acked-by: David Rientjes
Acked-by: Zefan Li
Signed-off-by: Tejun Heo
03 Mar, 2015
3 commits
-
The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Cc: # 3.9+
Signed-off-by: Jason Low
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn -
When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:
# mount -t cgroup -o cpuset xxx /mnt
# mkdir /mnt/tmp
# echo 0 > /mnt/tmp/cpuset.cpus
# echo > /mnt/tmp/cpuset.cpus
# cat cpuset.cpus# cat cpuset.effective_cpus
0-15And a kernel warning in update_cpumasks_hier() is triggered:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()Cc: # 3.17+
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn -
If clone_children is enabled, effective masks won't be initialized
due to the bug:# mount -t cgroup -o cpuset xxx /mnt
# echo 1 > cgroup.clone_children
# mkdir /mnt/tmp
# cat /mnt/tmp/
# cat cpuset.effective_cpus# cat cpuset.cpus
0-15And then this cpuset won't constrain the tasks in it.
Either the bug or the fix has no effect on unified hierarchy, as
there's no clone_chidren flag there any more.Reported-by: Christian Brauner
Reported-by: Serge Hallyn
Cc: # 3.17+
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
Tested-by: Serge Hallyn
14 Feb, 2015
1 commit
-
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.* kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static
buffer which is protected by a dedicated spinlock. Removed.Signed-off-by: Tejun Heo
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Feb, 2015
1 commit
-
The only caller of cpuset_init_current_mems_allowed is the __init
annotated build_all_zonelists_init, so we can also make the former __init.Signed-off-by: Rasmus Villemoes
Cc: Vlastimil Babka
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Vishnu Pratap Singh
Cc: Pintu Kumar
Cc: Michal Nazarewicz
Cc: Mel Gorman
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Tim Chen
Cc: Hugh Dickins
Cc: Li Zefan
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Dec, 2014
1 commit
-
Pull cgroup update from Tejun Heo:
"cpuset got simplified a bit. cgroup core got a fix on unified
hierarchy and grew some effective css related interfaces which will be
used for blkio support for writeback IO traffic which is currently
being worked on"* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: implement cgroup_get_e_css()
cgroup: add cgroup_subsys->css_e_css_changed()
cgroup: add cgroup_subsys->css_released()
cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
cpuset: lock vs unlock typo
cpuset: simplify cpuset_node_allowed API
cpuset: convert callback_mutex to a spinlock
28 Oct, 2014
2 commits
-
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Li Zefan
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar -
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.This patch fixes both things at once, as they are opposite faces
of the same coin.The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.Reported-by: Daniel Wagner
Reported-by: Vincent Legout
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dario Faggioli
Cc: Michael Trimarchi
Cc: Fabio Checconi
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan
Cc: Linus Torvalds
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar
27 Oct, 2014
3 commits
-
This will deadlock instead of unlocking.
Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
Signed-off-by: Dan Carpenter
Acked-by: Vladimir Davydov
Signed-off-by: Tejun Heo -
Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.So let's simplify the API back to the single check.
Suggested-by: David Rientjes
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo -
The callback_mutex is only used to synchronize reads/updates of cpusets'
flags and cpu/node masks. These operations should always proceed fast so
there's no reason why we can't use a spinlock instead of the mutex.Converting the callback_mutex into a spinlock will let us call
cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
it possible to simplify the code by merging the hardwall and asoftwall
checks into the same function, which is the business of the next patch.Suggested-by: Zefan Li
Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Acked-by: Zefan Li
Signed-off-by: Tejun Heo
10 Oct, 2014
1 commit
-
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Just a handful of cleanup patches"* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: remove redundant variable in cgroup_mount()"
cgroup: remove redundant variable in cgroup_mount()
cgroup: fix missing unlock in cgroup_release_agent()
cgroup: remove CGRP_RELEASABLE flag
perf/cgroup: Remove perf_put_cgroup()
cgroup: remove redundant check in cgroup_ino()
cpuset: simplify proc_cpuset_show()
cgroup: simplify proc_cgroup_show()
cgroup: use a per-cgroup work for release agent
cgroup: remove bogus comments
cgroup: remove redundant code in cgroup_rmdir()
cgroup: remove some useless forward declarations
cgroup: fix a typo in comment.
25 Sep, 2014
1 commit
-
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.Here's the full report:
https://lkml.org/lkml/2014/9/19/230To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Miao Xie
Cc: Kees Cook
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Cc: # 2.6.31+
Reported-by: Tetsuo Handa
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
19 Sep, 2014
1 commit
-
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().
Signed-off-by: Zefan Li
Signed-off-by: Tejun Heo
05 Aug, 2014
1 commit
-
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.All the changes are to the new interface and no behavior changed for
the multiple hierarchies"* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
30 Jul, 2014
1 commit
-
The WARN_ON() is used to check if we break the legal hierarchy, on
which the effective mems should be equal to configured mems.Reported-by: Mike Qiu
Tested-by: Mike Qiu
Signed-off-by: Li Zefan
15 Jul, 2014
1 commit
-
Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vivek Goyal
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Aristeu Rozanski
Cc: Aneesh Kumar K.V
10 Jul, 2014
2 commits
-
cpuset.cpus and cpuset.mems are the configured masks, and we need
to export effective masks to userspace, so users know the real
cpus_allowed and mems_allowed that apply to the tasks in a cpuset.v2:
- export those masks unconditionally, suggested by Tejun.Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo -
As the configured masks won't be limited by its parent, and the top
cpuset's masks won't change when hotplug happens, it's natural to
allow writing offlined masks to the configured masks.If on default hierarchy:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
# cat /cpuset/sub/cpuset.cpus
1If on legacy hierarchy:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# mkdir /cpuset/sub
# echo 1 > /cpuset/sub/cpuset.cpus
-bash: echo: write error: Invalid argumentNote the checks don't need to be gated by cgroup_on_dfl, because we've
initialized top_cpuset.{cpus,mems}_allowed accordingly in cpuset_bind().Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo