15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

29 Sep, 2016

1 commit

  • 4c737b41de7f ("cgroup: make cgroup_path() and friends behave in the
    style of strlcpy()") botched the conversion of proc_cpuset_show() and
    broke its error handling. It made the function return 0 on failures
    and fail to handle error returns from cgroup_path_ns(). Fix it.

    Reported-by: Dan Carpenter
    Signed-off-by: Tejun Heo

    Tejun Heo
     

16 Sep, 2016

1 commit


13 Sep, 2016

1 commit

  • A discrepancy between cpu_online_mask and cpuset's effective_cpus
    mask is inevitable during hotplug since cpuset defers updating of
    effective_cpus mask using a workqueue, during which time nothing
    prevents the system from more hotplug operations. For that reason
    guarantee_online_cpus() walks up the cpuset hierarchy until it finds
    an intersection under the assumption that top cpuset's effective_cpus
    mask intersects with cpu_online_mask even with such a race occurring.

    However a sequence of CPU hotplugs can open a time window, during which
    none of the effective CPUs in the top cpuset intersect with
    cpu_online_mask.

    For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

    ======================== ===========================
    cpu_online_mask top_cpuset.effective_cpus
    ======================== ===========================
    echo 1 > cpu2/online.
    CPU hotplug notifier woke up hotplug work but not yet scheduled.
    [0,2] [0]

    echo 0 > cpu0/online.
    The workqueue is still runnable.
    [2] [0]
    ======================== ===========================

    Now there is no intersection between cpu_online_mask and
    top_cpuset.effective_cpus. Thus invoking sys_sched_setaffinity() at
    this moment can cause following:

    Unable to handle kernel NULL pointer dereference at virtual address 000000d0
    ------------[ cut here ]------------
    Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 2 PID: 1420 Comm: taskset Tainted: G W 4.4.8+ #98
    task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
    PC is at guarantee_online_cpus+0x2c/0x58
    LR is at cpuset_cpus_allowed+0x4c/0x6c

    Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
    Call trace:
    [] guarantee_online_cpus+0x2c/0x58
    [] cpuset_cpus_allowed+0x4c/0x6c
    [] sched_setaffinity+0xc0/0x1ac
    [] SyS_sched_setaffinity+0x98/0xac
    [] el0_svc_naked+0x24/0x28

    The top cpuset's effective_cpus are guaranteed to be identical to
    cpu_online_mask eventually. Hence fall back to cpu_online_mask when
    there is no intersection between top cpuset's effective_cpus and
    cpu_online_mask.

    Signed-off-by: Joonwoo Park
    Acked-by: Li Zefan
    Cc: Tejun Heo
    Cc: cgroups@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: # 3.17+
    Signed-off-by: Tejun Heo

    Joonwoo Park
     

10 Aug, 2016

2 commits

  • cgroup_path() and friends used to format the path from the end and
    thus the resulting path usually didn't start at the start of the
    passed in buffer. Also, when the buffer was too small, the partial
    result was truncated from the head rather than tail and there was no
    way to tell how long the full path would be. These make the functions
    less robust and more awkward to use.

    With recent updates to kernfs_path(), cgroup_path() and friends can be
    made to behave in strlcpy() style.

    * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
    always return the length of the full path. If buffer is too small,
    it contains nul terminated truncated output.

    * All users updated accordingly.

    v2: cgroup_path() usage in kernel/sched/debug.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: Serge Hallyn
    Cc: Peter Zijlstra

    Tejun Heo
     
  • A new task inherits cpus_allowed and mems_allowed masks from its parent,
    but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
    before this new task is inserted into the cgroup's task list, the new task
    won't be updated accordingly.

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Zefan Li
     

29 Jul, 2016

1 commit

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
    it is checking the flag on the current task rather than the given one.

    This doesn't make much sense and it is actually wrong. If the current
    task which updates the nodemask of a cpuset got killed by the OOM killer
    then a part of the cpuset cgroup processes would have incompatible
    nodemask which is surprising to say the least.

    The comment suggests the intention was to skip oom victim or an exiting
    task so we should be checking the given task. But even then it would be
    layering violation because it is the memory allocator to interpret the
    TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
    should simply follow the same mask.

    Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

2 commits

  • An important function for cpusets is cpuset_node_allowed(), which
    optimizes on the fact if there's a single root CPU set, it must be
    trivially allowed. But the check "nr_cpusets()
    Signed-off-by: Mel Gorman
    Acked-by: Zefan Li
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Apr, 2016

1 commit

  • Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
    kicks off asynchronous NUMA node migration if necessary during task
    migration and flushes it from cpuset_post_attach_flush() which is
    called at the end of __cgroup_procs_write(). This is to avoid
    performing migration with cgroup_threadgroup_rwsem write-locked which
    can lead to deadlock through dependency on kworker creation.

    memcg has a similar issue with charge moving, so let's convert it to
    an official callback rather than the current one-off cpuset specific
    function. This patch adds cgroup_subsys->post_attach callback and
    makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

    The conversion is mostly one-to-one except that the new callback is
    called under cgroup_mutex. This is to guarantee that no other
    migration operations are started before ->post_attach callbacks are
    finished. cgroup_mutex is one of the outermost mutex in the system
    and has never been and shouldn't be a problem. We can add specialized
    synchronization around __cgroup_procs_write() but I don't think
    there's any noticeable benefit.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: # 4.4+ prerequisite for the next patch

    Tejun Heo
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

23 Feb, 2016

1 commit


17 Feb, 2016

1 commit

  • Introduce the ability to create new cgroup namespace. The newly created
    cgroup namespace remembers the cgroup of the process at the point
    of creation of the cgroup namespace (referred as cgroupns-root).
    The main purpose of cgroup namespace is to virtualize the contents
    of /proc/self/cgroup file. Processes inside a cgroup namespace
    are only able to see paths relative to their namespace root
    (unless they are moved outside of their cgroupns-root, at which point
    they will see a relative path from their cgroupns-root).
    For a correctly setup container this enables container-tools
    (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
    containers without leaking system level cgroup hierarchy to the task.
    This patch only implements the 'unshare' part of the cgroupns.

    Signed-off-by: Aditya Kali
    Signed-off-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aditya Kali
     

22 Jan, 2016

1 commit

  • If "cpuset.memory_migrate" is set, when a process is moved from one
    cpuset to another with a different memory node mask, pages in used by
    the process are migrated to the new set of nodes. This was performed
    synchronously in the ->attach() callback, which is synchronized
    against process management. Recently, the synchronization was changed
    from per-process rwsem to global percpu rwsem for simplicity and
    optimization.

    Combined with the synchronous mm migration, this led to deadlocks
    because mm migration could schedule a work item which may in turn try
    to create a new worker blocking on the process management lock held
    from cgroup process migration path.

    This heavy an operation shouldn't be performed synchronously from that
    deep inside cgroup migration in the first place. This patch punts the
    actual migration to an ordered workqueue and updates cgroup process
    migration and cpuset config update paths to flush the workqueue after
    all locks are released. This way, the operations still seem
    synchronous to userland without entangling mm migration with process
    management synchronization. CPU hotplug can also invoke mm migration
    but there's no reason for it to wait for mm migrations and thus
    doesn't synchronize against their completions.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Christian Borntraeger
    Cc: stable@vger.kernel.org # v4.4+

    Tejun Heo
     

03 Dec, 2015

2 commits

  • Tejun Heo
     
  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

26 Nov, 2015

1 commit

  • The following patch replaces all instances of time_t with time64_t i.e.
    change the type used for representing time from 32-bit to 64-bit. All
    32-bit kernels to date use a signed 32-bit time_t type, which can only
    represent time until January 2038. Since embedded systems running 32-bit
    Linux are going to survive beyond that date, we have to change all
    current uses, in a backwards compatible way.

    The patch also changes the function get_seconds() that returns a 32-bit
    integer to ktime_get_seconds() that returns seconds as 64-bit integer.

    The patch changes the type of ticks from time_t to u32. We keep ticks as
    32-bits as the function uses 32-bit arithmetic which would prove less
    expensive than 64-bit arithmetic and the function is expected to be
    called atleast once every 32 seconds.

    Signed-off-by: Heena Sirwani
    Reviewed-by: Arnd Bergmann
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Tejun Heo

    Arnd Bergmann
     

06 Nov, 2015

2 commits

  • Merge patch-bomb from Andrew Morton:

    - inotify tweaks

    - some ocfs2 updates (many more are awaiting review)

    - various misc bits

    - kernel/watchdog.c updates

    - Some of mm. I have a huge number of MM patches this time and quite a
    lot of it is quite difficult and much will be held over to next time.

    * emailed patches from Andrew Morton : (162 commits)
    selftests: vm: add tests for lock on fault
    mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
    mm: introduce VM_LOCKONFAULT
    mm: mlock: add new mlock system call
    mm: mlock: refactor mlock, munlock, and munlockall code
    kasan: always taint kernel on report
    mm, slub, kasan: enable user tracking by default with KASAN=y
    kasan: use IS_ALIGNED in memory_is_poisoned_8()
    kasan: Fix a type conversion error
    lib: test_kasan: add some testcases
    kasan: update reference to kasan prototype repo
    kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
    kasan: various fixes in documentation
    kasan: update log messages
    kasan: accurately determine the type of the bad access
    kasan: update reported bug types for kernel memory accesses
    kasan: update reported bug types for not user nor kernel memory accesses
    mm/kasan: prevent deadlock in kasan reporting
    mm/kasan: don't use kasan shadow pointer in generic functions
    mm/kasan: MODULE_VADDR is not available on all archs
    ...

    Linus Torvalds
     
  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Oct, 2015

1 commit

  • Currently, cgroup_has_tasks() tests whether the target cgroup has any
    css_set linked to it. This works because a css_set's refcnt converges
    with the number of tasks linked to it and thus there's no css_set
    linked to a cgroup if it doesn't have any live tasks.

    To help tracking resource usage of zombie tasks, putting the ref of
    css_set will be separated from disassociating the task from the
    css_set which means that a cgroup may have css_sets linked to it even
    when it doesn't have any live tasks.

    This patch replaces cgroup_has_tasks() with cgroup_is_populated()
    which tests cgroup->nr_populated instead which locally counts the
    number of populated css_sets. Unlike cgroup_has_tasks(),
    cgroup_is_populated() is recursive - if any of the descendants is
    populated, the cgroup is populated too. While this changes the
    meaning of the test, all the existing users are okay with the change.

    While at it, replace the open-coded ->populated_cnt test in
    cgroup_events_show() with cgroup_is_populated().

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

23 Sep, 2015

2 commits

  • It wasn't explicitly documented but, when a process is being migrated,
    cpuset and memcg depend on cgroup_taskset_first() returning the
    threadgroup leader; however, this approach is somewhat ghetto and
    would no longer work for the planned multi-process migration.

    This patch introduces explicit cgroup_taskset_for_each_leader() which
    iterates over only the threadgroup leaders and replaces
    cgroup_taskset_first() usages for accessing the leader with it.

    This prepares both memcg and cpuset for multi-process migration. This
    patch also updates the documentation for cgroup_taskset_for_each() to
    clarify the iteration rules and removes comments mentioning task
    ordering in tasksets.

    v2: A previous patch which added threadgroup leader test was dropped.
    Patch updated accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Acked-by: Michal Hocko
    Cc: Johannes Weiner

    Tejun Heo
     
  • If memory_migrate flag is set, cpuset migrates memory according to the
    destnation css's nodemask. The current implementation migrates memory
    whenever any thread of a process is migrated making the behavior
    somewhat arbitrary. Let's tie memory operations to the threadgroup
    leader so that memory is migrated only when the leader is migrated.

    While this is a behavior change, given the inherent fuziness, this
    change is not too likely to be noticed and allows us to clearly define
    who owns the memory (always the leader) and helps the planned atomic
    multi-process migration.

    Note that we're currently migrating memory in migration path proper
    while holding all the locks. In the long term, this should be moved
    out to an async work item.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     

19 Sep, 2015

1 commit

  • cftype->mode allows controllers to give arbitrary permissions to
    interface knobs. Except for "cgroup.event_control", the existing uses
    are spurious.

    * Some explicitly specify S_IRUGO | S_IWUSR even though that's the
    default.

    * "cpuset.memory_pressure" specifies S_IRUGO while also setting a
    write callback which returns -EACCES. All it needs to do is simply
    not setting a write callback.

    "cgroup.event_control" uses cftype->mode to make the file
    world-writable. It's a misdesigned interface and we don't want
    controllers to be tweaking interface file permissions in general.
    This patch removes cftype->mode and all its spurious uses and
    implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
    marked as compatibility-only.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner

    Tejun Heo
     

18 Sep, 2015

1 commit

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

10 Aug, 2015

1 commit

  • The comment says it's using trialcs->mems_allowed as a temp variable but
    it didn't match the code. Change the code to match the comment.

    This fixes an issue when writing in cpuset.mems when a sub-directory
    exists: we need to write several times for the information to persist:

    | root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
    | root@alban:/sys/fs/cgroup/cpuset# cd footest9
    | root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
    | 0
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
    |
    | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
    | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
    | 0
    | root@alban:/sys/fs/cgroup/cpuset/footest9#

    This should help to fix the following issue in Docker:
    https://github.com/opencontainers/runc/issues/133
    In some conditions, a Docker container needs to be started twice in
    order to work.

    Signed-off-by: Alban Crequy
    Tested-by: Iago López Galeiras
    Cc: # 3.17+
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Alban Crequy
     

15 Apr, 2015

1 commit

  • Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
    remove the obscure comment about it and its special-case exception.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Mar, 2015

1 commit

  • Ensure that cpus specified with the isolcpus= boot commandline
    option stay outside of the load balancing in the kernel scheduler.

    Operations like load balancing can introduce unwanted latencies,
    which is exactly what the isolcpus= commandline is there to prevent.

    Previously, simply creating a new cpuset, without even touching the
    cpuset.cpus field inside the new cpuset, would undo the effects of
    isolcpus=, by creating a scheduler domain spanning the whole system,
    and setting up load balancing inside that domain. The cpuset root
    cpuset.cpus file is read-only, so there was not even a way to undo
    that effect.

    This does not impact the majority of cpusets users, since isolcpus=
    is a fairly specialized feature used for realtime purposes.

    Cc: Peter Zijlstra
    Cc: Clark Williams
    Cc: Li Zefan
    Cc: Ingo Molnar
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Rik van Riel
    Tested-by: David Rientjes
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Rik van Riel
     

03 Mar, 2015

3 commits

  • The cpuset.sched_relax_domain_level can control how far we do
    immediate load balancing on a system. However, it was found on recent
    kernels that echo'ing a value into cpuset.sched_relax_domain_level
    did not reduce any immediate load balancing.

    The reason this occurred was because the update_domain_attr_tree() traversal
    did not update for the "top_cpuset". This resulted in nothing being changed
    when modifying the sched_relax_domain_level parameter.

    This patch is able to address that problem by having update_domain_attr_tree()
    allow updates for the root in the cpuset traversal.

    Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
    Cc: # 3.9+
    Signed-off-by: Jason Low
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo
    Tested-by: Serge Hallyn

    Jason Low
     
  • When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:

    # mount -t cgroup -o cpuset xxx /mnt
    # mkdir /mnt/tmp
    # echo 0 > /mnt/tmp/cpuset.cpus
    # echo > /mnt/tmp/cpuset.cpus
    # cat cpuset.cpus

    # cat cpuset.effective_cpus
    0-15

    And a kernel warning in update_cpumasks_hier() is triggered:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()

    Cc: # 3.17+
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo
    Tested-by: Serge Hallyn

    Zefan Li
     
  • If clone_children is enabled, effective masks won't be initialized
    due to the bug:

    # mount -t cgroup -o cpuset xxx /mnt
    # echo 1 > cgroup.clone_children
    # mkdir /mnt/tmp
    # cat /mnt/tmp/
    # cat cpuset.effective_cpus

    # cat cpuset.cpus
    0-15

    And then this cpuset won't constrain the tasks in it.

    Either the bug or the fix has no effect on unified hierarchy, as
    there's no clone_chidren flag there any more.

    Reported-by: Christian Brauner
    Reported-by: Serge Hallyn
    Cc: # 3.17+
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo
    Tested-by: Serge Hallyn

    Zefan Li
     

14 Feb, 2015

1 commit

  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    * kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static
    buffer which is protected by a dedicated spinlock. Removed.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

13 Feb, 2015

1 commit

  • The only caller of cpuset_init_current_mems_allowed is the __init
    annotated build_all_zonelists_init, so we can also make the former __init.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

28 Oct, 2014

2 commits

  • How we deal with updates to exclusive cpusets is currently broken.
    As an example, suppose we have an exclusive cpuset composed of
    two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
    up to the allowed bandwidth. If we want now to modify cpusetA's
    cpumask, we have to check that removing a cpu's amount of
    bandwidth doesn't break AC guarantees. This thing isn't checked
    in the current code.

    This patch fixes the problem above, denying an update if the
    new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
    that are currently active.

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
    affinity (performing what is commonly called clustered scheduling).
    Unfortunately, such thing is currently broken for two reasons:

    - No check is performed when the user tries to attach a task to
    an exlusive cpuset (recall that exclusive cpusets have an
    associated maximum allowed bandwidth).

    - Bandwidths of source and destination cpusets are not correctly
    updated after a task is migrated between them.

    This patch fixes both things at once, as they are opposite faces
    of the same coin.

    The check is performed in cpuset_can_attach(), as there aren't any
    points of failure after that function. The updated is split in two
    halves. We first reserve bandwidth in the destination cpuset, after
    we pass the check in cpuset_can_attach(). And we then release
    bandwidth from the source cpuset when the task's affinity is
    actually changed. Even if there can be time windows when sched_setattr()
    may erroneously fail in the source cpuset, we are fine with it, as
    we can't perfom an atomic update of both cpusets at once.

    Reported-by: Daniel Wagner
    Reported-by: Vincent Legout
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: michael@amarulasolutions.com
    Cc: luca.abeni@unitn.it
    Cc: Li Zefan
    Cc: Linus Torvalds
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

27 Oct, 2014

3 commits

  • This will deadlock instead of unlocking.

    Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
    Signed-off-by: Dan Carpenter
    Acked-by: Vladimir Davydov
    Signed-off-by: Tejun Heo

    Dan Carpenter
     
  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     
  • The callback_mutex is only used to synchronize reads/updates of cpusets'
    flags and cpu/node masks. These operations should always proceed fast so
    there's no reason why we can't use a spinlock instead of the mutex.

    Converting the callback_mutex into a spinlock will let us call
    cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
    it possible to simplify the code by merging the hardwall and asoftwall
    checks into the same function, which is the business of the next patch.

    Suggested-by: Zefan Li
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     

25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li