22 Jun, 2016

2 commits

  • The `events_limit` variable needs to be formatted with %lld and not %ld.
    This fixes the following warning discovered by kbuild test robot:

    kernel/cgroup_pids.c: In function 'pids_events_show':
    kernel/cgroup_pids.c:313:24: warning: format '%ld' expects argument of type
    'long int', but argument 3 has type 'long long int' [-Wformat=]
    seq_printf(sf, "max %ld\n", atomic64_read(&pids->events_limit));
    ^

    tj: Added explicit (s64) cast as atomic64 switches between long long
    and long depending on 32 or 64.

    Signed-off-by: Kenny Yu
    Signed-off-by: Tejun Heo

    Kenny Yu
     
  • This patch adds more visibility into the pids controller when the controller
    rejects a fork request. Whenever fork fails because the limit on the number of
    pids in the cgroup is reached, the controller will log this and also notify the
    newly added cgroups events file. The `max` key in the events file represents
    the number of times fork failed because of the pids controller.

    This change also logs only the first time the `max` event counter is
    incremented. This is to provide a hint to the user to understand why fork
    failed, as users are not yet used to seeing fork failures because of the
    pids controller.

    Signed-off-by: Kenny Yu
    Acked-by: Johannes Weiner cmpxchg.org>
    Signed-off-by: Tejun Heo

    Kenny Yu
     

15 Dec, 2015

1 commit


03 Dec, 2015

3 commits

  • Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
    kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     
  • Because accounting resources for the root cgroup sometimes incurs
    measureable overhead for workloads which don't care about cgroup and
    often ends up calculating a number which is available elsewhere in a
    slightly different form, cgroup is not in the business of providing
    system-wide statistics. The pids controller which was introduced
    recently was exposing "pids.current" at the root. This patch disable
    accounting for root cgroup and removes the file from the root
    directory.

    While this is a userland visible behavior change, pids has been
    available only in one version and was badly broken there, so I don't
    think this will be noticeable. If it turns out to be a problem, we
    can reinstate it for v1 hierarchies.

    Signed-off-by: Tejun Heo
    Cc: Aleksa Sarai

    Tejun Heo
     
  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

30 Nov, 2015

2 commits

  • Now that we know that the forking task can't migrate amd the child is always
    moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
    change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
    And since we no longer need to pin this css, we can remove pid_fork().

    Note: the patch uses task_css_check(true), perhaps it makes sense to add a
    helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
    account.

    Signed-off-by: Oleg Nesterov
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     
  • If the new child migrates to another cgroup before cgroup_post_fork() calls
    subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
    pids_uncharge(old_pids) + pids_charge(pids) sequence twice.

    Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
    unconditionally. percpu_down_read() is cheap and this allows other cleanups,
    see the next changes.

    Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.

    Signed-off-by: Oleg Nesterov
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

16 Oct, 2015

2 commits

  • pids controller is completely broken in that it uncharges when a task
    exits allowing zombies to escape resource control. With the recent
    updates, cgroup core now maintains cgroup association till task free
    and pids controller can be fixed by uncharging on free instead of
    exit.

    This patch adds cgroup_subsys->free() method and update pids
    controller to use it instead of ->exit() for uncharging.

    Signed-off-by: Tejun Heo
    Cc: Aleksa Sarai

    Tejun Heo
     
  • cgroup_exit() is called when a task exits and disassociates the
    exiting task from its cgroups and half-attach it to the root cgroup.
    This is unnecessary and undesirable.

    No controller actually needs an exiting task to be disassociated with
    non-root cgroups. Both cpu and perf_event controllers update the
    association to the root cgroup from their exit callbacks just to keep
    consistent with the cgroup core behavior.

    Also, this disassociation makes it difficult to track resources held
    by zombies or determine where the zombies came from. Currently, pids
    controller is completely broken as it uncharges on exit and zombies
    always escape the resource restriction. With cgroup association being
    reset on exit, fixing it is pretty painful.

    There's no reason to reset cgroup membership on exit. The zombie can
    be removed from its css_set so that it doesn't show up on
    "cgroup.procs" and thus can't be migrated or interfere with cgroup
    removal. It can still pin and point to the css_set so that its cgroup
    membership is maintained. This patch makes cgroup core keep zombies
    associated with their cgroups at the time of exit.

    * Previous patches decoupled populated_cnt tracking from css_set
    lifetime, so a dying task can be simply unlinked from its css_set
    while pinning and pointing to the css_set. This keeps css_set
    association from task side alive while hiding it from "cgroup.procs"
    and populated_cnt tracking. The css_set reference is dropped when
    the task_struct is freed.

    * ->exit() callback no longer needs the css arguments as the
    associated css never changes once PF_EXITING is set. Removed.

    * cpu and perf_events controllers no longer need ->exit() callbacks.
    There's no reason to explicitly switch away on exit. The final
    schedule out is enough. The callbacks are removed.

    * On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
    still reports "/" for all zombies. On the default hierarchy,
    "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
    to at the time of exit. If the cgroup gets removed before the task
    is reaped, " (deleted)" is appended.

    v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

26 Aug, 2015

1 commit

  • Fix incorrect usage of css_get and css_put to put a different css in
    pids_{cancel_,}attach() than the one grabbed in pids_can_attach(). This
    could lead to quite serious memory leakage (and unsafe operations on the
    putted css).

    tj: minor comment update

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai