03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

16 Nov, 2015

1 commit

  • 6f60eade2433 ("cgroup: generalize obtaining the handles of and
    notifying cgroup files") introduced cftype->file_offset so that the
    handles for per-css file instances can be recorded. These handles
    then can be used, for example, to generate file modified
    notifications.

    Unfortunately, it made the wrong assumption that files are created
    once for a given css and removed on its destruction. Due to the
    dependencies among subsystems, a css may be hidden from userland and
    then later shown again. This is implemented by removing and
    re-creating the affected files, so the associated kernfs_node for a
    given cgroup file may change over time. This incorrect assumption led
    to the corruption of css->files lists.

    Reimplement cftype->file_offset handling so that cgroup_file->kn is
    protected by a lock and updated as files are created and destroyed.
    This also makes keeping them on per-cgroup list unnecessary.

    Signed-off-by: Tejun Heo
    Reported-by: James Sedgwick
    Fixes: 6f60eade2433 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
    Acked-by: Johannes Weiner
    Acked-by: Zefan Li

    Tejun Heo
     

16 Oct, 2015

4 commits

  • cgroup_exit() is called when a task exits and disassociates the
    exiting task from its cgroups and half-attach it to the root cgroup.
    This is unnecessary and undesirable.

    No controller actually needs an exiting task to be disassociated with
    non-root cgroups. Both cpu and perf_event controllers update the
    association to the root cgroup from their exit callbacks just to keep
    consistent with the cgroup core behavior.

    Also, this disassociation makes it difficult to track resources held
    by zombies or determine where the zombies came from. Currently, pids
    controller is completely broken as it uncharges on exit and zombies
    always escape the resource restriction. With cgroup association being
    reset on exit, fixing it is pretty painful.

    There's no reason to reset cgroup membership on exit. The zombie can
    be removed from its css_set so that it doesn't show up on
    "cgroup.procs" and thus can't be migrated or interfere with cgroup
    removal. It can still pin and point to the css_set so that its cgroup
    membership is maintained. This patch makes cgroup core keep zombies
    associated with their cgroups at the time of exit.

    * Previous patches decoupled populated_cnt tracking from css_set
    lifetime, so a dying task can be simply unlinked from its css_set
    while pinning and pointing to the css_set. This keeps css_set
    association from task side alive while hiding it from "cgroup.procs"
    and populated_cnt tracking. The css_set reference is dropped when
    the task_struct is freed.

    * ->exit() callback no longer needs the css arguments as the
    associated css never changes once PF_EXITING is set. Removed.

    * cpu and perf_events controllers no longer need ->exit() callbacks.
    There's no reason to explicitly switch away on exit. The final
    schedule out is enough. The callbacks are removed.

    * On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
    still reports "/" for all zombies. On the default hierarchy,
    "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
    to at the time of exit. If the cgroup gets removed before the task
    is reaped, " (deleted)" is appended.

    v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     
  • css_set_rwsem is the inner lock protecting css_sets and is accessed
    from hot paths such as fork and exit. Internally, it has no reason to
    be a rwsem or even mutex. There are no internal blocking operations
    while holding it. This was rwsem because css task iteration used to
    expose it to external iterator users. As the previous patch updated
    css task iteration such that the locking is not leaked to its users,
    there's no reason to keep it a rwsem.

    This patch converts css_set_rwsem to a spinlock and rename it to
    css_set_lock. It uses bh-safe operations as a planned usage needs to
    access it from RCU callback context.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • css_sets are synchronized through css_set_rwsem but the locking scheme
    is kinda bizarre. The hot paths - fork and exit - have to write lock
    the rwsem making the rw part pointless; furthermore, many readers
    already hold cgroup_mutex.

    One of the readers is css task iteration. It read locks the rwsem
    over the entire duration of iteration. This leads to silly locking
    behavior. When cpuset tries to migrate processes of a cgroup to a
    different NUMA node, css_set_rwsem is held across the entire migration
    attempt which can take a long time locking out forking, exiting and
    other cgroup operations.

    This patch updates css task iteration so that it locks css_set_rwsem
    only while the iterator is being advanced. css task iteration
    involves two levels - css_set and task iteration. As css_sets in use
    are practically immutable, simply pinning the current one is enough
    for resuming iteration afterwards. Task iteration is tricky as tasks
    may leave their css_set while iteration is in progress. This is
    solved by keeping track of active iterators and advancing them if
    their next task leaves its css_set.

    v2: put_task_struct() in css_task_iter_next() moved outside
    css_set_rwsem. A later patch will add cgroup operations to
    task_struct free path which may grab the same lock and this avoids
    deadlock possibilities.

    css_set_move_task() updated to use list_for_each_entry_safe() when
    walking task_iters and advancing them. This is necessary as
    advancing an iter may remove it from the list.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Currently, cgroup_has_tasks() tests whether the target cgroup has any
    css_set linked to it. This works because a css_set's refcnt converges
    with the number of tasks linked to it and thus there's no css_set
    linked to a cgroup if it doesn't have any live tasks.

    To help tracking resource usage of zombie tasks, putting the ref of
    css_set will be separated from disassociating the task from the
    css_set which means that a cgroup may have css_sets linked to it even
    when it doesn't have any live tasks.

    This patch replaces cgroup_has_tasks() with cgroup_is_populated()
    which tests cgroup->nr_populated instead which locally counts the
    number of populated css_sets. Unlike cgroup_has_tasks(),
    cgroup_is_populated() is recursive - if any of the descendants is
    populated, the cgroup is populated too. While this changes the
    meaning of the test, all the existing users are okay with the change.

    While at it, replace the open-coded ->populated_cnt test in
    cgroup_events_show() with cgroup_is_populated().

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

23 Sep, 2015

1 commit

  • It wasn't explicitly documented but, when a process is being migrated,
    cpuset and memcg depend on cgroup_taskset_first() returning the
    threadgroup leader; however, this approach is somewhat ghetto and
    would no longer work for the planned multi-process migration.

    This patch introduces explicit cgroup_taskset_for_each_leader() which
    iterates over only the threadgroup leaders and replaces
    cgroup_taskset_first() usages for accessing the leader with it.

    This prepares both memcg and cpuset for multi-process migration. This
    patch also updates the documentation for cgroup_taskset_for_each() to
    clarify the iteration rules and removes comments mentioning task
    ordering in tasksets.

    v2: A previous patch which added threadgroup leader test was dropped.
    Patch updated accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Acked-by: Michal Hocko
    Cc: Johannes Weiner

    Tejun Heo
     

19 Sep, 2015

1 commit

  • cgroup core handles creations and removals of cgroup interface files
    as described by cftypes. There are cases where the handle for a given
    file instance is necessary, for example, to generate a file modified
    event. Currently, this is handled by explicitly matching the callback
    method pointer and storing the file handle manually in
    cgroup_add_file(). While this simple approach works for cgroup core
    files, it can't for controller interface files.

    This patch generalizes cgroup interface file handle handling. struct
    cgroup_file is defined and each cftype can optionally tell cgroup core
    to store the file handle by setting ->file_offset. A file handle
    remains accessible as long as the containing css is accessible.

    Both "cgroup.procs" and "cgroup.events" are converted to use the new
    generic mechanism instead of hooking directly into cgroup_add_file().
    Also, cgroup_file_notify() which takes a struct cgroup_file and
    generates a file modified event on it is added and replaces explicit
    kernfs_notify() invocations.

    This generalizes cgroup file handle handling and allows controllers to
    generate file modified notifications.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner

    Tejun Heo
     

18 Sep, 2015

2 commits

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Whether a subsys is enabled and attached to the default hierarchy
    seldom changes and may be tested in the hot paths. This patch
    implements static_key based cgroup_subsys_enabled() and
    cgroup_subsys_on_dfl() tests.

    The following patches will update the users and remove duplicate
    mechanisms.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     

26 Aug, 2015

1 commit


05 Aug, 2015

1 commit

  • Traditionally, each cgroup controller implemented whatever interface
    it wanted leading to interfaces which are widely inconsistent.
    Examining the requirements of the controllers readily yield that there
    are only a few control schemes shared among all.

    Two major controllers already had to implement new interface for the
    unified hierarchy due to significant structural changes. Let's take
    the chance to establish common conventions throughout all controllers.

    This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
    based control knobs and documents the conventions that controllers
    should follow on the unified hierarchy. Except for io.weight knob,
    all existing unified hierarchy knobs are already compliant. A
    follow-up patch will update io.weight.

    v2: Added descriptions of min, low and high knobs.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Cc: Li Zefan
    Cc: Peter Zijlstra

    Tejun Heo
     

15 Jul, 2015

1 commit

  • Add a new cgroup subsystem callback can_fork that conditionally
    states whether or not the fork is accepted or rejected by a cgroup
    policy. In addition, add a cancel_fork callback so that if an error
    occurs later in the forking process, any state modified by can_fork can
    be reverted.

    Allow for a private opaque pointer to be passed from cgroup_can_fork to
    cgroup_post_fork, allowing for the fork state to be stored by each
    subsystem separately.

    Also add a tagging system for cgroup_subsys.h to allow for CGROUP_
    enumerations to be be defined and used. In addition, explicitly add a
    CGROUP_CANFORK_COUNT macro to make arrays easier to define.

    This is in preparation for implementing the pids cgroup subsystem.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

27 Jun, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     

02 Jun, 2015

1 commit

  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 May, 2015

2 commits

  • From c4d440938b5e2015c70594fe6666a099c844f929 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 13 May 2015 16:21:40 -0400

    Over time, cgroup.h grew organically and doesn't have much logical
    structure at this point. Separation of cgroup-defs.h in the previous
    patch gives us a good chance for reorganizing cgroup.h as changes to
    the header are likely to cause conflicts anyway.

    This patch reorganizes cgroup.h so that it has consistent logical
    grouping.

    This is pure reorganization.

    v2: Relocating #ifdef CONFIG_CGROUPS caused build failure when cgroup
    is disabled. Dropped.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • From 2d728f74bfc071df06773e2fd7577dd5dab6425d Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 13 May 2015 15:37:01 -0400

    This patch separates out cgroup-defs.h from cgroup.h which has grown a
    lot of dependencies. cgroup-defs.h currently only contains constant
    and type definitions and can be used to break circular include
    dependency. While moving, definitions are reordered so that
    cgroup-defs.h has consistent logical structure.

    This patch is pure reorganization.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

07 Jan, 2015

1 commit


12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

2 commits

  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • Charges currently pin the css indirectly by playing tricks during
    css_offline(): user pages stall the offlining process until all of them
    have been reparented, whereas kmemcg acquires a keep-alive reference if
    outstanding kernel pages are detected at that point.

    In preparation for removing all this complexity, make the pinning explicit
    and acquire a css references for every charged page.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Nov, 2014

1 commit


18 Nov, 2014

3 commits

  • Implement cgroup_get_e_css() which finds and gets the effective css
    for the specified cgroup and subsystem combination. This function
    always returns a valid pinned css. This will be used by cgroup
    writeback support.

    While at it, add comment to cgroup_e_css() to explain why that
    function is different from cgroup_get_e_css() and has to test
    cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is
    invoked if any of the effective csses seen from the css's cgroup may
    have changed. This will be used to implement cgroup writeback
    support.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • Add a new cgroup subsys callback css_released(). This is called when
    the reference count of the css (cgroup_subsys_state) reaches zero
    before RCU scheduling free.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     

19 Sep, 2014

4 commits

  • We call put_css_set() after setting CGRP_RELEASABLE flag in
    cgroup_task_migrate(), but in other places we call it without setting
    the flag. I don't see the necessity of this flag.

    Moreover once the flag is set, it will never be cleared, unless writing
    to the notify_on_release control file, so it can be quite confusing
    if we look at the output of debug.releasable.

    # mount -t cgroup -o debug xxx /cgroup
    # mkdir /cgroup/child
    # cat /cgroup/child/debug.releasable
    0 /cgroup/child/tasks
    # cat /cgroup/child/debug.releasable
    0
    # echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks
    # cat /proc/child/debug.releasable
    1
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • After we implemented default unified hierarchy, cgrp->kn can never
    be NULL.

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • Instead of using a global work to schedule release agent on removable
    cgroups, we change to use a per-cgroup work to do this, which makes
    the code much simpler.

    v2: use a dedicated work instead of reusing css->destroy_work. (Tejun)

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

18 Sep, 2014

1 commit


15 Jul, 2014

4 commits

  • cgroup now distinguishes cftypes for the default and legacy
    hierarchies more explicitly by using separate arrays and
    CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only
    inside cgroup core proper. Let's make it clear that the flags are
    internal by prefixing them with double underscores.

    CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The
    two flags are also collected and assigned bits >= 16 so that they
    aren't mixed with the published flags.

    v2: Convert the extra ones in cgroup_exit_cftypes() which are added by
    revision to the previous patch.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Until now, cftype arrays carried files for both the default and legacy
    hierarchies and the files which needed to be used on only one of them
    were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
    gets confusing very quickly and we may end up exposing interface files
    to the default hierarchy without thinking it through.

    This patch makes cgroup core provide separate sets of interfaces for
    cftype handling so that the cftypes for the default and legacy
    hierarchies are clearly distinguished. The previous two patches
    renamed the existing ones so that they clearly indicate that they're
    for the legacy hierarchies. This patch adds the interface for the
    default hierarchy and apply them selectively depending on the
    hierarchy type.

    * cftypes added through cgroup_subsys->dfl_cftypes and
    cgroup_add_dfl_cftypes() only show up on the default hierarchy.

    * cftypes added through cgroup_subsys->legacy_cftypes and
    cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

    * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
    same array for the cases where the interface files are identical on
    both types of hierarchies.

    * This makes all the existing subsystem interface files legacy-only by
    default and all subsystems will have no interface file created when
    enabled on the default hierarchy. Each subsystem should explicitly
    review and compose the interface for the default hierarchy.

    * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
    makes subsystems which haven't decided the interface files for the
    default hierarchy to present the legacy files on the default
    hierarchy so that its behavior on the default hierarchy can be
    tested. As the awkward name suggests, this is for development only.

    * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
    array isn't used on the default hierarchy. The flag is removed.

    v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

    v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
    as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

09 Jul, 2014

6 commits

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
    reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK.

    This doesn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    blkio piggybacking on memory is an implementation detail which
    preferably should be handled automatically without requiring explicit
    userland action. To achieve that, this patch implements
    cgroup_subsys->depends_on which contains the mask of subsystems which
    should be enabled together when the subsystem is enabled.

    The previous patches already implemented the support for enabled but
    invisible subsystems and cgroup_subsys->depends_on can be easily
    implemented by updating cgroup_refresh_child_subsys_mask() so that it
    calculates cgroup->child_subsys_mask considering
    cgroup_subsys->depends_on of the explicitly enabled subsystems.

    Documentation/cgroups/unified-hierarchy.txt is updated to explain that
    subsystems may not become immediately available after being unused
    from userland and that dependency could be a factor in it. As
    subsystems may already keep residual references, this doesn't
    significantly change how subsystem rebinding can be used.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     
  • cgroup is implementing support for subsystem dependency which would
    require a way to enable a subsystem even when it's not directly
    configured through "cgroup.subtree_control".

    The previous patches added support for explicitly and implicitly
    enabled subsystems and showing/hiding their interface files. An
    explicitly enabled subsystem may become implicitly enabled if it's
    turned off through "cgroup.subtree_control" but there are subsystems
    depending on it. In such cases, the subsystem, as it's turned off
    when seen from userland, shouldn't enforce any resource control.
    Also, the subsystem may be explicitly turned on later again and its
    interface files should be as close to the intial state as possible.

    This patch adds cgroup_subsys->css_reset() which is invoked when a css
    is hidden. The callback should disable resource control and reset the
    state to the vanilla state.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     
  • cgroup is implementing support for subsystem dependency which would
    require a way to enable a subsystem even when it's not directly
    configured through "cgroup.subtree_control".

    The preceding patch distinguished cgroup->subtree_control and
    ->child_subsys_mask where the former is the subsystems explicitly
    configured by the userland and the latter is all enabled subsystems
    currently is equal to the former but will include subsystems
    implicitly enabled through dependency.

    Subsystems which are enabled due to dependency shouldn't be visible to
    userland. This patch updates cgroup_subtree_control_write() and
    create_css() such that interface files are not created for implicitly
    enabled subsytems.

    * @visible paramter is added to create_css(). Interface files are
    created only when true.

    * If an already implicitly enabled subsystem is turned on through
    "cgroup.subtree_control", the existing css should be used. css
    draining is skipped.

    * cgroup_subtree_control_write() computes the new target
    cgroup->child_subsys_mask and create/kill or show/hide csses
    accordingly.

    As the two subsystem masks are still kept identical, this patch
    doesn't introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo
     
  • cgroup is implementing support for subsystem dependency which would
    require a way to enable a subsystem even when it's not directly
    configured through "cgroup.subtree_control".

    Previously, cgroup->child_subsys_mask directly reflected
    "cgroup.subtree_control" and the enabled subsystems in the child
    cgroups. This patch adds cgroup->subtree_control which
    "cgroup.subtree_control" operates on. cgroup->child_subsys_mask is
    now calculated from cgroup->subtree_control by
    cgroup_refresh_child_subsys_mask(), which sets it identical to
    cgroup->subtree_control for now.

    This will allow using cgroup->child_subsys_mask for all the enabled
    subsystems including the implicit ones and ->subtree_control for
    tracking the explicitly requested ones. This patch keeps the two
    masks identical and doesn't introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo