17 May, 2014

1 commit

  • cgroup in general is moving towards using cgroup_subsys_state as the
    fundamental structural component and css_parent() was introduced to
    convert from using cgroup->parent to css->parent. It was quite some
    time ago and we're moving forward with making css more prominent.

    This patch drops the trivial wrapper css_parent() and let the users
    dereference css->parent. While at it, explicitly mark fields of css
    which are public and immutable.

    v2: New usage from device_cgroup.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Johannes Weiner

    Tejun Heo
     

14 May, 2014

2 commits

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     
  • Unlike the more usual refcnting, what css_tryget() provides is the
    distinction between online and offline csses instead of protection
    against upping a refcnt which already reached zero. cgroup is
    planning to provide actual tryget which fails if the refcnt already
    reached zero. Let's rename the existing trygets so that they clearly
    indicate that they're onliness.

    I thought about keeping the existing names as-are and introducing new
    names for the planned actual tryget; however, given that each
    controller participates in the synchronization of the online state, it
    seems worthwhile to make it explicit that these functions are about
    on/offline state.

    Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
    to css_tryget_online_from_dir(). This is pure rename.

    v2: cgroup_freezer grew new usages of css_tryget(). Update
    accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

13 May, 2014

2 commits

  • While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
    replace freezer->lock with freezer_mutex") introduced a bug in
    update_if_frozen() where it returns with rcu_read_lock() held. Fix it
    by adding rcu_read_unlock() before returning.

    Signed-off-by: Tejun Heo
    Reported-by: kbuild test robot

    Tejun Heo
     
  • After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
    to css_set_rwsem"), css task iterators requires sleepable context as
    it may block on css_set_rwsem. I missed that cgroup_freezer was
    iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
    errors like the following on freezer state reads and transitions.

    BUG: sleeping function called from invalid context at /work
    /os/work/kernel/locking/rwsem.c:20
    in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
    5 locks held by bash/462:
    #0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
    #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
    #2: (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
    #3: (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
    #4: (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
    Preemption disabled at:[] console_unlock+0x1e4/0x460

    CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
    ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
    0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] __might_sleep+0x162/0x260
    [] down_read+0x24/0x60
    [] css_task_iter_start+0x27/0x70
    [] freezer_apply_state+0x5d/0x130
    [] freezer_write+0xf6/0x1e0
    [] cgroup_file_write+0xd8/0x230
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xb6/0x1c0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    freezer->lock used to be used in hot paths but that time is long gone
    and there's no reason for the lock to be IRQ-safe spinlock or even
    per-cgroup. In fact, given the fact that a cgroup may contain large
    number of tasks, it's not a good idea to iterate over them while
    holding IRQ-safe spinlock.

    Let's simplify locking by replacing per-cgroup freezer->lock with
    global freezer_mutex. This also makes the comments explaining the
    intricacies of policy inheritance and the locking around it as the
    states are protected by a common mutex.

    The conversion is mostly straight-forward. The followings are worth
    mentioning.

    * freezer_css_online() no longer needs double locking.

    * freezer_attach() now performs propagation simply while holding
    freezer_mutex. update_if_frozen() race no longer exists and the
    comment is removed.

    * freezer_fork() now tests whether the task is in root cgroup using
    the new task_css_is_root() without doing rcu_read_lock/unlock(). If
    not, it grabs freezer_mutex and performs the operation.

    * freezer_read() and freezer_change_state() grab freezer_mutex across
    the whole operation and pin the css while iterating so that each
    descendant processing happens in sleepable context.

    Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

25 Feb, 2014

1 commit

  • cgroup_subsys->fork() callback is special in that it's called outside
    the usual cgroup locking and may race with on-going migration.
    freezer_fork() currently doesn't consider such race condition;
    however, it is still correct thanks to the fact that freeze_task() may
    be called spuriously.

    This is quite subtle. Let's explain what's going on and add test to
    detect racing and losing to task migration and skip freeze_task() in
    such cases for documentation.

    This doesn't make any behavior difference meaningful to userland.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: "Rafael J. Wysocki"

    Tejun Heo
     

13 Feb, 2014

1 commit

  • If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
    css. The intention of the interface is to make it easy to skip css's
    (cgroup_subsys_states) which already match the migration target;
    however, this is entirely unnecessary as migration taskset doesn't
    include tasks which are already in the target cgroup. Drop @skip_css
    from cgroup_taskset_for_each().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann

    Tejun Heo
     

08 Feb, 2014

1 commit

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo
     

06 Dec, 2013

1 commit

  • In preparation of conversion to kernfs, cgroup file handling is
    updated so that it can be easily mapped to kernfs. This patch
    replaces cftype->read_seq_string() with cftype->seq_show() which is
    not limited to single_open() operation and will map directcly to
    kernfs seq_file interface.

    The conversions are mechanical. As ->seq_show() doesn't have @css and
    @cft, the functions which make use of them are converted to use
    seq_css() and seq_cft() respectively. In several occassions, e.f. if
    it has seq_string in its name, the function name is updated to fit the
    new method better.

    This patch does not introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Michal Hocko
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Neil Horman

    Tejun Heo
     

09 Aug, 2013

11 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    This patch converts task iterators to deal with css instead of cgroup.
    Note that under unified hierarchy, different sets of tasks will be
    considered belonging to a given cgroup depending on the subsystem in
    question and making the iterators deal with css instead cgroup
    provides them with enough information about the iteration.

    While at it, fix several function comment formats in cpuset.c.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley

    Tejun Heo
     
  • Currently all cgroup_task_iter functions require @cgrp to be passed
    in, which is superflous and increases chance of usage error. Make
    cgroup_task_iter remember the cgroup being iterated and drop @cgrp
    argument from next and end functions.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup now has multiple iterators and it's quite confusing to have
    something which walks over tasks of a single cgroup named cgroup_iter.
    Let's rename it to cgroup_task_iter.

    While at it, reformat / update comments and replace the overview
    comment above the interface function decls with proper function
    comments. Such overview can be useful but function comments should be
    more than enough here.

    This is pure rename and doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

20 Nov, 2012

2 commits


10 Nov, 2012

6 commits

  • Up until now, cgroup_freezer didn't implement hierarchy properly.
    cgroups could be arranged in hierarchy but it didn't make any
    difference in how each cgroup_freezer behaved. They all operated
    separately.

    This patch implements proper hierarchy support. If a cgroup is
    frozen, all its descendants are frozen. A cgroup is thawed iff it and
    all its ancestors are THAWED. freezer.self_freezing shows the current
    freezing state for the cgroup itself. freezer.parent_freezing shows
    whether the cgroup is freezing because any of its ancestors is
    freezing.

    freezer_post_create() locks the parent and new cgroup and inherits the
    parent's state and freezer_change_state() applies new state top-down
    using cgroup_for_each_descendant_pre() which guarantees that no child
    can escape its parent's state. update_if_frozen() uses
    cgroup_for_each_descendant_post() to propagate frozen states
    bottom-up.

    Synchronization could be coarser and easier by using a single mutex to
    protect all hierarchy operations. Finer grained approach was used
    because it wasn't too difficult for cgroup_freezer and I think it's
    beneficial to have an example implementation and cgroup_freezer is
    rather simple and can serve a good one.

    As this makes cgroup_freezer properly hierarchical,
    freezer_subsys.broken_hierarchy marking is removed.

    Note that this patch changes userland visible behavior - freezing a
    cgroup now freezes all its descendants too. This behavior change is
    intended and has been warned via .broken_hierarchy.

    v2: Michal spotted a bug in freezer_change_state() - descendants were
    inheriting from the wrong ancestor. Fixed.

    v3: Documentation/cgroups/freezer-subsystem.txt updated.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko

    Tejun Heo
     
  • A cgroup is online and visible to iteration between ->post_create()
    and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
    toggles it from the newly added freezer_post_create() and
    freezer_pre_destroy() while holding freezer->lock such that a
    cgroup_freezer can be reilably distinguished to be online. This will
    be used by full hierarchy support.

    ONLINE test is added to freezer_apply_state() but it currently doesn't
    make any difference as freezer_write() can only be called for an
    online cgroup.

    Adjusting system_freezing_cnt on destruction is moved from
    freezer_destroy() to the new freezer_pre_destroy() for consistency.

    This patch doesn't introduce any noticeable behavior change.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko

    Tejun Heo
     
  • Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
    the two flags. This is to prepare for full hierarchy support.

    freezer_apply_date() is updated such that it can handle setting and
    clearing of both flags. The two flags are also exposed to userland
    via read-only files self_freezing and parent_freezing.

    Other than the added cgroupfs files, this patch doesn't introduce any
    behavior change.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko

    Tejun Heo
     
  • freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
    As the scheduled full hierarchy support requires more than one
    freezing condition, switch it to mask of flags. If FREEZING is not
    set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
    both FREEZING and FROZEN are set. Now that tasks can be attached to
    an already frozen cgroup, this also makes freezing condition checks
    more natural.

    This patch doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko

    Tejun Heo
     
  • * Make freezer_change_state() take bool @freeze instead of enum
    freezer_state.

    * Separate out freezer_apply_state() out of freezer_change_state().
    This makes freezer_change_state() a rather silly thin wrapper. It
    will be filled with hierarchy handling later on.

    This patch doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko

    Tejun Heo
     
  • * Clean-up indentation and line-breaks. Drop the invalid comment
    about freezer->lock.

    * Make all internal functions take @freezer instead of both @cgroup
    and @freezer.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko

    Tejun Heo
     

27 Oct, 2012

1 commit

  • try_to_freeze_tasks() and cgroup_freezer rely on scheduler locks
    to ensure that a task doing STOPPED/TRACED -> RUNNING transition
    can't escape freezing. This mostly works, but ptrace_stop() does
    not necessarily call schedule(), it can change task->state back to
    RUNNING and check freezing() without any lock/barrier in between.

    We could add the necessary barrier, but this patch changes
    ptrace_stop() and do_signal_stop() to use freezable_schedule().
    This fixes the race, freezer_count() and freezer_should_skip()
    carefully avoid the race.

    And this simplifies the code, try_to_freeze_tasks/update_if_frozen
    no longer need to use task_is_stopped_or_traced() checks with the
    non trivial assumptions. We can rely on the mechanism which was
    specially designed to mark the sleeping task as "frozen enough".

    v2: As Tejun pointed out, we can also change get_signal_to_deliver()
    and move try_to_freeze() up before 'relock' label.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

21 Oct, 2012

3 commits

  • freezer_read/write() used cgroup_lock_live_group() to synchronize
    against task migration into and out of the target cgroup.
    cgroup_lock_live_group() grabs the internal cgroup lock and using it
    from outside cgroup core leads to complex and fragile locking
    dependency issues which are difficult to resolve.

    Now that freezer_can_attach() is replaced with freezer_attach() and
    update_if_frozen() updated, nothing requires excluding migration
    against freezer state reads and changes.

    This patch removes cgroup_lock_live_group() and the matching
    cgroup_unlock() usages. The prone-to-bitrot, already outdated and
    unnecessary global lock hierarchy documentation is replaced with
    documentation in local scope.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Cc: Li Zefan

    Tejun Heo
     
  • Locking will change such that migration can happen while
    freezer_read/write() is in progress. This means that
    update_if_frozen() can no longer assume that all tasks in the cgroup
    coform to the current freezer state - newly migrated tasks which
    haven't finished freezer_attach() yet might be in any state.

    This patch updates update_if_frozen() such that it no longer verifies
    task states against freezer state. It now simply decides whether
    FREEZING stage is complete.

    This removal of verification makes it meaningless to call from
    freezer_change_state(). Drop it and move the fast exit test from
    freezer_read() - the only left caller - to update_if_frozen().

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Cc: Li Zefan

    Tejun Heo
     
  • cgroup_freezer is one of the few users of cgroup_subsys->can_attach()
    and uses it to prevent tasks from being migrated into or out of a
    frozen cgroup. This makes cgroup_freezer cumbersome to use especially
    when co-mounted with other controllers.

    ->can_attach() is problematic in general as it can make co-mounting
    multiple cgroups difficult - migrating tasks may fail for reasons
    completely irrelevant for other controllers. freezer_can_attach() in
    particular is more problematic because it messes with cgroup internal
    locking to ensure that the state verification performed at
    freezer_can_attach() stays valid until migration is complete.

    This patch replaces freezer_can_attach() with freezer_attach() so that
    tasks are always allowed to migrate - they are nudged into the
    conforming state from freezer_attach(). This means that there can be
    tasks which are being migrated which don't conform to the current
    cgroup_freezer state until freezer_attach() is complete. Under the
    current locking scheme, the only such place is freezer_fork() which is
    updated to handle such window.

    While this patch doesn't remove the use of internal cgroup locking
    from freezer_read/write() paths, it removes the requirement to keep
    the freezer state constant while migrating and enables such change.

    Note that this creates a userland visible behavior change - FROZEN
    cgroup can no longer be used to lock migrations in and out of the
    cgroup. This behavior change is intended. I don't think the feature
    is necessary - userland should coordinate accesses to cgroup fs anyway
    - and even if the feature is needed cgroup_freezer is the completely
    wrong place to implement it.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Cc: Li Zefan

    Tejun Heo
     

17 Oct, 2012

3 commits

  • cgroup_freezer doesn't transition from FREEZING to FROZEN if the
    cgroup contains PF_NOFREEZE tasks or tasks sleeping with
    PF_FREEZER_SKIP set.

    Only kernel tasks can be non-freezable (PF_NOFREEZE) and there's
    nothing cgroup_freezer or userland can do about or to it. It's
    pointless to stall the transition for PF_NOFREEZE tasks.

    PF_FREEZER_SKIP indicates that the task can be skipped when
    determining whether frozen state is reached. A task with
    PF_FREEZER_SKIP is guaranteed to perform try_to_freeze() after it
    wakes up and can be considered frozen much like stopped or traced
    tasks. Note that a vfork parent uses PF_FREEZER_SKIP while waiting
    for the child.

    This updates update_if_frozen() such that it only considers freezable
    tasks and treats %true freezer_should_skip() tasks as frozen.

    This allows cgroups w/ kthreads and vfork parents successfully reach
    FROZEN state.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki

    Tejun Heo
     
  • try_to_freeze_cgroup() has condition checks which are intended to fail
    the write operation to freezer.state if there are tasks which can't be
    frozen. The condition checks have been broken for quite some time
    now. freeze_task() returns %false if the target task can't be frozen,
    so num_cant_freeze_now is never incremented.

    In addition, strangely, cgroup freezing proceeds even after the write
    is failed, which is rather broken.

    This patch rips out the non-working code intended to fail the write to
    freezer.state when the cgroup contains non-freezable tasks and makes
    it official that writes to freezer.state succeed whether there are
    non-freezable tasks in the cgroup or not.

    This leaves is_task_frozen_enough() with only one user -
    upste_if_frozen(). Collapse it into the caller. Note that this
    removes an extra call to freezing().

    This doesn't cause any userland behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki

    Tejun Heo
     
  • cgroup core has a bug which violates a basic rule about event
    notifications - when a new entity needs to be added, you add that to
    the notification list first and then make the new entity conform to
    the current state. If done in the reverse order, an event happening
    inbetween will be lost.

    cgroup_subsys->fork() is invoked way before the new task is added to
    the css_set. Currently, cgroup_freezer is the only user of ->fork()
    and uses it to make new tasks conform to the current state of the
    freezer. If FROZEN state is requested while fork is in progress
    between cgroup_fork_callbacks() and cgroup_post_fork(), the child
    could escape freezing - the cgroup isn't frozen when ->fork() is
    called and the freezer couldn't see the new task on the css_set.

    This patch moves cgroup_subsys->fork() invocation to
    cgroup_post_fork() after the new task is added to the css_set.
    cgroup_fork_callbacks() is removed.

    Because now a task may be migrated during cgroup_subsys->fork(),
    freezer_fork() is updated so that it adheres to the usual RCU locking
    and the rather pointless comment on why locking can be different there
    is removed (if it doesn't make anything simpler, why even bother?).

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Cc: stable@vger.kernel.org

    Tejun Heo
     

15 Sep, 2012

1 commit

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

02 Apr, 2012

1 commit

  • Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
    net_cls and device controllers to use the new cftype based interface.
    Termination entry is added to cftype arrays and populate callbacks are
    replaced with cgroup_subsys->base_cftypes initializations.

    This is functionally identical transformation. There shouldn't be any
    visible behavior change.

    memcg is rather special and will be converted separately.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "David S. Miller"
    Cc: Vivek Goyal

    Tejun Heo
     

03 Feb, 2012

1 commit

  • The argument is not used at all, and it's not necessary, because
    a specific callback handler of course knows which subsys it
    belongs to.

    Now only ->pupulate() takes this argument, because the handlers of
    this callback always call cgroup_add_file()/cgroup_add_files().

    So we reduce a few lines of code, though the shrinking of object size
    is minimal.

    16 files changed, 113 insertions(+), 162 deletions(-)

    text data bss dec hex filename
    5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
    5486170 656987 7039960 13183117 c9288d vmlinux.o

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds