14 May, 2014

3 commits

  • c1a71504e971 ("cgroup: don't recycle cgroup id until all csses' have
    been destroyed") made cgroup ID persist until a cgroup is released and
    add cgroup->subsys[] clearing to css_release() so that css_from_id()
    doesn't return a css which has already been released which happens
    before cgroup release; however, the right change here was updating
    offline_css() to clear cgroup->subsys[] which was done by e32978031016
    ("cgroup: cgroup->subsys[] should be cleared after the css is
    offlined") instead of clearing it from css_release().

    We're now clearing cgroup->subsys[] twice. This is okay for
    traditional hierarchies as a css's lifetime is the same as its
    cgroup's; however, this confuses unified hierarchy and turning on and
    off a controller repeatedly using "cgroup.subtree_control" can lead to
    an oops like the following which happens because cgroup->subsys[] is
    incorrectly cleared asynchronously by css_release().

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
    IP: [] kill_css+0x21/0x1c0
    PGD 1170d067 PUD f0ab067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
    RIP: 0010:[] [] kill_css+0x21/0x1c0
    RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
    RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
    RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
    R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
    R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
    FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
    Stack:
    ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
    ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
    ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
    Call Trace:
    [] cgroup_subtree_control_write+0x85f/0xa00
    [] cgroup_file_write+0x38/0x1d0
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xb6/0x1c0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b
    Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
    RIP [] kill_css+0x21/0x1c0
    RSP
    CR2: 0000000000000008
    ---[ end trace e7aae1f877c4e1b4 ]---

    Remove the unnecessary cgroup->subsys[] clearing from css_release().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup_idr_remove() can be invoked from bh leading to lockdep
    detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup_subtree_control_write() waits for offline to complete
    child-by-child before enabling a controller; however, it has a couple
    bugs.

    * It doesn't initialize the wait_queue_t. This can lead to infinite
    hang on the following schedule() among other things.

    * It forgets to pin the child before releasing cgroup_tree_mutex and
    performing schedule(). The child may already be gone by the time it
    wakes up and invokes finish_wait(). Pin the child being waited on.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

13 May, 2014

5 commits

  • …j/cgroup into for-3.16

    Pull to receive e37a06f10994 ("cgroup: fix the retry path of
    cgroup_mount()") to avoid unnecessary conflicts with planned
    cgroup_tree_mutex removal and also to be able to remove the temp fix
    added by 36c38fb7144a ("blkcg: use trylock on blkcg_pol_mutex in
    blkcg_reset_stats()") afterwards.

    Signed-off-by: Tejun Heo <tj@kernel.org>

    Tejun Heo
     
  • While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
    replace freezer->lock with freezer_mutex") introduced a bug in
    update_if_frozen() where it returns with rcu_read_lock() held. Fix it
    by adding rcu_read_unlock() before returning.

    Signed-off-by: Tejun Heo
    Reported-by: kbuild test robot

    Tejun Heo
     
  • Pull to receive percpu_ref_tryget[_live]() changes. Planned cgroup
    changes will make use of them.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
    to css_set_rwsem"), css task iterators requires sleepable context as
    it may block on css_set_rwsem. I missed that cgroup_freezer was
    iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
    errors like the following on freezer state reads and transitions.

    BUG: sleeping function called from invalid context at /work
    /os/work/kernel/locking/rwsem.c:20
    in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
    5 locks held by bash/462:
    #0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
    #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
    #2: (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
    #3: (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
    #4: (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
    Preemption disabled at:[] console_unlock+0x1e4/0x460

    CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
    ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
    0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] __might_sleep+0x162/0x260
    [] down_read+0x24/0x60
    [] css_task_iter_start+0x27/0x70
    [] freezer_apply_state+0x5d/0x130
    [] freezer_write+0xf6/0x1e0
    [] cgroup_file_write+0xd8/0x230
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xb6/0x1c0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    freezer->lock used to be used in hot paths but that time is long gone
    and there's no reason for the lock to be IRQ-safe spinlock or even
    per-cgroup. In fact, given the fact that a cgroup may contain large
    number of tasks, it's not a good idea to iterate over them while
    holding IRQ-safe spinlock.

    Let's simplify locking by replacing per-cgroup freezer->lock with
    global freezer_mutex. This also makes the comments explaining the
    intricacies of policy inheritance and the locking around it as the
    states are protected by a common mutex.

    The conversion is mostly straight-forward. The followings are worth
    mentioning.

    * freezer_css_online() no longer needs double locking.

    * freezer_attach() now performs propagation simply while holding
    freezer_mutex. update_if_frozen() race no longer exists and the
    comment is removed.

    * freezer_fork() now tests whether the task is in root cgroup using
    the new task_css_is_root() without doing rcu_read_lock/unlock(). If
    not, it grabs freezer_mutex and performs the operation.

    * freezer_read() and freezer_change_state() grab freezer_mutex across
    the whole operation and pin the css while iterating so that each
    descendant processing happens in sleepable context.

    Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Determining the css of a task usually requires RCU read lock as that's
    the only thing which keeps the returned css accessible till its
    reference is acquired; however, testing whether a task belongs to the
    root can be performed without dereferencing the returned css by
    comparing the returned pointer against the root one in init_css_set[]
    which never changes.

    Implement task_css_is_root() which can be invoked in any context.
    This will be used by the scheduled cgroup_freezer change.

    v2: cgroup no longer supports modular controllers. No need to export
    init_css_set. Pointed out by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

10 May, 2014

2 commits

  • Implement percpu_ref_tryget() which fails if the refcnt already
    reached zero. Note that this is different from the recently renamed
    percpu_ref_tryget_live() which fails if the refcnt has been killed and
    is draining the remaining references. percpu_ref_tryget() succeeds on
    a killed refcnt as long as its current refcnt is above zero.

    Signed-off-by: Tejun Heo
    Acked-by: Kent Overstreet

    Tejun Heo
     
  • percpu_ref_tryget() is different from the usual tryget semantics in
    that it fails if the refcnt is in its dying stage even if the refcnt
    hasn't reached zero yet. We're about to introduce the more
    conventional tryget and the current one has only one user. Let's
    rename it to percpu_ref_tryget_live() so that it explicitly signifies
    the peculiarities of its semantics.

    This is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Kent Overstreet

    Tejun Heo
     

07 May, 2014

1 commit


06 May, 2014

4 commits

  • Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Fabian Frederick
     
  • This patch also converts seq_printf to seq_puts

    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Fabian Frederick
     
  • Fix typo and variable name.

    tj: Updated @cgrp argument description in cgroup_destroy_css_killed()

    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Tejun Heo

    Fabian Frederick
     
  • During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
    which nests above both the kernfs s_active protection and cgroup_mutex
    is added to synchronize cgroup file type operations as cgroup_mutex
    needed to be grabbed from some file operations and thus can't be put
    above s_active protection.

    While this arrangement mostly worked for cgroup, this triggered the
    following lockdep warning.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W
    -------------------------------------------------------
    trinity-c173/9024 is trying to acquire lock:
    (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

    but task is already holding lock:
    (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (s_active#89){++++.+}:
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
    kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
    cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
    cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
    rebind_subsystems (kernel/cgroup.c:1144)
    cgroup_setup_root (kernel/cgroup.c:1568)
    cgroup_mount (kernel/cgroup.c:1716)
    mount_fs (fs/super.c:1094)
    vfs_kern_mount (fs/namespace.c:899)
    do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
    SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
    tracesys (arch/x86/kernel/entry_64.S:746)

    -> #1 (cgroup_tree_mutex){+.+.+.}:
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
    blkcg_policy_register (block/blk-cgroup.c:1106)
    throtl_init (block/blk-throttle.c:1694)
    do_one_initcall (init/main.c:789)
    kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
    kernel_init (init/main.c:935)
    ret_from_fork (arch/x86/kernel/entry_64.S:552)

    -> #0 (blkcg_pol_mutex){+.+.+.}:
    __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
    cgroup_file_write (kernel/cgroup.c:2714)
    kernfs_fop_write (fs/kernfs/file.c:295)
    vfs_write (fs/read_write.c:532)
    SyS_write (fs/read_write.c:584 fs/read_write.c:576)
    tracesys (arch/x86/kernel/entry_64.S:746)

    other info that might help us debug this:

    Chain exists of:
    blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(s_active#89);
    lock(cgroup_tree_mutex);
    lock(s_active#89);
    lock(blkcg_pol_mutex);

    *** DEADLOCK ***

    4 locks held by trinity-c173/9024:
    #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
    #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
    #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
    #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

    stack backtrace:
    CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
    ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
    ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
    ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_circular_bug (kernel/locking/lockdep.c:1216)
    __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
    cgroup_file_write (kernel/cgroup.c:2714)
    kernfs_fop_write (fs/kernfs/file.c:295)
    vfs_write (fs/read_write.c:532)
    SyS_write (fs/read_write.c:584 fs/read_write.c:576)

    This is a highly unlikely but valid circular dependency between "echo
    1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going
    through further locking update which will remove this complication but
    for now let's use trylock on blkcg_pol_mutex and retry the file
    operation if the trylock fails.

    Signed-off-by: Tejun Heo
    Reported-by: Sasha Levin
    References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com

    Tejun Heo
     

05 May, 2014

8 commits

  • [PATCH v3 1/2] device_cgroup: check if exception removal is allowed

    When the device cgroup hierarchy was introduced in
    bd2953ebbb53 - devcg: propagate local changes down the hierarchy

    a specific case was overlooked. Consider the hierarchy bellow:

    A default policy: ALLOW, exceptions will deny access
    \
    B default policy: ALLOW, exceptions will deny access

    There's no need to verify when an new exception is added to B because
    in this case exceptions will deny access to further devices, which is
    always fine. Hierarchy in device cgroup only makes sure B won't have
    more access than A.

    But when an exception is removed (by writing devices.allow), it isn't
    checked if the user is in fact removing an inherited exception from A,
    thus giving more access to B.

    Example:

    # echo 'a' >A/devices.allow
    # echo 'c 1:3 rw' >A/devices.deny
    # echo $$ >A/B/tasks
    # echo >/dev/null
    -bash: /dev/null: Operation not permitted
    # echo 'c 1:3 w' >A/B/devices.allow
    # echo >/dev/null
    #

    This shouldn't be allowed and this patch fixes it by making sure to never allow
    exceptions in this case to be removed if the exception is partially or fully
    present on the parent.

    v3: missing '*' in function description
    v2: improved log message and formatting fixes

    Cc: cgroups@vger.kernel.org
    Cc: Li Zefan
    Cc: stable@vger.kernel.org
    Signed-off-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     
  • Moving more extensive explanations to the end of the comment.

    Cc: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Acked-by: Serge Hallyn
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     
  • Until now, cgroup->id has been used to identify all the associated
    csses and css_from_id() takes cgroup ID and returns the matching css
    by looking up the cgroup and then dereferencing the css associated
    with it; however, now that the lifetimes of cgroup and css are
    separate, this is incorrect and breaks on the unified hierarchy when a
    controller is disabled and enabled back again before the previous
    instance is released.

    This patch adds css->id which is a subsystem-unique ID and converts
    css_from_id() to look up by the new css->id instead. memcg is the
    only user of css_from_id() and also converted to use css->id instead.

    For traditional hierarchies, this shouldn't make any functional
    difference.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Jianyu Zhan
    Acked-by: Li Zefan

    Tejun Heo
     
  • init_css() takes the cgroup the new css belongs to as an argument and
    initializes the new css's ->cgroup and ->parent pointers but doesn't
    acquire the matching reference counts. After the previous patch,
    create_css() puts init_css() and reference acquisition right next to
    each other. Let's move reference acquistion into init_css() and
    rename the function to init_and_link_css(). This makes sense and is
    easier to follow. This makes the root csses to hold a reference on
    cgrp_dfl_root.cgrp, which is harmless.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, when create_css() fails in the middle, the half-initialized
    css is freed by invoking cgroup_subsys->css_free() directly. This
    patch updates the function so that it invokes RCU free path instead.
    As the RCU free path puts the parent css and owning cgroup, their
    references are now acquired right after a new css is successfully
    allocated.

    This doesn't make any visible difference now but is to enable
    implementing css->id and RCU protected lookup by such IDs.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
    ends up requiring cgroup_put() to be invoked under sleepable context.
    This is okay for now but is an unusual requirement and we'll soon add
    css->id which will have the same problem but won't be able to simply
    grab cgroup_mutex as removal will have to happen from css_release()
    which can't sleep.

    Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
    which protects the idr operations with the lock and use them for
    cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab
    cgroup_mutex and css_from_id() is updated to always require RCU read
    lock instead of either RCU read lock or cgroup_mutex, which doesn't
    affect the existing users.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, cgroup->id is allocated from 0, which is always assigned to
    the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
    invalid IDs and ends up incrementing all IDs by one.

    It's reasonable to reserve 0 for special purposes. This patch updates
    cgroup core so that ID 0 is not used and the root cgroups get ID 1.
    The ID incrementing is removed form memcg.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: Li Zefan

    Tejun Heo
     
  • There's no reason to use atomic bitops for cgroup_subsys_state->flags,
    cgroup_root->flags and various subsys_masks. This patch updates those
    to use bitwise and/or operations instead and converts them form
    unsigned long to unsigned int.

    This makes the fields occupy (marginally) smaller space and makes it
    clear that they don't require atomicity.

    This patch doesn't cause any behavior difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

26 Apr, 2014

10 commits

  • Use pr_fmt and remove embedded prefixes.
    Realign modified multi-line statements to open parenthesis.
    Convert embedded function name to "%s: ", __func__

    Signed-off-by: Joe Perches
    Signed-off-by: Tejun Heo

    Joe Perches
     
  • As suggested by scripts/checkpatch.pl, substitude all pr_warning()
    with pr_warn().

    No functional change.

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo

    Jianyu Zhan
     
  • 6612f05b88fa309c9 ("cgroup: unify pidlist and other file handling")
    has removed the only user of cgroup_pidlist_seq_operations :
    cgroup_pidlist_open().

    This patch removes it.

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo

    Jianyu Zhan
     
  • 1d5be6b287c8efc87 ("cgroup: move module ref handling into
    rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
    refcounts on subsystems.

    And unified hierachy makes parse_cgroupfs_options not need to call
    with cgroup_mutex held to protect the cgroup_subsys[].

    So this patch removes BUG_ON() and the comment. As the comment
    doesn't contain useful information afterwards, the whole comment is
    removed.

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo

    Jianyu Zhan
     
  • Unified hierarchy will be the new version of cgroup interface. This
    patch adds Documentation/cgroups/unified-hierarchy.txt which describes
    the design and rationales of unified hierarchy.

    v2: Grammatical updates as per Randy Dunlap's review.

    Signed-off-by: Tejun Heo
    Cc: Randy Dunlap

    Tejun Heo
     
  • cgroup users often need a way to determine when a cgroup's
    subhierarchy becomes empty so that it can be cleaned up. cgroup
    currently provides release_agent for it; unfortunately, this mechanism
    is riddled with issues.

    * It delivers events by forking and execing a userland binary
    specified as the release_agent. This is a long deprecated method of
    notification delivery. It's extremely heavy, slow and cumbersome to
    integrate with larger infrastructure.

    * There is single monitoring point at the root. There's no way to
    delegate management of a subtree.

    * The event isn't recursive. It triggers when a cgroup doesn't have
    any tasks or child cgroups. Events for internal nodes trigger only
    after all children are removed. This again makes it impossible to
    delegate management of a subtree.

    * Events are filtered from the kernel side. "notify_on_release" file
    is used to subscribe to or suppress release event. This is
    unnecessarily complicated and probably done this way because event
    delivery itself was expensive.

    This patch implements interface file "cgroup.populated" which can be
    used to monitor whether the cgroup's subhierarchy has tasks in it or
    not. Its value is 0 if there is no task in the cgroup and its
    descendants; otherwise, 1, and kernfs_notify() notificaiton is
    triggers when the value changes, which can be monitored through poll
    and [di]notify.

    This is a lot ligther and simpler and trivially allows delegating
    management of subhierarchy - subhierarchy monitoring can block further
    propgation simply by putting itself or another process in the root of
    the subhierarchy and monitor events that it's interested in from there
    without interfering with monitoring higher in the tree.

    v2: Patch description updated as per Serge.

    v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
    subtree_ prefix was a bit confusing because
    "cgroup.subtree_control" uses it to denote the tree rooted at the
    cgroup sans the cgroup itself while the populated state includes
    the cgroup itself.

    Signed-off-by: Tejun Heo
    Acked-by: Serge Hallyn
    Acked-by: Li Zefan
    Cc: Lennart Poettering

    Tejun Heo
     
  • …/gregkh/driver-core into for-3.16

    Pull in driver-core-next to receive kernfs_notify() updates which will
    be used by the planned "cgroup.populated" implementation.

    Signed-off-by: Tejun Heo <tj@kernel.org>

    Tejun Heo
     
  • Support for uevent_helper, aka hotplug, is not required on many systems
    these days but it can still be enabled via sysfs or sysctl.

    Reported-by: Darren Shepherd
    Signed-off-by: Michael Marineau
    Signed-off-by: Greg Kroah-Hartman

    Michael Marineau
     
  • kernfs_notify() is used to indicate either new data is available or
    the content of a file has changed. It currently only triggers poll
    which may not be the most convenient to monitor especially when there
    are a lot to monitor. Let's hook it up to fsnotify too so that the
    events can be monitored via inotify too.

    fsnotify_modify() requires file * but kernfs_notify() doesn't have any
    specific file associated; however, we can walk all super_blocks
    associated with a kernfs_root and as kernfs always associate one ino
    with inode and one dentry with an inode, it's trivial to look up the
    dentry associated with a given kernfs_node. As any active monitor
    would pin dentry, just looking up existing dentry is enough. This
    patch looks up the dentry associated with the specified kernfs_node
    and generates events equivalent to fsnotify_modify().

    Note that as fsnotify doesn't provide fsnotify_modify() equivalent
    which can be called with dentry, kernfs_notify() directly calls
    fsnotify_parent() and fsnotify(). It might be better to add a wrapper
    in fsnotify.h instead.

    Signed-off-by: Tejun Heo
    Cc: John McCutchan
    Cc: Robert Love
    Cc: Eric Paris
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Currently, there's no way to find out which super_blocks are
    associated with a given kernfs_root. Let's implement it - the planned
    inotify extension to kernfs_notify() needs it.

    Make kernfs_super_info point back to the super_block and chain it at
    kernfs_root->supers.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

23 Apr, 2014

7 commits

  • cgroup is switching away from multiple hierarchies and will use one
    unified default hierarchy where controllers can be dynamically enabled
    and disabled per subtree. The default hierarchy will serve as the
    unified hierarchy to which all controllers are attached and a css on
    the default hierarchy would need to also serve the tasks of descendant
    cgroups which don't have the controller enabled - ie. the tree may be
    collapsed from leaf towards root when viewed from specific
    controllers. This has been implemented through effective css in the
    previous patches.

    This patch finally implements dynamic subtree controller
    enable/disable on the default hierarchy via a new knob -
    "cgroup.subtree_control" which controls which controllers are enabled
    on the child cgroups. Let's assume a hierarchy like the following.

    root - A - B - C
    \ D

    root's "cgroup.subtree_control" determines which controllers are
    enabled on A. A's on B. B's on C and D. This coincides with the
    fact that controllers on the immediate sub-level are used to
    distribute the resources of the parent. In fact, it's natural to
    assume that resource control knobs of a child belong to its parent.
    Enabling a controller in "cgroup.subtree_control" declares that
    distribution of the respective resources of the cgroup will be
    controlled. Note that this means that controller enable states are
    shared among siblings.

    The default hierarchy has an extra restriction - only cgroups which
    don't contain any task may have controllers enabled in
    "cgroup.subtree_control". Combined with the other properties of the
    default hierarchy, this guarantees that, from the view point of
    controllers, tasks are only on the leaf cgroups. In other words, only
    leaf csses may contain tasks. This rules out situations where child
    cgroups compete against internal tasks of the parent, which is a
    competition between two different types of entities without any clear
    way to determine resource distribution between the two. Different
    controllers handle it differently and all the implemented behaviors
    are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
    structural constraints imposed from cgroup core removes the burden
    from controller implementations and enables showing one consistent
    behavior across all controllers.

    When a controller is enabled or disabled, css associations for the
    controller in the subtrees of each child should be updated. After
    enabling, the whole subtree of a child should point to the new css of
    the child. After disabling, the whole subtree of a child should point
    to the cgroup's css. This is implemented by first updating cgroup
    states such that cgroup_e_css() result points to the appropriate css
    and then invoking cgroup_update_dfl_csses() which migrates all tasks
    in the affected subtrees to the self cgroup on the default hierarchy.

    * When read, "cgroup.subtree_control" lists all the currently enabled
    controllers on the children of the cgroup.

    * White-space separated list of controller names prefixed with either
    '+' or '-' can be written to "cgroup.subtree_control". The ones
    prefixed with '+' are enabled on the controller and '-' disabled.

    * A controller can be enabled iff the parent's
    "cgroup.subtree_control" enables it and disabled iff no child's
    "cgroup.subtree_control" has it enabled.

    * If a cgroup has tasks, no controller can be enabled via
    "cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
    some controllers enabled, tasks can't be migrated into the cgroup.

    * All controllers which aren't bound on other hierarchies are
    automatically associated with the root cgroup of the default
    hierarchy. All the controllers which are bound to the default
    hierarchy are listed in the read-only file "cgroup.controllers" in
    the root directory.

    * "cgroup.controllers" in all non-root cgroups is read-only file whose
    content is equal to that of "cgroup.subtree_control" of the parent.
    This indicates which controllers can be used in the cgroup's
    "cgroup.subtree_control".

    This is still experimental and there are some holes, one of which is
    that ->can_attach() failure during cgroup_update_dfl_csses() may leave
    the cgroups in an undefined state. The issues will be addressed by
    future patches.

    v2: Non-root cgroups now also have "cgroup.controllers".

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Unified hierarchy implementation would require re-migrating tasks onto
    the same cgroup on the default hierarchy to reflect updated effective
    csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as
    the destination cgrp. When NULL is specified, the destination is
    considered to be the cgroup on the default hierarchy associated with
    each css_set.

    After this change, the identity check in cgroup_migrate_add_src()
    isn't sufficient for noop detection as the associated csses may change
    without any cgroup association changing. The only way to tell whether
    a migration is noop or not is testing whether the source and
    destination csets are identical. The noop check in
    cgroup_migrate_add_src() is removed and cset identity test is added to
    cgroup_migreate_prepare_dst(). If it's detected that source and
    destination csets are identical, the cset is removed removed from
    @preloaded_csets and all the migration nodes are cleared which makes
    cgroup_migrate() ignore the cset.

    Also, make the function append the destination css_sets to
    @preloaded_list so that destination css_sets always come after source
    css_sets.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Because the default root couldn't have any non-root csses attached to
    it, rebinding away from it was always allowed; however, the default
    hierarchy will soon host the unified hierarchy and have non-root csses
    so the rebind restrictions need to be updated accordingly.

    Instead of special casing rebinding from the default hierarchy and
    then checking whether the source hierarchy has children cgroups, which
    implies non-root csses for !dfl hierarchies, simply check whether the
    source hierarchy has non-root csses for the subsystem using
    css_next_child().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • To implement the unified hierarchy behavior, we'll need to be able to
    determine the associated cgroup on the default hierarchy from css_set.
    Let's add css_set->dfl_cgrp so that it can be accessed conveniently
    and efficiently.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Now that effective css handling has been added and iterators updated
    accordingly, it's safe to allow cgroup creation in the default
    hierarchy. Unblock cgroup creation in the default hierarchy.

    As the default hierarchy will implement explicit enabling and
    disabling of controllers on each cgroup, suppress automatic css
    enabling on cgroup creation.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • After a css finishes offlining, offline_css() mistakenly performs
    RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the
    cgroup->subsys[] pointer to the current value. The intention was to
    clear it after offline is complete, not reassign the same value.

    Update it to assign NULL instead of the current value. This makes
    cgroup_css() to return NULL once offline is complete. All the
    existing users of the function either can handle NULL return already
    or guarantee that the css doesn't get offlined.

    While this is a bugfix, as css lifetime is currently tied to the
    cgroup it belongs to, this bug doesn't cause any actual problems.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, css_task_iter iterates tasks associated with a css by
    visiting each css_set associated with the owning cgroup and walking
    tasks of each of them. This works fine for !unified hierarchies as
    each cgroup has its own css for each associated subsystem on the
    hierarchy; however, on the planned unified hierarchy, a cgroup may not
    have csses associated and its tasks would be considered associated
    with the matching css of the nearest ancestor which has the subsystem
    enabled.

    This means that on the default unified hierarchy, just walking all
    tasks associated with a cgroup isn't enough to walk all tasks which
    are associated with the specified css. If any of its children doesn't
    have the matching css enabled, task iteration should also include all
    tasks from the subtree. We already added cgroup->e_csets[] to list
    all css_sets effectively associated with a given css and walk css_sets
    on that list instead to achieve such iteration.

    This patch updates css_task_iter iteration such that it walks css_sets
    on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
    requested on an non-dummy css. Thanks to the previous iteration
    update, this change can be achieved with the addition of
    css_task_iter->ss and minimal updates to css_advance_task_iter() and
    css_task_iter_start().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo