12 Aug, 2013

1 commit

  • commit 084457f284abf6789d90509ee11dae383842b23b upstream.

    cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
    is dropped, but dget() won't prevent cgroupfs from being umounted. When
    the race happens, vfs will see some dentries with non-zero refcnt while
    umount is in process.

    Keep running this:
    mount -t cgroup -o blkio xxx /cgroup
    umount /cgroup

    And this:
    modprobe cfq-iosched
    rmmod cfs-iosched

    After a while, the BUG() in shrink_dcache_for_umount_subtree() may
    be triggered:

    BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Li Zefan
     

22 Jul, 2013

1 commit

  • commit 1c8158eeae0f37d0eee9f1fbe68080df6a408df2 upstream.

    commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc
    Author: Tejun Heo
    Date: Sat Jul 7 16:08:18 2012 -0700

    cgroup: fix cgroup hierarchy umount race

    This commit fixed a race caused by the dput() in css_dput_fn(), but
    the dput() in cgroup_event_remove() can also lead to the same BUG().

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Li Zefan
     

29 May, 2013

1 commit

  • With the new __DEVEL__sane_behavior mount option was introduced,
    if the root cgroup is alive with no xattr function, to mount a
    new cgroup with xattr will be rejected in terms of design which
    just fine. However, if the root cgroup does not mounted with
    __DEVEL__sane_hehavior, to create a new cgroup with xattr option
    will succeed although after that the EA function does not works
    as expected but will get ENOTSUPP for setting up attributes under
    either cgroup. e.g.

    setfattr: /cgroup2/test: Operation not supported

    Instead of keeping silence in this case, it's better to drop a log
    entry in warning level. That would be helpful to understand the
    reason behind the scene from the user's perspective, and this is
    essentially an improvement does not break the backward compatibilities.

    With this fix, above mount attemption will keep up works as usual but
    the following line cound be found at the system log:

    [ ...] cgroup: new mount options do not match the existing superblock

    tj: minor formatting / message updates.

    Signed-off-by: Jie Liu
    Reported-by: Alexey Kodanev
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Jeff Liu
     

24 May, 2013

1 commit

  • When cgroup_next_descendant_pre() initiates a walk, it checks whether
    the subtree root doesn't have any children and if not returns NULL.
    Later code assumes that the subtree isn't empty. This is broken
    because the subtree may become empty inbetween, which can lead to the
    traversal escaping the subtree by walking to the sibling of the
    subtree root.

    There's no reason to have the early exit path. Remove it along with
    the later assumption that the subtree isn't empty. This simplifies
    the code a bit and fixes the subtle bug.

    While at it, fix the comment of cgroup_for_each_descendant_pre() which
    was incorrectly referring to ->css_offline() instead of
    ->css_online().

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Cc: stable@vger.kernel.org

    Tejun Heo
     

14 May, 2013

1 commit

  • cgroup_create_file() calls d_instantiate(), which may decide to look
    at the xattrs on the file. Smack always does this and SELinux can be
    configured to do so.

    But cgroup_add_file() didn't initialize xattrs before calling
    cgroup_create_file(), which finally leads to dereferencing NULL
    dentry->d_fsdata.

    This bug has been there since cgroup xattr was introduced.

    Cc: # 3.8.x
    Reported-by: Ivan Bulatovic
    Reported-by: Casey Schaufler
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

30 Apr, 2013

4 commits

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this development cycle were:

    - full dynticks preparatory work by Frederic Weisbecker

    - factor out the cpu time accounting code better, by Li Zefan

    - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

    - various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched: Fix init NOHZ_IDLE flag
    sched: Prevent to re-select dst-cpu in load_balance()
    sched: Rename load_balance_tmpmask to load_balance_mask
    sched: Move up affinity check to mitigate useless redoing overhead
    sched: Don't consider other cpus in our group in case of NEWLY_IDLE
    sched: Explicitly cpu_idle_type checking in rebalance_domains()
    sched: Change position of resched_cpu() in load_balance()
    sched: Fix wrong rq's runnable_avg update with rt tasks
    sched: Document task_struct::personality field
    sched/cpuacct/UML: Fix header file dependency bug on the UML build
    cgroup: Kill subsys.active flag
    sched/cpuacct: No need to check subsys active state
    sched/cpuacct: Initialize cpuacct subsystem earlier
    sched/cpuacct: Initialize root cpuacct earlier
    sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
    sched/cpuacct: Clean up cpuacct.h
    sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
    sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
    sched/cpuacct: Add cpuacct_acount_field()
    sched/cpuacct: Add cpuacct_init()
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "A lot of activities on workqueue side this time. The changes achieve
    the followings.

    - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
    updated to be able to interface with multiple backend worker pools.
    This involved a lot of churning but the end result seems actually
    neater as unbound workqueues are now a lot closer to per-cpu ones.

    - The ability to interface with multiple backend worker pools are
    used to implement unbound workqueues with custom attributes.
    Currently the supported attributes are the nice level and CPU
    affinity. It may be expanded to include cgroup association in
    future. The attributes can be specified either by calling
    apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
    the workqueue in question is exported through sysfs.

    The backend worker pools are keyed by the actual attributes and
    shared by any workqueues which share the same attributes. When
    attributes of a workqueue are changed, the workqueue binds to the
    worker pool with the specified attributes while leaving the work
    items which are already executing in its previous worker pools
    alone.

    This allows converting custom worker pool implementations which
    want worker attribute tuning to use workqueues. The writeback pool
    is already converted in block tree and there are a couple others
    are likely to follow including btrfs io workers.

    - WQ_UNBOUND's ability to bind to multiple worker pools is also used
    to make it NUMA-aware. Because there's no association between work
    item issuer and the specific worker assigned to execute it, before
    this change, using unbound workqueue led to unnecessary cross-node
    bouncing and it couldn't be helped by autonuma as it requires tasks
    to have implicit node affinity and workers are assigned randomly.

    After these changes, an unbound workqueue now binds to multiple
    NUMA-affine worker pools so that queued work items are executed in
    the same node. This is turned on by default but can be disabled
    system-wide or for individual workqueues.

    Crypto was requesting NUMA affinity as encrypting data across
    different nodes can contribute noticeable overhead and doing it
    per-cpu was too limiting for certain cases and IO throughput could
    be bottlenecked by one CPU being fully occupied while others have
    idle cycles.

    While the new features required a lot of changes including
    restructuring locking, it didn't complicate the execution paths much.
    The unbound workqueue handling is now closer to per-cpu ones and the
    new features are implemented by simply associating a workqueue with
    different sets of backend worker pools without changing queue,
    execution or flush paths.

    As such, even though the amount of change is very high, I feel
    relatively safe in that it isn't likely to cause subtle issues with
    basic correctness of work item execution and handling. If something
    is wrong, it's likely to show up as being associated with worker pools
    with the wrong attributes or OOPS while workqueue attributes are being
    changed or during CPU hotplug.

    While this creates more backend worker pools, it doesn't add too many
    more workers unless, of course, there are many workqueues with unique
    combinations of attributes. Assuming everything else is the same,
    NUMA awareness costs an extra worker pool per NUMA node with online
    CPUs.

    There are also a couple things which are being routed outside the
    workqueue tree.

    - block tree pulled in workqueue for-3.10 so that writeback worker
    pool can be converted to unbound workqueue with sysfs control
    exposed. This simplifies the code, makes writeback workers
    NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

    - The conversion to workqueue means that there's no 1:1 association
    between a specific worker, which makes writeback folks unhappy as
    they want to be able to tell which filesystem caused a problem from
    backtrace on systems with many filesystems mounted. This is
    resolved by allowing work items to set debug info string which is
    printed when the task is dumped. As this change involves unifying
    implementations of dump_stack() and friends in arch codes, it's
    being routed through Andrew's -mm tree."

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
    workqueue: use kmem_cache_free() instead of kfree()
    workqueue: avoid false negative WARN_ON() in destroy_workqueue()
    workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
    workqueue: implement NUMA affinity for unbound workqueues
    workqueue: introduce put_pwq_unlocked()
    workqueue: introduce numa_pwq_tbl_install()
    workqueue: use NUMA-aware allocation for pool_workqueues
    workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
    workqueue: map an unbound workqueues to multiple per-node pool_workqueues
    workqueue: move hot fields of workqueue_struct to the end
    workqueue: make workqueue->name[] fixed len
    workqueue: add workqueue->unbound_attrs
    workqueue: determine NUMA node of workers accourding to the allowed cpumask
    workqueue: drop 'H' from kworker names of unbound worker pools
    workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
    workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
    workqueue: fix memory leak in apply_workqueue_attrs()
    workqueue: fix unbound workqueue attrs hashing / comparison
    workqueue: fix race condition in unbound workqueue free path
    workqueue: remove pwq_lock which is no longer used
    ...

    Linus Torvalds
     
  • Now that we have generic and well ordered cgroup tree walkers there is
    no need to keep css_get_next in the place.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Apr, 2013

2 commits

  • I mistakenly removed the call to eventfd->poll() while I was actually
    intending to remove the return value...

    Calling evenfd->poll() will hook cgroup_event_wake() to the poll
    waitqueue, which will be called to unregister eventfd when rmdir a
    cgroup or close eventfd.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Try:
    # mount -t cgroup xxx /cgroup
    # mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

    And you might see this:

    ida_remove called for id=1 which is not allocated.

    It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
    and free cgrp->root before ida_simple_removed() is called. What's
    worse is we're accessing cgrp->root while it has been freed.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

19 Apr, 2013

1 commit

  • We should store file xattrs in struct cfent instead of struct cftype,
    because cftype is a type while cfent is object instance of cftype.

    For example each cgroup has a tasks file, and each tasks file is
    associated with a uniq cfent, but all those files share the same
    struct cftype.

    Alexey Kodanev reported a crash, which can be reproduced:

    # mount -t cgroup -o xattr /sys/fs/cgroup
    # mkdir /sys/fs/cgroup/test
    # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
    # rmdir /sys/fs/cgroup/test
    # umount /sys/fs/cgroup
    oops!

    In this case, simple_xattrs_free() will free the same struct simple_xattrs
    twice.

    tj: Dropped unused local variable @cft from cgroup_diput().

    Cc: # 3.8.x
    Reported-by: Alexey Kodanev
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

15 Apr, 2013

5 commits

  • It's not used, and it can be retrieved via cgrp->root->top_cgroup.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • It's a sad fact that at this point various cgroup controllers are
    carrying so many idiosyncrasies and pure insanities that it simply
    isn't possible to reach any sort of sane consistent behavior while
    maintaining staying fully compatible with what already has been
    exposed to userland.

    As we can't break exposed userland interface, transitioning to sane
    behaviors can only be done in steps while maintaining backwards
    compatibility. This patch introduces a new mount option -
    __DEVEL__sane_behavior - which disables crazy features and enforces
    consistent behaviors in cgroup core proper and various controllers.
    As exactly which behaviors it changes are still being determined, the
    mount option, at this point, is useful only for development of the new
    behaviors. As such, the mount option is prefixed with __DEVEL__ and
    generates a warning message when used.

    Eventually, once we get to the point where all controller's behaviors
    are consistent enough to implement unified hierarchy, the __DEVEL__
    prefix will be dropped, and more importantly, unified-hierarchy will
    enforce sane_behavior by default. Maybe we'll able to completely drop
    the crazy stuff after a while, maybe not, but we at least have a
    strategy to move on to saner behaviors.

    This patch introduces the mount option and changes the following
    behaviors in cgroup core.

    * Mount options "noprefix" and "clone_children" are disallowed. Also,
    cgroupfs file cgroup.clone_children is not created.

    * When mounting an existing superblock, mount options should match.
    This is currently pretty crazy. If one mounts a cgroup, creates a
    subdirectory, unmounts it and then mount it again with different
    option, it looks like the new options are applied but they aren't.

    * Remount is disallowed.

    The behaviors changes are documented in the comment above
    CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
    controllers are converted and planned improvements progress.

    v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Serge E. Hallyn
    Acked-by: Li Zefan
    Cc: Michal Hocko
    Cc: Vivek Goyal

    Tejun Heo
     
  • While controllers shouldn't be accessing cgroupfs_root directly, it
    being hidden inside kern/cgroup.c makes somethings pretty silly. This
    makes routing hierarchy-wide settings which need to be visible to
    controllers cumbersome.

    We're gonna add another hierarchy-wide setting which needs to be
    accessed from controllers. Move cgroupfs_root and its flags to the
    header file so that we can access root settings with inline helpers.

    Signed-off-by: Tejun Heo
    Acked-by: Serge E. Hallyn
    Acked-by: Li Zefan

    Tejun Heo
     
  • There's no reason to be using bitops, which tends to be more
    cumbersome, to handle root flags. Convert them to masks. Also, as
    they'll be moved to include/linux/cgroup.h and it's generally a good
    idea, add CGRP_ prefix.

    Note that flags are assigned from (1 << 1). The first bit will be
    used by a flag which will be added soon.

    Signed-off-by: Tejun Heo
    Acked-by: Serge E. Hallyn
    Acked-by: Li Zefan

    Tejun Heo
     
  • While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
    cgroup_path() vs rename() race") introduced a bug where the path of a
    non-root cgroup would have two slahses at the beginning, which is
    caused by treating the root cgroup which has the name '/' like
    non-root cgroups.

    $ grep systemd /proc/self/cgroup
    1:name=systemd://user/root/1

    Fix it by special casing root cgroup case and not looping over it in
    the normal path.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan

    Tejun Heo
     

13 Apr, 2013

1 commit


11 Apr, 2013

3 commits

  • A couple controllers want to determine whether two cgroups are in
    ancestor/descendant relationship. As it's more likely that the
    descendant is the primary subject of interest and there are other
    operations focusing on the descendants, let's ask is_descendent rather
    than is_ancestor.

    Implementation is trivial as the previous patch guarantees that all
    ancestors of a cgroup stay accessible as long as the cgroup is
    accessible.

    tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
    rewrote descriptions.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
    be freed. Then we rmdir the parent cgroup, and the parent is freed
    immediately due to css ref draining to 0. Now it would be a disaster if
    the still-alive child cgroup tries to access its parent.

    Make sure this won't happen.

    Signed-off-by: Li Zefan
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • The bind() method of cgroup_subsys is not used in any of the
    controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
    devices, perf, hugetlb, cpu and cpuacct)

    tj: Removed the entry on ->bind() from
    Documentation/cgroups/cgroups.txt. Also updated a couple
    paragraphs which were suggesting that dynamic re-binding may be
    implemented. It's not gonna.

    Signed-off-by: Rami Rosen
    Signed-off-by: Tejun Heo

    Rami Rosen
     

10 Apr, 2013

1 commit


08 Apr, 2013

5 commits

  • We don't want controllers to assume that the information is officially
    available and do funky things with it.

    The only user is task_subsys_state_check() which uses it to verify RCU
    access context. We can move cgroup_lock_is_held() inside
    CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
    to conditionally exposing cgroup_mutex.

    Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
    and use lockdep_is_held() directly on the mutex in
    task_subsys_state_check().

    While at it, add parentheses around macro arguments in
    task_subsys_state_check().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Now that locking interface is unexported, there's no reason to keep
    around these thin wrappers. Kill them and use mutex operations
    directly.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Now that all external cgroup_lock() users are gone, we can finally
    unexport the locking interface and prevent future abuse of
    cgroup_mutex.

    Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also,
    cgroup_attach_task() doesn't have any user left and can't be used
    without locking interface anyway. Make it static too.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be
    made static. Relocate the former and cgroup_attach_task_all() so that
    we don't need forward declarations.

    This patch is pure relocation.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • When a cpuset becomes empty (no CPU or memory), its tasks are
    transferred with the nearest ancestor with execution resources. This
    is implemented using cgroup_scan_tasks() with a callback which grabs
    cgroup_mutex and invokes cgroup_attach_task() on each task.

    Both cgroup_mutex and cgroup_attach_task() are scheduled to be
    unexported. Implement cgroup_transfer_tasks() in cgroup proper which
    is essentially the same as move_member_tasks_to_cpuset() except that
    it takes cgroups instead of cpusets and @to comes before @from like
    normal functions with those arguments, and replace
    move_member_tasks_to_cpuset() with it.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

04 Apr, 2013

1 commit


20 Mar, 2013

3 commits

  • These two functions share most of the code.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • The 3rd parameter of flex_array_prealloc() is the number of elements,
    not the index of the last element.

    The effect of the bug is, when opening cgroup.procs, a flex array will
    be allocated and all elements of the array is allocated with
    GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
    allocate memory for it, it'll trigger a BUG_ON().

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Li Zefan
     
  • PF_THREAD_BOUND was originally used to mark kernel threads which were
    bound to a specific CPU using kthread_bind() and a task with the flag
    set allows cpus_allowed modifications only to itself. Workqueue is
    currently abusing it to prevent userland from meddling with
    cpus_allowed of workqueue workers.

    What we need is a flag to prevent userland from messing with
    cpus_allowed of certain kernel tasks. In kernel, anyone can
    (incorrectly) squash the flag, and, for worker-type usages,
    restricting cpus_allowed modification to the task itself doesn't
    provide meaningful extra proection as other tasks can inject work
    items to the task anyway.

    This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
    sched_setaffinity() checks the flag and return -EINVAL if set.
    set_cpus_allowed_ptr() is no longer affected by the flag.

    This will allow simplifying workqueue worker CPU affinity management.

    Signed-off-by: Tejun Heo
    Acked-by: Ingo Molnar
    Reviewed-by: Lai Jiangshan
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner

    Tejun Heo
     

13 Mar, 2013

5 commits


06 Mar, 2013

1 commit

  • subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
    and that's protected by cgroup_mutex, and then the memory *subsys[i]
    resides will be freed.

    So this is unsafe without any locking:

    if (!ss || ss->module)
    ...

    v2:
    - add a comment for enum cgroup_subsys_id
    - simplify the comment in cgroup_exit()

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

05 Mar, 2013

1 commit

  • We no longer fail rmdir() when there're still css refs, so we don't
    need to check css refs in check_for_release().

    This also voids a bug. cgroup_has_css_refs() accesses subsys[i]
    without cgroup_mutex, so it can race with cgroup_unload_subsys().

    cgroup_has_css_refs()
    ...
    if (ss == NULL || ss->root != cgrp->root)

    if ss pointers to net_cls_subsys, and cls_cgroup module is unloaded
    right after the former check but before the latter, the memory that
    net_cls_subsys resides has become invalid.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan