03 Nov, 2011

1 commit

  • While back-porting Johannes Weiner's patch "mm: memcg-aware global
    reclaim" for an internal effort, we noticed a significant performance
    regression during page-reclaim heavy workloads due to high contention of
    the ss->id_lock. This lock protects idr map, and serializes calls to
    idr_get_next() in css_get_next() (which is used during the memcg hierarchy
    walk).

    Since idr_get_next() is just doing a look up, we need only serialize it
    with respect to idr_remove()/idr_get_new(). By making the ss->id_lock a
    rwlock, contention is greatly reduced and performance improves.

    Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
    each core (one file + container per core) in parallel on a NUMA machine.
    Result is the time for the test to complete in 1 of the containers.
    Both kernels included Johannes' memcg-aware global reclaim patches.

    Before rwlock patch: 1710.778s
    After rwlock patch: 152.227s

    Signed-off-by: Andrew Bresticker
    Cc: Paul Menage
    Cc: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Bresticker
     

09 Jul, 2011

1 commit


27 May, 2011

2 commits

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

31 Mar, 2011

1 commit


16 Feb, 2011

2 commits

  • This kernel patch adds the ability to filter monitoring based on
    container groups (cgroups). This is for use in per-cpu mode only.

    The cgroup to monitor is passed as a file descriptor in the pid
    argument to the syscall. The file descriptor must be opened to
    the cgroup name in the cgroup filesystem. For instance, if the
    cgroup name is foo and cgroupfs is mounted in /cgroup, then the
    file descriptor is opened to /cgroup/foo. Cgroup mode is
    activated by passing PERF_FLAG_PID_CGROUP in the flags argument
    to the syscall.

    For instance to measure in cgroup foo on CPU1 assuming
    cgroupfs is mounted under /cgroup:

    struct perf_event_attr attr;
    int cgroup_fd, fd;

    cgroup_fd = open("/cgroup/foo", O_RDONLY);
    fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
    close(cgroup_fd);

    Signed-off-by: Stephane Eranian
    [ added perf_cgroup_{exit,attach} ]
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • Make the ::exit method act like ::attach, it is after all very nearly
    the same thing.

    The bug had no effect on correctness - fixing it is an optimization for
    the scheduler. Also, later perf-cgroups patches rely on it.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul Menage
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Nov, 2010

1 commit

  • "gadget", "through", "command", "maintain", "maintain", "controller", "address",
    "between", "initiali[zs]e", "instead", "function", "select", "already",
    "equal", "access", "management", "hierarchy", "registration", "interest",
    "relative", "memory", "offset", "already",

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Jiri Kosina

    Uwe Kleine-König
     

28 Oct, 2010

1 commit

  • The ns_cgroup is a control group interacting with the namespaces. When a
    new namespace is created, a corresponding cgroup is automatically created
    too. The cgroup name is the pid of the process who did 'unshare' or the
    child of 'clone'.

    This cgroup is tied with the namespace because it prevents a process to
    escape the control group and use the post_clone callback, so the child
    cgroup inherits the values of the parent cgroup.

    Unfortunately, the more we use this cgroup and the more we are facing
    problems with it:

    (1) when a process unshares, the cgroup name may conflict with a
    previous cgroup with the same pid, so unshare or clone return -EEXIST

    (2) the cgroup creation is out of control because there may have an
    application creating several namespaces where the system will
    automatically create several cgroups in his back and let them on the
    cgroupfs (eg. a vrf based on the network namespace).

    (3) the mix of (1) and (2) force an administrator to regularly check
    and clean these cgroups.

    This patchset removes the ns_cgroup by adding a new flag to the cgroup and
    the cgroupfs mount option. It enables the copy of the parent cgroup when
    a child cgroup is created. We can then safely remove the ns_cgroup as
    this flag brings a compatibility. We have now to manually create and add
    the task to a cgroup, which is consistent with the cgroup framework.

    This patch:

    Sent as an answer to a previous thread around the ns_cgroup.

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html

    It adds a control file 'clone_children' for a cgroup. This control file
    is a boolean specifying if the child cgroup should be a clone of the
    parent cgroup or not. The default value is 'false'.

    This flag makes the child cgroup to call the post_clone callback of all
    the subsystem, if it is available.

    At present, the cpuset is the only one which had implemented the
    post_clone callback.

    The option can be set at mount time by specifying the 'clone_children'
    mount option.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Acked-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: Jamal Hadi Salim
    Cc: Matt Helsley
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

07 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • Add cgroup_attach_task_all()

    The existing cgroup_attach_task_current_cg() API is called by a thread to
    attach another thread to all of its cgroups; this is unsuitable for cases
    where a privileged task wants to attach itself to the cgroups of a less
    privileged one, since the call must be made from the context of the target
    task.

    This patch adds a more generic cgroup_attach_task_all() API that allows
    both the source task and to-be-moved task to be specified.
    cgroup_attach_task_current_cg() becomes a specialization of the more
    generic new function.

    [menage@google.com: rewrote changelog]
    [akpm@linux-foundation.org: address reviewer comments]
    Signed-off-by: Michael S. Tsirkin
    Tested-by: Alex Williamson
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Ben Blum
    Cc: Sridhar Samudrala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

20 Aug, 2010

1 commit


05 Aug, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
    phy/marvell: add 88ec048 support
    igb: Program MDICNFG register prior to PHY init
    e1000e: correct MAC-PHY interconnect register offset for 82579
    hso: Add new product ID
    can: Add driver for esd CAN-USB/2 device
    l2tp: fix export of header file for userspace
    can-raw: Fix skb_orphan_try handling
    Revert "net: remove zap_completion_queue"
    net: cleanup inclusion
    phy/marvell: add 88e1121 interface mode support
    u32: negative offset fix
    net: Fix a typo from "dev" to "ndev"
    igb: Use irq_synchronize per vector when using MSI-X
    ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
    e1000e: Fix irq_synchronize in MSI-X case
    e1000e: register pm_qos request on hardware activation
    ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
    net: Add getsockopt support for TCP thin-streams
    cxgb4: update driver version
    cxgb4: add new PCI IDs
    ...

    Manually fix up conflicts in:
    - drivers/net/e1000e/netdev.c: due to pm_qos registration
    infrastructure changes
    - drivers/net/phy/marvell.c: conflict between adding 88ec048 support
    and cleaning up the IDs
    - drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
    conflict (registration change vs marking it static)

    Linus Torvalds
     

28 Jul, 2010

1 commit


09 Jun, 2010

1 commit

  • PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
    typically holds rq->lock around the css rcu derefs but the generic
    cgroup code doesn't (and can't) know about that lock.

    Provide means to add extra checks to the css dereference and use that
    in the scheduler to annotate its users.

    The addition of rq->lock to these checks is correct because the
    cgroup_subsys::attach() method takes the rq->lock for each task it
    moves, therefore by holding that lock, we ensure the task is pinned to
    the current cgroup and the RCU derefence is valid.

    That leaves one genuine race in __sched_setscheduler() where we used
    task_group() without holding any of the required locks and thus raced
    with the cgroup code. Solve this by moving the check under the
    appropriate lock.

    Signed-off-by: Peter Zijlstra
    Cc: "Paul E. McKenney"
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2010

1 commit

  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 May, 2010

1 commit

  • Expand task_subsys_state()'s rcu_dereference_check() to include the full
    locking rule as documented in Documentation/cgroups/cgroups.txt by adding
    a check for task->alloc_lock being held.

    This fixes an RCU false positive when resuming from suspend. The warning
    comes from freezer cgroup in cgroup_freezing_or_frozen().

    Signed-off-by: Li Zefan
    Acked-by: Matt Helsley
    Signed-off-by: Paul E. McKenney

    Li Zefan
     

13 Mar, 2010

7 commits

  • Events should be removed after rmdir of cgroup directory, but before
    destroying subsystem state objects. Let's take reference to cgroup
    directory dentry to do that.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patchset introduces eventfd-based API for notifications in cgroups
    and implements memory notifications on top of it.

    It uses statistics in memory controler to track memory usage.

    Output of time(1) on building kernel on tmpfs:

    Root cgroup before changes:
    make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total
    Non-root cgroup before changes:
    make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total
    Root cgroup after changes (0 thresholds):
    make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total
    Non-root cgroup after changes (0 thresholds):
    make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total
    Root cgroup after changes (1 thresholds, never crossed):
    make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total
    Non-root cgroup after changes (1 thresholds, never crossed):
    make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total

    This patch:

    Introduce the write-only file "cgroup.event_control" in every cgroup.

    To register new notification handler you need:
    - create an eventfd;
    - open a control file to be monitored. Callbacks register_event() and
    unregister_event() must be defined for the control file;
    - write " " to cgroup.event_control.
    Interpretation of args is defined by control file implementation;

    eventfd will be woken up by control file implementation or when the
    cgroup is removed.

    To unregister notification handler just close eventfd.

    If you need notification functionality for a control file you have to
    implement callbacks register_event() and unregister_event() in the
    struct cftype.

    [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provides support for unloading modular subsystems.

    This patch adds a new function cgroup_unload_subsys which is to be used
    for removing a loaded subsystem during module deletion. Reference
    counting of the subsystems' modules is moved from once (at load time) to
    once per attached hierarchy (in parse_cgroupfs_options and
    rebind_subsystems) (i.e., 0 or 1).

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add interface between cgroups subsystem management and module loading

    This patch implements rudimentary module-loading support for cgroups -
    namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
    module initcall, and a struct module pointer in struct cgroup_subsys.

    Several functions that might be wanted by modules have had EXPORT_SYMBOL
    added to them, but it's unclear exactly which functions want it and which
    won't.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • This patch series provides the ability for cgroup subsystems to be
    compiled as modules both within and outside the kernel tree. This is
    mainly useful for classifiers and subsystems that hook into components
    that are already modules. cls_cgroup and blkio-cgroup serve as the
    example use cases for this feature.

    It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
    which modular subsystems can use to register and depart during runtime.
    The net_cls classifier subsystem serves as the example for a subsystem
    which can be converted into a module using these changes.

    Patch #1 sets up the subsys[] array so its contents can be dynamic as
    modules appear and (eventually) disappear. Iterations over the array are
    modified to handle when subsystems are absent, and the dynamic section of
    the array is protected by cgroup_mutex.

    Patch #2 implements an interface for modules to load subsystems, called
    cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
    pointer in struct cgroup_subsys.

    Patch #3 adds a mechanism for unloading modular subsystems, which includes
    a more advanced rework of the rudimentary reference counting introduced in
    patch 2.

    Patch #4 modifies the net_cls subsystem, which already had some module
    declarations, to be configurable as a module, which also serves as a
    simple proof-of-concept.

    Part of implementing patches 2 and 4 involved updating css pointers in
    each css_set when the module appears or leaves. In doing this, it was
    discovered that css_sets always remain linked to the dummy cgroup,
    regardless of whether or not any subsystems are actually bound to it
    (i.e., not mounted on an actual hierarchy). The subsystem loading and
    unloading code therefore should keep in mind the special cases where the
    added subsystem is the only one in the dummy cgroup (and therefore all
    css_sets need to be linked back into it) and where the removed subsys was
    the only one in the dummy cgroup (and therefore all css_sets should be
    unlinked from it) - however, as all css_sets always stay attached to the
    dummy cgroup anyway, these cases are ignored. Any fix that addresses this
    issue should also make sure these cases are addressed in the subsystem
    loading and unloading code.

    This patch:

    Make subsys[] able to be dynamically populated to support modular
    subsystems

    This patch reworks the way the subsys[] array is used so that subsystems
    can register themselves after boot time, and enables the internals of
    cgroups to be able to handle when subsystems are not present or may
    appear/disappear.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Current css_get() and css_put() increment/decrement css->refcnt one by
    one.

    This patch add a new function __css_get(), which takes "count" as a arg
    and increment the css->refcnt by "count". And this patch also add a new
    arg("count") to __css_put() and change the function to decrement the
    css->refcnt by "count".

    These coalesce version of __css_get()/__css_put() will be used to improve
    performance of memcg's moving charge feature later, where instead of
    calling css_get()/css_put() repeatedly, these new functions will be used.

    No change is needed for current users of css_get()/css_put().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Paul Menage
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Add cancel_attach() operation to struct cgroup_subsys. cancel_attach()
    can be used when can_attach() operation prepares something for the subsys,
    but we should rollback what can_attach() operation has prepared if attach
    task fails after we've succeeded in can_attach().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Li Zefan
    Reviewed-by: Paul Menage
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

02 Oct, 2009

1 commit


24 Sep, 2009

5 commits

  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Changes css_set freeing mechanism to be under RCU

    This is a prepatch for making the procs file writable. In order to free the
    old css_sets for each task to be moved as they're being moved, the freeing
    mechanism must be RCU-protected, or else we would have to have a call to
    synchronize_rcu() for each task before freeing its old css_set.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: "Paul E. McKenney"
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Previously there was the problem in which two processes from different pid
    namespaces reading the tasks or procs file could result in one process
    seeing results from the other's namespace. Rather than one pidlist for
    each file in a cgroup, we now keep a list of pidlists keyed by namespace
    and file type (tasks versus procs) in which entries are placed on demand.
    Each pidlist has its own lock, and that the pidlists themselves are passed
    around in the seq_file's private pointer means we don't have to touch the
    cgroup or its master list except when creating and destroying entries.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • struct cgroup used to have a bunch of fields for keeping track of the
    pidlist for the tasks file. Those are now separated into a new struct
    cgroup_pidlist, of which two are had, one for procs and one for tasks.
    The way the seq_file operations are set up is changed so that just the
    pidlist struct gets passed around as the private data.

    Interface example: Suppose a multithreaded process has pid 1000 and other
    threads with ids 1001, 1002, 1003:
    $ cat tasks
    1000
    1001
    1002
    1003
    $ cat cgroup.procs
    1000
    $

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • The following series adds a "cgroup.procs" file to each cgroup that
    reports unique tgids rather than pids, and allows all threads in a
    threadgroup to be atomically moved to a new cgroup.

    The subsystem "attach" interface is modified to support attaching whole
    threadgroups at a time, which could introduce potential problems if any
    subsystem were to need to access the old cgroup of every thread being
    moved. The attach interface may need to be revised if this becomes the
    case.

    Also added is functionality for read/write locking all CLONE_THREAD
    fork()ing within a threadgroup, by means of an rwsem that lives in the
    sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
    with the sighand's atomic count. This scheme should introduce no extra
    overhead in the fork path when there's no contention.

    The final patch reveals potential for a race when forking before a
    subsystem's attach function is called - one potential solution in case any
    subsystem has this problem is to hang on to the group's fork mutex through
    the attach() calls, though no subsystem yet demonstrates need for an
    extended critical section.

    This patch:

    Revert

    commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
    Author: Li Zefan
    AuthorDate: Wed Jul 29 15:04:04 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug

    This is in preparation for some clashing cgroups changes that subsume the
    original commit's functionaliy.

    The original commit fixed a pid namespace bug which Ben Blum fixed
    independently (in the same way, but with different code) as part of a
    series of patches. I played around with trying to reconcile Ben's patch
    series with Li's patch, but concluded that it was simpler to just revert
    Li's, given that Ben's patch series contained essentially the same fix.

    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

30 Jul, 2009

2 commits

  • After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix
    frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg)
    doesn't return -EBUSY by temporary ref counts. That commit expects all
    refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can
    wait permanently. This patch tries to fix that and change followings.

    - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
    - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
    if there are sleeping ones, wakes them up.
    - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.

    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Reviewed-by: Paul Menage
    Acked-by: Balbir Sigh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The bug was introduced by commit cc31edceee04a7b87f2be48f9489ebb72d264844
    ("cgroups: convert tasks file to use a seq_file with shared pid array").

    We cache a pid array for all threads that are opening the same "tasks"
    file, but the pids in the array are always from the namespace of the
    last process that opened the file, so all other threads will read pids
    from that namespace instead of their own namespaces.

    To fix it, we maintain a list of pid arrays, which is keyed by pid_ns.
    The list will be of length 1 at most time.

    Reported-by: Paul Menage
    Idea-by: Paul Menage
    Signed-off-by: Li Zefan
    Reviewed-by: Serge Hallyn
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

04 Apr, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
    trivial: Update my email address
    trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
    trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
    trivial: Fix misspelling of "Celsius".
    trivial: remove unused variable 'path' in alloc_file()
    trivial: fix a pdlfush -> pdflush typo in comment
    trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
    trivial: wusb: Storage class should be before const qualifier
    trivial: drivers/char/bsr.c: Storage class should be before const qualifier
    trivial: h8300: Storage class should be before const qualifier
    trivial: fix where cgroup documentation is not correctly referred to
    trivial: Give the right path in Documentation example
    trivial: MTD: remove EOL from MODULE_DESCRIPTION
    trivial: Fix typo in bio_split()'s documentation
    trivial: PWM: fix of #endif comment
    trivial: fix typos/grammar errors in Kconfig texts
    trivial: Fix misspelling of firmware
    trivial: cgroups: documentation typo and spelling corrections
    trivial: Update contact info for Jochen Hein
    trivial: fix typo "resgister" -> "register"
    ...

    Linus Torvalds
     

03 Apr, 2009

6 commits

  • We need to pass some data to test_task() or process_task() in some cases.
    Will be used later.

    Signed-off-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This patch tries to fix OOM Killer problems caused by hierarchy.
    Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
    kill a task in memcg.

    But, when hierarchy is used, it's broken and correct task cannot
    be killed. For example, in following cgroup

    /groupA/ hierarchy=1, limit=1G,
    01 nolimit
    02 nolimit
    All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
    groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

    This patch provides makes the bad process be selected from all tasks
    under hierarchy. BTW, currently, oom_jiffies is updated against groupA
    in above case. oom_jiffies of tree should be updated.

    To see how oom_jiffies is used, please check mem_cgroup_oom_called()
    callers.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: const fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • We have some read-only files and write-only files, but currently they are
    all set to 0644, which is counter-intuitive and cause trouble for some
    cgroup tools like libcgroup.

    This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
    own files' file mode, and for the most cases cft->mode can be default to 0
    and cgroup will figure out proper mode.

    Acked-by: Paul Menage
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • In following situation, with memory subsystem,

    /groupA use_hierarchy==1
    /01 some tasks
    /02 some tasks
    /03 some tasks
    /04 empty

    When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
    is triggered and the kernel walks tree under groupA. In this case,
    rmdir /groupA/04 fails with -EBUSY frequently because of temporal
    refcnt from the kernel.

    In general. cgroup can be rmdir'd if there are no children groups and
    no tasks. Frequent fails of rmdir() is not useful to users.
    (And the reason for -EBUSY is unknown to users.....in most cases)

    This patch tries to modify above behavior, by
    - retries if css_refcnt is got by someone.
    - add "return value" to pre_destroy() and allows subsystem to
    say "we're really busy!"

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

    This patch attaches unique ID to each css and provides following.

    - css_lookup(subsys, id)
    returns pointer to struct cgroup_subysys_state of id.
    - css_get_next(subsys, id, rootid, depth, foundid)
    returns the next css under "root" by scanning

    When cgroup_subsys->use_id is set, an id for css is maintained.

    The cgroup framework only parepares
    - css_id of root css for subsys
    - id is automatically attached at creation of css.
    - id is *not* freed automatically. Because the cgroup framework
    don't know lifetime of cgroup_subsys_state.
    free_css_id() function is provided. This must be called by subsys.

    There are several reasons to develop this.
    - Saving space .... For example, memcg's swap_cgroup is array of
    pointers to cgroup. But it is not necessary to be very fast.
    By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
    reduce much amount of memory usage.

    - Scanning without lock.
    CSS_ID provides "scan id under this ROOT" function. By this, scanning
    css under root can be written without locks.
    ex)
    do {
    rcu_read_lock();
    next = cgroup_get_next(subsys, id, root, &found);
    /* check sanity of next here */
    css_tryget();
    rcu_read_unlock();
    id = found + 1
    } while(...)

    Characteristics:
    - Each css has unique ID under subsys.
    - Lifetime of ID is controlled by subsys.
    - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
    - Allowed ID is 1-65535, ID 0 is UNUSED ID.

    Design Choices:
    - scan-by-ID v.s. scan-by-tree-walk.
    As /proc's pid scan does, scan-by-ID is robust when scanning is done
    by following kind of routine.
    scan -> rest a while(release a lock) -> conitunue from interrupted
    memcg's hierarchical reclaim does this.

    - When subsys->use_id is set, # of css in the system is limited to
    65535.

    [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The ns_proxy cgroup allows moving processes to child cgroups only one
    level deep at a time. This commit relaxes this restriction and makes it
    possible to attach tasks directly to grandchild cgroups, e.g.:

    ($pid is in the root cgroup)
    echo $pid > /cgroup/CG1/CG2/tasks

    Previously this operation would fail with -EPERM and would have to be
    performed as two steps:
    echo $pid > /cgroup/CG1/tasks
    echo $pid > /cgroup/CG1/CG2/tasks

    Also, the target cgroup no longer needs to be empty to move a task there.

    Signed-off-by: Grzegorz Nosek
    Acked-by: Serge Hallyn
    Reviewed-by: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grzegorz Nosek