29 Oct, 2009

1 commit

  • cgroup_write_X64() and cgroup_write_string() ignore the return value of
    strstrip(). it makes small inconsistent behavior.

    example:
    =========================
    # cd /mnt/cgroup/hoge
    # cat memory.swappiness
    60
    # echo "59 " > memory.swappiness
    # cat memory.swappiness
    59
    # echo " 58" > memory.swappiness
    bash: echo: write error: Invalid argument

    This patch fixes it.

    Cc: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

02 Oct, 2009

2 commits


24 Sep, 2009

11 commits

  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Changes css_set freeing mechanism to be under RCU

    This is a prepatch for making the procs file writable. In order to free the
    old css_sets for each task to be moved as they're being moved, the freeing
    mechanism must be RCU-protected, or else we would have to have a call to
    synchronize_rcu() for each task before freeing its old css_set.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: "Paul E. McKenney"
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Separates all pidlist allocation requests to a separate function that
    judges based on the requested size whether or not the array needs to be
    vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Previously there was the problem in which two processes from different pid
    namespaces reading the tasks or procs file could result in one process
    seeing results from the other's namespace. Rather than one pidlist for
    each file in a cgroup, we now keep a list of pidlists keyed by namespace
    and file type (tasks versus procs) in which entries are placed on demand.
    Each pidlist has its own lock, and that the pidlists themselves are passed
    around in the seq_file's private pointer means we don't have to touch the
    cgroup or its master list except when creating and destroying entries.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • struct cgroup used to have a bunch of fields for keeping track of the
    pidlist for the tasks file. Those are now separated into a new struct
    cgroup_pidlist, of which two are had, one for procs and one for tasks.
    The way the seq_file operations are set up is changed so that just the
    pidlist struct gets passed around as the private data.

    Interface example: Suppose a multithreaded process has pid 1000 and other
    threads with ids 1001, 1002, 1003:
    $ cat tasks
    1000
    1001
    1002
    1003
    $ cat cgroup.procs
    1000
    $

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • The following series adds a "cgroup.procs" file to each cgroup that
    reports unique tgids rather than pids, and allows all threads in a
    threadgroup to be atomically moved to a new cgroup.

    The subsystem "attach" interface is modified to support attaching whole
    threadgroups at a time, which could introduce potential problems if any
    subsystem were to need to access the old cgroup of every thread being
    moved. The attach interface may need to be revised if this becomes the
    case.

    Also added is functionality for read/write locking all CLONE_THREAD
    fork()ing within a threadgroup, by means of an rwsem that lives in the
    sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
    with the sighand's atomic count. This scheme should introduce no extra
    overhead in the fork path when there's no contention.

    The final patch reveals potential for a race when forking before a
    subsystem's attach function is called - one potential solution in case any
    subsystem has this problem is to hang on to the group's fork mutex through
    the attach() calls, though no subsystem yet demonstrates need for an
    extended critical section.

    This patch:

    Revert

    commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
    Author: Li Zefan
    AuthorDate: Wed Jul 29 15:04:04 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug

    This is in preparation for some clashing cgroups changes that subsume the
    original commit's functionaliy.

    The original commit fixed a pid namespace bug which Ben Blum fixed
    independently (in the same way, but with different code) as part of a
    series of patches. I played around with trying to reconcile Ben's patch
    series with Li's patch, but concluded that it was simpler to just revert
    Li's, given that Ben's patch series contained essentially the same fix.

    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This patch removes the restriction that a cgroup hierarchy must have at
    least one bound subsystem. The mount option "none" is treated as an
    explicit request for no bound subsystems.

    A hierarchy with no subsystems can be useful for plain task tracking, and
    is also a step towards the support for multiply-bindable subsystems.

    As part of this change, the hierarchy id is no longer calculated from the
    bitmask of subsystems in the hierarchy (since this is not guaranteed to be
    unique) but is allocated via an ida. Reference counts on cgroups from
    css_set objects are now taken explicitly one per hierarchy, rather than
    one per subsystem.

    Example usage:

    mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

    Based on the "no-op"/"none" subsystem concept proposed by
    kamezawa.hiroyu@jp.fujitsu.com

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Currently the cgroups code makes the assumption that the subsystem
    pointers in a struct css_set uniquely identify the hierarchy->cgroup
    mappings associated with the css_set; and there's no way to directly
    identify the associated set of cgroups other than by indirecting through
    the appropriate subsystem state pointers.

    This patch removes the need for that assumption by adding a back-pointer
    from struct cg_cgroup_link object to its associated cgroup; this allows
    the set of cgroups to be determined by traversing the cg_links list in
    the struct css_set.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • While it's architecturally clean to have the cgroup debug subsystem be
    completely independent of the cgroups framework, it limits its usefulness
    for debugging the contents of internal data structures. Move the debug
    subsystem code into the scope of all the cgroups data structures to make
    more detailed debugging possible.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • To simplify referring to cgroup hierarchies in mount statements, and to
    allow disambiguation in the presence of empty hierarchies and
    multiply-bindable subsystems this patch adds support for naming a new
    cgroup hierarchy via the "name=" mount option

    A pre-existing hierarchy may be specified by either name or by subsystems;
    a hierarchy's name cannot be changed by a remount operation.

    Example usage:

    # To create a hierarchy called "foo" containing the "cpu" subsystem
    mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1

    # To mount the "foo" hierarchy on a second location
    mount -t cgroup -oname=foo cgroup /mnt/cgroup2

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Make the last unlock sequence consistent with previous unlock sequeue.

    Acked-by: Balbir Singh
    Acked-by: Paul Menage
    Signed-off-by: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     

23 Sep, 2009

1 commit

  • Make all seq_operations structs const, to help mitigate against
    revectoring user-triggerable function pointers.

    This is derived from the grsecurity patch, although generated from scratch
    because it's simpler than extracting the changes from there.

    Signed-off-by: James Morris
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morris
     

22 Sep, 2009

2 commits


11 Sep, 2009

1 commit


30 Jul, 2009

2 commits

  • After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix
    frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg)
    doesn't return -EBUSY by temporary ref counts. That commit expects all
    refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can
    wait permanently. This patch tries to fix that and change followings.

    - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
    - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
    if there are sleeping ones, wakes them up.
    - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.

    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Reviewed-by: Paul Menage
    Acked-by: Balbir Sigh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The bug was introduced by commit cc31edceee04a7b87f2be48f9489ebb72d264844
    ("cgroups: convert tasks file to use a seq_file with shared pid array").

    We cache a pid array for all threads that are opening the same "tasks"
    file, but the pids in the array are always from the namespace of the
    last process that opened the file, so all other threads will read pids
    from that namespace instead of their own namespaces.

    To fix it, we maintain a list of pid arrays, which is keyed by pid_ns.
    The list will be of length 1 at most time.

    Reported-by: Paul Menage
    Idea-by: Paul Menage
    Signed-off-by: Li Zefan
    Reviewed-by: Serge Hallyn
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

19 Jun, 2009

1 commit

  • The 'noprefix' option was introduced for backwards-compatibility of
    cpuset, but actually it can be used when mounting other subsystems.

    This results in possibility of name collision, and now the collision can
    really happen, because we have 'stat' file in both memory and cpuacct
    subsystem:

    # mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt

    Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory
    subsys can be seen.

    We don't want users to use nopreifx, and also want to avoid name
    collision, so we change to allow noprefix only if mounting just the cpuset
    subsystem.

    [akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32]
    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

12 Jun, 2009

1 commit


09 May, 2009

1 commit


03 Apr, 2009

7 commits

  • This patch tries to fix OOM Killer problems caused by hierarchy.
    Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
    kill a task in memcg.

    But, when hierarchy is used, it's broken and correct task cannot
    be killed. For example, in following cgroup

    /groupA/ hierarchy=1, limit=1G,
    01 nolimit
    02 nolimit
    All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
    groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

    This patch provides makes the bad process be selected from all tasks
    under hierarchy. BTW, currently, oom_jiffies is updated against groupA
    in above case. oom_jiffies of tree should be updated.

    To see how oom_jiffies is used, please check mem_cgroup_oom_called()
    callers.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: const fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Remount can fail in either case:
    - wrong mount options is specified, or option 'noprefix' is changed.
    - a to-be-added subsys is already mounted/active.

    When using remount to change 'release_agent', for the above former failure
    case, remount will return errno with release_agent unchanged, but for the
    latter case, remount will return EBUSY with relase_agent changed, which is
    unexpected I think:

    # mount -t cgroup -o cpu xxx /cgrp1
    # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
    # cat /cgrp2/release_agent
    agent1
    # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
    mount: /cgrp2 not mounted already, or bad option
    # cat /cgrp2/release_agent
    agent1
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • We have some read-only files and write-only files, but currently they are
    all set to 0644, which is counter-intuitive and cause trouble for some
    cgroup tools like libcgroup.

    This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
    own files' file mode, and for the most cases cft->mode can be default to 0
    and cgroup will figure out proper mode.

    Acked-by: Paul Menage
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Reduces object file size a bit:

    Before:
    $ size kernel/cgroup.o
    text data bss dec hex filename
    21593 7804 4924 34321 8611 kernel/cgroup.o
    After:
    $ size kernel/cgroup.o
    text data bss dec hex filename
    21537 7744 4924 34205 859d kernel/cgroup.o

    Signed-off-by: Jesper Juhl
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • In following situation, with memory subsystem,

    /groupA use_hierarchy==1
    /01 some tasks
    /02 some tasks
    /03 some tasks
    /04 empty

    When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
    is triggered and the kernel walks tree under groupA. In this case,
    rmdir /groupA/04 fails with -EBUSY frequently because of temporal
    refcnt from the kernel.

    In general. cgroup can be rmdir'd if there are no children groups and
    no tasks. Frequent fails of rmdir() is not useful to users.
    (And the reason for -EBUSY is unknown to users.....in most cases)

    This patch tries to modify above behavior, by
    - retries if css_refcnt is got by someone.
    - add "return value" to pre_destroy() and allows subsystem to
    say "we're really busy!"

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

    This patch attaches unique ID to each css and provides following.

    - css_lookup(subsys, id)
    returns pointer to struct cgroup_subysys_state of id.
    - css_get_next(subsys, id, rootid, depth, foundid)
    returns the next css under "root" by scanning

    When cgroup_subsys->use_id is set, an id for css is maintained.

    The cgroup framework only parepares
    - css_id of root css for subsys
    - id is automatically attached at creation of css.
    - id is *not* freed automatically. Because the cgroup framework
    don't know lifetime of cgroup_subsys_state.
    free_css_id() function is provided. This must be called by subsys.

    There are several reasons to develop this.
    - Saving space .... For example, memcg's swap_cgroup is array of
    pointers to cgroup. But it is not necessary to be very fast.
    By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
    reduce much amount of memory usage.

    - Scanning without lock.
    CSS_ID provides "scan id under this ROOT" function. By this, scanning
    css under root can be written without locks.
    ex)
    do {
    rcu_read_lock();
    next = cgroup_get_next(subsys, id, root, &found);
    /* check sanity of next here */
    css_tryget();
    rcu_read_unlock();
    id = found + 1
    } while(...)

    Characteristics:
    - Each css has unique ID under subsys.
    - Lifetime of ID is controlled by subsys.
    - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
    - Allowed ID is 1-65535, ID 0 is UNUSED ID.

    Design Choices:
    - scan-by-ID v.s. scan-by-tree-walk.
    As /proc's pid scan does, scan-by-ID is robust when scanning is done
    by following kind of routine.
    scan -> rest a while(release a lock) -> conitunue from interrupted
    memcg's hierarchical reclaim does this.

    - When subsys->use_id is set, # of css in the system is limited to
    65535.

    [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The ns_proxy cgroup allows moving processes to child cgroups only one
    level deep at a time. This commit relaxes this restriction and makes it
    possible to attach tasks directly to grandchild cgroups, e.g.:

    ($pid is in the root cgroup)
    echo $pid > /cgroup/CG1/CG2/tasks

    Previously this operation would fail with -EPERM and would have to be
    performed as two steps:
    echo $pid > /cgroup/CG1/tasks
    echo $pid > /cgroup/CG1/CG2/tasks

    Also, the target cgroup no longer needs to be empty to move a task there.

    Signed-off-by: Grzegorz Nosek
    Acked-by: Serge Hallyn
    Reviewed-by: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grzegorz Nosek
     

28 Mar, 2009

2 commits

  • simple_set_mnt() is defined as returning 'int' but always returns 0.
    Callers assume simple_set_mnt() never fails and don't properly cleanup if
    it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev()
    should:

    up_write(sb->s_unmount);
    deactivate_super(sb);

    if simple_set_mnt() fails.

    Since simple_set_mnt() never fails, would be cleaner if it did not
    return anything.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Sukadev Bhattiprolu
     
  • Signed-off-by: Al Viro

    Al Viro
     

19 Feb, 2009

1 commit

  • In cgroup_kill_sb(), root is freed before sb is detached from the list, so
    another sget() may find this sb and call cgroup_test_super(), which will
    access the root that has been freed.

    Reported-by: Al Viro
    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

12 Feb, 2009

1 commit

  • I enabled all cgroup subsystems when compiling kernel, and then:
    # mount -t cgroup -o net_cls xxx /mnt
    # mkdir /mnt/0

    This showed up immediately:
    BUG: MAX_LOCKDEP_SUBCLASSES too low!
    turning off the locking correctness validator.

    It's caused by the cgroup hierarchy lock:
    for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
    struct cgroup_subsys *ss = subsys[i];
    if (ss->root == root)
    mutex_lock_nested(&ss->hierarchy_mutex, i);
    }

    Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
    MAX_LOCKDEP_SUBCLASSES is 8.

    This patch uses different lockdep keys for different subsystems.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

30 Jan, 2009

4 commits

  • root_count was being incremented in cgroup_get_sb() after all error
    checking was complete, but decremented in cgroup_kill_sb(), which can be
    called on a superblock that we gave up on due to an error. This patch
    changes cgroup_kill_sb() to only decrement root_count if the root was
    previously linked into the list of roots.

    Signed-off-by: Paul Menage
    Tested-by: Serge Hallyn
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • css_tryget() and cgroup_clear_css_refs() contain polling loops; these
    loops should have cpu_relax calls in them to reduce cross-cache traffic.

    Signed-off-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • I fixed a bug in cgroup_clone() in Linus' tree in commit 7b574b7
    ("cgroups: fix a race between cgroup_clone and umount") without noticing
    there was a cleanup patch in -mm tree that should be rebased (now commit
    104cbd5, "cgroups: use task_lock() for access tsk->cgroups safe in
    cgroup_clone()"), thus resulted in lock inconsistency.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Now, cgrp->sibling is handled under hierarchy mutex.
    error route should do so, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jan, 2009

2 commits

  • Add css_tryget(), that obtains a counted reference on a CSS. It is used
    in situations where the caller has a "weak" reference to the CSS, i.e.
    one that does not protect the cgroup from removal via a reference count,
    but would instead be cleaned up by a destroy() callback.

    css_tryget() will return true on success, or false if the cgroup is being
    removed.

    This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
    with the difference that in the event of css_tryget() racing with a
    cgroup_rmdir(), css_tryget() will only return false if the cgroup really
    does get removed.

    This implementation is done by biasing css->refcnt, so that a refcnt of 1
    means "releasable" and 0 means "released or releasing". In the event of a
    race, css_tryget() distinguishes between "released" and "releasing" by
    checking for the CSS_REMOVED flag in css->flags.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • These patches introduce new locking/refcount support for cgroups to
    reduce the need for subsystems to call cgroup_lock(). This will
    ultimately allow the atomicity of cgroup_rmdir() (which was removed
    recently) to be restored.

    These three patches give:

    1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
    use to prevent changes to its own cgroup tree

    2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
    memory controller

    3/3 - introduce a css_tryget() function similar to the one recently
    proposed by Kamezawa, but avoiding spurious refcount failures in
    the event of a race between a css_tryget() and an unsuccessful
    cgroup_rmdir()

    Future patches will likely involve:

    - using hierarchy mutex in place of cgroup_lock() in more subsystems
    where appropriate

    - restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

    This patch:

    Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
    the hierarchy observed by that subsystem. It is taken by the cgroup
    subsystem (in addition to cgroup_mutex) for the following operations:

    - linking a cgroup into that subsystem's cgroup tree
    - unlinking a cgroup from that subsystem's cgroup tree
    - moving the subsystem to/from a hierarchy (including across the
    bind() callback)

    Thus if the subsystem holds its own hierarchy_mutex, it can safely
    traverse its own hierarchy.

    Signed-off-by: Paul Menage
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage