26 Jul, 2008

5 commits

  • cgroup_clone creates a new cgroup with the pid of the task. This works
    correctly for unshare, but for clone cgroup_clone is called from
    copy_namespaces inside copy_process, which happens before the new pid is
    created. As a result, the new cgroup was created with current's pid.
    This patch:

    1. Moves the call inside copy_process to after the new pid
    is created
    2. Passes the struct pid into ns_cgroup_clone (as it is not
    yet attached to the task)
    3. Passes a name from ns_cgroup_clone() into cgroup_clone()
    so as to keep cgroup_clone() itself simpler
    4. Uses pid_vnr() to get the process id value, so that the
    pid used to name the new cgroup is always the pid as it
    would be known to the task which did the cloning or
    unsharing. I think that is the most intuitive thing to
    do. This way, task t1 does clone(CLONE_NEWPID) to get
    t2, which does clone(CLONE_NEWPID) to get t3, then the
    cgroup for t3 will be named for the pid by which t2 knows
    t3.

    (Thanks to Dan Smith for finding the main bug)

    Changelog:
    June 11: Incorporate Paul Menage's feedback: don't pass
    NULL to ns_cgroup_clone from unshare, and reduce
    patch size by using 'nodename' in cgroup_clone.
    June 10: Original version

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge Hallyn
    Acked-by: Paul Menage
    Tested-by: Dan Smith
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch contains cleanups suggested by reviewers for the recent
    write_string() patchset:

    - pair cgroup_lock_live_group() with cgroup_unlock() in cgroup.c for
    clarity, rather than directly unlocking cgroup_mutex.

    - make the return type of cgroup_lock_live_group() a bool

    - use a #define'd constant for the local buffer size in read/write functions

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Acked-by: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Adds cgroup_release_agent_write() and cgroup_release_agent_show()
    methods to handle writing/reading the path to a cgroup hierarchy's
    release agent. As a result, cgroup_common_file_read() is now unnecessary.

    As part of the change, a previously-tolerated race in
    cgroup_release_agent() is avoided by copying the current
    release_agent_path prior to calling call_usermode_helper().

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Acked-by: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This patch adds a write_string() method for cgroups control files. The
    semantics are that a buffer is copied from userspace to kernelspace
    and the handler function invoked on that buffer. The buffer is
    guaranteed to be nul-terminated, and no longer than max_write_len
    (defaulting to 64 bytes if unspecified). Later patches will convert
    existing raw file write handlers in control group subsystems to use
    this method.

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Acked-by: Balbir Singh
    Acked-by: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This patch removes some extraneous spaces from method declarations in
    struct cftype, to fit in with conventional kernel style.

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: Balbir Singh
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

29 Apr, 2008

9 commits

  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Introduce a read_seq() helper in cftype, which uses seq_file to print out
    lists. Use it in the devices cgroup. Also split devices.allow into two
    files, so now devices.deny and devices.allow are the ones to use to manipulate
    the whitelist, while devices.list outputs the cgroup's current whitelist.

    Signed-off-by: Serge E. Hallyn
    Acked-by: Paul Menage
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Now we can run through the hash table instead of running through the
    linked-list.

    Signed-off-by: Li Zefan
    Reviewed-by: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • When we attach a process to a different cgroup, the css_set linked-list will
    be run through to find a suitable existing css_set to use. This patch
    implements a hash table for better performance.

    The following benchmarks have been tested:

    For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
    task in each, and then move an additional task through each cgroup in
    turn.

    Here is a test result:

    N Loop orig - Time(s) hash - Time(s)
    ----------------------------------------------
    1 10000 1.201231728 1.196311177
    5 2000 1.065743872 1.040566424
    10 1000 0.991054735 0.986876440
    50 200 0.976554203 0.969608733
    100 100 0.998504680 0.969218270
    500 20 1.157347764 0.962602963
    1000 10 1.619521852 1.085140172

    Signed-off-by: Li Zefan
    Reviewed-by: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Trigger callback can be used to receive a kick-up from the user space. The
    string written is ignored.

    The cftype->private is used for multiplexing events.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Paul Menage
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • These patches add cgroups read_s64 and write_s64 control file methods (the
    signed equivalent of read_u64/write_u64) and use them to implement the
    cpu.rt_runtime_us control file in the CFS cgroup subsystem.

    This patch:

    These are the signed equivalents of the read_u64/write_u64 methods

    Signed-off-by: Paul Menage
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • The "releasable" control file provided by the cgroup framework exports the
    state of a per-cgroup flag that's related to the notify-on-release feature.
    This isn't really generally useful, unless you're trying to debug this
    particular feature of cgroups.

    This patch moves the "releasable" file to the cgroup_debug subsystem.

    Signed-off-by: Paul Menage
    Cc: "Li Zefan"
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: "YAMAMOTO Takashi"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Adds a new type of supported control file representation, a map from strings
    to u64 values.

    Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
    "$key $value\n"

    Signed-off-by: Paul Menage
    Cc: "Li Zefan"
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: "YAMAMOTO Takashi"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Several people have justifiably complained that the "_uint" suffix is
    inappropriate for functions that handle u64 values, so this patch just renames
    all these functions and their users to have the suffic _u64.

    [peterz@infradead.org: build fix]
    Signed-off-by: Paul Menage
    Cc: "Li Zefan"
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: "YAMAMOTO Takashi"
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

05 Apr, 2008

1 commit

  • The effects of cgroup_disable=foo are:

    - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
    - foo isn't visible as an individually mountable subsystem

    As a result there will only ever be one call to foo->create(), at init time;
    all processes will stay in this group, and the group will never be mounted on
    a visible hierarchy. Any additional effects (e.g. not allocating metadata)
    are up to the foo subsystem.

    This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
    but it could easily be extended to do so if any of the early_init systems
    wanted it - I think it would just involve some nastier parameter processing
    since it would occur before the command-line argument parser had been run.

    Hugh said:

    Ballpark figures, I'm trying to get this question out rather than
    processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
    to the affected paths, booting with cgroup_disable=memory cuts that back to
    1% overhead (due to slightly bigger struct page).

    I'm no expert on distros, they may have no interest whatever in
    CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
    without it, or apply the cgroup_disable=memory patches.

    Unix bench's execl test result on x86_64 was

    == just after boot without mounting any cgroup fs.==
    mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
    mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0
    ==

    [lizf@cn.fujitsu.com: fix boot option parsing]
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: David Rientjes
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

24 Feb, 2008

2 commits

  • - replace old name 'cont' with 'cgrp' (Paul Menage did this cleanup for
    cgroup.c in commit bd89aabc6761de1c35b154fe6f914a445d301510)
    - remove a duplicate declaration of cgroup_path()

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • fix:
    - comments about need_forkexit_callback
    - comments about release agent
    - typo and comment style, etc.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

08 Feb, 2008

3 commits

  • This patch corrects a situation that occurs when one disables all the cpus in
    a cpuset.

    Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
    which is incorrect because it may then overlap its cpu-exclusive sibling.

    Tasks of an empty cpuset should be moved to the cpuset which is the parent of
    their current cpuset. Or if the parent cpuset has no cpus, to its parent,
    etc.

    And the empty cpuset should be released (if it is flagged notify_on_release).

    Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
    iterate through all tasks in the cpu-less cpuset. We are deliberately
    avoiding a walk of the tasklist.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
    calling a test function and a process function for each. And call the process
    function without holding the css_set_lock lock.

    The idea is David Rientjes', predicting that such a function will make it much
    easier in the future to extend things that require access to each task in a
    cgroup without holding the lock,

    [akpm@linux-foundation.org: cleanup]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Add a handler "pre_destroy" to cgroup_subsys. It is called before
    cgroup_rmdir() checks all subsys's refcnt.

    I think this is useful for subsys which have some extra refs even if there
    are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
    "destroy() against subsystem is called only when refcnt=0." and allows css
    ref to be used by other objects than tasks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Oct, 2007

9 commits

  • This patch is inspired by the discussion at
    http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
    as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The
    patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
    ported)

    This patch implements per cgroup statistics infrastructure and re-uses
    code from the taskstats interface. A new set of cgroup operations are
    registered with commands and attributes. It should be very easy to
    *extend* per cgroup statistics, by adding members to the cgroupstats
    structure.

    The current model for cgroupstats is a pull, a push model (to post
    statistics on interesting events), should be very easy to add. Currently
    user space requests for statistics by passing the cgroup file
    descriptor. Statistics about the state of all the tasks in the cgroup
    is returned to user space.

    TODO's/NOTE:

    This patch provides an infrastructure for implementing cgroup statistics.
    Based on the needs of each controller, we can incrementally add more statistics,
    event based support for notification of statistics, accumulation of taskstats
    into cgroup statistics in the future.

    Sample output

    # ./cgroupstats -C /cgroup/a
    sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0

    # ./cgroupstats -C /cgroup/
    sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0

    If the approach looks good, I'll enhance and post the user space utility for
    the same

    Feedback, comments, test results are always welcome!

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the following files to the cgroup filesystem:

    notify_on_release - configures/reports whether the cgroup subsystem should
    attempt to run a release script when this cgroup becomes unused

    release_agent - configures/reports the release agent to be used for this
    hierarchy (top level in each hierarchy only)

    releasable - reports whether this cgroup would have been auto-released if
    notify_on_release was true and a release agent was configured (mainly useful
    for debugging)

    To avoid locking issues, invoking the userspace release agent is done via a
    workqueue task; cgroups that need to have their release agents invoked by
    the workqueue task are linked on to a list.

    [pj@sgi.com: Need to include kmod.h]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Replace the struct css_set embedded in task_struct with a pointer; all tasks
    that have the same set of memberships across all hierarchies will share a
    css_set object, and will be linked via their css_sets field to the "tasks"
    list_head in the css_set.

    Assuming that many tasks share the same cgroup assignments, this reduces
    overall space usage and keeps the size of the task_struct down (three pointers
    added to task_struct compared to a non-cgroups kernel, no matter how many
    subsystems are registered).

    [akpm@linux-foundation.org: fix a printk]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add:

    /proc/cgroups - general system info

    /proc/*/cgroup - per-task cgroup membership info

    [a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add support for cgroup_clone(), a way to create new cgroups intended to
    be used for systems such as namespace unsharing. A new subsystem callback,
    post_clone(), is added to allow subsystems to automatically configure cloned
    cgroups.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This adds the necessary hooks to the fork() and exit() paths to ensure
    that new children inherit their parent's cgroup assignments, and that
    exiting processes release reference counts on their cgroups.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add write_uint() helper method for cgroup subsystems

    This helper is analagous to the read_uint() helper method for
    reporting u64 values to userspace. It's designed to reduce the amount
    of boilerplate requierd for creating new cgroup subsystems.

    Signed-off-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add the per-directory "tasks" file for cgroupfs mounts; this allows the
    user to determine which tasks are members of a cgroup by reading a
    cgroup's "tasks", and to move a task into a cgroup by writing its pid to
    its "tasks".

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Generic Process Control Groups
    --------------------------

    There have recently been various proposals floating around for
    resource management/accounting and other task grouping subsystems in
    the kernel, including ResGroups, User BeanCounters, NSProxy
    cgroups, and others. These all need the basic abstraction of being
    able to group together multiple processes in an aggregate, in order to
    track/limit the resources permitted to those processes, or control
    other behaviour of the processes, and all implement this grouping in
    different ways.

    This patchset provides a framework for tracking and grouping processes
    into arbitrary "cgroups" and assigning arbitrary state to those
    groupings, in order to control the behaviour of the cgroup as an
    aggregate.

    The intention is that the various resource management and
    virtualization/cgroup efforts can also become task cgroup
    clients, with the result that:

    - the userspace APIs are (somewhat) normalised

    - it's easier to test e.g. the ResGroups CPU controller in
    conjunction with the BeanCounters memory controller, or use either of
    them as the resource-control portion of a virtual server system.

    - the additional kernel footprint of any of the competing resource
    management systems is substantially reduced, since it doesn't need
    to provide process grouping/containment, hence improving their
    chances of getting into the kernel

    This patch:

    Add the main task cgroups framework - the cgroup filesystem, and the
    basic structures for tracking membership and associating subsystem state
    objects to tasks.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage