11 Apr, 2008

1 commit

  • Extend the /proc//cgroup file to include the appropriate hierarchy ID on
    each line.

    Currently this ID isn't really needed since a hierarchy can be completely
    identified by the set of subsystems bound to it, but this is likely to change
    in the near future in order to support stateless subsystems and
    merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the
    need for an API change later.

    Signed-off-by: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

05 Apr, 2008

1 commit

  • The effects of cgroup_disable=foo are:

    - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
    - foo isn't visible as an individually mountable subsystem

    As a result there will only ever be one call to foo->create(), at init time;
    all processes will stay in this group, and the group will never be mounted on
    a visible hierarchy. Any additional effects (e.g. not allocating metadata)
    are up to the foo subsystem.

    This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
    but it could easily be extended to do so if any of the early_init systems
    wanted it - I think it would just involve some nastier parameter processing
    since it would occur before the command-line argument parser had been run.

    Hugh said:

    Ballpark figures, I'm trying to get this question out rather than
    processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
    to the affected paths, booting with cgroup_disable=memory cuts that back to
    1% overhead (due to slightly bigger struct page).

    I'm no expert on distros, they may have no interest whatever in
    CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
    without it, or apply the cgroup_disable=memory patches.

    Unix bench's execl test result on x86_64 was

    == just after boot without mounting any cgroup fs.==
    mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
    mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0
    ==

    [lizf@cn.fujitsu.com: fix boot option parsing]
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: David Rientjes
    Signed-off-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

31 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • The documentation says the default value of notify_on_release of a child
    cgroup is inherited from its parent, which is reasonable, but the
    implementation just sets the flag disabled.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

24 Feb, 2008

5 commits


08 Feb, 2008

9 commits

  • There's one place that works with task pids - its the "tasks" file in cgroups.
    The read/write handlers assume, that the pid values go to/come from the user
    space and thus it is a virtual pid, i.e. the pid as it is seen from inside a
    namespace.

    Tune the code accordingly.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This patch corrects a situation that occurs when one disables all the cpus in
    a cpuset.

    Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
    which is incorrect because it may then overlap its cpu-exclusive sibling.

    Tasks of an empty cpuset should be moved to the cpuset which is the parent of
    their current cpuset. Or if the parent cpuset has no cpus, to its parent,
    etc.

    And the empty cpuset should be released (if it is flagged notify_on_release).

    Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
    iterate through all tasks in the cpu-less cpuset. We are deliberately
    avoiding a walk of the tasklist.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
    calling a test function and a process function for each. And call the process
    function without holding the css_set_lock lock.

    The idea is David Rientjes', predicting that such a function will make it much
    easier in the future to extend things that require access to each task in a
    cgroup without holding the lock,

    [akpm@linux-foundation.org: cleanup]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Add a handler "pre_destroy" to cgroup_subsys. It is called before
    cgroup_rmdir() checks all subsys's refcnt.

    I think this is useful for subsys which have some extra refs even if there
    are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
    "destroy() against subsystem is called only when refcnt=0." and allows css
    ref to be used by other objects than tasks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • cgroup_is_releasable() and notify_on_release() should be static,
    not global inline.

    Signed-off-by: Adrian Bunk
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Move the calls to the cgroup subsystem destroy() methods from
    cgroup_rmdir() to cgroup_diput(). This allows control file reads and
    writes to access their subsystem state without having to be concerned with
    locking against cgroup destruction - the control file dentry will keep the
    cgroup and its subsystem state objects alive until the file is closed.

    The documentation is updated to reflect the changed semantics of destroy();
    additionally the locking comments for destroy() and some other methods were
    clarified and decrustified.

    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Simplify the space stripping code in cgroup file write.

    [akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
    Signed-off-by: Paul Jackson
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Coding style fix - one line conditionals don't get braces.

    Signed-off-by: Paul Jackson
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch removes dead code spotted by the Coverity checker
    (look at the "(nbytes >= PATH_MAX)" check).

    Signed-off-by: Adrian Bunk
    Cc: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

15 Nov, 2007

1 commit

  • When I boot with the 'quiet' parameter, I see on the screen:

    [ 0.000000] Initializing cgroup subsys cpuset
    [ 0.000000] Initializing cgroup subsys cpu
    [ 39.036026] Initializing cgroup subsys cpuacct
    [ 39.036080] Initializing cgroup subsys debug
    [ 39.036118] Initializing cgroup subsys ns

    This patch lowers the priority of those messages, adds a "cgroup: " prefix
    to another couple of printks and kills the useless reference to the source
    file.

    Signed-off-by: Diego Calleja
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Diego Calleja
     

24 Oct, 2007

1 commit


20 Oct, 2007

11 commits

  • Replace "cont" with "cgrp" and other misc renaming

    This patch finishes some of the names that got missed in the great
    "task containers" -> "control groups" rename. Primarily it renames
    the local variable "cont" to "cgrp" in a number of places, and renames
    the CONT_* enum members to CGRP_*.

    This patch is not intended to have any effect on the generated code;
    the output of "objdump -d kernel/cgroup.o" is unchanged.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • There are two places that do so - the cgroups subsystem and the autofs
    code.

    Signed-off-by: Pavel Emelyanov
    Cc: Ian Kent
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This patch is inspired by the discussion at
    http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
    as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The
    patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
    ported)

    This patch implements per cgroup statistics infrastructure and re-uses
    code from the taskstats interface. A new set of cgroup operations are
    registered with commands and attributes. It should be very easy to
    *extend* per cgroup statistics, by adding members to the cgroupstats
    structure.

    The current model for cgroupstats is a pull, a push model (to post
    statistics on interesting events), should be very easy to add. Currently
    user space requests for statistics by passing the cgroup file
    descriptor. Statistics about the state of all the tasks in the cgroup
    is returned to user space.

    TODO's/NOTE:

    This patch provides an infrastructure for implementing cgroup statistics.
    Based on the needs of each controller, we can incrementally add more statistics,
    event based support for notification of statistics, accumulation of taskstats
    into cgroup statistics in the future.

    Sample output

    # ./cgroupstats -C /cgroup/a
    sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0

    # ./cgroupstats -C /cgroup/
    sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0

    If the approach looks good, I'll enhance and post the user space utility for
    the same

    Feedback, comments, test results are always welcome!

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the following files to the cgroup filesystem:

    notify_on_release - configures/reports whether the cgroup subsystem should
    attempt to run a release script when this cgroup becomes unused

    release_agent - configures/reports the release agent to be used for this
    hierarchy (top level in each hierarchy only)

    releasable - reports whether this cgroup would have been auto-released if
    notify_on_release was true and a release agent was configured (mainly useful
    for debugging)

    To avoid locking issues, invoking the userspace release agent is done via a
    workqueue task; cgroups that need to have their release agents invoked by
    the workqueue task are linked on to a list.

    [pj@sgi.com: Need to include kmod.h]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Replace the struct css_set embedded in task_struct with a pointer; all tasks
    that have the same set of memberships across all hierarchies will share a
    css_set object, and will be linked via their css_sets field to the "tasks"
    list_head in the css_set.

    Assuming that many tasks share the same cgroup assignments, this reduces
    overall space usage and keeps the size of the task_struct down (three pointers
    added to task_struct compared to a non-cgroups kernel, no matter how many
    subsystems are registered).

    [akpm@linux-foundation.org: fix a printk]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add:

    /proc/cgroups - general system info

    /proc/*/cgroup - per-task cgroup membership info

    [a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add support for cgroup_clone(), a way to create new cgroups intended to
    be used for systems such as namespace unsharing. A new subsystem callback,
    post_clone(), is added to allow subsystems to automatically configure cloned
    cgroups.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This adds the necessary hooks to the fork() and exit() paths to ensure
    that new children inherit their parent's cgroup assignments, and that
    exiting processes release reference counts on their cgroups.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add write_uint() helper method for cgroup subsystems

    This helper is analagous to the read_uint() helper method for
    reporting u64 values to userspace. It's designed to reduce the amount
    of boilerplate requierd for creating new cgroup subsystems.

    Signed-off-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Add the per-directory "tasks" file for cgroupfs mounts; this allows the
    user to determine which tasks are members of a cgroup by reading a
    cgroup's "tasks", and to move a task into a cgroup by writing its pid to
    its "tasks".

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Generic Process Control Groups
    --------------------------

    There have recently been various proposals floating around for
    resource management/accounting and other task grouping subsystems in
    the kernel, including ResGroups, User BeanCounters, NSProxy
    cgroups, and others. These all need the basic abstraction of being
    able to group together multiple processes in an aggregate, in order to
    track/limit the resources permitted to those processes, or control
    other behaviour of the processes, and all implement this grouping in
    different ways.

    This patchset provides a framework for tracking and grouping processes
    into arbitrary "cgroups" and assigning arbitrary state to those
    groupings, in order to control the behaviour of the cgroup as an
    aggregate.

    The intention is that the various resource management and
    virtualization/cgroup efforts can also become task cgroup
    clients, with the result that:

    - the userspace APIs are (somewhat) normalised

    - it's easier to test e.g. the ResGroups CPU controller in
    conjunction with the BeanCounters memory controller, or use either of
    them as the resource-control portion of a virtual server system.

    - the additional kernel footprint of any of the competing resource
    management systems is substantially reduced, since it doesn't need
    to provide process grouping/containment, hence improving their
    chances of getting into the kernel

    This patch:

    Add the main task cgroups framework - the cgroup filesystem, and the
    basic structures for tracking membership and associating subsystem state
    objects to tasks.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage