04 Jan, 2014

1 commit

  • Zefan Li requested [1] to perform the following cleanup/refactoring:

    - Split cgroupfs classid handling into net core to better express a
    possible more generic use.

    - Disable module support for cgroupfs bits as the majority of other
    cgroupfs subsystems do not have that, and seems to be not wished
    from cgroup side. Zefan probably might want to follow-up for netprio
    later on.

    - By this, code can be further reduced which previously took care of
    functionality built when compiled as module.

    cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
    that we are consistent with {netclassid,netprio}_cgroup naming that is
    under net/core/ as suggested by Zefan.

    No change in functionality, but only code refactoring that is being
    done here.

    [1] http://patchwork.ozlabs.org/patch/304825/

    Suggested-by: Li Zefan
    Signed-off-by: Daniel Borkmann
    Cc: Zefan Li
    Cc: Thomas Graf
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

10 Jul, 2013

1 commit

  • cafe563591 ("bcache: A block layer cache") added a new cgroup
    subsystem bcache_subsys without proper review and ack. bcache_subsys
    seems to use cgroup for group stats and per-group cache_mode
    configuration. This is very much the type of usage that we don't want
    to allow.

    Fortunately, CONFIG_CGROUP_BCACHE which enables bcache_subsys is
    currently commented out, so this shouldn't have any upstream users.
    Let's nip in the bud. While at it, clarify in cgroup_subsys.h that no
    new subsystem should be added without explicit acks from cgroup
    maintainers.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Cc: linux-bcache@vger.kernel.org

    Tejun Heo
     

24 Mar, 2013

1 commit

  • Does writethrough and writeback caching, handles unclean shutdown, and
    has a bunch of other nifty features motivated by real world usage.

    See the wiki at http://bcache.evilpiepirate.org for more.

    Signed-off-by: Kent Overstreet

    Kent Overstreet
     

15 Sep, 2012

1 commit

  • Before we are able to define all subsystem ids at compile time we need
    a more fine grained control what gets defined when we include
    cgroup_subsys.h. For example we define the enums for the subsystems or
    to declare for struct cgroup_subsys (builtin subsystem) by including
    cgroup_subsys.h and defining SUBSYS accordingly.

    Currently, the decision if a subsys is used is defined inside the
    header by testing if CONFIG_*=y is true. By moving this test outside
    of cgroup_subsys.h we are able to control it on the include level.

    This is done by introducing IS_SUBSYS_ENABLED which then is defined
    according the task, e.g. is CONFIG_*=y or CONFIG_*=m.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

01 Aug, 2012

2 commits

  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Implement a new controller that allows us to control HugeTLB allocations.
    The extension allows to limit the HugeTLB usage per control group and
    enforces the controller limit during page fault. Since HugeTLB doesn't
    support page reclaim, enforcing the limit at page fault time implies that,
    the application will get SIGBUS signal if it tries to access HugeTLB pages
    beyond its limit. This requires the application to know beforehand how
    much HugeTLB pages it would require for its use.

    The charge/uncharge calls will be added to HugeTLB code in later patch.
    Support for cgroup removal will be added in later patches.

    [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
    [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g]
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

23 Nov, 2011

1 commit

  • This patch adds in the infrastructure code to create the network priority
    cgroup. The cgroup, in addition to the standard processes file creates two
    control files:

    1) prioidx - This is a read-only file that exports the index of this cgroup.
    This is a value that is both arbitrary and unique to a cgroup in this subsystem,
    and is used to index the per-device priority map

    2) priomap - This is a writeable file. On read it reports a table of 2-tuples
    where name is the name of a network interface and priority is
    indicates the priority assigned to frames egresessing on the named interface and
    originating from a pid in this cgroup

    This cgroup allows for skb priority to be set prior to a root qdisc getting
    selected. This is benenficial for DCB enabled systems, in that it allows for any
    application to use dcb configured priorities so without application modification

    Signed-off-by: Neil Horman
    Signed-off-by: John Fastabend
    CC: Robert Love
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

27 May, 2011

1 commit

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     

16 Feb, 2011

1 commit

  • This kernel patch adds the ability to filter monitoring based on
    container groups (cgroups). This is for use in per-cpu mode only.

    The cgroup to monitor is passed as a file descriptor in the pid
    argument to the syscall. The file descriptor must be opened to
    the cgroup name in the cgroup filesystem. For instance, if the
    cgroup name is foo and cgroupfs is mounted in /cgroup, then the
    file descriptor is opened to /cgroup/foo. Cgroup mode is
    activated by passing PERF_FLAG_PID_CGROUP in the flags argument
    to the syscall.

    For instance to measure in cgroup foo on CPU1 assuming
    cgroupfs is mounted under /cgroup:

    struct perf_event_attr attr;
    int cgroup_fd, fd;

    cgroup_fd = open("/cgroup/foo", O_RDONLY);
    fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
    close(cgroup_fd);

    Signed-off-by: Stephane Eranian
    [ added perf_cgroup_{exit,attach} ]
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

04 Dec, 2009

1 commit


08 Nov, 2008

1 commit

  • The classifier should cover the most common use case and will work
    without any special configuration.

    The principle of the classifier is to directly access the
    task_struct via get_current(). In order for this to work,
    classification requests from softirqs must be ignored. This is
    not a problem because the vast majority of packets in softirq
    context are not assigned to a task anyway. For this to work, a
    mechanism is needed to trace softirq context.

    This repost goes back to the method of relying on the number of
    nested bh disable calls for the sake of not adding too much
    complexity and the option to come up with something more reliable
    if actually needed.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

20 Oct, 2008

1 commit

  • This patch implements a new freezer subsystem in the control groups
    framework. It provides a way to stop and resume execution of all tasks in
    a cgroup by writing in the cgroup filesystem.

    The freezer subsystem in the container filesystem defines a file named
    freezer.state. Writing "FROZEN" to the state file will freeze all tasks
    in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
    the cgroup. Reading will return the current state.

    * Examples of usage :

    # mkdir /containers/freezer
    # mount -t cgroup -ofreezer freezer /containers
    # mkdir /containers/0
    # echo $some_pid > /containers/0/tasks

    to get status of the freezer subsystem :

    # cat /containers/0/freezer.state
    RUNNING

    to freeze all tasks in the container :

    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FREEZING
    # cat /containers/0/freezer.state
    FROZEN

    to unfreeze all tasks in the container :

    # echo RUNNING > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    RUNNING

    This is the basic mechanism which should do the right thing for user space
    task in a simple scenario.

    It's important to note that freezing can be incomplete. In that case we
    return EBUSY. This means that some tasks in the cgroup are busy doing
    something that prevents us from completely freezing the cgroup at this
    time. After EBUSY, the cgroup will remain partially frozen -- reflected
    by freezer.state reporting "FREEZING" when read. The state will remain
    "FREEZING" until one of these things happens:

    1) Userspace cancels the freezing operation by writing "RUNNING" to
    the freezer.state file
    2) Userspace retries the freezing operation by writing "FROZEN" to
    the freezer.state file (writing "FREEZING" is not legal
    and returns EIO)
    3) The tasks that blocked the cgroup from entering the "FROZEN"
    state disappear from the cgroup's set of tasks.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: export thaw_process]
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Matt Helsley
    Acked-by: Serge E. Hallyn
    Tested-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     

29 Apr, 2008

1 commit

  • Implement a cgroup to track and enforce open and mknod restrictions on device
    files. A device cgroup associates a device access whitelist with each cgroup.
    A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block).
    'all' means it applies to all types and all major and minor numbers. Major
    and minor are either an integer or * for all. Access is a composition of r
    (read), w (write), and m (mknod).

    The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of
    the parent. Admins can then remove devices from the whitelist or add new
    entries. A child cgroup can never receive a device access which is denied its
    parent. However when a device access is removed from a parent it will not
    also be removed from the child(ren).

    An entry is added using devices.allow, and removed using
    devices.deny. For instance

    echo 'c 1:3 mr' > /cgroups/1/devices.allow

    allows cgroup 1 to read and mknod the device usually known as
    /dev/null. Doing

    echo a > /cgroups/1/devices.deny

    will remove the default 'a *:* mrw' entry.

    CAP_SYS_ADMIN is needed to change permissions or move another task to a new
    cgroup. A cgroup may not be granted more permissions than the cgroup's parent
    has. Any task can move itself between cgroups. This won't be sufficient, but
    we can decide the best way to adequately restrict movement later.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix may-be-used-uninitialized warning]
    Signed-off-by: Serge E. Hallyn
    Acked-by: James Morris
    Looks-good-to: Pavel Emelyanov
    Cc: Daniel Hokka Zakrisson
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

05 Mar, 2008

1 commit

  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

13 Feb, 2008

1 commit


08 Feb, 2008

1 commit

  • Setup the memory cgroup and add basic hooks and controls to integrate
    and work with the cgroup.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

03 Dec, 2007

1 commit

  • Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for
    us, which provided a cpu accounting resource controller. This feature would be
    useful if someone wants to group tasks only for accounting purpose and doesnt
    really want to exercise any control over their cpu consumption.

    The patch below reintroduces the feature. It is based on Paul Menage's
    original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
    these differences:

    - Removed load average information. I felt it needs more thought (esp
    to deal with SMP and virtualized platforms) and can be added for
    2.6.25 after more discussions.
    - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
    stats are) and invoke cpuacct_charge() from the respective scheduler
    classes
    - Make accounting scalable on SMP systems by splitting the usage
    counter to be per-cpu
    - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
    code is not big enough to warrant a new file and also this rightly
    needs to live inside the scheduler. Also things like accessing
    rq->lock while reading cpu usage becomes easier if the code lived in
    kernel/sched.c)

    The patch also modifies the cpu controller not to provide the same accounting
    information.

    Tested-by: Balbir Singh

    Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
    some simple tests like cpuspin (spin on the cpu), ran several tasks in
    the same group and timed them. Compared their time stamps with
    cpuacct.usage.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Balbir Singh
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     

15 Nov, 2007

1 commit

  • Revert 62d0df64065e7c135d0002f069444fbdfc64768f.

    This was originally intended as a simple initial example of how to create a
    control groups subsystem; it wasn't intended for mainline, but I didn't make
    this clear enough to Andrew.

    The CFS cgroup subsystem now has better functionality for the per-cgroup usage
    accounting (based directly on CFS stats) than the "usage" status file in this
    patch, and the "load" status file is rather simplistic - although having a
    per-cgroup load average report would be a useful feature, I don't believe this
    patch actually provides it. If it gets into the final 2.6.24 we'd probably
    have to support this interface for ever.

    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

20 Oct, 2007

6 commits

  • Enable "cgroup" (formerly containers) based fair group scheduling. This
    will let administrator create arbitrary groups of tasks (using "cgroup"
    pseudo filesystem) and control their cpu bandwidth usage.

    [akpm@linux-foundation.org: fix cpp condition]
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Dhaval Giani
    Cc: Randy Dunlap
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa Vaddagiri
     
  • When a task enters a new namespace via a clone() or unshare(), a new cgroup
    is created and the task moves into it.

    This version names cgroups which are automatically created using
    cgroup_clone() as "node_" where pid is the pid of the unsharing or
    cloned process. (Thanks Pavel for the idea) This is safe because if the
    process unshares again, it will create

    /cgroups/(...)/node_/node_

    The only possibilities (AFAICT) for a -EEXIST on unshare are

    1. pid wraparound
    2. a process fails an unshare, then tries again.

    Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
    node_ will be empty and can be rmdir'ed to make the subsequent unshare()
    succeed.

    Changelog:
    Name cloned cgroups as "node_".

    [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
    Signed-off-by: Serge E. Hallyn
    Cc: Paul Menage
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This example subsystem exports debugging information as an aid to diagnosing
    refcount leaks, etc, in the cgroup framework.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This example demonstrates how to use the generic cgroup subsystem for a
    simple resource tracker that counts, for the processes in a cgroup, the
    total CPU time used and the %CPU used in the last complete 10 second interval.

    Portions contributed by Balbir Singh

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Generic Process Control Groups
    --------------------------

    There have recently been various proposals floating around for
    resource management/accounting and other task grouping subsystems in
    the kernel, including ResGroups, User BeanCounters, NSProxy
    cgroups, and others. These all need the basic abstraction of being
    able to group together multiple processes in an aggregate, in order to
    track/limit the resources permitted to those processes, or control
    other behaviour of the processes, and all implement this grouping in
    different ways.

    This patchset provides a framework for tracking and grouping processes
    into arbitrary "cgroups" and assigning arbitrary state to those
    groupings, in order to control the behaviour of the cgroup as an
    aggregate.

    The intention is that the various resource management and
    virtualization/cgroup efforts can also become task cgroup
    clients, with the result that:

    - the userspace APIs are (somewhat) normalised

    - it's easier to test e.g. the ResGroups CPU controller in
    conjunction with the BeanCounters memory controller, or use either of
    them as the resource-control portion of a virtual server system.

    - the additional kernel footprint of any of the competing resource
    management systems is substantially reduced, since it doesn't need
    to provide process grouping/containment, hence improving their
    chances of getting into the kernel

    This patch:

    Add the main task cgroups framework - the cgroup filesystem, and the
    basic structures for tracking membership and associating subsystem state
    objects to tasks.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage