Blame view

Documentation/cgroups/cgroups.txt 26.3 KB
ddbcc7e8e   Paul Menage   Task Control Grou...
1
2
  				CGROUPS
  				-------
45ce80fb6   Li Zefan   cgroups: consolid...
3
4
  Written by Paul Menage <menage@google.com> based on
  Documentation/cgroups/cpusets.txt
ddbcc7e8e   Paul Menage   Task Control Grou...
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  
  Original copyright statements from cpusets.txt:
  Portions Copyright (C) 2004 BULL SA.
  Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
  Modified by Paul Jackson <pj@sgi.com>
  Modified by Christoph Lameter <clameter@sgi.com>
  
  CONTENTS:
  =========
  
  1. Control Groups
    1.1 What are cgroups ?
    1.2 Why are cgroups needed ?
    1.3 How are cgroups implemented ?
    1.4 What does notify_on_release do ?
97978e6d1   Daniel Lezcano   cgroup: add clone...
20
21
    1.5 What does clone_children do ?
    1.6 How do I use cgroups ?
ddbcc7e8e   Paul Menage   Task Control Grou...
22
23
24
  2. Usage Examples and Syntax
    2.1 Basic Usage
    2.2 Attaching processes
8ca712ea8   Kirill A. Shutemov   cgroups: fix CONT...
25
    2.3 Mounting hierarchies by name
0dea11687   Kirill A. Shutemov   cgroup: implement...
26
    2.4 Notification API
ddbcc7e8e   Paul Menage   Task Control Grou...
27
28
29
30
31
32
33
  3. Kernel API
    3.1 Overview
    3.2 Synchronization
    3.3 Subsystem API
  4. Questions
  
  1. Control Groups
d19e05833   Li Zefan   cgroup: fix and u...
34
  =================
ddbcc7e8e   Paul Menage   Task Control Grou...
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
  
  1.1 What are cgroups ?
  ----------------------
  
  Control Groups provide a mechanism for aggregating/partitioning sets of
  tasks, and all their future children, into hierarchical groups with
  specialized behaviour.
  
  Definitions:
  
  A *cgroup* associates a set of tasks with a set of parameters for one
  or more subsystems.
  
  A *subsystem* is a module that makes use of the task grouping
  facilities provided by cgroups to treat groups of tasks in
  particular ways. A subsystem is typically a "resource controller" that
  schedules a resource or applies per-cgroup limits, but it may be
  anything that wants to act on a group of processes, e.g. a
  virtualization subsystem.
  
  A *hierarchy* is a set of cgroups arranged in a tree, such that
  every task in the system is in exactly one of the cgroups in the
  hierarchy, and a set of subsystems; each subsystem has system-specific
  state attached to each cgroup in the hierarchy.  Each hierarchy has
  an instance of the cgroup virtual filesystem associated with it.
caa790ba6   Chris Samuel   trivial: cgroups:...
60
  At any one time there may be multiple active hierarchies of task
ddbcc7e8e   Paul Menage   Task Control Grou...
61
62
63
64
65
66
67
68
69
70
71
72
  cgroups. Each hierarchy is a partition of all tasks in the system.
  
  User level code may create and destroy cgroups by name in an
  instance of the cgroup virtual file system, specify and query to
  which cgroup a task is assigned, and list the task pids assigned to
  a cgroup. Those creations and assignments only affect the hierarchy
  associated with that instance of the cgroup file system.
  
  On their own, the only use for cgroups is for simple job
  tracking. The intention is that other subsystems hook into the generic
  cgroup support to provide new attributes for cgroups, such as
  accounting/limiting the resources which processes in a cgroup can
45ce80fb6   Li Zefan   cgroups: consolid...
73
  access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows
ddbcc7e8e   Paul Menage   Task Control Grou...
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
  you to associate a set of CPUs and a set of memory nodes with the
  tasks in each cgroup.
  
  1.2 Why are cgroups needed ?
  ----------------------------
  
  There are multiple efforts to provide process aggregations in the
  Linux kernel, mainly for resource tracking purposes. Such efforts
  include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
  namespaces. These all require the basic notion of a
  grouping/partitioning of processes, with newly forked processes ending
  in the same group (cgroup) as their parent process.
  
  The kernel cgroup patch provides the minimum essential kernel
  mechanisms required to efficiently implement such groups. It has
  minimal impact on the system fast paths, and provides hooks for
  specific subsystems such as cpusets to provide additional behaviour as
  desired.
  
  Multiple hierarchy support is provided to allow for situations where
  the division of tasks into cgroups is distinctly different for
  different subsystems - having parallel hierarchies allows each
  hierarchy to be a natural division of tasks, without having to handle
  complex combinations of tasks that would be present if several
  unrelated subsystems needed to be forced into the same tree of
  cgroups.
  
  At one extreme, each resource controller or subsystem could be in a
  separate hierarchy; at the other extreme, all subsystems
  would be attached to the same hierarchy.
  
  As an example of a scenario (originally proposed by vatsa@in.ibm.com)
  that can benefit from multiple hierarchies, consider a large
  university server with various users - students, professors, system
  tasks etc. The resource planning for this server could be along the
  following lines:
6ad85239d   Geunsik Lim   Documentation: up...
110
         CPU :          "Top cpuset"
ddbcc7e8e   Paul Menage   Task Control Grou...
111
112
                         /       \
                 CPUSet1         CPUSet2
6ad85239d   Geunsik Lim   Documentation: up...
113
114
                    |               |
                 (Professors)    (Students)
ddbcc7e8e   Paul Menage   Task Control Grou...
115
116
117
  
                 In addition (system tasks) are attached to topcpuset (so
                 that they can run anywhere) with a limit of 20%
6ad85239d   Geunsik Lim   Documentation: up...
118
         Memory : Professors (50%), Students (30%), system (20%)
ddbcc7e8e   Paul Menage   Task Control Grou...
119

6ad85239d   Geunsik Lim   Documentation: up...
120
         Disk : Professors (50%), Students (30%), system (20%)
ddbcc7e8e   Paul Menage   Task Control Grou...
121
122
123
  
         Network : WWW browsing (20%), Network File System (60%), others (20%)
                                 / \
6ad85239d   Geunsik Lim   Documentation: up...
124
                 Professors (15%)  students (5%)
ddbcc7e8e   Paul Menage   Task Control Grou...
125

caa790ba6   Chris Samuel   trivial: cgroups:...
126
  Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
ddbcc7e8e   Paul Menage   Task Control Grou...
127
  into NFS network class.
caa790ba6   Chris Samuel   trivial: cgroups:...
128
  At the same time Firefox/Lynx will share an appropriate CPU/Memory class
ddbcc7e8e   Paul Menage   Task Control Grou...
129
130
131
132
133
134
  depending on who launched it (prof/student).
  
  With the ability to classify tasks differently for different resources
  (by putting those resource subsystems in different hierarchies) then
  the admin can easily set up a script which receives exec notifications
  and depending on who is launching the browser he can
f6e07d380   Jörg Sommer   Documentation: up...
135
      # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
ddbcc7e8e   Paul Menage   Task Control Grou...
136
137
138
  
  With only a single hierarchy, he now would potentially have to create
  a separate cgroup for every browser launched and associate it with
67de0162f   Jörg Sommer   Documentation: fi...
139
  appropriate network and other resource class.  This may lead to
ddbcc7e8e   Paul Menage   Task Control Grou...
140
141
142
143
  proliferation of such cgroups.
  
  Also lets say that the administrator would like to give enhanced network
  access temporarily to a student's browser (since it is night and the user
d19e05833   Li Zefan   cgroup: fix and u...
144
  wants to do online gaming :))  OR give one of the students simulation
ddbcc7e8e   Paul Menage   Task Control Grou...
145
  apps enhanced CPU power,
d19e05833   Li Zefan   cgroup: fix and u...
146
  With ability to write pids directly to resource classes, it's just a
ddbcc7e8e   Paul Menage   Task Control Grou...
147
  matter of :
f6e07d380   Jörg Sommer   Documentation: up...
148
         # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
ddbcc7e8e   Paul Menage   Task Control Grou...
149
         (after some time)
f6e07d380   Jörg Sommer   Documentation: up...
150
         # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
ddbcc7e8e   Paul Menage   Task Control Grou...
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
  
  Without this ability, he would have to split the cgroup into
  multiple separate ones and then associate the new cgroups with the
  new resource classes.
  
  
  
  1.3 How are cgroups implemented ?
  ---------------------------------
  
  Control Groups extends the kernel as follows:
  
   - Each task in the system has a reference-counted pointer to a
     css_set.
  
   - A css_set contains a set of reference-counted pointers to
     cgroup_subsys_state objects, one for each cgroup subsystem
     registered in the system. There is no direct link from a task to
     the cgroup of which it's a member in each hierarchy, but this
     can be determined by following pointers through the
     cgroup_subsys_state objects. This is because accessing the
     subsystem state is something that's expected to happen frequently
     and in performance-critical code, whereas operations that require a
     task's actual cgroup assignments (in particular, moving between
817929ec2   Paul Menage   Task Control Grou...
175
176
177
     cgroups) are less common. A linked list runs through the cg_list
     field of each task_struct using the css_set, anchored at
     css_set->tasks.
ddbcc7e8e   Paul Menage   Task Control Grou...
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
  
   - A cgroup hierarchy filesystem can be mounted  for browsing and
     manipulation from user space.
  
   - You can list all the tasks (by pid) attached to any cgroup.
  
  The implementation of cgroups requires a few, simple hooks
  into the rest of the kernel, none in performance critical paths:
  
   - in init/main.c, to initialize the root cgroups and initial
     css_set at system boot.
  
   - in fork and exit, to attach and detach a task from its css_set.
  
  In addition a new file system, of type "cgroup" may be mounted, to
  enable browsing and modifying the cgroups presently known to the
  kernel.  When mounting a cgroup hierarchy, you may specify a
  comma-separated list of subsystems to mount as the filesystem mount
  options.  By default, mounting the cgroup filesystem attempts to
  mount a hierarchy containing all registered subsystems.
  
  If an active hierarchy with exactly the same set of subsystems already
  exists, it will be reused for the new mount. If no existing hierarchy
  matches, and any of the requested subsystems are in use in an existing
  hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
  is activated, associated with the requested subsystems.
  
  It's not currently possible to bind a new subsystem to an active
  cgroup hierarchy, or to unbind a subsystem from an active cgroup
  hierarchy. This may be possible in future, but is fraught with nasty
  error-recovery issues.
  
  When a cgroup filesystem is unmounted, if there are any
  child cgroups created below the top-level cgroup, that hierarchy
  will remain active even though unmounted; if there are no
  child cgroups then the hierarchy will be deactivated.
  
  No new system calls are added for cgroups - all support for
  querying and modifying cgroups is via this cgroup file system.
  
  Each task under /proc has an added file named 'cgroup' displaying,
  for each active hierarchy, the subsystem names and the cgroup name
  as the path relative to the root of the cgroup file system.
  
  Each cgroup is represented by a directory in the cgroup file system
  containing the following files describing that cgroup:
7823da36c   Paul Menage   cgroups: update d...
224
225
226
227
228
229
   - tasks: list of tasks (by pid) attached to that cgroup.  This list
     is not guaranteed to be sorted.  Writing a thread id into this file
     moves the thread into this cgroup.
   - cgroup.procs: list of tgids in the cgroup.  This list is not
     guaranteed to be sorted or free of duplicate tgids, and userspace
     should sort/uniquify the list if this property is required.
74a1166df   Ben Blum   cgroups: make pro...
230
231
     Writing a thread group id into this file moves all threads in that
     group into this cgroup.
d19e05833   Li Zefan   cgroup: fix and u...
232
233
234
   - notify_on_release flag: run the release agent on exit?
   - release_agent: the path to use for release notifications (this file
     exists in the top cgroup only)
ddbcc7e8e   Paul Menage   Task Control Grou...
235
236
  
  Other subsystems such as cpusets may add additional files in each
d19e05833   Li Zefan   cgroup: fix and u...
237
  cgroup dir.
ddbcc7e8e   Paul Menage   Task Control Grou...
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
  
  New cgroups are created using the mkdir system call or shell
  command.  The properties of a cgroup, such as its flags, are
  modified by writing to the appropriate file in that cgroups
  directory, as listed above.
  
  The named hierarchical structure of nested cgroups allows partitioning
  a large system into nested, dynamically changeable, "soft-partitions".
  
  The attachment of each task, automatically inherited at fork by any
  children of that task, to a cgroup allows organizing the work load
  on a system into related sets of tasks.  A task may be re-attached to
  any other cgroup, if allowed by the permissions on the necessary
  cgroup file system directories.
  
  When a task is moved from one cgroup to another, it gets a new
  css_set pointer - if there's an already existing css_set with the
  desired collection of cgroups then that group is reused, else a new
b851ee792   Li Zefan   cgroups: update d...
256
257
  css_set is allocated. The appropriate existing css_set is located by
  looking into a hash table.
ddbcc7e8e   Paul Menage   Task Control Grou...
258

817929ec2   Paul Menage   Task Control Grou...
259
260
261
  To allow access from a cgroup to the css_sets (and hence tasks)
  that comprise it, a set of cg_cgroup_link objects form a lattice;
  each cg_cgroup_link is linked into a list of cg_cgroup_links for
d19e05833   Li Zefan   cgroup: fix and u...
262
  a single cgroup on its cgrp_link_list field, and a list of
817929ec2   Paul Menage   Task Control Grou...
263
264
265
266
267
  cg_cgroup_links for a single css_set on its cg_link_list.
  
  Thus the set of tasks in a cgroup can be listed by iterating over
  each css_set that references the cgroup, and sub-iterating over
  each css_set's task set.
ddbcc7e8e   Paul Menage   Task Control Grou...
268
269
270
271
272
273
  The use of a Linux virtual file system (vfs) to represent the
  cgroup hierarchy provides for a familiar permission and name space
  for cgroups, with a minimum of additional kernel code.
  
  1.4 What does notify_on_release do ?
  ------------------------------------
ddbcc7e8e   Paul Menage   Task Control Grou...
274
275
276
277
278
279
280
281
282
283
284
285
  If the notify_on_release flag is enabled (1) in a cgroup, then
  whenever the last task in the cgroup leaves (exits or attaches to
  some other cgroup) and the last child cgroup of that cgroup
  is removed, then the kernel runs the command specified by the contents
  of the "release_agent" file in that hierarchy's root directory,
  supplying the pathname (relative to the mount point of the cgroup
  file system) of the abandoned cgroup.  This enables automatic
  removal of abandoned cgroups.  The default value of
  notify_on_release in the root cgroup at system boot is disabled
  (0).  The default value of other cgroups at creation is the current
  value of their parents notify_on_release setting. The default value of
  a cgroup hierarchy's release_agent path is empty.
97978e6d1   Daniel Lezcano   cgroup: add clone...
286
287
288
289
290
291
292
293
294
295
  1.5 What does clone_children do ?
  ---------------------------------
  
  If the clone_children flag is enabled (1) in a cgroup, then all
  cgroups created beneath will call the post_clone callbacks for each
  subsystem of the newly created cgroup. Usually when this callback is
  implemented for a subsystem, it copies the values of the parent
  subsystem, this is the case for the cpuset.
  
  1.6 How do I use cgroups ?
ddbcc7e8e   Paul Menage   Task Control Grou...
296
297
298
299
  --------------------------
  
  To start a new job that is to be contained within a cgroup, using
  the "cpuset" cgroup subsystem, the steps are something like:
f6e07d380   Jörg Sommer   Documentation: up...
300
301
302
303
304
305
306
307
308
   1) mount -t tmpfs cgroup_root /sys/fs/cgroup
   2) mkdir /sys/fs/cgroup/cpuset
   3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
   4) Create the new cgroup by doing mkdir's and write's (or echo's) in
      the /sys/fs/cgroup virtual file system.
   5) Start a task that will be the "founding father" of the new job.
   6) Attach that task to the new cgroup by writing its pid to the
      /sys/fs/cgroup/cpuset/tasks file for that cgroup.
   7) fork, exec or clone the job tasks from this founding father task.
ddbcc7e8e   Paul Menage   Task Control Grou...
309
310
311
312
  
  For example, the following sequence of commands will setup a cgroup
  named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
  and then start a subshell 'sh' in that cgroup:
f6e07d380   Jörg Sommer   Documentation: up...
313
314
315
316
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/cpuset
    mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
    cd /sys/fs/cgroup/cpuset
ddbcc7e8e   Paul Menage   Task Control Grou...
317
318
    mkdir Charlie
    cd Charlie
0f146a764   Dhaval Giani   cgroups: fix docu...
319
320
    /bin/echo 2-3 > cpuset.cpus
    /bin/echo 1 > cpuset.mems
ddbcc7e8e   Paul Menage   Task Control Grou...
321
322
323
324
325
326
327
328
329
330
331
332
333
334
    /bin/echo $$ > tasks
    sh
    # The subshell 'sh' is now running in cgroup Charlie
    # The next line should display '/Charlie'
    cat /proc/self/cgroup
  
  2. Usage Examples and Syntax
  ============================
  
  2.1 Basic Usage
  ---------------
  
  Creating, modifying, using the cgroups can be done through the cgroup
  virtual filesystem.
caa790ba6   Chris Samuel   trivial: cgroups:...
335
  To mount a cgroup hierarchy with all available subsystems, type:
f6e07d380   Jörg Sommer   Documentation: up...
336
  # mount -t cgroup xxx /sys/fs/cgroup
ddbcc7e8e   Paul Menage   Task Control Grou...
337
338
339
  
  The "xxx" is not interpreted by the cgroup code, but will appear in
  /proc/mounts so may be any useful identifying string that you like.
bb6405eab   Eric B Munson   Documentation: up...
340
341
342
  Note: Some subsystems do not work without some user input first.  For instance,
  if cpusets are enabled the user will have to populate the cpus and mems files
  for each new cgroup created before that group can be used.
f6e07d380   Jörg Sommer   Documentation: up...
343
344
345
346
347
348
349
350
  As explained in section `1.2 Why are cgroups needed?' you should create
  different hierarchies of cgroups for each single resource or group of
  resources you want to control. Therefore, you should mount a tmpfs on
  /sys/fs/cgroup and create directories for each cgroup resource or resource
  group.
  
  # mount -t tmpfs cgroup_root /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/rg1
595f4b694   Trevor Woerner   Documentation/cgr...
351
  To mount a cgroup hierarchy with just the cpuset and memory
ddbcc7e8e   Paul Menage   Task Control Grou...
352
  subsystems, type:
f6e07d380   Jörg Sommer   Documentation: up...
353
  # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
ddbcc7e8e   Paul Menage   Task Control Grou...
354
355
356
  
  To change the set of subsystems bound to a mounted hierarchy, just
  remount with different options:
f6e07d380   Jörg Sommer   Documentation: up...
357
  # mount -o remount,cpuset,blkio hier1 /sys/fs/cgroup/rg1
ddbcc7e8e   Paul Menage   Task Control Grou...
358

1bdcd78e2   Trevor Woerner   cgroups: remove d...
359
  Now memory is removed from the hierarchy and blkio is added.
b6719ec1a   Li Zefan   cgroups: more doc...
360

1bdcd78e2   Trevor Woerner   cgroups: remove d...
361
  Note this will add blkio to the hierarchy but won't remove memory or
b6719ec1a   Li Zefan   cgroups: more doc...
362
  cpuset, because the new options are appended to the old ones:
f6e07d380   Jörg Sommer   Documentation: up...
363
  # mount -o remount,blkio /sys/fs/cgroup/rg1
b6719ec1a   Li Zefan   cgroups: more doc...
364
365
366
  
  To Specify a hierarchy's release_agent:
  # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
f6e07d380   Jörg Sommer   Documentation: up...
367
    xxx /sys/fs/cgroup/rg1
b6719ec1a   Li Zefan   cgroups: more doc...
368
369
  
  Note that specifying 'release_agent' more than once will return failure.
ddbcc7e8e   Paul Menage   Task Control Grou...
370
371
372
373
374
  
  Note that changing the set of subsystems is currently only supported
  when the hierarchy consists of a single (root) cgroup. Supporting
  the ability to arbitrarily bind/unbind subsystems from an existing
  cgroup hierarchy is intended to be implemented in the future.
f6e07d380   Jörg Sommer   Documentation: up...
375
376
  Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
  tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
ddbcc7e8e   Paul Menage   Task Control Grou...
377
  is the cgroup that holds the whole system.
b6719ec1a   Li Zefan   cgroups: more doc...
378
  If you want to change the value of release_agent:
f6e07d380   Jörg Sommer   Documentation: up...
379
  # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
b6719ec1a   Li Zefan   cgroups: more doc...
380
381
  
  It can also be changed via remount.
f6e07d380   Jörg Sommer   Documentation: up...
382
383
  If you want to create a new cgroup under /sys/fs/cgroup/rg1:
  # cd /sys/fs/cgroup/rg1
ddbcc7e8e   Paul Menage   Task Control Grou...
384
385
386
387
388
389
390
  # mkdir my_cgroup
  
  Now you want to do something with this cgroup.
  # cd my_cgroup
  
  In this directory you can find several files:
  # ls
7823da36c   Paul Menage   cgroups: update d...
391
  cgroup.procs notify_on_release tasks
d19e05833   Li Zefan   cgroup: fix and u...
392
  (plus whatever files added by the attached subsystems)
ddbcc7e8e   Paul Menage   Task Control Grou...
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
  
  Now attach your shell to this cgroup:
  # /bin/echo $$ > tasks
  
  You can also create cgroups inside your cgroup by using mkdir in this
  directory.
  # mkdir my_sub_cs
  
  To remove a cgroup, just use rmdir:
  # rmdir my_sub_cs
  
  This will fail if the cgroup is in use (has cgroups inside, or
  has processes attached, or is held alive by other subsystem-specific
  reference).
  
  2.2 Attaching processes
  -----------------------
  
  # /bin/echo PID > tasks
  
  Note that it is PID, not PIDs. You can only attach ONE task at a time.
  If you have several tasks to attach, you have to do it one after another:
  
  # /bin/echo PID1 > tasks
  # /bin/echo PID2 > tasks
  	...
  # /bin/echo PIDn > tasks
bef67c5a7   Li Zefan   cgroups: document...
420
421
422
  You can attach the current shell task by echoing 0:
  
  # echo 0 > tasks
74a1166df   Ben Blum   cgroups: make pro...
423
424
425
426
427
  You can use the cgroup.procs file instead of the tasks file to move all
  threads in a threadgroup at once. Echoing the pid of any task in a
  threadgroup to cgroup.procs causes all tasks in that threadgroup to be
  be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
  in the writing task's threadgroup.
bb6405eab   Eric B Munson   Documentation: up...
428
429
430
431
  Note: Since every task is always a member of exactly one cgroup in each
  mounted hierarchy, to remove a task from its current cgroup you must
  move it into a new cgroup (possibly the root cgroup) by writing to the
  new cgroup's tasks file.
5fe69d7e2   Li Zefan   Documentation: up...
432
433
  Note: Due to some restrictions enforced by some cgroup subsystems, moving
  a process to another cgroup can fail.
bb6405eab   Eric B Munson   Documentation: up...
434

c6d57f331   Paul Menage   cgroups: support ...
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
  2.3 Mounting hierarchies by name
  --------------------------------
  
  Passing the name=<x> option when mounting a cgroups hierarchy
  associates the given name with the hierarchy.  This can be used when
  mounting a pre-existing hierarchy, in order to refer to it by name
  rather than by its set of active subsystems.  Each hierarchy is either
  nameless, or has a unique name.
  
  The name should match [\w.-]+
  
  When passing a name=<x> option for a new hierarchy, you need to
  specify subsystems manually; the legacy behaviour of mounting all
  subsystems when none are explicitly specified is not supported when
  you give a subsystem a name.
  
  The name of the subsystem appears as part of the hierarchy description
  in /proc/mounts and /proc/<pid>/cgroups.
0dea11687   Kirill A. Shutemov   cgroup: implement...
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
  2.4 Notification API
  --------------------
  
  There is mechanism which allows to get notifications about changing
  status of a cgroup.
  
  To register new notification handler you need:
   - create a file descriptor for event notification using eventfd(2);
   - open a control file to be monitored (e.g. memory.usage_in_bytes);
   - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
     Interpretation of args is defined by control file implementation;
  
  eventfd will be woken up by control file implementation or when the
  cgroup is removed.
  
  To unregister notification handler just close eventfd.
  
  NOTE: Support of notifications should be implemented for the control
  file. See documentation for the subsystem.
c6d57f331   Paul Menage   cgroups: support ...
472

ddbcc7e8e   Paul Menage   Task Control Grou...
473
474
475
476
477
478
479
480
481
482
483
484
485
486
  3. Kernel API
  =============
  
  3.1 Overview
  ------------
  
  Each kernel subsystem that wants to hook into the generic cgroup
  system needs to create a cgroup_subsys object. This contains
  various methods, which are callbacks from the cgroup system, along
  with a subsystem id which will be assigned by the cgroup system.
  
  Other fields in the cgroup_subsys object include:
  
  - subsys_id: a unique array index for the subsystem, indicating which
d19e05833   Li Zefan   cgroup: fix and u...
487
    entry in cgroup->subsys[] this subsystem should be managing.
ddbcc7e8e   Paul Menage   Task Control Grou...
488

d19e05833   Li Zefan   cgroup: fix and u...
489
490
  - name: should be initialized to a unique subsystem name. Should be
    no longer than MAX_CGROUP_TYPE_NAMELEN.
ddbcc7e8e   Paul Menage   Task Control Grou...
491

d19e05833   Li Zefan   cgroup: fix and u...
492
493
  - early_init: indicate if the subsystem needs early initialization
    at system boot.
ddbcc7e8e   Paul Menage   Task Control Grou...
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
  
  Each cgroup object created by the system has an array of pointers,
  indexed by subsystem id; this pointer is entirely managed by the
  subsystem; the generic cgroup code will never touch this pointer.
  
  3.2 Synchronization
  -------------------
  
  There is a global mutex, cgroup_mutex, used by the cgroup
  system. This should be taken by anything that wants to modify a
  cgroup. It may also be taken to prevent cgroups from being
  modified, but more specific locks may be more appropriate in that
  situation.
  
  See kernel/cgroup.c for more details.
  
  Subsystems can take/release the cgroup_mutex via the functions
ddbcc7e8e   Paul Menage   Task Control Grou...
511
512
513
514
515
516
517
518
  cgroup_lock()/cgroup_unlock().
  
  Accessing a task's cgroup pointer may be done in the following ways:
  - while holding cgroup_mutex
  - while holding the task's alloc_lock (via task_lock())
  - inside an rcu_read_lock() section via rcu_dereference()
  
  3.3 Subsystem API
d19e05833   Li Zefan   cgroup: fix and u...
519
  -----------------
ddbcc7e8e   Paul Menage   Task Control Grou...
520
521
522
523
524
  
  Each subsystem should:
  
  - add an entry in linux/cgroup_subsys.h
  - define a cgroup_subsys object called <name>_subsys
e6a1105ba   Ben Blum   cgroups: subsyste...
525
  If a subsystem can be compiled as a module, it should also have in its
cf5d5941f   Ben Blum   cgroups: subsyste...
526
527
528
  module initcall a call to cgroup_load_subsys(), and in its exitcall a
  call to cgroup_unload_subsys(). It should also set its_subsys.module =
  THIS_MODULE in its .c file.
e6a1105ba   Ben Blum   cgroups: subsyste...
529

ddbcc7e8e   Paul Menage   Task Control Grou...
530
531
532
  Each subsystem may export the following methods. The only mandatory
  methods are create/destroy. Any others that are null are presumed to
  be successful no-ops.
d19e05833   Li Zefan   cgroup: fix and u...
533
534
  struct cgroup_subsys_state *create(struct cgroup_subsys *ss,
  				   struct cgroup *cgrp)
8dc4f3e17   Paul Menage   cgroups: move cgr...
535
  (cgroup_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
536
537
538
539
540
541
542
543
544
545
546
547
  
  Called to create a subsystem state object for a cgroup. The
  subsystem should allocate its subsystem state object for the passed
  cgroup, returning a pointer to the new object on success or a
  negative error code. On success, the subsystem pointer should point to
  a structure of type cgroup_subsys_state (typically embedded in a
  larger subsystem-specific object), which will be initialized by the
  cgroup system. Note that this will be called at initialization to
  create the root subsystem state for this subsystem; this case can be
  identified by the passed cgroup object having a NULL parent (since
  it's the root of the hierarchy) and may be an appropriate place for
  initialization code.
d19e05833   Li Zefan   cgroup: fix and u...
548
  void destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
8dc4f3e17   Paul Menage   cgroups: move cgr...
549
  (cgroup_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
550

8dc4f3e17   Paul Menage   cgroups: move cgr...
551
552
553
554
555
556
557
  The cgroup system is about to destroy the passed cgroup; the subsystem
  should do any necessary cleanup and free its subsystem state
  object. By the time this method is called, the cgroup has already been
  unlinked from the file system and from the child list of its parent;
  cgroup->parent is still valid. (Note - can also be called for a
  newly-created cgroup if an error occurs after this subsystem's
  create() method has been called for the new cgroup).
ddbcc7e8e   Paul Menage   Task Control Grou...
558

ec64f5154   KAMEZAWA Hiroyuki   cgroup: fix frequ...
559
  int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
d19e05833   Li Zefan   cgroup: fix and u...
560
561
562
  
  Called before checking the reference count on each subsystem. This may
  be useful for subsystems which have some extra references even if
ec64f5154   KAMEZAWA Hiroyuki   cgroup: fix frequ...
563
564
565
  there are not tasks in the cgroup. If pre_destroy() returns error code,
  rmdir() will fail with it. From this behavior, pre_destroy() can be
  called multiple times against a cgroup.
d19e05833   Li Zefan   cgroup: fix and u...
566
567
  
  int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
2f7ee5691   Tejun Heo   cgroup: introduce...
568
  	       struct cgroup_taskset *tset)
8dc4f3e17   Paul Menage   cgroups: move cgr...
569
  (cgroup_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
570

2f7ee5691   Tejun Heo   cgroup: introduce...
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
  Called prior to moving one or more tasks into a cgroup; if the
  subsystem returns an error, this will abort the attach operation.
  @tset contains the tasks to be attached and is guaranteed to have at
  least one task in it.
  
  If there are multiple tasks in the taskset, then:
    - it's guaranteed that all are from the same thread group
    - @tset contains all tasks from the thread group whether or not
      they're switching cgroups
    - the first task is the leader
  
  Each @tset entry also contains the task's old cgroup and tasks which
  aren't switching cgroup can be skipped easily using the
  cgroup_taskset_for_each() iterator. Note that this isn't called on a
  fork. If this method returns 0 (success) then this should remain valid
  while the caller holds cgroup_mutex and it is ensured that either
f780bdb7c   Ben Blum   cgroups: add per-...
587
  attach() or cancel_attach() will be called in future.
2468c7234   Daisuke Nishimura   cgroup: introduce...
588
  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
2f7ee5691   Tejun Heo   cgroup: introduce...
589
  		   struct cgroup_taskset *tset)
2468c7234   Daisuke Nishimura   cgroup: introduce...
590
591
592
593
  (cgroup_mutex held by caller)
  
  Called when a task attach operation has failed after can_attach() has succeeded.
  A subsystem whose can_attach() has some side-effects should provide this
883931612   Thomas Weber   Fix typos in comm...
594
  function, so that the subsystem can implement a rollback. If not, not necessary.
2468c7234   Daisuke Nishimura   cgroup: introduce...
595
  This will be called only about subsystems whose can_attach() operation have
2f7ee5691   Tejun Heo   cgroup: introduce...
596
  succeeded. The parameters are identical to can_attach().
2468c7234   Daisuke Nishimura   cgroup: introduce...
597

d19e05833   Li Zefan   cgroup: fix and u...
598
  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
2f7ee5691   Tejun Heo   cgroup: introduce...
599
  	    struct cgroup_taskset *tset)
18e7f1f0d   Li Zefan   cgroups: document...
600
  (cgroup_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
601
602
603
  
  Called after the task has been attached to the cgroup, to allow any
  post-attachment activity that requires memory allocations or blocking.
2f7ee5691   Tejun Heo   cgroup: introduce...
604
  The parameters are identical to can_attach().
f780bdb7c   Ben Blum   cgroups: add per-...
605

ddbcc7e8e   Paul Menage   Task Control Grou...
606
  void fork(struct cgroup_subsy *ss, struct task_struct *task)
ddbcc7e8e   Paul Menage   Task Control Grou...
607

e8d55fdeb   Li Zefan   cgroups: simplify...
608
  Called when a task is forked into a cgroup.
ddbcc7e8e   Paul Menage   Task Control Grou...
609
610
  
  void exit(struct cgroup_subsys *ss, struct task_struct *task)
ddbcc7e8e   Paul Menage   Task Control Grou...
611

d19e05833   Li Zefan   cgroup: fix and u...
612
  Called during task exit.
ddbcc7e8e   Paul Menage   Task Control Grou...
613

d19e05833   Li Zefan   cgroup: fix and u...
614
  int populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
18e7f1f0d   Li Zefan   cgroups: document...
615
  (cgroup_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
616
617
618
619
620
621
622
  
  Called after creation of a cgroup to allow a subsystem to populate
  the cgroup directory with file entries.  The subsystem should make
  calls to cgroup_add_file() with objects of type cftype (see
  include/linux/cgroup.h for details).  Note that although this
  method can return an error code, the error code is currently not
  always handled well.
d19e05833   Li Zefan   cgroup: fix and u...
623
  void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
18e7f1f0d   Li Zefan   cgroups: document...
624
  (cgroup_mutex held by caller)
697f41610   Paul Menage   Task Control Grou...
625

a77aea920   Daniel Lezcano   cgroup: remove th...
626
  Called during cgroup_create() to do any parameter
697f41610   Paul Menage   Task Control Grou...
627
628
629
  initialization which might be required before a task could attach.  For
  example in cpusets, no task may attach before 'cpus' and 'mems' are set
  up.
ddbcc7e8e   Paul Menage   Task Control Grou...
630
  void bind(struct cgroup_subsys *ss, struct cgroup *root)
999cd8a45   Paul Menage   cgroups: add a pe...
631
  (cgroup_mutex and ss->hierarchy_mutex held by caller)
ddbcc7e8e   Paul Menage   Task Control Grou...
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
  
  Called when a cgroup subsystem is rebound to a different hierarchy
  and root cgroup. Currently this will only involve movement between
  the default hierarchy (which never has sub-cgroups) and a hierarchy
  that is being created/destroyed (and hence has no sub-cgroups).
  
  4. Questions
  ============
  
  Q: what's up with this '/bin/echo' ?
  A: bash's builtin 'echo' command does not check calls to write() against
     errors. If you use it in the cgroup file system, you won't be
     able to tell whether a command succeeded or failed.
  
  Q: When I attach processes, only the first of the line gets really attached !
  A: We can only return one error code per call to write(). So you should also
     put only ONE pid.