Blame view

Documentation/cgroup-v1/cgroups.txt 26.4 KB
81f7e3824   Eric Lee   Initial Release, ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
  				CGROUPS
  				-------
  
  Written by Paul Menage <menage@google.com> based on
  Documentation/cgroup-v1/cpusets.txt
  
  Original copyright statements from cpusets.txt:
  Portions Copyright (C) 2004 BULL SA.
  Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
  Modified by Paul Jackson <pj@sgi.com>
  Modified by Christoph Lameter <cl@linux.com>
  
  CONTENTS:
  =========
  
  1. Control Groups
    1.1 What are cgroups ?
    1.2 Why are cgroups needed ?
    1.3 How are cgroups implemented ?
    1.4 What does notify_on_release do ?
    1.5 What does clone_children do ?
    1.6 How do I use cgroups ?
  2. Usage Examples and Syntax
    2.1 Basic Usage
    2.2 Attaching processes
    2.3 Mounting hierarchies by name
  3. Kernel API
    3.1 Overview
    3.2 Synchronization
    3.3 Subsystem API
  4. Extended attributes usage
  5. Questions
  
  1. Control Groups
  =================
  
  1.1 What are cgroups ?
  ----------------------
  
  Control Groups provide a mechanism for aggregating/partitioning sets of
  tasks, and all their future children, into hierarchical groups with
  specialized behaviour.
  
  Definitions:
  
  A *cgroup* associates a set of tasks with a set of parameters for one
  or more subsystems.
  
  A *subsystem* is a module that makes use of the task grouping
  facilities provided by cgroups to treat groups of tasks in
  particular ways. A subsystem is typically a "resource controller" that
  schedules a resource or applies per-cgroup limits, but it may be
  anything that wants to act on a group of processes, e.g. a
  virtualization subsystem.
  
  A *hierarchy* is a set of cgroups arranged in a tree, such that
  every task in the system is in exactly one of the cgroups in the
  hierarchy, and a set of subsystems; each subsystem has system-specific
  state attached to each cgroup in the hierarchy.  Each hierarchy has
  an instance of the cgroup virtual filesystem associated with it.
  
  At any one time there may be multiple active hierarchies of task
  cgroups. Each hierarchy is a partition of all tasks in the system.
  
  User-level code may create and destroy cgroups by name in an
  instance of the cgroup virtual file system, specify and query to
  which cgroup a task is assigned, and list the task PIDs assigned to
  a cgroup. Those creations and assignments only affect the hierarchy
  associated with that instance of the cgroup file system.
  
  On their own, the only use for cgroups is for simple job
  tracking. The intention is that other subsystems hook into the generic
  cgroup support to provide new attributes for cgroups, such as
  accounting/limiting the resources which processes in a cgroup can
  access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow
  you to associate a set of CPUs and a set of memory nodes with the
  tasks in each cgroup.
  
  1.2 Why are cgroups needed ?
  ----------------------------
  
  There are multiple efforts to provide process aggregations in the
  Linux kernel, mainly for resource-tracking purposes. Such efforts
  include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
  namespaces. These all require the basic notion of a
  grouping/partitioning of processes, with newly forked processes ending
  up in the same group (cgroup) as their parent process.
  
  The kernel cgroup patch provides the minimum essential kernel
  mechanisms required to efficiently implement such groups. It has
  minimal impact on the system fast paths, and provides hooks for
  specific subsystems such as cpusets to provide additional behaviour as
  desired.
  
  Multiple hierarchy support is provided to allow for situations where
  the division of tasks into cgroups is distinctly different for
  different subsystems - having parallel hierarchies allows each
  hierarchy to be a natural division of tasks, without having to handle
  complex combinations of tasks that would be present if several
  unrelated subsystems needed to be forced into the same tree of
  cgroups.
  
  At one extreme, each resource controller or subsystem could be in a
  separate hierarchy; at the other extreme, all subsystems
  would be attached to the same hierarchy.
  
  As an example of a scenario (originally proposed by vatsa@in.ibm.com)
  that can benefit from multiple hierarchies, consider a large
  university server with various users - students, professors, system
  tasks etc. The resource planning for this server could be along the
  following lines:
  
         CPU :          "Top cpuset"
                         /       \
                 CPUSet1         CPUSet2
                    |               |
                 (Professors)    (Students)
  
                 In addition (system tasks) are attached to topcpuset (so
                 that they can run anywhere) with a limit of 20%
  
         Memory : Professors (50%), Students (30%), system (20%)
  
         Disk : Professors (50%), Students (30%), system (20%)
  
         Network : WWW browsing (20%), Network File System (60%), others (20%)
                                 / \
                 Professors (15%)  students (5%)
  
  Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
  into the NFS network class.
  
  At the same time Firefox/Lynx will share an appropriate CPU/Memory class
  depending on who launched it (prof/student).
  
  With the ability to classify tasks differently for different resources
  (by putting those resource subsystems in different hierarchies),
  the admin can easily set up a script which receives exec notifications
  and depending on who is launching the browser he can
  
      # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
  
  With only a single hierarchy, he now would potentially have to create
  a separate cgroup for every browser launched and associate it with
  appropriate network and other resource class.  This may lead to
  proliferation of such cgroups.
  
  Also let's say that the administrator would like to give enhanced network
  access temporarily to a student's browser (since it is night and the user
  wants to do online gaming :))  OR give one of the student's simulation
  apps enhanced CPU power.
  
  With ability to write PIDs directly to resource classes, it's just a
  matter of:
  
         # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
         (after some time)
         # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
  
  Without this ability, the administrator would have to split the cgroup into
  multiple separate ones and then associate the new cgroups with the
  new resource classes.
  
  
  
  1.3 How are cgroups implemented ?
  ---------------------------------
  
  Control Groups extends the kernel as follows:
  
   - Each task in the system has a reference-counted pointer to a
     css_set.
  
   - A css_set contains a set of reference-counted pointers to
     cgroup_subsys_state objects, one for each cgroup subsystem
     registered in the system. There is no direct link from a task to
     the cgroup of which it's a member in each hierarchy, but this
     can be determined by following pointers through the
     cgroup_subsys_state objects. This is because accessing the
     subsystem state is something that's expected to happen frequently
     and in performance-critical code, whereas operations that require a
     task's actual cgroup assignments (in particular, moving between
     cgroups) are less common. A linked list runs through the cg_list
     field of each task_struct using the css_set, anchored at
     css_set->tasks.
  
   - A cgroup hierarchy filesystem can be mounted for browsing and
     manipulation from user space.
  
   - You can list all the tasks (by PID) attached to any cgroup.
  
  The implementation of cgroups requires a few, simple hooks
  into the rest of the kernel, none in performance-critical paths:
  
   - in init/main.c, to initialize the root cgroups and initial
     css_set at system boot.
  
   - in fork and exit, to attach and detach a task from its css_set.
  
  In addition, a new file system of type "cgroup" may be mounted, to
  enable browsing and modifying the cgroups presently known to the
  kernel.  When mounting a cgroup hierarchy, you may specify a
  comma-separated list of subsystems to mount as the filesystem mount
  options.  By default, mounting the cgroup filesystem attempts to
  mount a hierarchy containing all registered subsystems.
  
  If an active hierarchy with exactly the same set of subsystems already
  exists, it will be reused for the new mount. If no existing hierarchy
  matches, and any of the requested subsystems are in use in an existing
  hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
  is activated, associated with the requested subsystems.
  
  It's not currently possible to bind a new subsystem to an active
  cgroup hierarchy, or to unbind a subsystem from an active cgroup
  hierarchy. This may be possible in future, but is fraught with nasty
  error-recovery issues.
  
  When a cgroup filesystem is unmounted, if there are any
  child cgroups created below the top-level cgroup, that hierarchy
  will remain active even though unmounted; if there are no
  child cgroups then the hierarchy will be deactivated.
  
  No new system calls are added for cgroups - all support for
  querying and modifying cgroups is via this cgroup file system.
  
  Each task under /proc has an added file named 'cgroup' displaying,
  for each active hierarchy, the subsystem names and the cgroup name
  as the path relative to the root of the cgroup file system.
  
  Each cgroup is represented by a directory in the cgroup file system
  containing the following files describing that cgroup:
  
   - tasks: list of tasks (by PID) attached to that cgroup.  This list
     is not guaranteed to be sorted.  Writing a thread ID into this file
     moves the thread into this cgroup.
   - cgroup.procs: list of thread group IDs in the cgroup.  This list is
     not guaranteed to be sorted or free of duplicate TGIDs, and userspace
     should sort/uniquify the list if this property is required.
     Writing a thread group ID into this file moves all threads in that
     group into this cgroup.
   - notify_on_release flag: run the release agent on exit?
   - release_agent: the path to use for release notifications (this file
     exists in the top cgroup only)
  
  Other subsystems such as cpusets may add additional files in each
  cgroup dir.
  
  New cgroups are created using the mkdir system call or shell
  command.  The properties of a cgroup, such as its flags, are
  modified by writing to the appropriate file in that cgroups
  directory, as listed above.
  
  The named hierarchical structure of nested cgroups allows partitioning
  a large system into nested, dynamically changeable, "soft-partitions".
  
  The attachment of each task, automatically inherited at fork by any
  children of that task, to a cgroup allows organizing the work load
  on a system into related sets of tasks.  A task may be re-attached to
  any other cgroup, if allowed by the permissions on the necessary
  cgroup file system directories.
  
  When a task is moved from one cgroup to another, it gets a new
  css_set pointer - if there's an already existing css_set with the
  desired collection of cgroups then that group is reused, otherwise a new
  css_set is allocated. The appropriate existing css_set is located by
  looking into a hash table.
  
  To allow access from a cgroup to the css_sets (and hence tasks)
  that comprise it, a set of cg_cgroup_link objects form a lattice;
  each cg_cgroup_link is linked into a list of cg_cgroup_links for
  a single cgroup on its cgrp_link_list field, and a list of
  cg_cgroup_links for a single css_set on its cg_link_list.
  
  Thus the set of tasks in a cgroup can be listed by iterating over
  each css_set that references the cgroup, and sub-iterating over
  each css_set's task set.
  
  The use of a Linux virtual file system (vfs) to represent the
  cgroup hierarchy provides for a familiar permission and name space
  for cgroups, with a minimum of additional kernel code.
  
  1.4 What does notify_on_release do ?
  ------------------------------------
  
  If the notify_on_release flag is enabled (1) in a cgroup, then
  whenever the last task in the cgroup leaves (exits or attaches to
  some other cgroup) and the last child cgroup of that cgroup
  is removed, then the kernel runs the command specified by the contents
  of the "release_agent" file in that hierarchy's root directory,
  supplying the pathname (relative to the mount point of the cgroup
  file system) of the abandoned cgroup.  This enables automatic
  removal of abandoned cgroups.  The default value of
  notify_on_release in the root cgroup at system boot is disabled
  (0).  The default value of other cgroups at creation is the current
  value of their parents' notify_on_release settings. The default value of
  a cgroup hierarchy's release_agent path is empty.
  
  1.5 What does clone_children do ?
  ---------------------------------
  
  This flag only affects the cpuset controller. If the clone_children
  flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
  configuration from the parent during initialization.
  
  1.6 How do I use cgroups ?
  --------------------------
  
  To start a new job that is to be contained within a cgroup, using
  the "cpuset" cgroup subsystem, the steps are something like:
  
   1) mount -t tmpfs cgroup_root /sys/fs/cgroup
   2) mkdir /sys/fs/cgroup/cpuset
   3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
   4) Create the new cgroup by doing mkdir's and write's (or echo's) in
      the /sys/fs/cgroup/cpuset virtual file system.
   5) Start a task that will be the "founding father" of the new job.
   6) Attach that task to the new cgroup by writing its PID to the
      /sys/fs/cgroup/cpuset tasks file for that cgroup.
   7) fork, exec or clone the job tasks from this founding father task.
  
  For example, the following sequence of commands will setup a cgroup
  named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
  and then start a subshell 'sh' in that cgroup:
  
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/cpuset
    mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
    cd /sys/fs/cgroup/cpuset
    mkdir Charlie
    cd Charlie
    /bin/echo 2-3 > cpuset.cpus
    /bin/echo 1 > cpuset.mems
    /bin/echo $$ > tasks
    sh
    # The subshell 'sh' is now running in cgroup Charlie
    # The next line should display '/Charlie'
    cat /proc/self/cgroup
  
  2. Usage Examples and Syntax
  ============================
  
  2.1 Basic Usage
  ---------------
  
  Creating, modifying, using cgroups can be done through the cgroup
  virtual filesystem.
  
  To mount a cgroup hierarchy with all available subsystems, type:
  # mount -t cgroup xxx /sys/fs/cgroup
  
  The "xxx" is not interpreted by the cgroup code, but will appear in
  /proc/mounts so may be any useful identifying string that you like.
  
  Note: Some subsystems do not work without some user input first.  For instance,
  if cpusets are enabled the user will have to populate the cpus and mems files
  for each new cgroup created before that group can be used.
  
  As explained in section `1.2 Why are cgroups needed?' you should create
  different hierarchies of cgroups for each single resource or group of
  resources you want to control. Therefore, you should mount a tmpfs on
  /sys/fs/cgroup and create directories for each cgroup resource or resource
  group.
  
  # mount -t tmpfs cgroup_root /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/rg1
  
  To mount a cgroup hierarchy with just the cpuset and memory
  subsystems, type:
  # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
  
  While remounting cgroups is currently supported, it is not recommend
  to use it. Remounting allows changing bound subsystems and
  release_agent. Rebinding is hardly useful as it only works when the
  hierarchy is empty and release_agent itself should be replaced with
  conventional fsnotify. The support for remounting will be removed in
  the future.
  
  To Specify a hierarchy's release_agent:
  # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
    xxx /sys/fs/cgroup/rg1
  
  Note that specifying 'release_agent' more than once will return failure.
  
  Note that changing the set of subsystems is currently only supported
  when the hierarchy consists of a single (root) cgroup. Supporting
  the ability to arbitrarily bind/unbind subsystems from an existing
  cgroup hierarchy is intended to be implemented in the future.
  
  Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
  tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
  is the cgroup that holds the whole system.
  
  If you want to change the value of release_agent:
  # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
  
  It can also be changed via remount.
  
  If you want to create a new cgroup under /sys/fs/cgroup/rg1:
  # cd /sys/fs/cgroup/rg1
  # mkdir my_cgroup
  
  Now you want to do something with this cgroup.
  # cd my_cgroup
  
  In this directory you can find several files:
  # ls
  cgroup.procs notify_on_release tasks
  (plus whatever files added by the attached subsystems)
  
  Now attach your shell to this cgroup:
  # /bin/echo $$ > tasks
  
  You can also create cgroups inside your cgroup by using mkdir in this
  directory.
  # mkdir my_sub_cs
  
  To remove a cgroup, just use rmdir:
  # rmdir my_sub_cs
  
  This will fail if the cgroup is in use (has cgroups inside, or
  has processes attached, or is held alive by other subsystem-specific
  reference).
  
  2.2 Attaching processes
  -----------------------
  
  # /bin/echo PID > tasks
  
  Note that it is PID, not PIDs. You can only attach ONE task at a time.
  If you have several tasks to attach, you have to do it one after another:
  
  # /bin/echo PID1 > tasks
  # /bin/echo PID2 > tasks
  	...
  # /bin/echo PIDn > tasks
  
  You can attach the current shell task by echoing 0:
  
  # echo 0 > tasks
  
  You can use the cgroup.procs file instead of the tasks file to move all
  threads in a threadgroup at once. Echoing the PID of any task in a
  threadgroup to cgroup.procs causes all tasks in that threadgroup to be
  attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
  in the writing task's threadgroup.
  
  Note: Since every task is always a member of exactly one cgroup in each
  mounted hierarchy, to remove a task from its current cgroup you must
  move it into a new cgroup (possibly the root cgroup) by writing to the
  new cgroup's tasks file.
  
  Note: Due to some restrictions enforced by some cgroup subsystems, moving
  a process to another cgroup can fail.
  
  2.3 Mounting hierarchies by name
  --------------------------------
  
  Passing the name=<x> option when mounting a cgroups hierarchy
  associates the given name with the hierarchy.  This can be used when
  mounting a pre-existing hierarchy, in order to refer to it by name
  rather than by its set of active subsystems.  Each hierarchy is either
  nameless, or has a unique name.
  
  The name should match [\w.-]+
  
  When passing a name=<x> option for a new hierarchy, you need to
  specify subsystems manually; the legacy behaviour of mounting all
  subsystems when none are explicitly specified is not supported when
  you give a subsystem a name.
  
  The name of the subsystem appears as part of the hierarchy description
  in /proc/mounts and /proc/<pid>/cgroups.
  
  
  3. Kernel API
  =============
  
  3.1 Overview
  ------------
  
  Each kernel subsystem that wants to hook into the generic cgroup
  system needs to create a cgroup_subsys object. This contains
  various methods, which are callbacks from the cgroup system, along
  with a subsystem ID which will be assigned by the cgroup system.
  
  Other fields in the cgroup_subsys object include:
  
  - subsys_id: a unique array index for the subsystem, indicating which
    entry in cgroup->subsys[] this subsystem should be managing.
  
  - name: should be initialized to a unique subsystem name. Should be
    no longer than MAX_CGROUP_TYPE_NAMELEN.
  
  - early_init: indicate if the subsystem needs early initialization
    at system boot.
  
  Each cgroup object created by the system has an array of pointers,
  indexed by subsystem ID; this pointer is entirely managed by the
  subsystem; the generic cgroup code will never touch this pointer.
  
  3.2 Synchronization
  -------------------
  
  There is a global mutex, cgroup_mutex, used by the cgroup
  system. This should be taken by anything that wants to modify a
  cgroup. It may also be taken to prevent cgroups from being
  modified, but more specific locks may be more appropriate in that
  situation.
  
  See kernel/cgroup.c for more details.
  
  Subsystems can take/release the cgroup_mutex via the functions
  cgroup_lock()/cgroup_unlock().
  
  Accessing a task's cgroup pointer may be done in the following ways:
  - while holding cgroup_mutex
  - while holding the task's alloc_lock (via task_lock())
  - inside an rcu_read_lock() section via rcu_dereference()
  
  3.3 Subsystem API
  -----------------
  
  Each subsystem should:
  
  - add an entry in linux/cgroup_subsys.h
  - define a cgroup_subsys object called <name>_subsys
  
  If a subsystem can be compiled as a module, it should also have in its
  module initcall a call to cgroup_load_subsys(), and in its exitcall a
  call to cgroup_unload_subsys(). It should also set its_subsys.module =
  THIS_MODULE in its .c file.
  
  Each subsystem may export the following methods. The only mandatory
  methods are css_alloc/free. Any others that are null are presumed to
  be successful no-ops.
  
  struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)
  (cgroup_mutex held by caller)
  
  Called to allocate a subsystem state object for a cgroup. The
  subsystem should allocate its subsystem state object for the passed
  cgroup, returning a pointer to the new object on success or a
  ERR_PTR() value. On success, the subsystem pointer should point to
  a structure of type cgroup_subsys_state (typically embedded in a
  larger subsystem-specific object), which will be initialized by the
  cgroup system. Note that this will be called at initialization to
  create the root subsystem state for this subsystem; this case can be
  identified by the passed cgroup object having a NULL parent (since
  it's the root of the hierarchy) and may be an appropriate place for
  initialization code.
  
  int css_online(struct cgroup *cgrp)
  (cgroup_mutex held by caller)
  
  Called after @cgrp successfully completed all allocations and made
  visible to cgroup_for_each_child/descendant_*() iterators. The
  subsystem may choose to fail creation by returning -errno. This
  callback can be used to implement reliable state sharing and
  propagation along the hierarchy. See the comment on
  cgroup_for_each_descendant_pre() for details.
  
  void css_offline(struct cgroup *cgrp);
  (cgroup_mutex held by caller)
  
  This is the counterpart of css_online() and called iff css_online()
  has succeeded on @cgrp. This signifies the beginning of the end of
  @cgrp. @cgrp is being removed and the subsystem should start dropping
  all references it's holding on @cgrp. When all references are dropped,
  cgroup removal will proceed to the next step - css_free(). After this
  callback, @cgrp should be considered dead to the subsystem.
  
  void css_free(struct cgroup *cgrp)
  (cgroup_mutex held by caller)
  
  The cgroup system is about to free @cgrp; the subsystem should free
  its subsystem state object. By the time this method is called, @cgrp
  is completely unused; @cgrp->parent is still valid. (Note - can also
  be called for a newly-created cgroup if an error occurs after this
  subsystem's create() method has been called for the new cgroup).
  
  int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
  (cgroup_mutex held by caller)
  
  Called prior to moving one or more tasks into a cgroup; if the
  subsystem returns an error, this will abort the attach operation.
  @tset contains the tasks to be attached and is guaranteed to have at
  least one task in it.
  
  If there are multiple tasks in the taskset, then:
    - it's guaranteed that all are from the same thread group
    - @tset contains all tasks from the thread group whether or not
      they're switching cgroups
    - the first task is the leader
  
  Each @tset entry also contains the task's old cgroup and tasks which
  aren't switching cgroup can be skipped easily using the
  cgroup_taskset_for_each() iterator. Note that this isn't called on a
  fork. If this method returns 0 (success) then this should remain valid
  while the caller holds cgroup_mutex and it is ensured that either
  attach() or cancel_attach() will be called in future.
  
  void css_reset(struct cgroup_subsys_state *css)
  (cgroup_mutex held by caller)
  
  An optional operation which should restore @css's configuration to the
  initial state.  This is currently only used on the unified hierarchy
  when a subsystem is disabled on a cgroup through
  "cgroup.subtree_control" but should remain enabled because other
  subsystems depend on it.  cgroup core makes such a css invisible by
  removing the associated interface files and invokes this callback so
  that the hidden subsystem can return to the initial neutral state.
  This prevents unexpected resource control from a hidden css and
  ensures that the configuration is in the initial state when it is made
  visible again later.
  
  void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
  (cgroup_mutex held by caller)
  
  Called when a task attach operation has failed after can_attach() has succeeded.
  A subsystem whose can_attach() has some side-effects should provide this
  function, so that the subsystem can implement a rollback. If not, not necessary.
  This will be called only about subsystems whose can_attach() operation have
  succeeded. The parameters are identical to can_attach().
  
  void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
  (cgroup_mutex held by caller)
  
  Called after the task has been attached to the cgroup, to allow any
  post-attachment activity that requires memory allocations or blocking.
  The parameters are identical to can_attach().
  
  void fork(struct task_struct *task)
  
  Called when a task is forked into a cgroup.
  
  void exit(struct task_struct *task)
  
  Called during task exit.
  
  void free(struct task_struct *task)
  
  Called when the task_struct is freed.
  
  void bind(struct cgroup *root)
  (cgroup_mutex held by caller)
  
  Called when a cgroup subsystem is rebound to a different hierarchy
  and root cgroup. Currently this will only involve movement between
  the default hierarchy (which never has sub-cgroups) and a hierarchy
  that is being created/destroyed (and hence has no sub-cgroups).
  
  4. Extended attribute usage
  ===========================
  
  cgroup filesystem supports certain types of extended attributes in its
  directories and files.  The current supported types are:
  	- Trusted (XATTR_TRUSTED)
  	- Security (XATTR_SECURITY)
  
  Both require CAP_SYS_ADMIN capability to set.
  
  Like in tmpfs, the extended attributes in cgroup filesystem are stored
  using kernel memory and it's advised to keep the usage at minimum.  This
  is the reason why user defined extended attributes are not supported, since
  any user can do it and there's no limit in the value size.
  
  The current known users for this feature are SELinux to limit cgroup usage
  in containers and systemd for assorted meta data like main PID in a cgroup
  (systemd creates a cgroup per service).
  
  5. Questions
  ============
  
  Q: what's up with this '/bin/echo' ?
  A: bash's builtin 'echo' command does not check calls to write() against
     errors. If you use it in the cgroup file system, you won't be
     able to tell whether a command succeeded or failed.
  
  Q: When I attach processes, only the first of the line gets really attached !
  A: We can only return one error code per call to write(). So you should also
     put only ONE PID.