Commit 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8

Authored by Li Zefan
Committed by Linus Torvalds
1 parent 23964d2d02

cgroups: consolidate cgroup documents

Move Documentation/cpusets.txt and Documentation/controllers/* to
Documentation/cgroups/

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 17 changed files with 1824 additions and 1824 deletions Side-by-side Diff

Documentation/cgroups/cgroups.txt
1 1 CGROUPS
2 2 -------
3 3  
4   -Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
  4 +Written by Paul Menage <menage@google.com> based on
  5 +Documentation/cgroups/cpusets.txt
5 6  
6 7 Original copyright statements from cpusets.txt:
7 8 Portions Copyright (C) 2004 BULL SA.
... ... @@ -68,7 +69,7 @@
68 69 tracking. The intention is that other subsystems hook into the generic
69 70 cgroup support to provide new attributes for cgroups, such as
70 71 accounting/limiting the resources which processes in a cgroup can
71   -access. For example, cpusets (see Documentation/cpusets.txt) allows
  72 +access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows
72 73 you to associate a set of CPUs and a set of memory nodes with the
73 74 tasks in each cgroup.
74 75  
Documentation/cgroups/cpuacct.txt
  1 +CPU Accounting Controller
  2 +-------------------------
  3 +
  4 +The CPU accounting controller is used to group tasks using cgroups and
  5 +account the CPU usage of these groups of tasks.
  6 +
  7 +The CPU accounting controller supports multi-hierarchy groups. An accounting
  8 +group accumulates the CPU usage of all of its child groups and the tasks
  9 +directly present in its group.
  10 +
  11 +Accounting groups can be created by first mounting the cgroup filesystem.
  12 +
  13 +# mkdir /cgroups
  14 +# mount -t cgroup -ocpuacct none /cgroups
  15 +
  16 +With the above step, the initial or the parent accounting group
  17 +becomes visible at /cgroups. At bootup, this group includes all the
  18 +tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
  19 +/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
  20 +this group which is essentially the CPU time obtained by all the tasks
  21 +in the system.
  22 +
  23 +New accounting groups can be created under the parent group /cgroups.
  24 +
  25 +# cd /cgroups
  26 +# mkdir g1
  27 +# echo $$ > g1
  28 +
  29 +The above steps create a new group g1 and move the current shell
  30 +process (bash) into it. CPU time consumed by this bash and its children
  31 +can be obtained from g1/cpuacct.usage and the same is accumulated in
  32 +/cgroups/cpuacct.usage also.
Documentation/cgroups/cpusets.txt
  1 + CPUSETS
  2 + -------
  3 +
  4 +Copyright (C) 2004 BULL SA.
  5 +Written by Simon.Derr@bull.net
  6 +
  7 +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
  8 +Modified by Paul Jackson <pj@sgi.com>
  9 +Modified by Christoph Lameter <clameter@sgi.com>
  10 +Modified by Paul Menage <menage@google.com>
  11 +Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
  12 +
  13 +CONTENTS:
  14 +=========
  15 +
  16 +1. Cpusets
  17 + 1.1 What are cpusets ?
  18 + 1.2 Why are cpusets needed ?
  19 + 1.3 How are cpusets implemented ?
  20 + 1.4 What are exclusive cpusets ?
  21 + 1.5 What is memory_pressure ?
  22 + 1.6 What is memory spread ?
  23 + 1.7 What is sched_load_balance ?
  24 + 1.8 What is sched_relax_domain_level ?
  25 + 1.9 How do I use cpusets ?
  26 +2. Usage Examples and Syntax
  27 + 2.1 Basic Usage
  28 + 2.2 Adding/removing cpus
  29 + 2.3 Setting flags
  30 + 2.4 Attaching processes
  31 +3. Questions
  32 +4. Contact
  33 +
  34 +1. Cpusets
  35 +==========
  36 +
  37 +1.1 What are cpusets ?
  38 +----------------------
  39 +
  40 +Cpusets provide a mechanism for assigning a set of CPUs and Memory
  41 +Nodes to a set of tasks. In this document "Memory Node" refers to
  42 +an on-line node that contains memory.
  43 +
  44 +Cpusets constrain the CPU and Memory placement of tasks to only
  45 +the resources within a tasks current cpuset. They form a nested
  46 +hierarchy visible in a virtual file system. These are the essential
  47 +hooks, beyond what is already present, required to manage dynamic
  48 +job placement on large systems.
  49 +
  50 +Cpusets use the generic cgroup subsystem described in
  51 +Documentation/cgroups/cgroups.txt.
  52 +
  53 +Requests by a task, using the sched_setaffinity(2) system call to
  54 +include CPUs in its CPU affinity mask, and using the mbind(2) and
  55 +set_mempolicy(2) system calls to include Memory Nodes in its memory
  56 +policy, are both filtered through that tasks cpuset, filtering out any
  57 +CPUs or Memory Nodes not in that cpuset. The scheduler will not
  58 +schedule a task on a CPU that is not allowed in its cpus_allowed
  59 +vector, and the kernel page allocator will not allocate a page on a
  60 +node that is not allowed in the requesting tasks mems_allowed vector.
  61 +
  62 +User level code may create and destroy cpusets by name in the cgroup
  63 +virtual file system, manage the attributes and permissions of these
  64 +cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
  65 +specify and query to which cpuset a task is assigned, and list the
  66 +task pids assigned to a cpuset.
  67 +
  68 +
  69 +1.2 Why are cpusets needed ?
  70 +----------------------------
  71 +
  72 +The management of large computer systems, with many processors (CPUs),
  73 +complex memory cache hierarchies and multiple Memory Nodes having
  74 +non-uniform access times (NUMA) presents additional challenges for
  75 +the efficient scheduling and memory placement of processes.
  76 +
  77 +Frequently more modest sized systems can be operated with adequate
  78 +efficiency just by letting the operating system automatically share
  79 +the available CPU and Memory resources amongst the requesting tasks.
  80 +
  81 +But larger systems, which benefit more from careful processor and
  82 +memory placement to reduce memory access times and contention,
  83 +and which typically represent a larger investment for the customer,
  84 +can benefit from explicitly placing jobs on properly sized subsets of
  85 +the system.
  86 +
  87 +This can be especially valuable on:
  88 +
  89 + * Web Servers running multiple instances of the same web application,
  90 + * Servers running different applications (for instance, a web server
  91 + and a database), or
  92 + * NUMA systems running large HPC applications with demanding
  93 + performance characteristics.
  94 +
  95 +These subsets, or "soft partitions" must be able to be dynamically
  96 +adjusted, as the job mix changes, without impacting other concurrently
  97 +executing jobs. The location of the running jobs pages may also be moved
  98 +when the memory locations are changed.
  99 +
  100 +The kernel cpuset patch provides the minimum essential kernel
  101 +mechanisms required to efficiently implement such subsets. It
  102 +leverages existing CPU and Memory Placement facilities in the Linux
  103 +kernel to avoid any additional impact on the critical scheduler or
  104 +memory allocator code.
  105 +
  106 +
  107 +1.3 How are cpusets implemented ?
  108 +---------------------------------
  109 +
  110 +Cpusets provide a Linux kernel mechanism to constrain which CPUs and
  111 +Memory Nodes are used by a process or set of processes.
  112 +
  113 +The Linux kernel already has a pair of mechanisms to specify on which
  114 +CPUs a task may be scheduled (sched_setaffinity) and on which Memory
  115 +Nodes it may obtain memory (mbind, set_mempolicy).
  116 +
  117 +Cpusets extends these two mechanisms as follows:
  118 +
  119 + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
  120 + kernel.
  121 + - Each task in the system is attached to a cpuset, via a pointer
  122 + in the task structure to a reference counted cgroup structure.
  123 + - Calls to sched_setaffinity are filtered to just those CPUs
  124 + allowed in that tasks cpuset.
  125 + - Calls to mbind and set_mempolicy are filtered to just
  126 + those Memory Nodes allowed in that tasks cpuset.
  127 + - The root cpuset contains all the systems CPUs and Memory
  128 + Nodes.
  129 + - For any cpuset, one can define child cpusets containing a subset
  130 + of the parents CPU and Memory Node resources.
  131 + - The hierarchy of cpusets can be mounted at /dev/cpuset, for
  132 + browsing and manipulation from user space.
  133 + - A cpuset may be marked exclusive, which ensures that no other
  134 + cpuset (except direct ancestors and descendents) may contain
  135 + any overlapping CPUs or Memory Nodes.
  136 + - You can list all the tasks (by pid) attached to any cpuset.
  137 +
  138 +The implementation of cpusets requires a few, simple hooks
  139 +into the rest of the kernel, none in performance critical paths:
  140 +
  141 + - in init/main.c, to initialize the root cpuset at system boot.
  142 + - in fork and exit, to attach and detach a task from its cpuset.
  143 + - in sched_setaffinity, to mask the requested CPUs by what's
  144 + allowed in that tasks cpuset.
  145 + - in sched.c migrate_all_tasks(), to keep migrating tasks within
  146 + the CPUs allowed by their cpuset, if possible.
  147 + - in the mbind and set_mempolicy system calls, to mask the requested
  148 + Memory Nodes by what's allowed in that tasks cpuset.
  149 + - in page_alloc.c, to restrict memory to allowed nodes.
  150 + - in vmscan.c, to restrict page recovery to the current cpuset.
  151 +
  152 +You should mount the "cgroup" filesystem type in order to enable
  153 +browsing and modifying the cpusets presently known to the kernel. No
  154 +new system calls are added for cpusets - all support for querying and
  155 +modifying cpusets is via this cpuset file system.
  156 +
  157 +The /proc/<pid>/status file for each task has four added lines,
  158 +displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
  159 +and mems_allowed (on which Memory Nodes it may obtain memory),
  160 +in the two formats seen in the following example:
  161 +
  162 + Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
  163 + Cpus_allowed_list: 0-127
  164 + Mems_allowed: ffffffff,ffffffff
  165 + Mems_allowed_list: 0-63
  166 +
  167 +Each cpuset is represented by a directory in the cgroup file system
  168 +containing (on top of the standard cgroup files) the following
  169 +files describing that cpuset:
  170 +
  171 + - cpus: list of CPUs in that cpuset
  172 + - mems: list of Memory Nodes in that cpuset
  173 + - memory_migrate flag: if set, move pages to cpusets nodes
  174 + - cpu_exclusive flag: is cpu placement exclusive?
  175 + - mem_exclusive flag: is memory placement exclusive?
  176 + - mem_hardwall flag: is memory allocation hardwalled
  177 + - memory_pressure: measure of how much paging pressure in cpuset
  178 +
  179 +In addition, the root cpuset only has the following file:
  180 + - memory_pressure_enabled flag: compute memory_pressure?
  181 +
  182 +New cpusets are created using the mkdir system call or shell
  183 +command. The properties of a cpuset, such as its flags, allowed
  184 +CPUs and Memory Nodes, and attached tasks, are modified by writing
  185 +to the appropriate file in that cpusets directory, as listed above.
  186 +
  187 +The named hierarchical structure of nested cpusets allows partitioning
  188 +a large system into nested, dynamically changeable, "soft-partitions".
  189 +
  190 +The attachment of each task, automatically inherited at fork by any
  191 +children of that task, to a cpuset allows organizing the work load
  192 +on a system into related sets of tasks such that each set is constrained
  193 +to using the CPUs and Memory Nodes of a particular cpuset. A task
  194 +may be re-attached to any other cpuset, if allowed by the permissions
  195 +on the necessary cpuset file system directories.
  196 +
  197 +Such management of a system "in the large" integrates smoothly with
  198 +the detailed placement done on individual tasks and memory regions
  199 +using the sched_setaffinity, mbind and set_mempolicy system calls.
  200 +
  201 +The following rules apply to each cpuset:
  202 +
  203 + - Its CPUs and Memory Nodes must be a subset of its parents.
  204 + - It can't be marked exclusive unless its parent is.
  205 + - If its cpu or memory is exclusive, they may not overlap any sibling.
  206 +
  207 +These rules, and the natural hierarchy of cpusets, enable efficient
  208 +enforcement of the exclusive guarantee, without having to scan all
  209 +cpusets every time any of them change to ensure nothing overlaps a
  210 +exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
  211 +to represent the cpuset hierarchy provides for a familiar permission
  212 +and name space for cpusets, with a minimum of additional kernel code.
  213 +
  214 +The cpus and mems files in the root (top_cpuset) cpuset are
  215 +read-only. The cpus file automatically tracks the value of
  216 +cpu_online_map using a CPU hotplug notifier, and the mems file
  217 +automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
  218 +nodes with memory--using the cpuset_track_online_nodes() hook.
  219 +
  220 +
  221 +1.4 What are exclusive cpusets ?
  222 +--------------------------------
  223 +
  224 +If a cpuset is cpu or mem exclusive, no other cpuset, other than
  225 +a direct ancestor or descendent, may share any of the same CPUs or
  226 +Memory Nodes.
  227 +
  228 +A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
  229 +i.e. it restricts kernel allocations for page, buffer and other data
  230 +commonly shared by the kernel across multiple users. All cpusets,
  231 +whether hardwalled or not, restrict allocations of memory for user
  232 +space. This enables configuring a system so that several independent
  233 +jobs can share common kernel data, such as file system pages, while
  234 +isolating each job's user allocation in its own cpuset. To do this,
  235 +construct a large mem_exclusive cpuset to hold all the jobs, and
  236 +construct child, non-mem_exclusive cpusets for each individual job.
  237 +Only a small amount of typical kernel memory, such as requests from
  238 +interrupt handlers, is allowed to be taken outside even a
  239 +mem_exclusive cpuset.
  240 +
  241 +
  242 +1.5 What is memory_pressure ?
  243 +-----------------------------
  244 +The memory_pressure of a cpuset provides a simple per-cpuset metric
  245 +of the rate that the tasks in a cpuset are attempting to free up in
  246 +use memory on the nodes of the cpuset to satisfy additional memory
  247 +requests.
  248 +
  249 +This enables batch managers monitoring jobs running in dedicated
  250 +cpusets to efficiently detect what level of memory pressure that job
  251 +is causing.
  252 +
  253 +This is useful both on tightly managed systems running a wide mix of
  254 +submitted jobs, which may choose to terminate or re-prioritize jobs that
  255 +are trying to use more memory than allowed on the nodes assigned them,
  256 +and with tightly coupled, long running, massively parallel scientific
  257 +computing jobs that will dramatically fail to meet required performance
  258 +goals if they start to use more memory than allowed to them.
  259 +
  260 +This mechanism provides a very economical way for the batch manager
  261 +to monitor a cpuset for signs of memory pressure. It's up to the
  262 +batch manager or other user code to decide what to do about it and
  263 +take action.
  264 +
  265 +==> Unless this feature is enabled by writing "1" to the special file
  266 + /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
  267 + code of __alloc_pages() for this metric reduces to simply noticing
  268 + that the cpuset_memory_pressure_enabled flag is zero. So only
  269 + systems that enable this feature will compute the metric.
  270 +
  271 +Why a per-cpuset, running average:
  272 +
  273 + Because this meter is per-cpuset, rather than per-task or mm,
  274 + the system load imposed by a batch scheduler monitoring this
  275 + metric is sharply reduced on large systems, because a scan of
  276 + the tasklist can be avoided on each set of queries.
  277 +
  278 + Because this meter is a running average, instead of an accumulating
  279 + counter, a batch scheduler can detect memory pressure with a
  280 + single read, instead of having to read and accumulate results
  281 + for a period of time.
  282 +
  283 + Because this meter is per-cpuset rather than per-task or mm,
  284 + the batch scheduler can obtain the key information, memory
  285 + pressure in a cpuset, with a single read, rather than having to
  286 + query and accumulate results over all the (dynamically changing)
  287 + set of tasks in the cpuset.
  288 +
  289 +A per-cpuset simple digital filter (requires a spinlock and 3 words
  290 +of data per-cpuset) is kept, and updated by any task attached to that
  291 +cpuset, if it enters the synchronous (direct) page reclaim code.
  292 +
  293 +A per-cpuset file provides an integer number representing the recent
  294 +(half-life of 10 seconds) rate of direct page reclaims caused by
  295 +the tasks in the cpuset, in units of reclaims attempted per second,
  296 +times 1000.
  297 +
  298 +
  299 +1.6 What is memory spread ?
  300 +---------------------------
  301 +There are two boolean flag files per cpuset that control where the
  302 +kernel allocates pages for the file system buffers and related in
  303 +kernel data structures. They are called 'memory_spread_page' and
  304 +'memory_spread_slab'.
  305 +
  306 +If the per-cpuset boolean flag file 'memory_spread_page' is set, then
  307 +the kernel will spread the file system buffers (page cache) evenly
  308 +over all the nodes that the faulting task is allowed to use, instead
  309 +of preferring to put those pages on the node where the task is running.
  310 +
  311 +If the per-cpuset boolean flag file 'memory_spread_slab' is set,
  312 +then the kernel will spread some file system related slab caches,
  313 +such as for inodes and dentries evenly over all the nodes that the
  314 +faulting task is allowed to use, instead of preferring to put those
  315 +pages on the node where the task is running.
  316 +
  317 +The setting of these flags does not affect anonymous data segment or
  318 +stack segment pages of a task.
  319 +
  320 +By default, both kinds of memory spreading are off, and memory
  321 +pages are allocated on the node local to where the task is running,
  322 +except perhaps as modified by the tasks NUMA mempolicy or cpuset
  323 +configuration, so long as sufficient free memory pages are available.
  324 +
  325 +When new cpusets are created, they inherit the memory spread settings
  326 +of their parent.
  327 +
  328 +Setting memory spreading causes allocations for the affected page
  329 +or slab caches to ignore the tasks NUMA mempolicy and be spread
  330 +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
  331 +mempolicies will not notice any change in these calls as a result of
  332 +their containing tasks memory spread settings. If memory spreading
  333 +is turned off, then the currently specified NUMA mempolicy once again
  334 +applies to memory page allocations.
  335 +
  336 +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
  337 +files. By default they contain "0", meaning that the feature is off
  338 +for that cpuset. If a "1" is written to that file, then that turns
  339 +the named feature on.
  340 +
  341 +The implementation is simple.
  342 +
  343 +Setting the flag 'memory_spread_page' turns on a per-process flag
  344 +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
  345 +joins that cpuset. The page allocation calls for the page cache
  346 +is modified to perform an inline check for this PF_SPREAD_PAGE task
  347 +flag, and if set, a call to a new routine cpuset_mem_spread_node()
  348 +returns the node to prefer for the allocation.
  349 +
  350 +Similarly, setting 'memory_spread_slab' turns on the flag
  351 +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
  352 +pages from the node returned by cpuset_mem_spread_node().
  353 +
  354 +The cpuset_mem_spread_node() routine is also simple. It uses the
  355 +value of a per-task rotor cpuset_mem_spread_rotor to select the next
  356 +node in the current tasks mems_allowed to prefer for the allocation.
  357 +
  358 +This memory placement policy is also known (in other contexts) as
  359 +round-robin or interleave.
  360 +
  361 +This policy can provide substantial improvements for jobs that need
  362 +to place thread local data on the corresponding node, but that need
  363 +to access large file system data sets that need to be spread across
  364 +the several nodes in the jobs cpuset in order to fit. Without this
  365 +policy, especially for jobs that might have one thread reading in the
  366 +data set, the memory allocation across the nodes in the jobs cpuset
  367 +can become very uneven.
  368 +
  369 +1.7 What is sched_load_balance ?
  370 +--------------------------------
  371 +
  372 +The kernel scheduler (kernel/sched.c) automatically load balances
  373 +tasks. If one CPU is underutilized, kernel code running on that
  374 +CPU will look for tasks on other more overloaded CPUs and move those
  375 +tasks to itself, within the constraints of such placement mechanisms
  376 +as cpusets and sched_setaffinity.
  377 +
  378 +The algorithmic cost of load balancing and its impact on key shared
  379 +kernel data structures such as the task list increases more than
  380 +linearly with the number of CPUs being balanced. So the scheduler
  381 +has support to partition the systems CPUs into a number of sched
  382 +domains such that it only load balances within each sched domain.
  383 +Each sched domain covers some subset of the CPUs in the system;
  384 +no two sched domains overlap; some CPUs might not be in any sched
  385 +domain and hence won't be load balanced.
  386 +
  387 +Put simply, it costs less to balance between two smaller sched domains
  388 +than one big one, but doing so means that overloads in one of the
  389 +two domains won't be load balanced to the other one.
  390 +
  391 +By default, there is one sched domain covering all CPUs, except those
  392 +marked isolated using the kernel boot time "isolcpus=" argument.
  393 +
  394 +This default load balancing across all CPUs is not well suited for
  395 +the following two situations:
  396 + 1) On large systems, load balancing across many CPUs is expensive.
  397 + If the system is managed using cpusets to place independent jobs
  398 + on separate sets of CPUs, full load balancing is unnecessary.
  399 + 2) Systems supporting realtime on some CPUs need to minimize
  400 + system overhead on those CPUs, including avoiding task load
  401 + balancing if that is not needed.
  402 +
  403 +When the per-cpuset flag "sched_load_balance" is enabled (the default
  404 +setting), it requests that all the CPUs in that cpusets allowed 'cpus'
  405 +be contained in a single sched domain, ensuring that load balancing
  406 +can move a task (not otherwised pinned, as by sched_setaffinity)
  407 +from any CPU in that cpuset to any other.
  408 +
  409 +When the per-cpuset flag "sched_load_balance" is disabled, then the
  410 +scheduler will avoid load balancing across the CPUs in that cpuset,
  411 +--except-- in so far as is necessary because some overlapping cpuset
  412 +has "sched_load_balance" enabled.
  413 +
  414 +So, for example, if the top cpuset has the flag "sched_load_balance"
  415 +enabled, then the scheduler will have one sched domain covering all
  416 +CPUs, and the setting of the "sched_load_balance" flag in any other
  417 +cpusets won't matter, as we're already fully load balancing.
  418 +
  419 +Therefore in the above two situations, the top cpuset flag
  420 +"sched_load_balance" should be disabled, and only some of the smaller,
  421 +child cpusets have this flag enabled.
  422 +
  423 +When doing this, you don't usually want to leave any unpinned tasks in
  424 +the top cpuset that might use non-trivial amounts of CPU, as such tasks
  425 +may be artificially constrained to some subset of CPUs, depending on
  426 +the particulars of this flag setting in descendent cpusets. Even if
  427 +such a task could use spare CPU cycles in some other CPUs, the kernel
  428 +scheduler might not consider the possibility of load balancing that
  429 +task to that underused CPU.
  430 +
  431 +Of course, tasks pinned to a particular CPU can be left in a cpuset
  432 +that disables "sched_load_balance" as those tasks aren't going anywhere
  433 +else anyway.
  434 +
  435 +There is an impedance mismatch here, between cpusets and sched domains.
  436 +Cpusets are hierarchical and nest. Sched domains are flat; they don't
  437 +overlap and each CPU is in at most one sched domain.
  438 +
  439 +It is necessary for sched domains to be flat because load balancing
  440 +across partially overlapping sets of CPUs would risk unstable dynamics
  441 +that would be beyond our understanding. So if each of two partially
  442 +overlapping cpusets enables the flag 'sched_load_balance', then we
  443 +form a single sched domain that is a superset of both. We won't move
  444 +a task to a CPU outside it cpuset, but the scheduler load balancing
  445 +code might waste some compute cycles considering that possibility.
  446 +
  447 +This mismatch is why there is not a simple one-to-one relation
  448 +between which cpusets have the flag "sched_load_balance" enabled,
  449 +and the sched domain configuration. If a cpuset enables the flag, it
  450 +will get balancing across all its CPUs, but if it disables the flag,
  451 +it will only be assured of no load balancing if no other overlapping
  452 +cpuset enables the flag.
  453 +
  454 +If two cpusets have partially overlapping 'cpus' allowed, and only
  455 +one of them has this flag enabled, then the other may find its
  456 +tasks only partially load balanced, just on the overlapping CPUs.
  457 +This is just the general case of the top_cpuset example given a few
  458 +paragraphs above. In the general case, as in the top cpuset case,
  459 +don't leave tasks that might use non-trivial amounts of CPU in
  460 +such partially load balanced cpusets, as they may be artificially
  461 +constrained to some subset of the CPUs allowed to them, for lack of
  462 +load balancing to the other CPUs.
  463 +
  464 +1.7.1 sched_load_balance implementation details.
  465 +------------------------------------------------
  466 +
  467 +The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
  468 +to most cpuset flags.) When enabled for a cpuset, the kernel will
  469 +ensure that it can load balance across all the CPUs in that cpuset
  470 +(makes sure that all the CPUs in the cpus_allowed of that cpuset are
  471 +in the same sched domain.)
  472 +
  473 +If two overlapping cpusets both have 'sched_load_balance' enabled,
  474 +then they will be (must be) both in the same sched domain.
  475 +
  476 +If, as is the default, the top cpuset has 'sched_load_balance' enabled,
  477 +then by the above that means there is a single sched domain covering
  478 +the whole system, regardless of any other cpuset settings.
  479 +
  480 +The kernel commits to user space that it will avoid load balancing
  481 +where it can. It will pick as fine a granularity partition of sched
  482 +domains as it can while still providing load balancing for any set
  483 +of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
  484 +
  485 +The internal kernel cpuset to scheduler interface passes from the
  486 +cpuset code to the scheduler code a partition of the load balanced
  487 +CPUs in the system. This partition is a set of subsets (represented
  488 +as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
  489 +the CPUs that must be load balanced.
  490 +
  491 +Whenever the 'sched_load_balance' flag changes, or CPUs come or go
  492 +from a cpuset with this flag enabled, or a cpuset with this flag
  493 +enabled is removed, the cpuset code builds a new such partition and
  494 +passes it to the scheduler sched domain setup code, to have the sched
  495 +domains rebuilt as necessary.
  496 +
  497 +This partition exactly defines what sched domains the scheduler should
  498 +setup - one sched domain for each element (cpumask_t) in the partition.
  499 +
  500 +The scheduler remembers the currently active sched domain partitions.
  501 +When the scheduler routine partition_sched_domains() is invoked from
  502 +the cpuset code to update these sched domains, it compares the new
  503 +partition requested with the current, and updates its sched domains,
  504 +removing the old and adding the new, for each change.
  505 +
  506 +
  507 +1.8 What is sched_relax_domain_level ?
  508 +--------------------------------------
  509 +
  510 +In sched domain, the scheduler migrates tasks in 2 ways; periodic load
  511 +balance on tick, and at time of some schedule events.
  512 +
  513 +When a task is woken up, scheduler try to move the task on idle CPU.
  514 +For example, if a task A running on CPU X activates another task B
  515 +on the same CPU X, and if CPU Y is X's sibling and performing idle,
  516 +then scheduler migrate task B to CPU Y so that task B can start on
  517 +CPU Y without waiting task A on CPU X.
  518 +
  519 +And if a CPU run out of tasks in its runqueue, the CPU try to pull
  520 +extra tasks from other busy CPUs to help them before it is going to
  521 +be idle.
  522 +
  523 +Of course it takes some searching cost to find movable tasks and/or
  524 +idle CPUs, the scheduler might not search all CPUs in the domain
  525 +everytime. In fact, in some architectures, the searching ranges on
  526 +events are limited in the same socket or node where the CPU locates,
  527 +while the load balance on tick searchs all.
  528 +
  529 +For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
  530 +is idle while CPU X and the siblings are busy, scheduler can't migrate
  531 +woken task B from X to Z since it is out of its searching range.
  532 +As the result, task B on CPU X need to wait task A or wait load balance
  533 +on the next tick. For some applications in special situation, waiting
  534 +1 tick may be too long.
  535 +
  536 +The 'sched_relax_domain_level' file allows you to request changing
  537 +this searching range as you like. This file takes int value which
  538 +indicates size of searching range in levels ideally as follows,
  539 +otherwise initial value -1 that indicates the cpuset has no request.
  540 +
  541 + -1 : no request. use system default or follow request of others.
  542 + 0 : no search.
  543 + 1 : search siblings (hyperthreads in a core).
  544 + 2 : search cores in a package.
  545 + 3 : search cpus in a node [= system wide on non-NUMA system]
  546 + ( 4 : search nodes in a chunk of node [on NUMA system] )
  547 + ( 5 : search system wide [on NUMA system] )
  548 +
  549 +The system default is architecture dependent. The system default
  550 +can be changed using the relax_domain_level= boot parameter.
  551 +
  552 +This file is per-cpuset and affect the sched domain where the cpuset
  553 +belongs to. Therefore if the flag 'sched_load_balance' of a cpuset
  554 +is disabled, then 'sched_relax_domain_level' have no effect since
  555 +there is no sched domain belonging the cpuset.
  556 +
  557 +If multiple cpusets are overlapping and hence they form a single sched
  558 +domain, the largest value among those is used. Be careful, if one
  559 +requests 0 and others are -1 then 0 is used.
  560 +
  561 +Note that modifying this file will have both good and bad effects,
  562 +and whether it is acceptable or not will be depend on your situation.
  563 +Don't modify this file if you are not sure.
  564 +
  565 +If your situation is:
  566 + - The migration costs between each cpu can be assumed considerably
  567 + small(for you) due to your special application's behavior or
  568 + special hardware support for CPU cache etc.
  569 + - The searching cost doesn't have impact(for you) or you can make
  570 + the searching cost enough small by managing cpuset to compact etc.
  571 + - The latency is required even it sacrifices cache hit rate etc.
  572 +then increasing 'sched_relax_domain_level' would benefit you.
  573 +
  574 +
  575 +1.9 How do I use cpusets ?
  576 +--------------------------
  577 +
  578 +In order to minimize the impact of cpusets on critical kernel
  579 +code, such as the scheduler, and due to the fact that the kernel
  580 +does not support one task updating the memory placement of another
  581 +task directly, the impact on a task of changing its cpuset CPU
  582 +or Memory Node placement, or of changing to which cpuset a task
  583 +is attached, is subtle.
  584 +
  585 +If a cpuset has its Memory Nodes modified, then for each task attached
  586 +to that cpuset, the next time that the kernel attempts to allocate
  587 +a page of memory for that task, the kernel will notice the change
  588 +in the tasks cpuset, and update its per-task memory placement to
  589 +remain within the new cpusets memory placement. If the task was using
  590 +mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
  591 +its new cpuset, then the task will continue to use whatever subset
  592 +of MPOL_BIND nodes are still allowed in the new cpuset. If the task
  593 +was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
  594 +in the new cpuset, then the task will be essentially treated as if it
  595 +was MPOL_BIND bound to the new cpuset (even though its numa placement,
  596 +as queried by get_mempolicy(), doesn't change). If a task is moved
  597 +from one cpuset to another, then the kernel will adjust the tasks
  598 +memory placement, as above, the next time that the kernel attempts
  599 +to allocate a page of memory for that task.
  600 +
  601 +If a cpuset has its 'cpus' modified, then each task in that cpuset
  602 +will have its allowed CPU placement changed immediately. Similarly,
  603 +if a tasks pid is written to a cpusets 'tasks' file, in either its
  604 +current cpuset or another cpuset, then its allowed CPU placement is
  605 +changed immediately. If such a task had been bound to some subset
  606 +of its cpuset using the sched_setaffinity() call, the task will be
  607 +allowed to run on any CPU allowed in its new cpuset, negating the
  608 +affect of the prior sched_setaffinity() call.
  609 +
  610 +In summary, the memory placement of a task whose cpuset is changed is
  611 +updated by the kernel, on the next allocation of a page for that task,
  612 +but the processor placement is not updated, until that tasks pid is
  613 +rewritten to the 'tasks' file of its cpuset. This is done to avoid
  614 +impacting the scheduler code in the kernel with a check for changes
  615 +in a tasks processor placement.
  616 +
  617 +Normally, once a page is allocated (given a physical page
  618 +of main memory) then that page stays on whatever node it
  619 +was allocated, so long as it remains allocated, even if the
  620 +cpusets memory placement policy 'mems' subsequently changes.
  621 +If the cpuset flag file 'memory_migrate' is set true, then when
  622 +tasks are attached to that cpuset, any pages that task had
  623 +allocated to it on nodes in its previous cpuset are migrated
  624 +to the tasks new cpuset. The relative placement of the page within
  625 +the cpuset is preserved during these migration operations if possible.
  626 +For example if the page was on the second valid node of the prior cpuset
  627 +then the page will be placed on the second valid node of the new cpuset.
  628 +
  629 +Also if 'memory_migrate' is set true, then if that cpusets
  630 +'mems' file is modified, pages allocated to tasks in that
  631 +cpuset, that were on nodes in the previous setting of 'mems',
  632 +will be moved to nodes in the new setting of 'mems.'
  633 +Pages that were not in the tasks prior cpuset, or in the cpusets
  634 +prior 'mems' setting, will not be moved.
  635 +
  636 +There is an exception to the above. If hotplug functionality is used
  637 +to remove all the CPUs that are currently assigned to a cpuset,
  638 +then all the tasks in that cpuset will be moved to the nearest ancestor
  639 +with non-empty cpus. But the moving of some (or all) tasks might fail if
  640 +cpuset is bound with another cgroup subsystem which has some restrictions
  641 +on task attaching. In this failing case, those tasks will stay
  642 +in the original cpuset, and the kernel will automatically update
  643 +their cpus_allowed to allow all online CPUs. When memory hotplug
  644 +functionality for removing Memory Nodes is available, a similar exception
  645 +is expected to apply there as well. In general, the kernel prefers to
  646 +violate cpuset placement, over starving a task that has had all
  647 +its allowed CPUs or Memory Nodes taken offline.
  648 +
  649 +There is a second exception to the above. GFP_ATOMIC requests are
  650 +kernel internal allocations that must be satisfied, immediately.
  651 +The kernel may drop some request, in rare cases even panic, if a
  652 +GFP_ATOMIC alloc fails. If the request cannot be satisfied within
  653 +the current tasks cpuset, then we relax the cpuset, and look for
  654 +memory anywhere we can find it. It's better to violate the cpuset
  655 +than stress the kernel.
  656 +
  657 +To start a new job that is to be contained within a cpuset, the steps are:
  658 +
  659 + 1) mkdir /dev/cpuset
  660 + 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
  661 + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
  662 + the /dev/cpuset virtual file system.
  663 + 4) Start a task that will be the "founding father" of the new job.
  664 + 5) Attach that task to the new cpuset by writing its pid to the
  665 + /dev/cpuset tasks file for that cpuset.
  666 + 6) fork, exec or clone the job tasks from this founding father task.
  667 +
  668 +For example, the following sequence of commands will setup a cpuset
  669 +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
  670 +and then start a subshell 'sh' in that cpuset:
  671 +
  672 + mount -t cgroup -ocpuset cpuset /dev/cpuset
  673 + cd /dev/cpuset
  674 + mkdir Charlie
  675 + cd Charlie
  676 + /bin/echo 2-3 > cpus
  677 + /bin/echo 1 > mems
  678 + /bin/echo $$ > tasks
  679 + sh
  680 + # The subshell 'sh' is now running in cpuset Charlie
  681 + # The next line should display '/Charlie'
  682 + cat /proc/self/cpuset
  683 +
  684 +In the future, a C library interface to cpusets will likely be
  685 +available. For now, the only way to query or modify cpusets is
  686 +via the cpuset file system, using the various cd, mkdir, echo, cat,
  687 +rmdir commands from the shell, or their equivalent from C.
  688 +
  689 +The sched_setaffinity calls can also be done at the shell prompt using
  690 +SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
  691 +calls can be done at the shell prompt using the numactl command
  692 +(part of Andi Kleen's numa package).
  693 +
  694 +2. Usage Examples and Syntax
  695 +============================
  696 +
  697 +2.1 Basic Usage
  698 +---------------
  699 +
  700 +Creating, modifying, using the cpusets can be done through the cpuset
  701 +virtual filesystem.
  702 +
  703 +To mount it, type:
  704 +# mount -t cgroup -o cpuset cpuset /dev/cpuset
  705 +
  706 +Then under /dev/cpuset you can find a tree that corresponds to the
  707 +tree of the cpusets in the system. For instance, /dev/cpuset
  708 +is the cpuset that holds the whole system.
  709 +
  710 +If you want to create a new cpuset under /dev/cpuset:
  711 +# cd /dev/cpuset
  712 +# mkdir my_cpuset
  713 +
  714 +Now you want to do something with this cpuset.
  715 +# cd my_cpuset
  716 +
  717 +In this directory you can find several files:
  718 +# ls
  719 +cpu_exclusive memory_migrate mems tasks
  720 +cpus memory_pressure notify_on_release
  721 +mem_exclusive memory_spread_page sched_load_balance
  722 +mem_hardwall memory_spread_slab sched_relax_domain_level
  723 +
  724 +Reading them will give you information about the state of this cpuset:
  725 +the CPUs and Memory Nodes it can use, the processes that are using
  726 +it, its properties. By writing to these files you can manipulate
  727 +the cpuset.
  728 +
  729 +Set some flags:
  730 +# /bin/echo 1 > cpu_exclusive
  731 +
  732 +Add some cpus:
  733 +# /bin/echo 0-7 > cpus
  734 +
  735 +Add some mems:
  736 +# /bin/echo 0-7 > mems
  737 +
  738 +Now attach your shell to this cpuset:
  739 +# /bin/echo $$ > tasks
  740 +
  741 +You can also create cpusets inside your cpuset by using mkdir in this
  742 +directory.
  743 +# mkdir my_sub_cs
  744 +
  745 +To remove a cpuset, just use rmdir:
  746 +# rmdir my_sub_cs
  747 +This will fail if the cpuset is in use (has cpusets inside, or has
  748 +processes attached).
  749 +
  750 +Note that for legacy reasons, the "cpuset" filesystem exists as a
  751 +wrapper around the cgroup filesystem.
  752 +
  753 +The command
  754 +
  755 +mount -t cpuset X /dev/cpuset
  756 +
  757 +is equivalent to
  758 +
  759 +mount -t cgroup -ocpuset X /dev/cpuset
  760 +echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
  761 +
  762 +2.2 Adding/removing cpus
  763 +------------------------
  764 +
  765 +This is the syntax to use when writing in the cpus or mems files
  766 +in cpuset directories:
  767 +
  768 +# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4
  769 +# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4
  770 +
  771 +2.3 Setting flags
  772 +-----------------
  773 +
  774 +The syntax is very simple:
  775 +
  776 +# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive'
  777 +# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive'
  778 +
  779 +2.4 Attaching processes
  780 +-----------------------
  781 +
  782 +# /bin/echo PID > tasks
  783 +
  784 +Note that it is PID, not PIDs. You can only attach ONE task at a time.
  785 +If you have several tasks to attach, you have to do it one after another:
  786 +
  787 +# /bin/echo PID1 > tasks
  788 +# /bin/echo PID2 > tasks
  789 + ...
  790 +# /bin/echo PIDn > tasks
  791 +
  792 +
  793 +3. Questions
  794 +============
  795 +
  796 +Q: what's up with this '/bin/echo' ?
  797 +A: bash's builtin 'echo' command does not check calls to write() against
  798 + errors. If you use it in the cpuset file system, you won't be
  799 + able to tell whether a command succeeded or failed.
  800 +
  801 +Q: When I attach processes, only the first of the line gets really attached !
  802 +A: We can only return one error code per call to write(). So you should also
  803 + put only ONE pid.
  804 +
  805 +4. Contact
  806 +==========
  807 +
  808 +Web: http://www.bullopensource.org/cpuset
Documentation/cgroups/devices.txt
  1 +Device Whitelist Controller
  2 +
  3 +1. Description:
  4 +
  5 +Implement a cgroup to track and enforce open and mknod restrictions
  6 +on device files. A device cgroup associates a device access
  7 +whitelist with each cgroup. A whitelist entry has 4 fields.
  8 +'type' is a (all), c (char), or b (block). 'all' means it applies
  9 +to all types and all major and minor numbers. Major and minor are
  10 +either an integer or * for all. Access is a composition of r
  11 +(read), w (write), and m (mknod).
  12 +
  13 +The root device cgroup starts with rwm to 'all'. A child device
  14 +cgroup gets a copy of the parent. Administrators can then remove
  15 +devices from the whitelist or add new entries. A child cgroup can
  16 +never receive a device access which is denied by its parent. However
  17 +when a device access is removed from a parent it will not also be
  18 +removed from the child(ren).
  19 +
  20 +2. User Interface
  21 +
  22 +An entry is added using devices.allow, and removed using
  23 +devices.deny. For instance
  24 +
  25 + echo 'c 1:3 mr' > /cgroups/1/devices.allow
  26 +
  27 +allows cgroup 1 to read and mknod the device usually known as
  28 +/dev/null. Doing
  29 +
  30 + echo a > /cgroups/1/devices.deny
  31 +
  32 +will remove the default 'a *:* rwm' entry. Doing
  33 +
  34 + echo a > /cgroups/1/devices.allow
  35 +
  36 +will add the 'a *:* rwm' entry to the whitelist.
  37 +
  38 +3. Security
  39 +
  40 +Any task can move itself between cgroups. This clearly won't
  41 +suffice, but we can decide the best way to adequately restrict
  42 +movement as people get some experience with this. We may just want
  43 +to require CAP_SYS_ADMIN, which at least is a separate bit from
  44 +CAP_MKNOD. We may want to just refuse moving to a cgroup which
  45 +isn't a descendent of the current one. Or we may want to use
  46 +CAP_MAC_ADMIN, since we really are trying to lock down root.
  47 +
  48 +CAP_SYS_ADMIN is needed to modify the whitelist or move another
  49 +task to a new cgroup. (Again we'll probably want to change that).
  50 +
  51 +A cgroup may not be granted more permissions than the cgroup's
  52 +parent has.
Documentation/cgroups/memcg_test.txt
  1 +Memory Resource Controller(Memcg) Implementation Memo.
  2 +Last Updated: 2008/12/15
  3 +Base Kernel Version: based on 2.6.28-rc8-mm.
  4 +
  5 +Because VM is getting complex (one of reasons is memcg...), memcg's behavior
  6 +is complex. This is a document for memcg's internal behavior.
  7 +Please note that implementation details can be changed.
  8 +
  9 +(*) Topics on API should be in Documentation/cgroups/memory.txt)
  10 +
  11 +0. How to record usage ?
  12 + 2 objects are used.
  13 +
  14 + page_cgroup ....an object per page.
  15 + Allocated at boot or memory hotplug. Freed at memory hot removal.
  16 +
  17 + swap_cgroup ... an entry per swp_entry.
  18 + Allocated at swapon(). Freed at swapoff().
  19 +
  20 + The page_cgroup has USED bit and double count against a page_cgroup never
  21 + occurs. swap_cgroup is used only when a charged page is swapped-out.
  22 +
  23 +1. Charge
  24 +
  25 + a page/swp_entry may be charged (usage += PAGE_SIZE) at
  26 +
  27 + mem_cgroup_newpage_charge()
  28 + Called at new page fault and Copy-On-Write.
  29 +
  30 + mem_cgroup_try_charge_swapin()
  31 + Called at do_swap_page() (page fault on swap entry) and swapoff.
  32 + Followed by charge-commit-cancel protocol. (With swap accounting)
  33 + At commit, a charge recorded in swap_cgroup is removed.
  34 +
  35 + mem_cgroup_cache_charge()
  36 + Called at add_to_page_cache()
  37 +
  38 + mem_cgroup_cache_charge_swapin()
  39 + Called at shmem's swapin.
  40 +
  41 + mem_cgroup_prepare_migration()
  42 + Called before migration. "extra" charge is done and followed by
  43 + charge-commit-cancel protocol.
  44 + At commit, charge against oldpage or newpage will be committed.
  45 +
  46 +2. Uncharge
  47 + a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
  48 +
  49 + mem_cgroup_uncharge_page()
  50 + Called when an anonymous page is fully unmapped. I.e., mapcount goes
  51 + to 0. If the page is SwapCache, uncharge is delayed until
  52 + mem_cgroup_uncharge_swapcache().
  53 +
  54 + mem_cgroup_uncharge_cache_page()
  55 + Called when a page-cache is deleted from radix-tree. If the page is
  56 + SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
  57 +
  58 + mem_cgroup_uncharge_swapcache()
  59 + Called when SwapCache is removed from radix-tree. The charge itself
  60 + is moved to swap_cgroup. (If mem+swap controller is disabled, no
  61 + charge to swap occurs.)
  62 +
  63 + mem_cgroup_uncharge_swap()
  64 + Called when swp_entry's refcnt goes down to 0. A charge against swap
  65 + disappears.
  66 +
  67 + mem_cgroup_end_migration(old, new)
  68 + At success of migration old is uncharged (if necessary), a charge
  69 + to new page is committed. At failure, charge to old page is committed.
  70 +
  71 +3. charge-commit-cancel
  72 + In some case, we can't know this "charge" is valid or not at charging
  73 + (because of races).
  74 + To handle such case, there are charge-commit-cancel functions.
  75 + mem_cgroup_try_charge_XXX
  76 + mem_cgroup_commit_charge_XXX
  77 + mem_cgroup_cancel_charge_XXX
  78 + these are used in swap-in and migration.
  79 +
  80 + At try_charge(), there are no flags to say "this page is charged".
  81 + at this point, usage += PAGE_SIZE.
  82 +
  83 + At commit(), the function checks the page should be charged or not
  84 + and set flags or avoid charging.(usage -= PAGE_SIZE)
  85 +
  86 + At cancel(), simply usage -= PAGE_SIZE.
  87 +
  88 +Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
  89 +
  90 +4. Anonymous
  91 + Anonymous page is newly allocated at
  92 + - page fault into MAP_ANONYMOUS mapping.
  93 + - Copy-On-Write.
  94 + It is charged right after it's allocated before doing any page table
  95 + related operations. Of course, it's uncharged when another page is used
  96 + for the fault address.
  97 +
  98 + At freeing anonymous page (by exit() or munmap()), zap_pte() is called
  99 + and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
  100 + are done at page_remove_rmap() when page_mapcount() goes down to 0.
  101 +
  102 + Another page freeing is by page-reclaim (vmscan.c) and anonymous
  103 + pages are swapped out. In this case, the page is marked as
  104 + PageSwapCache(). uncharge() routine doesn't uncharge the page marked
  105 + as SwapCache(). It's delayed until __delete_from_swap_cache().
  106 +
  107 + 4.1 Swap-in.
  108 + At swap-in, the page is taken from swap-cache. There are 2 cases.
  109 +
  110 + (a) If the SwapCache is newly allocated and read, it has no charges.
  111 + (b) If the SwapCache has been mapped by processes, it has been
  112 + charged already.
  113 +
  114 + This swap-in is one of the most complicated work. In do_swap_page(),
  115 + following events occur when pte is unchanged.
  116 +
  117 + (1) the page (SwapCache) is looked up.
  118 + (2) lock_page()
  119 + (3) try_charge_swapin()
  120 + (4) reuse_swap_page() (may call delete_swap_cache())
  121 + (5) commit_charge_swapin()
  122 + (6) swap_free().
  123 +
  124 + Considering following situation for example.
  125 +
  126 + (A) The page has not been charged before (2) and reuse_swap_page()
  127 + doesn't call delete_from_swap_cache().
  128 + (B) The page has not been charged before (2) and reuse_swap_page()
  129 + calls delete_from_swap_cache().
  130 + (C) The page has been charged before (2) and reuse_swap_page() doesn't
  131 + call delete_from_swap_cache().
  132 + (D) The page has been charged before (2) and reuse_swap_page() calls
  133 + delete_from_swap_cache().
  134 +
  135 + memory.usage/memsw.usage changes to this page/swp_entry will be
  136 + Case (A) (B) (C) (D)
  137 + Event
  138 + Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
  139 + ===========================================
  140 + (3) +1/+1 +1/+1 +1/+1 +1/+1
  141 + (4) - 0/ 0 - -1/ 0
  142 + (5) 0/-1 0/ 0 -1/-1 0/ 0
  143 + (6) - 0/-1 - 0/-1
  144 + ===========================================
  145 + Result 1/ 1 1/ 1 1/ 1 1/ 1
  146 +
  147 + In any cases, charges to this page should be 1/ 1.
  148 +
  149 + 4.2 Swap-out.
  150 + At swap-out, typical state transition is below.
  151 +
  152 + (a) add to swap cache. (marked as SwapCache)
  153 + swp_entry's refcnt += 1.
  154 + (b) fully unmapped.
  155 + swp_entry's refcnt += # of ptes.
  156 + (c) write back to swap.
  157 + (d) delete from swap cache. (remove from SwapCache)
  158 + swp_entry's refcnt -= 1.
  159 +
  160 +
  161 + At (b), the page is marked as SwapCache and not uncharged.
  162 + At (d), the page is removed from SwapCache and a charge in page_cgroup
  163 + is moved to swap_cgroup.
  164 +
  165 + Finally, at task exit,
  166 + (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
  167 + Here, a charge in swap_cgroup disappears.
  168 +
  169 +5. Page Cache
  170 + Page Cache is charged at
  171 + - add_to_page_cache_locked().
  172 +
  173 + uncharged at
  174 + - __remove_from_page_cache().
  175 +
  176 + The logic is very clear. (About migration, see below)
  177 + Note: __remove_from_page_cache() is called by remove_from_page_cache()
  178 + and __remove_mapping().
  179 +
  180 +6. Shmem(tmpfs) Page Cache
  181 + Memcg's charge/uncharge have special handlers of shmem. The best way
  182 + to understand shmem's page state transition is to read mm/shmem.c.
  183 + But brief explanation of the behavior of memcg around shmem will be
  184 + helpful to understand the logic.
  185 +
  186 + Shmem's page (just leaf page, not direct/indirect block) can be on
  187 + - radix-tree of shmem's inode.
  188 + - SwapCache.
  189 + - Both on radix-tree and SwapCache. This happens at swap-in
  190 + and swap-out,
  191 +
  192 + It's charged when...
  193 + - A new page is added to shmem's radix-tree.
  194 + - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
  195 + It's uncharged when
  196 + - A page is removed from radix-tree and not SwapCache.
  197 + - When SwapCache is removed, a charge is moved to swap_cgroup.
  198 + - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
  199 + disappears.
  200 +
  201 +7. Page Migration
  202 + One of the most complicated functions is page-migration-handler.
  203 + Memcg has 2 routines. Assume that we are migrating a page's contents
  204 + from OLDPAGE to NEWPAGE.
  205 +
  206 + Usual migration logic is..
  207 + (a) remove the page from LRU.
  208 + (b) allocate NEWPAGE (migration target)
  209 + (c) lock by lock_page().
  210 + (d) unmap all mappings.
  211 + (e-1) If necessary, replace entry in radix-tree.
  212 + (e-2) move contents of a page.
  213 + (f) map all mappings again.
  214 + (g) pushback the page to LRU.
  215 + (-) OLDPAGE will be freed.
  216 +
  217 + Before (g), memcg should complete all necessary charge/uncharge to
  218 + NEWPAGE/OLDPAGE.
  219 +
  220 + The point is....
  221 + - If OLDPAGE is anonymous, all charges will be dropped at (d) because
  222 + try_to_unmap() drops all mapcount and the page will not be
  223 + SwapCache.
  224 +
  225 + - If OLDPAGE is SwapCache, charges will be kept at (g) because
  226 + __delete_from_swap_cache() isn't called at (e-1)
  227 +
  228 + - If OLDPAGE is page-cache, charges will be kept at (g) because
  229 + remove_from_swap_cache() isn't called at (e-1)
  230 +
  231 + memcg provides following hooks.
  232 +
  233 + - mem_cgroup_prepare_migration(OLDPAGE)
  234 + Called after (b) to account a charge (usage += PAGE_SIZE) against
  235 + memcg which OLDPAGE belongs to.
  236 +
  237 + - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
  238 + Called after (f) before (g).
  239 + If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
  240 + charged, a charge by prepare_migration() is automatically canceled.
  241 + If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
  242 +
  243 + But zap_pte() (by exit or munmap) can be called while migration,
  244 + we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
  245 +
  246 +8. LRU
  247 + Each memcg has its own private LRU. Now, it's handling is under global
  248 + VM's control (means that it's handled under global zone->lru_lock).
  249 + Almost all routines around memcg's LRU is called by global LRU's
  250 + list management functions under zone->lru_lock().
  251 +
  252 + A special function is mem_cgroup_isolate_pages(). This scans
  253 + memcg's private LRU and call __isolate_lru_page() to extract a page
  254 + from LRU.
  255 + (By __isolate_lru_page(), the page is removed from both of global and
  256 + private LRU.)
  257 +
  258 +
  259 +9. Typical Tests.
  260 +
  261 + Tests for racy cases.
  262 +
  263 + 9.1 Small limit to memcg.
  264 + When you do test to do racy case, it's good test to set memcg's limit
  265 + to be very small rather than GB. Many races found in the test under
  266 + xKB or xxMB limits.
  267 + (Memory behavior under GB and Memory behavior under MB shows very
  268 + different situation.)
  269 +
  270 + 9.2 Shmem
  271 + Historically, memcg's shmem handling was poor and we saw some amount
  272 + of troubles here. This is because shmem is page-cache but can be
  273 + SwapCache. Test with shmem/tmpfs is always good test.
  274 +
  275 + 9.3 Migration
  276 + For NUMA, migration is an another special case. To do easy test, cpuset
  277 + is useful. Following is a sample script to do migration.
  278 +
  279 + mount -t cgroup -o cpuset none /opt/cpuset
  280 +
  281 + mkdir /opt/cpuset/01
  282 + echo 1 > /opt/cpuset/01/cpuset.cpus
  283 + echo 0 > /opt/cpuset/01/cpuset.mems
  284 + echo 1 > /opt/cpuset/01/cpuset.memory_migrate
  285 + mkdir /opt/cpuset/02
  286 + echo 1 > /opt/cpuset/02/cpuset.cpus
  287 + echo 1 > /opt/cpuset/02/cpuset.mems
  288 + echo 1 > /opt/cpuset/02/cpuset.memory_migrate
  289 +
  290 + In above set, when you moves a task from 01 to 02, page migration to
  291 + node 0 to node 1 will occur. Following is a script to migrate all
  292 + under cpuset.
  293 + --
  294 + move_task()
  295 + {
  296 + for pid in $1
  297 + do
  298 + /bin/echo $pid >$2/tasks 2>/dev/null
  299 + echo -n $pid
  300 + echo -n " "
  301 + done
  302 + echo END
  303 + }
  304 +
  305 + G1_TASK=`cat ${G1}/tasks`
  306 + G2_TASK=`cat ${G2}/tasks`
  307 + move_task "${G1_TASK}" ${G2} &
  308 + --
  309 + 9.4 Memory hotplug.
  310 + memory hotplug test is one of good test.
  311 + to offline memory, do following.
  312 + # echo offline > /sys/devices/system/memory/memoryXXX/state
  313 + (XXX is the place of memory)
  314 + This is an easy way to test page migration, too.
  315 +
  316 + 9.5 mkdir/rmdir
  317 + When using hierarchy, mkdir/rmdir test should be done.
  318 + Use tests like the following.
  319 +
  320 + echo 1 >/opt/cgroup/01/memory/use_hierarchy
  321 + mkdir /opt/cgroup/01/child_a
  322 + mkdir /opt/cgroup/01/child_b
  323 +
  324 + set limit to 01.
  325 + add limit to 01/child_b
  326 + run jobs under child_a and child_b
  327 +
  328 + create/delete following groups at random while jobs are running.
  329 + /opt/cgroup/01/child_a/child_aa
  330 + /opt/cgroup/01/child_b/child_bb
  331 + /opt/cgroup/01/child_c
  332 +
  333 + running new jobs in new group is also good.
  334 +
  335 + 9.6 Mount with other subsystems.
  336 + Mounting with other subsystems is a good test because there is a
  337 + race and lock dependency with other cgroup subsystems.
  338 +
  339 + example)
  340 + # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
  341 +
  342 + and do task move, mkdir, rmdir etc...under this.
Documentation/cgroups/memory.txt
  1 +Memory Resource Controller
  2 +
  3 +NOTE: The Memory Resource Controller has been generically been referred
  4 +to as the memory controller in this document. Do not confuse memory controller
  5 +used here with the memory controller that is used in hardware.
  6 +
  7 +Salient features
  8 +
  9 +a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
  10 +b. The infrastructure allows easy addition of other types of memory to control
  11 +c. Provides *zero overhead* for non memory controller users
  12 +d. Provides a double LRU: global memory pressure causes reclaim from the
  13 + global LRU; a cgroup on hitting a limit, reclaims from the per
  14 + cgroup LRU
  15 +
  16 +NOTE: Swap Cache (unmapped) is not accounted now.
  17 +
  18 +Benefits and Purpose of the memory controller
  19 +
  20 +The memory controller isolates the memory behaviour of a group of tasks
  21 +from the rest of the system. The article on LWN [12] mentions some probable
  22 +uses of the memory controller. The memory controller can be used to
  23 +
  24 +a. Isolate an application or a group of applications
  25 + Memory hungry applications can be isolated and limited to a smaller
  26 + amount of memory.
  27 +b. Create a cgroup with limited amount of memory, this can be used
  28 + as a good alternative to booting with mem=XXXX.
  29 +c. Virtualization solutions can control the amount of memory they want
  30 + to assign to a virtual machine instance.
  31 +d. A CD/DVD burner could control the amount of memory used by the
  32 + rest of the system to ensure that burning does not fail due to lack
  33 + of available memory.
  34 +e. There are several other use cases, find one or use the controller just
  35 + for fun (to learn and hack on the VM subsystem).
  36 +
  37 +1. History
  38 +
  39 +The memory controller has a long history. A request for comments for the memory
  40 +controller was posted by Balbir Singh [1]. At the time the RFC was posted
  41 +there were several implementations for memory control. The goal of the
  42 +RFC was to build consensus and agreement for the minimal features required
  43 +for memory control. The first RSS controller was posted by Balbir Singh[2]
  44 +in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
  45 +RSS controller. At OLS, at the resource management BoF, everyone suggested
  46 +that we handle both page cache and RSS together. Another request was raised
  47 +to allow user space handling of OOM. The current memory controller is
  48 +at version 6; it combines both mapped (RSS) and unmapped Page
  49 +Cache Control [11].
  50 +
  51 +2. Memory Control
  52 +
  53 +Memory is a unique resource in the sense that it is present in a limited
  54 +amount. If a task requires a lot of CPU processing, the task can spread
  55 +its processing over a period of hours, days, months or years, but with
  56 +memory, the same physical memory needs to be reused to accomplish the task.
  57 +
  58 +The memory controller implementation has been divided into phases. These
  59 +are:
  60 +
  61 +1. Memory controller
  62 +2. mlock(2) controller
  63 +3. Kernel user memory accounting and slab control
  64 +4. user mappings length controller
  65 +
  66 +The memory controller is the first controller developed.
  67 +
  68 +2.1. Design
  69 +
  70 +The core of the design is a counter called the res_counter. The res_counter
  71 +tracks the current memory usage and limit of the group of processes associated
  72 +with the controller. Each cgroup has a memory controller specific data
  73 +structure (mem_cgroup) associated with it.
  74 +
  75 +2.2. Accounting
  76 +
  77 + +--------------------+
  78 + | mem_cgroup |
  79 + | (res_counter) |
  80 + +--------------------+
  81 + / ^ \
  82 + / | \
  83 + +---------------+ | +---------------+
  84 + | mm_struct | |.... | mm_struct |
  85 + | | | | |
  86 + +---------------+ | +---------------+
  87 + |
  88 + + --------------+
  89 + |
  90 + +---------------+ +------+--------+
  91 + | page +----------> page_cgroup|
  92 + | | | |
  93 + +---------------+ +---------------+
  94 +
  95 + (Figure 1: Hierarchy of Accounting)
  96 +
  97 +
  98 +Figure 1 shows the important aspects of the controller
  99 +
  100 +1. Accounting happens per cgroup
  101 +2. Each mm_struct knows about which cgroup it belongs to
  102 +3. Each page has a pointer to the page_cgroup, which in turn knows the
  103 + cgroup it belongs to
  104 +
  105 +The accounting is done as follows: mem_cgroup_charge() is invoked to setup
  106 +the necessary data structures and check if the cgroup that is being charged
  107 +is over its limit. If it is then reclaim is invoked on the cgroup.
  108 +More details can be found in the reclaim section of this document.
  109 +If everything goes well, a page meta-data-structure called page_cgroup is
  110 +allocated and associated with the page. This routine also adds the page to
  111 +the per cgroup LRU.
  112 +
  113 +2.2.1 Accounting details
  114 +
  115 +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
  116 +(some pages which never be reclaimable and will not be on global LRU
  117 + are not accounted. we just accounts pages under usual vm management.)
  118 +
  119 +RSS pages are accounted at page_fault unless they've already been accounted
  120 +for earlier. A file page will be accounted for as Page Cache when it's
  121 +inserted into inode (radix-tree). While it's mapped into the page tables of
  122 +processes, duplicate accounting is carefully avoided.
  123 +
  124 +A RSS page is unaccounted when it's fully unmapped. A PageCache page is
  125 +unaccounted when it's removed from radix-tree.
  126 +
  127 +At page migration, accounting information is kept.
  128 +
  129 +Note: we just account pages-on-lru because our purpose is to control amount
  130 +of used pages. not-on-lru pages are tend to be out-of-control from vm view.
  131 +
  132 +2.3 Shared Page Accounting
  133 +
  134 +Shared pages are accounted on the basis of the first touch approach. The
  135 +cgroup that first touches a page is accounted for the page. The principle
  136 +behind this approach is that a cgroup that aggressively uses a shared
  137 +page will eventually get charged for it (once it is uncharged from
  138 +the cgroup that brought it in -- this will happen on memory pressure).
  139 +
  140 +Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
  141 +When you do swapoff and make swapped-out pages of shmem(tmpfs) to
  142 +be backed into memory in force, charges for pages are accounted against the
  143 +caller of swapoff rather than the users of shmem.
  144 +
  145 +
  146 +2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
  147 +Swap Extension allows you to record charge for swap. A swapped-in page is
  148 +charged back to original page allocator if possible.
  149 +
  150 +When swap is accounted, following files are added.
  151 + - memory.memsw.usage_in_bytes.
  152 + - memory.memsw.limit_in_bytes.
  153 +
  154 +usage of mem+swap is limited by memsw.limit_in_bytes.
  155 +
  156 +Note: why 'mem+swap' rather than swap.
  157 +The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
  158 +to move account from memory to swap...there is no change in usage of
  159 +mem+swap.
  160 +
  161 +In other words, when we want to limit the usage of swap without affecting
  162 +global LRU, mem+swap limit is better than just limiting swap from OS point
  163 +of view.
  164 +
  165 +2.5 Reclaim
  166 +
  167 +Each cgroup maintains a per cgroup LRU that consists of an active
  168 +and inactive list. When a cgroup goes over its limit, we first try
  169 +to reclaim memory from the cgroup so as to make space for the new
  170 +pages that the cgroup has touched. If the reclaim is unsuccessful,
  171 +an OOM routine is invoked to select and kill the bulkiest task in the
  172 +cgroup.
  173 +
  174 +The reclaim algorithm has not been modified for cgroups, except that
  175 +pages that are selected for reclaiming come from the per cgroup LRU
  176 +list.
  177 +
  178 +2. Locking
  179 +
  180 +The memory controller uses the following hierarchy
  181 +
  182 +1. zone->lru_lock is used for selecting pages to be isolated
  183 +2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
  184 +3. lock_page_cgroup() is used to protect page->page_cgroup
  185 +
  186 +3. User Interface
  187 +
  188 +0. Configuration
  189 +
  190 +a. Enable CONFIG_CGROUPS
  191 +b. Enable CONFIG_RESOURCE_COUNTERS
  192 +c. Enable CONFIG_CGROUP_MEM_RES_CTLR
  193 +
  194 +1. Prepare the cgroups
  195 +# mkdir -p /cgroups
  196 +# mount -t cgroup none /cgroups -o memory
  197 +
  198 +2. Make the new group and move bash into it
  199 +# mkdir /cgroups/0
  200 +# echo $$ > /cgroups/0/tasks
  201 +
  202 +Since now we're in the 0 cgroup,
  203 +We can alter the memory limit:
  204 +# echo 4M > /cgroups/0/memory.limit_in_bytes
  205 +
  206 +NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
  207 +mega or gigabytes.
  208 +
  209 +# cat /cgroups/0/memory.limit_in_bytes
  210 +4194304
  211 +
  212 +NOTE: The interface has now changed to display the usage in bytes
  213 +instead of pages
  214 +
  215 +We can check the usage:
  216 +# cat /cgroups/0/memory.usage_in_bytes
  217 +1216512
  218 +
  219 +A successful write to this file does not guarantee a successful set of
  220 +this limit to the value written into the file. This can be due to a
  221 +number of factors, such as rounding up to page boundaries or the total
  222 +availability of memory on the system. The user is required to re-read
  223 +this file after a write to guarantee the value committed by the kernel.
  224 +
  225 +# echo 1 > memory.limit_in_bytes
  226 +# cat memory.limit_in_bytes
  227 +4096
  228 +
  229 +The memory.failcnt field gives the number of times that the cgroup limit was
  230 +exceeded.
  231 +
  232 +The memory.stat file gives accounting information. Now, the number of
  233 +caches, RSS and Active pages/Inactive pages are shown.
  234 +
  235 +4. Testing
  236 +
  237 +Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
  238 +Apart from that v6 has been tested with several applications and regular
  239 +daily use. The controller has also been tested on the PPC64, x86_64 and
  240 +UML platforms.
  241 +
  242 +4.1 Troubleshooting
  243 +
  244 +Sometimes a user might find that the application under a cgroup is
  245 +terminated. There are several causes for this:
  246 +
  247 +1. The cgroup limit is too low (just too low to do anything useful)
  248 +2. The user is using anonymous memory and swap is turned off or too low
  249 +
  250 +A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
  251 +some of the pages cached in the cgroup (page cache pages).
  252 +
  253 +4.2 Task migration
  254 +
  255 +When a task migrates from one cgroup to another, it's charge is not
  256 +carried forward. The pages allocated from the original cgroup still
  257 +remain charged to it, the charge is dropped when the page is freed or
  258 +reclaimed.
  259 +
  260 +4.3 Removing a cgroup
  261 +
  262 +A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
  263 +cgroup might have some charge associated with it, even though all
  264 +tasks have migrated away from it.
  265 +Such charges are freed(at default) or moved to its parent. When moved,
  266 +both of RSS and CACHES are moved to parent.
  267 +If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
  268 +
  269 +Charges recorded in swap information is not updated at removal of cgroup.
  270 +Recorded information is discarded and a cgroup which uses swap (swapcache)
  271 +will be charged as a new owner of it.
  272 +
  273 +
  274 +5. Misc. interfaces.
  275 +
  276 +5.1 force_empty
  277 + memory.force_empty interface is provided to make cgroup's memory usage empty.
  278 + You can use this interface only when the cgroup has no tasks.
  279 + When writing anything to this
  280 +
  281 + # echo 0 > memory.force_empty
  282 +
  283 + Almost all pages tracked by this memcg will be unmapped and freed. Some of
  284 + pages cannot be freed because it's locked or in-use. Such pages are moved
  285 + to parent and this cgroup will be empty. But this may return -EBUSY in
  286 + some too busy case.
  287 +
  288 + Typical use case of this interface is that calling this before rmdir().
  289 + Because rmdir() moves all pages to parent, some out-of-use page caches can be
  290 + moved to the parent. If you want to avoid that, force_empty will be useful.
  291 +
  292 +5.2 stat file
  293 + memory.stat file includes following statistics (now)
  294 + cache - # of pages from page-cache and shmem.
  295 + rss - # of pages from anonymous memory.
  296 + pgpgin - # of event of charging
  297 + pgpgout - # of event of uncharging
  298 + active_anon - # of pages on active lru of anon, shmem.
  299 + inactive_anon - # of pages on active lru of anon, shmem
  300 + active_file - # of pages on active lru of file-cache
  301 + inactive_file - # of pages on inactive lru of file cache
  302 + unevictable - # of pages cannot be reclaimed.(mlocked etc)
  303 +
  304 + Below is depend on CONFIG_DEBUG_VM.
  305 + inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
  306 + recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
  307 + recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
  308 + recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
  309 + recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
  310 +
  311 + Memo:
  312 + recent_rotated means recent frequency of lru rotation.
  313 + recent_scanned means recent # of scans to lru.
  314 + showing for better debug please see the code for meanings.
  315 +
  316 +
  317 +5.3 swappiness
  318 + Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
  319 +
  320 + Following cgroup's swapiness can't be changed.
  321 + - root cgroup (uses /proc/sys/vm/swappiness).
  322 + - a cgroup which uses hierarchy and it has child cgroup.
  323 + - a cgroup which uses hierarchy and not the root of hierarchy.
  324 +
  325 +
  326 +6. Hierarchy support
  327 +
  328 +The memory controller supports a deep hierarchy and hierarchical accounting.
  329 +The hierarchy is created by creating the appropriate cgroups in the
  330 +cgroup filesystem. Consider for example, the following cgroup filesystem
  331 +hierarchy
  332 +
  333 + root
  334 + / | \
  335 + / | \
  336 + a b c
  337 + | \
  338 + | \
  339 + d e
  340 +
  341 +In the diagram above, with hierarchical accounting enabled, all memory
  342 +usage of e, is accounted to its ancestors up until the root (i.e, c and root),
  343 +that has memory.use_hierarchy enabled. If one of the ancestors goes over its
  344 +limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
  345 +children of the ancestor.
  346 +
  347 +6.1 Enabling hierarchical accounting and reclaim
  348 +
  349 +The memory controller by default disables the hierarchy feature. Support
  350 +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
  351 +
  352 +# echo 1 > memory.use_hierarchy
  353 +
  354 +The feature can be disabled by
  355 +
  356 +# echo 0 > memory.use_hierarchy
  357 +
  358 +NOTE1: Enabling/disabling will fail if the cgroup already has other
  359 +cgroups created below it.
  360 +
  361 +NOTE2: This feature can be enabled/disabled per subtree.
  362 +
  363 +7. TODO
  364 +
  365 +1. Add support for accounting huge pages (as a separate controller)
  366 +2. Make per-cgroup scanner reclaim not-shared pages first
  367 +3. Teach controller to account for shared-pages
  368 +4. Start reclamation in the background when the limit is
  369 + not yet hit but the usage is getting closer
  370 +
  371 +Summary
  372 +
  373 +Overall, the memory controller has been a stable controller and has been
  374 +commented and discussed quite extensively in the community.
  375 +
  376 +References
  377 +
  378 +1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
  379 +2. Singh, Balbir. Memory Controller (RSS Control),
  380 + http://lwn.net/Articles/222762/
  381 +3. Emelianov, Pavel. Resource controllers based on process cgroups
  382 + http://lkml.org/lkml/2007/3/6/198
  383 +4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
  384 + http://lkml.org/lkml/2007/4/9/78
  385 +5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
  386 + http://lkml.org/lkml/2007/5/30/244
  387 +6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
  388 +7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
  389 + subsystem (v3), http://lwn.net/Articles/235534/
  390 +8. Singh, Balbir. RSS controller v2 test results (lmbench),
  391 + http://lkml.org/lkml/2007/5/17/232
  392 +9. Singh, Balbir. RSS controller v2 AIM9 results
  393 + http://lkml.org/lkml/2007/5/18/1
  394 +10. Singh, Balbir. Memory controller v6 test results,
  395 + http://lkml.org/lkml/2007/8/19/36
  396 +11. Singh, Balbir. Memory controller introduction (v6),
  397 + http://lkml.org/lkml/2007/8/17/69
  398 +12. Corbet, Jonathan, Controlling memory use in cgroups,
  399 + http://lwn.net/Articles/243795/
Documentation/cgroups/resource_counter.txt
  1 +
  2 + The Resource Counter
  3 +
  4 +The resource counter, declared at include/linux/res_counter.h,
  5 +is supposed to facilitate the resource management by controllers
  6 +by providing common stuff for accounting.
  7 +
  8 +This "stuff" includes the res_counter structure and routines
  9 +to work with it.
  10 +
  11 +
  12 +
  13 +1. Crucial parts of the res_counter structure
  14 +
  15 + a. unsigned long long usage
  16 +
  17 + The usage value shows the amount of a resource that is consumed
  18 + by a group at a given time. The units of measurement should be
  19 + determined by the controller that uses this counter. E.g. it can
  20 + be bytes, items or any other unit the controller operates on.
  21 +
  22 + b. unsigned long long max_usage
  23 +
  24 + The maximal value of the usage over time.
  25 +
  26 + This value is useful when gathering statistical information about
  27 + the particular group, as it shows the actual resource requirements
  28 + for a particular group, not just some usage snapshot.
  29 +
  30 + c. unsigned long long limit
  31 +
  32 + The maximal allowed amount of resource to consume by the group. In
  33 + case the group requests for more resources, so that the usage value
  34 + would exceed the limit, the resource allocation is rejected (see
  35 + the next section).
  36 +
  37 + d. unsigned long long failcnt
  38 +
  39 + The failcnt stands for "failures counter". This is the number of
  40 + resource allocation attempts that failed.
  41 +
  42 + c. spinlock_t lock
  43 +
  44 + Protects changes of the above values.
  45 +
  46 +
  47 +
  48 +2. Basic accounting routines
  49 +
  50 + a. void res_counter_init(struct res_counter *rc)
  51 +
  52 + Initializes the resource counter. As usual, should be the first
  53 + routine called for a new counter.
  54 +
  55 + b. int res_counter_charge[_locked]
  56 + (struct res_counter *rc, unsigned long val)
  57 +
  58 + When a resource is about to be allocated it has to be accounted
  59 + with the appropriate resource counter (controller should determine
  60 + which one to use on its own). This operation is called "charging".
  61 +
  62 + This is not very important which operation - resource allocation
  63 + or charging - is performed first, but
  64 + * if the allocation is performed first, this may create a
  65 + temporary resource over-usage by the time resource counter is
  66 + charged;
  67 + * if the charging is performed first, then it should be uncharged
  68 + on error path (if the one is called).
  69 +
  70 + c. void res_counter_uncharge[_locked]
  71 + (struct res_counter *rc, unsigned long val)
  72 +
  73 + When a resource is released (freed) it should be de-accounted
  74 + from the resource counter it was accounted to. This is called
  75 + "uncharging".
  76 +
  77 + The _locked routines imply that the res_counter->lock is taken.
  78 +
  79 +
  80 + 2.1 Other accounting routines
  81 +
  82 + There are more routines that may help you with common needs, like
  83 + checking whether the limit is reached or resetting the max_usage
  84 + value. They are all declared in include/linux/res_counter.h.
  85 +
  86 +
  87 +
  88 +3. Analyzing the resource counter registrations
  89 +
  90 + a. If the failcnt value constantly grows, this means that the counter's
  91 + limit is too tight. Either the group is misbehaving and consumes too
  92 + many resources, or the configuration is not suitable for the group
  93 + and the limit should be increased.
  94 +
  95 + b. The max_usage value can be used to quickly tune the group. One may
  96 + set the limits to maximal values and either load the container with
  97 + a common pattern or leave one for a while. After this the max_usage
  98 + value shows the amount of memory the container would require during
  99 + its common activity.
  100 +
  101 + Setting the limit a bit above this value gives a pretty good
  102 + configuration that works in most of the cases.
  103 +
  104 + c. If the max_usage is much less than the limit, but the failcnt value
  105 + is growing, then the group tries to allocate a big chunk of resource
  106 + at once.
  107 +
  108 + d. If the max_usage is much less than the limit, but the failcnt value
  109 + is 0, then this group is given too high limit, that it does not
  110 + require. It is better to lower the limit a bit leaving more resource
  111 + for other groups.
  112 +
  113 +
  114 +
  115 +4. Communication with the control groups subsystem (cgroups)
  116 +
  117 +All the resource controllers that are using cgroups and resource counters
  118 +should provide files (in the cgroup filesystem) to work with the resource
  119 +counter fields. They are recommended to adhere to the following rules:
  120 +
  121 + a. File names
  122 +
  123 + Field name File name
  124 + ---------------------------------------------------
  125 + usage usage_in_<unit_of_measurement>
  126 + max_usage max_usage_in_<unit_of_measurement>
  127 + limit limit_in_<unit_of_measurement>
  128 + failcnt failcnt
  129 + lock no file :)
  130 +
  131 + b. Reading from file should show the corresponding field value in the
  132 + appropriate format.
  133 +
  134 + c. Writing to file
  135 +
  136 + Field Expected behavior
  137 + ----------------------------------
  138 + usage prohibited
  139 + max_usage reset to usage
  140 + limit set the limit
  141 + failcnt reset to zero
  142 +
  143 +
  144 +
  145 +5. Usage example
  146 +
  147 + a. Declare a task group (take a look at cgroups subsystem for this) and
  148 + fold a res_counter into it
  149 +
  150 + struct my_group {
  151 + struct res_counter res;
  152 +
  153 + <other fields>
  154 + }
  155 +
  156 + b. Put hooks in resource allocation/release paths
  157 +
  158 + int alloc_something(...)
  159 + {
  160 + if (res_counter_charge(res_counter_ptr, amount) < 0)
  161 + return -ENOMEM;
  162 +
  163 + <allocate the resource and return to the caller>
  164 + }
  165 +
  166 + void release_something(...)
  167 + {
  168 + res_counter_uncharge(res_counter_ptr, amount);
  169 +
  170 + <release the resource>
  171 + }
  172 +
  173 + In order to keep the usage value self-consistent, both the
  174 + "res_counter_ptr" and the "amount" in release_something() should be
  175 + the same as they were in the alloc_something() when the releasing
  176 + resource was allocated.
  177 +
  178 + c. Provide the way to read res_counter values and set them (the cgroups
  179 + still can help with it).
  180 +
  181 + c. Compile and run :)
Documentation/controllers/cpuacct.txt
1   -CPU Accounting Controller
2   --------------------------
3   -
4   -The CPU accounting controller is used to group tasks using cgroups and
5   -account the CPU usage of these groups of tasks.
6   -
7   -The CPU accounting controller supports multi-hierarchy groups. An accounting
8   -group accumulates the CPU usage of all of its child groups and the tasks
9   -directly present in its group.
10   -
11   -Accounting groups can be created by first mounting the cgroup filesystem.
12   -
13   -# mkdir /cgroups
14   -# mount -t cgroup -ocpuacct none /cgroups
15   -
16   -With the above step, the initial or the parent accounting group
17   -becomes visible at /cgroups. At bootup, this group includes all the
18   -tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
19   -/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
20   -this group which is essentially the CPU time obtained by all the tasks
21   -in the system.
22   -
23   -New accounting groups can be created under the parent group /cgroups.
24   -
25   -# cd /cgroups
26   -# mkdir g1
27   -# echo $$ > g1
28   -
29   -The above steps create a new group g1 and move the current shell
30   -process (bash) into it. CPU time consumed by this bash and its children
31   -can be obtained from g1/cpuacct.usage and the same is accumulated in
32   -/cgroups/cpuacct.usage also.
Documentation/controllers/devices.txt
1   -Device Whitelist Controller
2   -
3   -1. Description:
4   -
5   -Implement a cgroup to track and enforce open and mknod restrictions
6   -on device files. A device cgroup associates a device access
7   -whitelist with each cgroup. A whitelist entry has 4 fields.
8   -'type' is a (all), c (char), or b (block). 'all' means it applies
9   -to all types and all major and minor numbers. Major and minor are
10   -either an integer or * for all. Access is a composition of r
11   -(read), w (write), and m (mknod).
12   -
13   -The root device cgroup starts with rwm to 'all'. A child device
14   -cgroup gets a copy of the parent. Administrators can then remove
15   -devices from the whitelist or add new entries. A child cgroup can
16   -never receive a device access which is denied by its parent. However
17   -when a device access is removed from a parent it will not also be
18   -removed from the child(ren).
19   -
20   -2. User Interface
21   -
22   -An entry is added using devices.allow, and removed using
23   -devices.deny. For instance
24   -
25   - echo 'c 1:3 mr' > /cgroups/1/devices.allow
26   -
27   -allows cgroup 1 to read and mknod the device usually known as
28   -/dev/null. Doing
29   -
30   - echo a > /cgroups/1/devices.deny
31   -
32   -will remove the default 'a *:* rwm' entry. Doing
33   -
34   - echo a > /cgroups/1/devices.allow
35   -
36   -will add the 'a *:* rwm' entry to the whitelist.
37   -
38   -3. Security
39   -
40   -Any task can move itself between cgroups. This clearly won't
41   -suffice, but we can decide the best way to adequately restrict
42   -movement as people get some experience with this. We may just want
43   -to require CAP_SYS_ADMIN, which at least is a separate bit from
44   -CAP_MKNOD. We may want to just refuse moving to a cgroup which
45   -isn't a descendent of the current one. Or we may want to use
46   -CAP_MAC_ADMIN, since we really are trying to lock down root.
47   -
48   -CAP_SYS_ADMIN is needed to modify the whitelist or move another
49   -task to a new cgroup. (Again we'll probably want to change that).
50   -
51   -A cgroup may not be granted more permissions than the cgroup's
52   -parent has.
Documentation/controllers/memcg_test.txt
1   -Memory Resource Controller(Memcg) Implementation Memo.
2   -Last Updated: 2008/12/15
3   -Base Kernel Version: based on 2.6.28-rc8-mm.
4   -
5   -Because VM is getting complex (one of reasons is memcg...), memcg's behavior
6   -is complex. This is a document for memcg's internal behavior.
7   -Please note that implementation details can be changed.
8   -
9   -(*) Topics on API should be in Documentation/controllers/memory.txt)
10   -
11   -0. How to record usage ?
12   - 2 objects are used.
13   -
14   - page_cgroup ....an object per page.
15   - Allocated at boot or memory hotplug. Freed at memory hot removal.
16   -
17   - swap_cgroup ... an entry per swp_entry.
18   - Allocated at swapon(). Freed at swapoff().
19   -
20   - The page_cgroup has USED bit and double count against a page_cgroup never
21   - occurs. swap_cgroup is used only when a charged page is swapped-out.
22   -
23   -1. Charge
24   -
25   - a page/swp_entry may be charged (usage += PAGE_SIZE) at
26   -
27   - mem_cgroup_newpage_charge()
28   - Called at new page fault and Copy-On-Write.
29   -
30   - mem_cgroup_try_charge_swapin()
31   - Called at do_swap_page() (page fault on swap entry) and swapoff.
32   - Followed by charge-commit-cancel protocol. (With swap accounting)
33   - At commit, a charge recorded in swap_cgroup is removed.
34   -
35   - mem_cgroup_cache_charge()
36   - Called at add_to_page_cache()
37   -
38   - mem_cgroup_cache_charge_swapin()
39   - Called at shmem's swapin.
40   -
41   - mem_cgroup_prepare_migration()
42   - Called before migration. "extra" charge is done and followed by
43   - charge-commit-cancel protocol.
44   - At commit, charge against oldpage or newpage will be committed.
45   -
46   -2. Uncharge
47   - a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
48   -
49   - mem_cgroup_uncharge_page()
50   - Called when an anonymous page is fully unmapped. I.e., mapcount goes
51   - to 0. If the page is SwapCache, uncharge is delayed until
52   - mem_cgroup_uncharge_swapcache().
53   -
54   - mem_cgroup_uncharge_cache_page()
55   - Called when a page-cache is deleted from radix-tree. If the page is
56   - SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
57   -
58   - mem_cgroup_uncharge_swapcache()
59   - Called when SwapCache is removed from radix-tree. The charge itself
60   - is moved to swap_cgroup. (If mem+swap controller is disabled, no
61   - charge to swap occurs.)
62   -
63   - mem_cgroup_uncharge_swap()
64   - Called when swp_entry's refcnt goes down to 0. A charge against swap
65   - disappears.
66   -
67   - mem_cgroup_end_migration(old, new)
68   - At success of migration old is uncharged (if necessary), a charge
69   - to new page is committed. At failure, charge to old page is committed.
70   -
71   -3. charge-commit-cancel
72   - In some case, we can't know this "charge" is valid or not at charging
73   - (because of races).
74   - To handle such case, there are charge-commit-cancel functions.
75   - mem_cgroup_try_charge_XXX
76   - mem_cgroup_commit_charge_XXX
77   - mem_cgroup_cancel_charge_XXX
78   - these are used in swap-in and migration.
79   -
80   - At try_charge(), there are no flags to say "this page is charged".
81   - at this point, usage += PAGE_SIZE.
82   -
83   - At commit(), the function checks the page should be charged or not
84   - and set flags or avoid charging.(usage -= PAGE_SIZE)
85   -
86   - At cancel(), simply usage -= PAGE_SIZE.
87   -
88   -Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
89   -
90   -4. Anonymous
91   - Anonymous page is newly allocated at
92   - - page fault into MAP_ANONYMOUS mapping.
93   - - Copy-On-Write.
94   - It is charged right after it's allocated before doing any page table
95   - related operations. Of course, it's uncharged when another page is used
96   - for the fault address.
97   -
98   - At freeing anonymous page (by exit() or munmap()), zap_pte() is called
99   - and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
100   - are done at page_remove_rmap() when page_mapcount() goes down to 0.
101   -
102   - Another page freeing is by page-reclaim (vmscan.c) and anonymous
103   - pages are swapped out. In this case, the page is marked as
104   - PageSwapCache(). uncharge() routine doesn't uncharge the page marked
105   - as SwapCache(). It's delayed until __delete_from_swap_cache().
106   -
107   - 4.1 Swap-in.
108   - At swap-in, the page is taken from swap-cache. There are 2 cases.
109   -
110   - (a) If the SwapCache is newly allocated and read, it has no charges.
111   - (b) If the SwapCache has been mapped by processes, it has been
112   - charged already.
113   -
114   - This swap-in is one of the most complicated work. In do_swap_page(),
115   - following events occur when pte is unchanged.
116   -
117   - (1) the page (SwapCache) is looked up.
118   - (2) lock_page()
119   - (3) try_charge_swapin()
120   - (4) reuse_swap_page() (may call delete_swap_cache())
121   - (5) commit_charge_swapin()
122   - (6) swap_free().
123   -
124   - Considering following situation for example.
125   -
126   - (A) The page has not been charged before (2) and reuse_swap_page()
127   - doesn't call delete_from_swap_cache().
128   - (B) The page has not been charged before (2) and reuse_swap_page()
129   - calls delete_from_swap_cache().
130   - (C) The page has been charged before (2) and reuse_swap_page() doesn't
131   - call delete_from_swap_cache().
132   - (D) The page has been charged before (2) and reuse_swap_page() calls
133   - delete_from_swap_cache().
134   -
135   - memory.usage/memsw.usage changes to this page/swp_entry will be
136   - Case (A) (B) (C) (D)
137   - Event
138   - Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
139   - ===========================================
140   - (3) +1/+1 +1/+1 +1/+1 +1/+1
141   - (4) - 0/ 0 - -1/ 0
142   - (5) 0/-1 0/ 0 -1/-1 0/ 0
143   - (6) - 0/-1 - 0/-1
144   - ===========================================
145   - Result 1/ 1 1/ 1 1/ 1 1/ 1
146   -
147   - In any cases, charges to this page should be 1/ 1.
148   -
149   - 4.2 Swap-out.
150   - At swap-out, typical state transition is below.
151   -
152   - (a) add to swap cache. (marked as SwapCache)
153   - swp_entry's refcnt += 1.
154   - (b) fully unmapped.
155   - swp_entry's refcnt += # of ptes.
156   - (c) write back to swap.
157   - (d) delete from swap cache. (remove from SwapCache)
158   - swp_entry's refcnt -= 1.
159   -
160   -
161   - At (b), the page is marked as SwapCache and not uncharged.
162   - At (d), the page is removed from SwapCache and a charge in page_cgroup
163   - is moved to swap_cgroup.
164   -
165   - Finally, at task exit,
166   - (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
167   - Here, a charge in swap_cgroup disappears.
168   -
169   -5. Page Cache
170   - Page Cache is charged at
171   - - add_to_page_cache_locked().
172   -
173   - uncharged at
174   - - __remove_from_page_cache().
175   -
176   - The logic is very clear. (About migration, see below)
177   - Note: __remove_from_page_cache() is called by remove_from_page_cache()
178   - and __remove_mapping().
179   -
180   -6. Shmem(tmpfs) Page Cache
181   - Memcg's charge/uncharge have special handlers of shmem. The best way
182   - to understand shmem's page state transition is to read mm/shmem.c.
183   - But brief explanation of the behavior of memcg around shmem will be
184   - helpful to understand the logic.
185   -
186   - Shmem's page (just leaf page, not direct/indirect block) can be on
187   - - radix-tree of shmem's inode.
188   - - SwapCache.
189   - - Both on radix-tree and SwapCache. This happens at swap-in
190   - and swap-out,
191   -
192   - It's charged when...
193   - - A new page is added to shmem's radix-tree.
194   - - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
195   - It's uncharged when
196   - - A page is removed from radix-tree and not SwapCache.
197   - - When SwapCache is removed, a charge is moved to swap_cgroup.
198   - - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
199   - disappears.
200   -
201   -7. Page Migration
202   - One of the most complicated functions is page-migration-handler.
203   - Memcg has 2 routines. Assume that we are migrating a page's contents
204   - from OLDPAGE to NEWPAGE.
205   -
206   - Usual migration logic is..
207   - (a) remove the page from LRU.
208   - (b) allocate NEWPAGE (migration target)
209   - (c) lock by lock_page().
210   - (d) unmap all mappings.
211   - (e-1) If necessary, replace entry in radix-tree.
212   - (e-2) move contents of a page.
213   - (f) map all mappings again.
214   - (g) pushback the page to LRU.
215   - (-) OLDPAGE will be freed.
216   -
217   - Before (g), memcg should complete all necessary charge/uncharge to
218   - NEWPAGE/OLDPAGE.
219   -
220   - The point is....
221   - - If OLDPAGE is anonymous, all charges will be dropped at (d) because
222   - try_to_unmap() drops all mapcount and the page will not be
223   - SwapCache.
224   -
225   - - If OLDPAGE is SwapCache, charges will be kept at (g) because
226   - __delete_from_swap_cache() isn't called at (e-1)
227   -
228   - - If OLDPAGE is page-cache, charges will be kept at (g) because
229   - remove_from_swap_cache() isn't called at (e-1)
230   -
231   - memcg provides following hooks.
232   -
233   - - mem_cgroup_prepare_migration(OLDPAGE)
234   - Called after (b) to account a charge (usage += PAGE_SIZE) against
235   - memcg which OLDPAGE belongs to.
236   -
237   - - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
238   - Called after (f) before (g).
239   - If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
240   - charged, a charge by prepare_migration() is automatically canceled.
241   - If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
242   -
243   - But zap_pte() (by exit or munmap) can be called while migration,
244   - we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
245   -
246   -8. LRU
247   - Each memcg has its own private LRU. Now, it's handling is under global
248   - VM's control (means that it's handled under global zone->lru_lock).
249   - Almost all routines around memcg's LRU is called by global LRU's
250   - list management functions under zone->lru_lock().
251   -
252   - A special function is mem_cgroup_isolate_pages(). This scans
253   - memcg's private LRU and call __isolate_lru_page() to extract a page
254   - from LRU.
255   - (By __isolate_lru_page(), the page is removed from both of global and
256   - private LRU.)
257   -
258   -
259   -9. Typical Tests.
260   -
261   - Tests for racy cases.
262   -
263   - 9.1 Small limit to memcg.
264   - When you do test to do racy case, it's good test to set memcg's limit
265   - to be very small rather than GB. Many races found in the test under
266   - xKB or xxMB limits.
267   - (Memory behavior under GB and Memory behavior under MB shows very
268   - different situation.)
269   -
270   - 9.2 Shmem
271   - Historically, memcg's shmem handling was poor and we saw some amount
272   - of troubles here. This is because shmem is page-cache but can be
273   - SwapCache. Test with shmem/tmpfs is always good test.
274   -
275   - 9.3 Migration
276   - For NUMA, migration is an another special case. To do easy test, cpuset
277   - is useful. Following is a sample script to do migration.
278   -
279   - mount -t cgroup -o cpuset none /opt/cpuset
280   -
281   - mkdir /opt/cpuset/01
282   - echo 1 > /opt/cpuset/01/cpuset.cpus
283   - echo 0 > /opt/cpuset/01/cpuset.mems
284   - echo 1 > /opt/cpuset/01/cpuset.memory_migrate
285   - mkdir /opt/cpuset/02
286   - echo 1 > /opt/cpuset/02/cpuset.cpus
287   - echo 1 > /opt/cpuset/02/cpuset.mems
288   - echo 1 > /opt/cpuset/02/cpuset.memory_migrate
289   -
290   - In above set, when you moves a task from 01 to 02, page migration to
291   - node 0 to node 1 will occur. Following is a script to migrate all
292   - under cpuset.
293   - --
294   - move_task()
295   - {
296   - for pid in $1
297   - do
298   - /bin/echo $pid >$2/tasks 2>/dev/null
299   - echo -n $pid
300   - echo -n " "
301   - done
302   - echo END
303   - }
304   -
305   - G1_TASK=`cat ${G1}/tasks`
306   - G2_TASK=`cat ${G2}/tasks`
307   - move_task "${G1_TASK}" ${G2} &
308   - --
309   - 9.4 Memory hotplug.
310   - memory hotplug test is one of good test.
311   - to offline memory, do following.
312   - # echo offline > /sys/devices/system/memory/memoryXXX/state
313   - (XXX is the place of memory)
314   - This is an easy way to test page migration, too.
315   -
316   - 9.5 mkdir/rmdir
317   - When using hierarchy, mkdir/rmdir test should be done.
318   - Use tests like the following.
319   -
320   - echo 1 >/opt/cgroup/01/memory/use_hierarchy
321   - mkdir /opt/cgroup/01/child_a
322   - mkdir /opt/cgroup/01/child_b
323   -
324   - set limit to 01.
325   - add limit to 01/child_b
326   - run jobs under child_a and child_b
327   -
328   - create/delete following groups at random while jobs are running.
329   - /opt/cgroup/01/child_a/child_aa
330   - /opt/cgroup/01/child_b/child_bb
331   - /opt/cgroup/01/child_c
332   -
333   - running new jobs in new group is also good.
334   -
335   - 9.6 Mount with other subsystems.
336   - Mounting with other subsystems is a good test because there is a
337   - race and lock dependency with other cgroup subsystems.
338   -
339   - example)
340   - # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
341   -
342   - and do task move, mkdir, rmdir etc...under this.
Documentation/controllers/memory.txt
1   -Memory Resource Controller
2   -
3   -NOTE: The Memory Resource Controller has been generically been referred
4   -to as the memory controller in this document. Do not confuse memory controller
5   -used here with the memory controller that is used in hardware.
6   -
7   -Salient features
8   -
9   -a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
10   -b. The infrastructure allows easy addition of other types of memory to control
11   -c. Provides *zero overhead* for non memory controller users
12   -d. Provides a double LRU: global memory pressure causes reclaim from the
13   - global LRU; a cgroup on hitting a limit, reclaims from the per
14   - cgroup LRU
15   -
16   -NOTE: Swap Cache (unmapped) is not accounted now.
17   -
18   -Benefits and Purpose of the memory controller
19   -
20   -The memory controller isolates the memory behaviour of a group of tasks
21   -from the rest of the system. The article on LWN [12] mentions some probable
22   -uses of the memory controller. The memory controller can be used to
23   -
24   -a. Isolate an application or a group of applications
25   - Memory hungry applications can be isolated and limited to a smaller
26   - amount of memory.
27   -b. Create a cgroup with limited amount of memory, this can be used
28   - as a good alternative to booting with mem=XXXX.
29   -c. Virtualization solutions can control the amount of memory they want
30   - to assign to a virtual machine instance.
31   -d. A CD/DVD burner could control the amount of memory used by the
32   - rest of the system to ensure that burning does not fail due to lack
33   - of available memory.
34   -e. There are several other use cases, find one or use the controller just
35   - for fun (to learn and hack on the VM subsystem).
36   -
37   -1. History
38   -
39   -The memory controller has a long history. A request for comments for the memory
40   -controller was posted by Balbir Singh [1]. At the time the RFC was posted
41   -there were several implementations for memory control. The goal of the
42   -RFC was to build consensus and agreement for the minimal features required
43   -for memory control. The first RSS controller was posted by Balbir Singh[2]
44   -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
45   -RSS controller. At OLS, at the resource management BoF, everyone suggested
46   -that we handle both page cache and RSS together. Another request was raised
47   -to allow user space handling of OOM. The current memory controller is
48   -at version 6; it combines both mapped (RSS) and unmapped Page
49   -Cache Control [11].
50   -
51   -2. Memory Control
52   -
53   -Memory is a unique resource in the sense that it is present in a limited
54   -amount. If a task requires a lot of CPU processing, the task can spread
55   -its processing over a period of hours, days, months or years, but with
56   -memory, the same physical memory needs to be reused to accomplish the task.
57   -
58   -The memory controller implementation has been divided into phases. These
59   -are:
60   -
61   -1. Memory controller
62   -2. mlock(2) controller
63   -3. Kernel user memory accounting and slab control
64   -4. user mappings length controller
65   -
66   -The memory controller is the first controller developed.
67   -
68   -2.1. Design
69   -
70   -The core of the design is a counter called the res_counter. The res_counter
71   -tracks the current memory usage and limit of the group of processes associated
72   -with the controller. Each cgroup has a memory controller specific data
73   -structure (mem_cgroup) associated with it.
74   -
75   -2.2. Accounting
76   -
77   - +--------------------+
78   - | mem_cgroup |
79   - | (res_counter) |
80   - +--------------------+
81   - / ^ \
82   - / | \
83   - +---------------+ | +---------------+
84   - | mm_struct | |.... | mm_struct |
85   - | | | | |
86   - +---------------+ | +---------------+
87   - |
88   - + --------------+
89   - |
90   - +---------------+ +------+--------+
91   - | page +----------> page_cgroup|
92   - | | | |
93   - +---------------+ +---------------+
94   -
95   - (Figure 1: Hierarchy of Accounting)
96   -
97   -
98   -Figure 1 shows the important aspects of the controller
99   -
100   -1. Accounting happens per cgroup
101   -2. Each mm_struct knows about which cgroup it belongs to
102   -3. Each page has a pointer to the page_cgroup, which in turn knows the
103   - cgroup it belongs to
104   -
105   -The accounting is done as follows: mem_cgroup_charge() is invoked to setup
106   -the necessary data structures and check if the cgroup that is being charged
107   -is over its limit. If it is then reclaim is invoked on the cgroup.
108   -More details can be found in the reclaim section of this document.
109   -If everything goes well, a page meta-data-structure called page_cgroup is
110   -allocated and associated with the page. This routine also adds the page to
111   -the per cgroup LRU.
112   -
113   -2.2.1 Accounting details
114   -
115   -All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
116   -(some pages which never be reclaimable and will not be on global LRU
117   - are not accounted. we just accounts pages under usual vm management.)
118   -
119   -RSS pages are accounted at page_fault unless they've already been accounted
120   -for earlier. A file page will be accounted for as Page Cache when it's
121   -inserted into inode (radix-tree). While it's mapped into the page tables of
122   -processes, duplicate accounting is carefully avoided.
123   -
124   -A RSS page is unaccounted when it's fully unmapped. A PageCache page is
125   -unaccounted when it's removed from radix-tree.
126   -
127   -At page migration, accounting information is kept.
128   -
129   -Note: we just account pages-on-lru because our purpose is to control amount
130   -of used pages. not-on-lru pages are tend to be out-of-control from vm view.
131   -
132   -2.3 Shared Page Accounting
133   -
134   -Shared pages are accounted on the basis of the first touch approach. The
135   -cgroup that first touches a page is accounted for the page. The principle
136   -behind this approach is that a cgroup that aggressively uses a shared
137   -page will eventually get charged for it (once it is uncharged from
138   -the cgroup that brought it in -- this will happen on memory pressure).
139   -
140   -Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
141   -When you do swapoff and make swapped-out pages of shmem(tmpfs) to
142   -be backed into memory in force, charges for pages are accounted against the
143   -caller of swapoff rather than the users of shmem.
144   -
145   -
146   -2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
147   -Swap Extension allows you to record charge for swap. A swapped-in page is
148   -charged back to original page allocator if possible.
149   -
150   -When swap is accounted, following files are added.
151   - - memory.memsw.usage_in_bytes.
152   - - memory.memsw.limit_in_bytes.
153   -
154   -usage of mem+swap is limited by memsw.limit_in_bytes.
155   -
156   -Note: why 'mem+swap' rather than swap.
157   -The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
158   -to move account from memory to swap...there is no change in usage of
159   -mem+swap.
160   -
161   -In other words, when we want to limit the usage of swap without affecting
162   -global LRU, mem+swap limit is better than just limiting swap from OS point
163   -of view.
164   -
165   -2.5 Reclaim
166   -
167   -Each cgroup maintains a per cgroup LRU that consists of an active
168   -and inactive list. When a cgroup goes over its limit, we first try
169   -to reclaim memory from the cgroup so as to make space for the new
170   -pages that the cgroup has touched. If the reclaim is unsuccessful,
171   -an OOM routine is invoked to select and kill the bulkiest task in the
172   -cgroup.
173   -
174   -The reclaim algorithm has not been modified for cgroups, except that
175   -pages that are selected for reclaiming come from the per cgroup LRU
176   -list.
177   -
178   -2. Locking
179   -
180   -The memory controller uses the following hierarchy
181   -
182   -1. zone->lru_lock is used for selecting pages to be isolated
183   -2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
184   -3. lock_page_cgroup() is used to protect page->page_cgroup
185   -
186   -3. User Interface
187   -
188   -0. Configuration
189   -
190   -a. Enable CONFIG_CGROUPS
191   -b. Enable CONFIG_RESOURCE_COUNTERS
192   -c. Enable CONFIG_CGROUP_MEM_RES_CTLR
193   -
194   -1. Prepare the cgroups
195   -# mkdir -p /cgroups
196   -# mount -t cgroup none /cgroups -o memory
197   -
198   -2. Make the new group and move bash into it
199   -# mkdir /cgroups/0
200   -# echo $$ > /cgroups/0/tasks
201   -
202   -Since now we're in the 0 cgroup,
203   -We can alter the memory limit:
204   -# echo 4M > /cgroups/0/memory.limit_in_bytes
205   -
206   -NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
207   -mega or gigabytes.
208   -
209   -# cat /cgroups/0/memory.limit_in_bytes
210   -4194304
211   -
212   -NOTE: The interface has now changed to display the usage in bytes
213   -instead of pages
214   -
215   -We can check the usage:
216   -# cat /cgroups/0/memory.usage_in_bytes
217   -1216512
218   -
219   -A successful write to this file does not guarantee a successful set of
220   -this limit to the value written into the file. This can be due to a
221   -number of factors, such as rounding up to page boundaries or the total
222   -availability of memory on the system. The user is required to re-read
223   -this file after a write to guarantee the value committed by the kernel.
224   -
225   -# echo 1 > memory.limit_in_bytes
226   -# cat memory.limit_in_bytes
227   -4096
228   -
229   -The memory.failcnt field gives the number of times that the cgroup limit was
230   -exceeded.
231   -
232   -The memory.stat file gives accounting information. Now, the number of
233   -caches, RSS and Active pages/Inactive pages are shown.
234   -
235   -4. Testing
236   -
237   -Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
238   -Apart from that v6 has been tested with several applications and regular
239   -daily use. The controller has also been tested on the PPC64, x86_64 and
240   -UML platforms.
241   -
242   -4.1 Troubleshooting
243   -
244   -Sometimes a user might find that the application under a cgroup is
245   -terminated. There are several causes for this:
246   -
247   -1. The cgroup limit is too low (just too low to do anything useful)
248   -2. The user is using anonymous memory and swap is turned off or too low
249   -
250   -A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
251   -some of the pages cached in the cgroup (page cache pages).
252   -
253   -4.2 Task migration
254   -
255   -When a task migrates from one cgroup to another, it's charge is not
256   -carried forward. The pages allocated from the original cgroup still
257   -remain charged to it, the charge is dropped when the page is freed or
258   -reclaimed.
259   -
260   -4.3 Removing a cgroup
261   -
262   -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
263   -cgroup might have some charge associated with it, even though all
264   -tasks have migrated away from it.
265   -Such charges are freed(at default) or moved to its parent. When moved,
266   -both of RSS and CACHES are moved to parent.
267   -If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
268   -
269   -Charges recorded in swap information is not updated at removal of cgroup.
270   -Recorded information is discarded and a cgroup which uses swap (swapcache)
271   -will be charged as a new owner of it.
272   -
273   -
274   -5. Misc. interfaces.
275   -
276   -5.1 force_empty
277   - memory.force_empty interface is provided to make cgroup's memory usage empty.
278   - You can use this interface only when the cgroup has no tasks.
279   - When writing anything to this
280   -
281   - # echo 0 > memory.force_empty
282   -
283   - Almost all pages tracked by this memcg will be unmapped and freed. Some of
284   - pages cannot be freed because it's locked or in-use. Such pages are moved
285   - to parent and this cgroup will be empty. But this may return -EBUSY in
286   - some too busy case.
287   -
288   - Typical use case of this interface is that calling this before rmdir().
289   - Because rmdir() moves all pages to parent, some out-of-use page caches can be
290   - moved to the parent. If you want to avoid that, force_empty will be useful.
291   -
292   -5.2 stat file
293   - memory.stat file includes following statistics (now)
294   - cache - # of pages from page-cache and shmem.
295   - rss - # of pages from anonymous memory.
296   - pgpgin - # of event of charging
297   - pgpgout - # of event of uncharging
298   - active_anon - # of pages on active lru of anon, shmem.
299   - inactive_anon - # of pages on active lru of anon, shmem
300   - active_file - # of pages on active lru of file-cache
301   - inactive_file - # of pages on inactive lru of file cache
302   - unevictable - # of pages cannot be reclaimed.(mlocked etc)
303   -
304   - Below is depend on CONFIG_DEBUG_VM.
305   - inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
306   - recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
307   - recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
308   - recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
309   - recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
310   -
311   - Memo:
312   - recent_rotated means recent frequency of lru rotation.
313   - recent_scanned means recent # of scans to lru.
314   - showing for better debug please see the code for meanings.
315   -
316   -
317   -5.3 swappiness
318   - Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
319   -
320   - Following cgroup's swapiness can't be changed.
321   - - root cgroup (uses /proc/sys/vm/swappiness).
322   - - a cgroup which uses hierarchy and it has child cgroup.
323   - - a cgroup which uses hierarchy and not the root of hierarchy.
324   -
325   -
326   -6. Hierarchy support
327   -
328   -The memory controller supports a deep hierarchy and hierarchical accounting.
329   -The hierarchy is created by creating the appropriate cgroups in the
330   -cgroup filesystem. Consider for example, the following cgroup filesystem
331   -hierarchy
332   -
333   - root
334   - / | \
335   - / | \
336   - a b c
337   - | \
338   - | \
339   - d e
340   -
341   -In the diagram above, with hierarchical accounting enabled, all memory
342   -usage of e, is accounted to its ancestors up until the root (i.e, c and root),
343   -that has memory.use_hierarchy enabled. If one of the ancestors goes over its
344   -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
345   -children of the ancestor.
346   -
347   -6.1 Enabling hierarchical accounting and reclaim
348   -
349   -The memory controller by default disables the hierarchy feature. Support
350   -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
351   -
352   -# echo 1 > memory.use_hierarchy
353   -
354   -The feature can be disabled by
355   -
356   -# echo 0 > memory.use_hierarchy
357   -
358   -NOTE1: Enabling/disabling will fail if the cgroup already has other
359   -cgroups created below it.
360   -
361   -NOTE2: This feature can be enabled/disabled per subtree.
362   -
363   -7. TODO
364   -
365   -1. Add support for accounting huge pages (as a separate controller)
366   -2. Make per-cgroup scanner reclaim not-shared pages first
367   -3. Teach controller to account for shared-pages
368   -4. Start reclamation in the background when the limit is
369   - not yet hit but the usage is getting closer
370   -
371   -Summary
372   -
373   -Overall, the memory controller has been a stable controller and has been
374   -commented and discussed quite extensively in the community.
375   -
376   -References
377   -
378   -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
379   -2. Singh, Balbir. Memory Controller (RSS Control),
380   - http://lwn.net/Articles/222762/
381   -3. Emelianov, Pavel. Resource controllers based on process cgroups
382   - http://lkml.org/lkml/2007/3/6/198
383   -4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
384   - http://lkml.org/lkml/2007/4/9/78
385   -5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
386   - http://lkml.org/lkml/2007/5/30/244
387   -6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
388   -7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
389   - subsystem (v3), http://lwn.net/Articles/235534/
390   -8. Singh, Balbir. RSS controller v2 test results (lmbench),
391   - http://lkml.org/lkml/2007/5/17/232
392   -9. Singh, Balbir. RSS controller v2 AIM9 results
393   - http://lkml.org/lkml/2007/5/18/1
394   -10. Singh, Balbir. Memory controller v6 test results,
395   - http://lkml.org/lkml/2007/8/19/36
396   -11. Singh, Balbir. Memory controller introduction (v6),
397   - http://lkml.org/lkml/2007/8/17/69
398   -12. Corbet, Jonathan, Controlling memory use in cgroups,
399   - http://lwn.net/Articles/243795/
Documentation/controllers/resource_counter.txt
1   -
2   - The Resource Counter
3   -
4   -The resource counter, declared at include/linux/res_counter.h,
5   -is supposed to facilitate the resource management by controllers
6   -by providing common stuff for accounting.
7   -
8   -This "stuff" includes the res_counter structure and routines
9   -to work with it.
10   -
11   -
12   -
13   -1. Crucial parts of the res_counter structure
14   -
15   - a. unsigned long long usage
16   -
17   - The usage value shows the amount of a resource that is consumed
18   - by a group at a given time. The units of measurement should be
19   - determined by the controller that uses this counter. E.g. it can
20   - be bytes, items or any other unit the controller operates on.
21   -
22   - b. unsigned long long max_usage
23   -
24   - The maximal value of the usage over time.
25   -
26   - This value is useful when gathering statistical information about
27   - the particular group, as it shows the actual resource requirements
28   - for a particular group, not just some usage snapshot.
29   -
30   - c. unsigned long long limit
31   -
32   - The maximal allowed amount of resource to consume by the group. In
33   - case the group requests for more resources, so that the usage value
34   - would exceed the limit, the resource allocation is rejected (see
35   - the next section).
36   -
37   - d. unsigned long long failcnt
38   -
39   - The failcnt stands for "failures counter". This is the number of
40   - resource allocation attempts that failed.
41   -
42   - c. spinlock_t lock
43   -
44   - Protects changes of the above values.
45   -
46   -
47   -
48   -2. Basic accounting routines
49   -
50   - a. void res_counter_init(struct res_counter *rc)
51   -
52   - Initializes the resource counter. As usual, should be the first
53   - routine called for a new counter.
54   -
55   - b. int res_counter_charge[_locked]
56   - (struct res_counter *rc, unsigned long val)
57   -
58   - When a resource is about to be allocated it has to be accounted
59   - with the appropriate resource counter (controller should determine
60   - which one to use on its own). This operation is called "charging".
61   -
62   - This is not very important which operation - resource allocation
63   - or charging - is performed first, but
64   - * if the allocation is performed first, this may create a
65   - temporary resource over-usage by the time resource counter is
66   - charged;
67   - * if the charging is performed first, then it should be uncharged
68   - on error path (if the one is called).
69   -
70   - c. void res_counter_uncharge[_locked]
71   - (struct res_counter *rc, unsigned long val)
72   -
73   - When a resource is released (freed) it should be de-accounted
74   - from the resource counter it was accounted to. This is called
75   - "uncharging".
76   -
77   - The _locked routines imply that the res_counter->lock is taken.
78   -
79   -
80   - 2.1 Other accounting routines
81   -
82   - There are more routines that may help you with common needs, like
83   - checking whether the limit is reached or resetting the max_usage
84   - value. They are all declared in include/linux/res_counter.h.
85   -
86   -
87   -
88   -3. Analyzing the resource counter registrations
89   -
90   - a. If the failcnt value constantly grows, this means that the counter's
91   - limit is too tight. Either the group is misbehaving and consumes too
92   - many resources, or the configuration is not suitable for the group
93   - and the limit should be increased.
94   -
95   - b. The max_usage value can be used to quickly tune the group. One may
96   - set the limits to maximal values and either load the container with
97   - a common pattern or leave one for a while. After this the max_usage
98   - value shows the amount of memory the container would require during
99   - its common activity.
100   -
101   - Setting the limit a bit above this value gives a pretty good
102   - configuration that works in most of the cases.
103   -
104   - c. If the max_usage is much less than the limit, but the failcnt value
105   - is growing, then the group tries to allocate a big chunk of resource
106   - at once.
107   -
108   - d. If the max_usage is much less than the limit, but the failcnt value
109   - is 0, then this group is given too high limit, that it does not
110   - require. It is better to lower the limit a bit leaving more resource
111   - for other groups.
112   -
113   -
114   -
115   -4. Communication with the control groups subsystem (cgroups)
116   -
117   -All the resource controllers that are using cgroups and resource counters
118   -should provide files (in the cgroup filesystem) to work with the resource
119   -counter fields. They are recommended to adhere to the following rules:
120   -
121   - a. File names
122   -
123   - Field name File name
124   - ---------------------------------------------------
125   - usage usage_in_<unit_of_measurement>
126   - max_usage max_usage_in_<unit_of_measurement>
127   - limit limit_in_<unit_of_measurement>
128   - failcnt failcnt
129   - lock no file :)
130   -
131   - b. Reading from file should show the corresponding field value in the
132   - appropriate format.
133   -
134   - c. Writing to file
135   -
136   - Field Expected behavior
137   - ----------------------------------
138   - usage prohibited
139   - max_usage reset to usage
140   - limit set the limit
141   - failcnt reset to zero
142   -
143   -
144   -
145   -5. Usage example
146   -
147   - a. Declare a task group (take a look at cgroups subsystem for this) and
148   - fold a res_counter into it
149   -
150   - struct my_group {
151   - struct res_counter res;
152   -
153   - <other fields>
154   - }
155   -
156   - b. Put hooks in resource allocation/release paths
157   -
158   - int alloc_something(...)
159   - {
160   - if (res_counter_charge(res_counter_ptr, amount) < 0)
161   - return -ENOMEM;
162   -
163   - <allocate the resource and return to the caller>
164   - }
165   -
166   - void release_something(...)
167   - {
168   - res_counter_uncharge(res_counter_ptr, amount);
169   -
170   - <release the resource>
171   - }
172   -
173   - In order to keep the usage value self-consistent, both the
174   - "res_counter_ptr" and the "amount" in release_something() should be
175   - the same as they were in the alloc_something() when the releasing
176   - resource was allocated.
177   -
178   - c. Provide the way to read res_counter values and set them (the cgroups
179   - still can help with it).
180   -
181   - c. Compile and run :)
Documentation/cpusets.txt
1   - CPUSETS
2   - -------
3   -
4   -Copyright (C) 2004 BULL SA.
5   -Written by Simon.Derr@bull.net
6   -
7   -Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
8   -Modified by Paul Jackson <pj@sgi.com>
9   -Modified by Christoph Lameter <clameter@sgi.com>
10   -Modified by Paul Menage <menage@google.com>
11   -Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
12   -
13   -CONTENTS:
14   -=========
15   -
16   -1. Cpusets
17   - 1.1 What are cpusets ?
18   - 1.2 Why are cpusets needed ?
19   - 1.3 How are cpusets implemented ?
20   - 1.4 What are exclusive cpusets ?
21   - 1.5 What is memory_pressure ?
22   - 1.6 What is memory spread ?
23   - 1.7 What is sched_load_balance ?
24   - 1.8 What is sched_relax_domain_level ?
25   - 1.9 How do I use cpusets ?
26   -2. Usage Examples and Syntax
27   - 2.1 Basic Usage
28   - 2.2 Adding/removing cpus
29   - 2.3 Setting flags
30   - 2.4 Attaching processes
31   -3. Questions
32   -4. Contact
33   -
34   -1. Cpusets
35   -==========
36   -
37   -1.1 What are cpusets ?
38   -----------------------
39   -
40   -Cpusets provide a mechanism for assigning a set of CPUs and Memory
41   -Nodes to a set of tasks. In this document "Memory Node" refers to
42   -an on-line node that contains memory.
43   -
44   -Cpusets constrain the CPU and Memory placement of tasks to only
45   -the resources within a tasks current cpuset. They form a nested
46   -hierarchy visible in a virtual file system. These are the essential
47   -hooks, beyond what is already present, required to manage dynamic
48   -job placement on large systems.
49   -
50   -Cpusets use the generic cgroup subsystem described in
51   -Documentation/cgroups/cgroups.txt.
52   -
53   -Requests by a task, using the sched_setaffinity(2) system call to
54   -include CPUs in its CPU affinity mask, and using the mbind(2) and
55   -set_mempolicy(2) system calls to include Memory Nodes in its memory
56   -policy, are both filtered through that tasks cpuset, filtering out any
57   -CPUs or Memory Nodes not in that cpuset. The scheduler will not
58   -schedule a task on a CPU that is not allowed in its cpus_allowed
59   -vector, and the kernel page allocator will not allocate a page on a
60   -node that is not allowed in the requesting tasks mems_allowed vector.
61   -
62   -User level code may create and destroy cpusets by name in the cgroup
63   -virtual file system, manage the attributes and permissions of these
64   -cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
65   -specify and query to which cpuset a task is assigned, and list the
66   -task pids assigned to a cpuset.
67   -
68   -
69   -1.2 Why are cpusets needed ?
70   -----------------------------
71   -
72   -The management of large computer systems, with many processors (CPUs),
73   -complex memory cache hierarchies and multiple Memory Nodes having
74   -non-uniform access times (NUMA) presents additional challenges for
75   -the efficient scheduling and memory placement of processes.
76   -
77   -Frequently more modest sized systems can be operated with adequate
78   -efficiency just by letting the operating system automatically share
79   -the available CPU and Memory resources amongst the requesting tasks.
80   -
81   -But larger systems, which benefit more from careful processor and
82   -memory placement to reduce memory access times and contention,
83   -and which typically represent a larger investment for the customer,
84   -can benefit from explicitly placing jobs on properly sized subsets of
85   -the system.
86   -
87   -This can be especially valuable on:
88   -
89   - * Web Servers running multiple instances of the same web application,
90   - * Servers running different applications (for instance, a web server
91   - and a database), or
92   - * NUMA systems running large HPC applications with demanding
93   - performance characteristics.
94   -
95   -These subsets, or "soft partitions" must be able to be dynamically
96   -adjusted, as the job mix changes, without impacting other concurrently
97   -executing jobs. The location of the running jobs pages may also be moved
98   -when the memory locations are changed.
99   -
100   -The kernel cpuset patch provides the minimum essential kernel
101   -mechanisms required to efficiently implement such subsets. It
102   -leverages existing CPU and Memory Placement facilities in the Linux
103   -kernel to avoid any additional impact on the critical scheduler or
104   -memory allocator code.
105   -
106   -
107   -1.3 How are cpusets implemented ?
108   ----------------------------------
109   -
110   -Cpusets provide a Linux kernel mechanism to constrain which CPUs and
111   -Memory Nodes are used by a process or set of processes.
112   -
113   -The Linux kernel already has a pair of mechanisms to specify on which
114   -CPUs a task may be scheduled (sched_setaffinity) and on which Memory
115   -Nodes it may obtain memory (mbind, set_mempolicy).
116   -
117   -Cpusets extends these two mechanisms as follows:
118   -
119   - - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
120   - kernel.
121   - - Each task in the system is attached to a cpuset, via a pointer
122   - in the task structure to a reference counted cgroup structure.
123   - - Calls to sched_setaffinity are filtered to just those CPUs
124   - allowed in that tasks cpuset.
125   - - Calls to mbind and set_mempolicy are filtered to just
126   - those Memory Nodes allowed in that tasks cpuset.
127   - - The root cpuset contains all the systems CPUs and Memory
128   - Nodes.
129   - - For any cpuset, one can define child cpusets containing a subset
130   - of the parents CPU and Memory Node resources.
131   - - The hierarchy of cpusets can be mounted at /dev/cpuset, for
132   - browsing and manipulation from user space.
133   - - A cpuset may be marked exclusive, which ensures that no other
134   - cpuset (except direct ancestors and descendents) may contain
135   - any overlapping CPUs or Memory Nodes.
136   - - You can list all the tasks (by pid) attached to any cpuset.
137   -
138   -The implementation of cpusets requires a few, simple hooks
139   -into the rest of the kernel, none in performance critical paths:
140   -
141   - - in init/main.c, to initialize the root cpuset at system boot.
142   - - in fork and exit, to attach and detach a task from its cpuset.
143   - - in sched_setaffinity, to mask the requested CPUs by what's
144   - allowed in that tasks cpuset.
145   - - in sched.c migrate_all_tasks(), to keep migrating tasks within
146   - the CPUs allowed by their cpuset, if possible.
147   - - in the mbind and set_mempolicy system calls, to mask the requested
148   - Memory Nodes by what's allowed in that tasks cpuset.
149   - - in page_alloc.c, to restrict memory to allowed nodes.
150   - - in vmscan.c, to restrict page recovery to the current cpuset.
151   -
152   -You should mount the "cgroup" filesystem type in order to enable
153   -browsing and modifying the cpusets presently known to the kernel. No
154   -new system calls are added for cpusets - all support for querying and
155   -modifying cpusets is via this cpuset file system.
156   -
157   -The /proc/<pid>/status file for each task has four added lines,
158   -displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
159   -and mems_allowed (on which Memory Nodes it may obtain memory),
160   -in the two formats seen in the following example:
161   -
162   - Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
163   - Cpus_allowed_list: 0-127
164   - Mems_allowed: ffffffff,ffffffff
165   - Mems_allowed_list: 0-63
166   -
167   -Each cpuset is represented by a directory in the cgroup file system
168   -containing (on top of the standard cgroup files) the following
169   -files describing that cpuset:
170   -
171   - - cpus: list of CPUs in that cpuset
172   - - mems: list of Memory Nodes in that cpuset
173   - - memory_migrate flag: if set, move pages to cpusets nodes
174   - - cpu_exclusive flag: is cpu placement exclusive?
175   - - mem_exclusive flag: is memory placement exclusive?
176   - - mem_hardwall flag: is memory allocation hardwalled
177   - - memory_pressure: measure of how much paging pressure in cpuset
178   -
179   -In addition, the root cpuset only has the following file:
180   - - memory_pressure_enabled flag: compute memory_pressure?
181   -
182   -New cpusets are created using the mkdir system call or shell
183   -command. The properties of a cpuset, such as its flags, allowed
184   -CPUs and Memory Nodes, and attached tasks, are modified by writing
185   -to the appropriate file in that cpusets directory, as listed above.
186   -
187   -The named hierarchical structure of nested cpusets allows partitioning
188   -a large system into nested, dynamically changeable, "soft-partitions".
189   -
190   -The attachment of each task, automatically inherited at fork by any
191   -children of that task, to a cpuset allows organizing the work load
192   -on a system into related sets of tasks such that each set is constrained
193   -to using the CPUs and Memory Nodes of a particular cpuset. A task
194   -may be re-attached to any other cpuset, if allowed by the permissions
195   -on the necessary cpuset file system directories.
196   -
197   -Such management of a system "in the large" integrates smoothly with
198   -the detailed placement done on individual tasks and memory regions
199   -using the sched_setaffinity, mbind and set_mempolicy system calls.
200   -
201   -The following rules apply to each cpuset:
202   -
203   - - Its CPUs and Memory Nodes must be a subset of its parents.
204   - - It can't be marked exclusive unless its parent is.
205   - - If its cpu or memory is exclusive, they may not overlap any sibling.
206   -
207   -These rules, and the natural hierarchy of cpusets, enable efficient
208   -enforcement of the exclusive guarantee, without having to scan all
209   -cpusets every time any of them change to ensure nothing overlaps a
210   -exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
211   -to represent the cpuset hierarchy provides for a familiar permission
212   -and name space for cpusets, with a minimum of additional kernel code.
213   -
214   -The cpus and mems files in the root (top_cpuset) cpuset are
215   -read-only. The cpus file automatically tracks the value of
216   -cpu_online_map using a CPU hotplug notifier, and the mems file
217   -automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
218   -nodes with memory--using the cpuset_track_online_nodes() hook.
219   -
220   -
221   -1.4 What are exclusive cpusets ?
222   ---------------------------------
223   -
224   -If a cpuset is cpu or mem exclusive, no other cpuset, other than
225   -a direct ancestor or descendent, may share any of the same CPUs or
226   -Memory Nodes.
227   -
228   -A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
229   -i.e. it restricts kernel allocations for page, buffer and other data
230   -commonly shared by the kernel across multiple users. All cpusets,
231   -whether hardwalled or not, restrict allocations of memory for user
232   -space. This enables configuring a system so that several independent
233   -jobs can share common kernel data, such as file system pages, while
234   -isolating each job's user allocation in its own cpuset. To do this,
235   -construct a large mem_exclusive cpuset to hold all the jobs, and
236   -construct child, non-mem_exclusive cpusets for each individual job.
237   -Only a small amount of typical kernel memory, such as requests from
238   -interrupt handlers, is allowed to be taken outside even a
239   -mem_exclusive cpuset.
240   -
241   -
242   -1.5 What is memory_pressure ?
243   ------------------------------
244   -The memory_pressure of a cpuset provides a simple per-cpuset metric
245   -of the rate that the tasks in a cpuset are attempting to free up in
246   -use memory on the nodes of the cpuset to satisfy additional memory
247   -requests.
248   -
249   -This enables batch managers monitoring jobs running in dedicated
250   -cpusets to efficiently detect what level of memory pressure that job
251   -is causing.
252   -
253   -This is useful both on tightly managed systems running a wide mix of
254   -submitted jobs, which may choose to terminate or re-prioritize jobs that
255   -are trying to use more memory than allowed on the nodes assigned them,
256   -and with tightly coupled, long running, massively parallel scientific
257   -computing jobs that will dramatically fail to meet required performance
258   -goals if they start to use more memory than allowed to them.
259   -
260   -This mechanism provides a very economical way for the batch manager
261   -to monitor a cpuset for signs of memory pressure. It's up to the
262   -batch manager or other user code to decide what to do about it and
263   -take action.
264   -
265   -==> Unless this feature is enabled by writing "1" to the special file
266   - /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
267   - code of __alloc_pages() for this metric reduces to simply noticing
268   - that the cpuset_memory_pressure_enabled flag is zero. So only
269   - systems that enable this feature will compute the metric.
270   -
271   -Why a per-cpuset, running average:
272   -
273   - Because this meter is per-cpuset, rather than per-task or mm,
274   - the system load imposed by a batch scheduler monitoring this
275   - metric is sharply reduced on large systems, because a scan of
276   - the tasklist can be avoided on each set of queries.
277   -
278   - Because this meter is a running average, instead of an accumulating
279   - counter, a batch scheduler can detect memory pressure with a
280   - single read, instead of having to read and accumulate results
281   - for a period of time.
282   -
283   - Because this meter is per-cpuset rather than per-task or mm,
284   - the batch scheduler can obtain the key information, memory
285   - pressure in a cpuset, with a single read, rather than having to
286   - query and accumulate results over all the (dynamically changing)
287   - set of tasks in the cpuset.
288   -
289   -A per-cpuset simple digital filter (requires a spinlock and 3 words
290   -of data per-cpuset) is kept, and updated by any task attached to that
291   -cpuset, if it enters the synchronous (direct) page reclaim code.
292   -
293   -A per-cpuset file provides an integer number representing the recent
294   -(half-life of 10 seconds) rate of direct page reclaims caused by
295   -the tasks in the cpuset, in units of reclaims attempted per second,
296   -times 1000.
297   -
298   -
299   -1.6 What is memory spread ?
300   ----------------------------
301   -There are two boolean flag files per cpuset that control where the
302   -kernel allocates pages for the file system buffers and related in
303   -kernel data structures. They are called 'memory_spread_page' and
304   -'memory_spread_slab'.
305   -
306   -If the per-cpuset boolean flag file 'memory_spread_page' is set, then
307   -the kernel will spread the file system buffers (page cache) evenly
308   -over all the nodes that the faulting task is allowed to use, instead
309   -of preferring to put those pages on the node where the task is running.
310   -
311   -If the per-cpuset boolean flag file 'memory_spread_slab' is set,
312   -then the kernel will spread some file system related slab caches,
313   -such as for inodes and dentries evenly over all the nodes that the
314   -faulting task is allowed to use, instead of preferring to put those
315   -pages on the node where the task is running.
316   -
317   -The setting of these flags does not affect anonymous data segment or
318   -stack segment pages of a task.
319   -
320   -By default, both kinds of memory spreading are off, and memory
321   -pages are allocated on the node local to where the task is running,
322   -except perhaps as modified by the tasks NUMA mempolicy or cpuset
323   -configuration, so long as sufficient free memory pages are available.
324   -
325   -When new cpusets are created, they inherit the memory spread settings
326   -of their parent.
327   -
328   -Setting memory spreading causes allocations for the affected page
329   -or slab caches to ignore the tasks NUMA mempolicy and be spread
330   -instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
331   -mempolicies will not notice any change in these calls as a result of
332   -their containing tasks memory spread settings. If memory spreading
333   -is turned off, then the currently specified NUMA mempolicy once again
334   -applies to memory page allocations.
335   -
336   -Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
337   -files. By default they contain "0", meaning that the feature is off
338   -for that cpuset. If a "1" is written to that file, then that turns
339   -the named feature on.
340   -
341   -The implementation is simple.
342   -
343   -Setting the flag 'memory_spread_page' turns on a per-process flag
344   -PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
345   -joins that cpuset. The page allocation calls for the page cache
346   -is modified to perform an inline check for this PF_SPREAD_PAGE task
347   -flag, and if set, a call to a new routine cpuset_mem_spread_node()
348   -returns the node to prefer for the allocation.
349   -
350   -Similarly, setting 'memory_spread_slab' turns on the flag
351   -PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
352   -pages from the node returned by cpuset_mem_spread_node().
353   -
354   -The cpuset_mem_spread_node() routine is also simple. It uses the
355   -value of a per-task rotor cpuset_mem_spread_rotor to select the next
356   -node in the current tasks mems_allowed to prefer for the allocation.
357   -
358   -This memory placement policy is also known (in other contexts) as
359   -round-robin or interleave.
360   -
361   -This policy can provide substantial improvements for jobs that need
362   -to place thread local data on the corresponding node, but that need
363   -to access large file system data sets that need to be spread across
364   -the several nodes in the jobs cpuset in order to fit. Without this
365   -policy, especially for jobs that might have one thread reading in the
366   -data set, the memory allocation across the nodes in the jobs cpuset
367   -can become very uneven.
368   -
369   -1.7 What is sched_load_balance ?
370   ---------------------------------
371   -
372   -The kernel scheduler (kernel/sched.c) automatically load balances
373   -tasks. If one CPU is underutilized, kernel code running on that
374   -CPU will look for tasks on other more overloaded CPUs and move those
375   -tasks to itself, within the constraints of such placement mechanisms
376   -as cpusets and sched_setaffinity.
377   -
378   -The algorithmic cost of load balancing and its impact on key shared
379   -kernel data structures such as the task list increases more than
380   -linearly with the number of CPUs being balanced. So the scheduler
381   -has support to partition the systems CPUs into a number of sched
382   -domains such that it only load balances within each sched domain.
383   -Each sched domain covers some subset of the CPUs in the system;
384   -no two sched domains overlap; some CPUs might not be in any sched
385   -domain and hence won't be load balanced.
386   -
387   -Put simply, it costs less to balance between two smaller sched domains
388   -than one big one, but doing so means that overloads in one of the
389   -two domains won't be load balanced to the other one.
390   -
391   -By default, there is one sched domain covering all CPUs, except those
392   -marked isolated using the kernel boot time "isolcpus=" argument.
393   -
394   -This default load balancing across all CPUs is not well suited for
395   -the following two situations:
396   - 1) On large systems, load balancing across many CPUs is expensive.
397   - If the system is managed using cpusets to place independent jobs
398   - on separate sets of CPUs, full load balancing is unnecessary.
399   - 2) Systems supporting realtime on some CPUs need to minimize
400   - system overhead on those CPUs, including avoiding task load
401   - balancing if that is not needed.
402   -
403   -When the per-cpuset flag "sched_load_balance" is enabled (the default
404   -setting), it requests that all the CPUs in that cpusets allowed 'cpus'
405   -be contained in a single sched domain, ensuring that load balancing
406   -can move a task (not otherwised pinned, as by sched_setaffinity)
407   -from any CPU in that cpuset to any other.
408   -
409   -When the per-cpuset flag "sched_load_balance" is disabled, then the
410   -scheduler will avoid load balancing across the CPUs in that cpuset,
411   ---except-- in so far as is necessary because some overlapping cpuset
412   -has "sched_load_balance" enabled.
413   -
414   -So, for example, if the top cpuset has the flag "sched_load_balance"
415   -enabled, then the scheduler will have one sched domain covering all
416   -CPUs, and the setting of the "sched_load_balance" flag in any other
417   -cpusets won't matter, as we're already fully load balancing.
418   -
419   -Therefore in the above two situations, the top cpuset flag
420   -"sched_load_balance" should be disabled, and only some of the smaller,
421   -child cpusets have this flag enabled.
422   -
423   -When doing this, you don't usually want to leave any unpinned tasks in
424   -the top cpuset that might use non-trivial amounts of CPU, as such tasks
425   -may be artificially constrained to some subset of CPUs, depending on
426   -the particulars of this flag setting in descendent cpusets. Even if
427   -such a task could use spare CPU cycles in some other CPUs, the kernel
428   -scheduler might not consider the possibility of load balancing that
429   -task to that underused CPU.
430   -
431   -Of course, tasks pinned to a particular CPU can be left in a cpuset
432   -that disables "sched_load_balance" as those tasks aren't going anywhere
433   -else anyway.
434   -
435   -There is an impedance mismatch here, between cpusets and sched domains.
436   -Cpusets are hierarchical and nest. Sched domains are flat; they don't
437   -overlap and each CPU is in at most one sched domain.
438   -
439   -It is necessary for sched domains to be flat because load balancing
440   -across partially overlapping sets of CPUs would risk unstable dynamics
441   -that would be beyond our understanding. So if each of two partially
442   -overlapping cpusets enables the flag 'sched_load_balance', then we
443   -form a single sched domain that is a superset of both. We won't move
444   -a task to a CPU outside it cpuset, but the scheduler load balancing
445   -code might waste some compute cycles considering that possibility.
446   -
447   -This mismatch is why there is not a simple one-to-one relation
448   -between which cpusets have the flag "sched_load_balance" enabled,
449   -and the sched domain configuration. If a cpuset enables the flag, it
450   -will get balancing across all its CPUs, but if it disables the flag,
451   -it will only be assured of no load balancing if no other overlapping
452   -cpuset enables the flag.
453   -
454   -If two cpusets have partially overlapping 'cpus' allowed, and only
455   -one of them has this flag enabled, then the other may find its
456   -tasks only partially load balanced, just on the overlapping CPUs.
457   -This is just the general case of the top_cpuset example given a few
458   -paragraphs above. In the general case, as in the top cpuset case,
459   -don't leave tasks that might use non-trivial amounts of CPU in
460   -such partially load balanced cpusets, as they may be artificially
461   -constrained to some subset of the CPUs allowed to them, for lack of
462   -load balancing to the other CPUs.
463   -
464   -1.7.1 sched_load_balance implementation details.
465   -------------------------------------------------
466   -
467   -The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
468   -to most cpuset flags.) When enabled for a cpuset, the kernel will
469   -ensure that it can load balance across all the CPUs in that cpuset
470   -(makes sure that all the CPUs in the cpus_allowed of that cpuset are
471   -in the same sched domain.)
472   -
473   -If two overlapping cpusets both have 'sched_load_balance' enabled,
474   -then they will be (must be) both in the same sched domain.
475   -
476   -If, as is the default, the top cpuset has 'sched_load_balance' enabled,
477   -then by the above that means there is a single sched domain covering
478   -the whole system, regardless of any other cpuset settings.
479   -
480   -The kernel commits to user space that it will avoid load balancing
481   -where it can. It will pick as fine a granularity partition of sched
482   -domains as it can while still providing load balancing for any set
483   -of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
484   -
485   -The internal kernel cpuset to scheduler interface passes from the
486   -cpuset code to the scheduler code a partition of the load balanced
487   -CPUs in the system. This partition is a set of subsets (represented
488   -as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
489   -the CPUs that must be load balanced.
490   -
491   -Whenever the 'sched_load_balance' flag changes, or CPUs come or go
492   -from a cpuset with this flag enabled, or a cpuset with this flag
493   -enabled is removed, the cpuset code builds a new such partition and
494   -passes it to the scheduler sched domain setup code, to have the sched
495   -domains rebuilt as necessary.
496   -
497   -This partition exactly defines what sched domains the scheduler should
498   -setup - one sched domain for each element (cpumask_t) in the partition.
499   -
500   -The scheduler remembers the currently active sched domain partitions.
501   -When the scheduler routine partition_sched_domains() is invoked from
502   -the cpuset code to update these sched domains, it compares the new
503   -partition requested with the current, and updates its sched domains,
504   -removing the old and adding the new, for each change.
505   -
506   -
507   -1.8 What is sched_relax_domain_level ?
508   ---------------------------------------
509   -
510   -In sched domain, the scheduler migrates tasks in 2 ways; periodic load
511   -balance on tick, and at time of some schedule events.
512   -
513   -When a task is woken up, scheduler try to move the task on idle CPU.
514   -For example, if a task A running on CPU X activates another task B
515   -on the same CPU X, and if CPU Y is X's sibling and performing idle,
516   -then scheduler migrate task B to CPU Y so that task B can start on
517   -CPU Y without waiting task A on CPU X.
518   -
519   -And if a CPU run out of tasks in its runqueue, the CPU try to pull
520   -extra tasks from other busy CPUs to help them before it is going to
521   -be idle.
522   -
523   -Of course it takes some searching cost to find movable tasks and/or
524   -idle CPUs, the scheduler might not search all CPUs in the domain
525   -everytime. In fact, in some architectures, the searching ranges on
526   -events are limited in the same socket or node where the CPU locates,
527   -while the load balance on tick searchs all.
528   -
529   -For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
530   -is idle while CPU X and the siblings are busy, scheduler can't migrate
531   -woken task B from X to Z since it is out of its searching range.
532   -As the result, task B on CPU X need to wait task A or wait load balance
533   -on the next tick. For some applications in special situation, waiting
534   -1 tick may be too long.
535   -
536   -The 'sched_relax_domain_level' file allows you to request changing
537   -this searching range as you like. This file takes int value which
538   -indicates size of searching range in levels ideally as follows,
539   -otherwise initial value -1 that indicates the cpuset has no request.
540   -
541   - -1 : no request. use system default or follow request of others.
542   - 0 : no search.
543   - 1 : search siblings (hyperthreads in a core).
544   - 2 : search cores in a package.
545   - 3 : search cpus in a node [= system wide on non-NUMA system]
546   - ( 4 : search nodes in a chunk of node [on NUMA system] )
547   - ( 5 : search system wide [on NUMA system] )
548   -
549   -The system default is architecture dependent. The system default
550   -can be changed using the relax_domain_level= boot parameter.
551   -
552   -This file is per-cpuset and affect the sched domain where the cpuset
553   -belongs to. Therefore if the flag 'sched_load_balance' of a cpuset
554   -is disabled, then 'sched_relax_domain_level' have no effect since
555   -there is no sched domain belonging the cpuset.
556   -
557   -If multiple cpusets are overlapping and hence they form a single sched
558   -domain, the largest value among those is used. Be careful, if one
559   -requests 0 and others are -1 then 0 is used.
560   -
561   -Note that modifying this file will have both good and bad effects,
562   -and whether it is acceptable or not will be depend on your situation.
563   -Don't modify this file if you are not sure.
564   -
565   -If your situation is:
566   - - The migration costs between each cpu can be assumed considerably
567   - small(for you) due to your special application's behavior or
568   - special hardware support for CPU cache etc.
569   - - The searching cost doesn't have impact(for you) or you can make
570   - the searching cost enough small by managing cpuset to compact etc.
571   - - The latency is required even it sacrifices cache hit rate etc.
572   -then increasing 'sched_relax_domain_level' would benefit you.
573   -
574   -
575   -1.9 How do I use cpusets ?
576   ---------------------------
577   -
578   -In order to minimize the impact of cpusets on critical kernel
579   -code, such as the scheduler, and due to the fact that the kernel
580   -does not support one task updating the memory placement of another
581   -task directly, the impact on a task of changing its cpuset CPU
582   -or Memory Node placement, or of changing to which cpuset a task
583   -is attached, is subtle.
584   -
585   -If a cpuset has its Memory Nodes modified, then for each task attached
586   -to that cpuset, the next time that the kernel attempts to allocate
587   -a page of memory for that task, the kernel will notice the change
588   -in the tasks cpuset, and update its per-task memory placement to
589   -remain within the new cpusets memory placement. If the task was using
590   -mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
591   -its new cpuset, then the task will continue to use whatever subset
592   -of MPOL_BIND nodes are still allowed in the new cpuset. If the task
593   -was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
594   -in the new cpuset, then the task will be essentially treated as if it
595   -was MPOL_BIND bound to the new cpuset (even though its numa placement,
596   -as queried by get_mempolicy(), doesn't change). If a task is moved
597   -from one cpuset to another, then the kernel will adjust the tasks
598   -memory placement, as above, the next time that the kernel attempts
599   -to allocate a page of memory for that task.
600   -
601   -If a cpuset has its 'cpus' modified, then each task in that cpuset
602   -will have its allowed CPU placement changed immediately. Similarly,
603   -if a tasks pid is written to a cpusets 'tasks' file, in either its
604   -current cpuset or another cpuset, then its allowed CPU placement is
605   -changed immediately. If such a task had been bound to some subset
606   -of its cpuset using the sched_setaffinity() call, the task will be
607   -allowed to run on any CPU allowed in its new cpuset, negating the
608   -affect of the prior sched_setaffinity() call.
609   -
610   -In summary, the memory placement of a task whose cpuset is changed is
611   -updated by the kernel, on the next allocation of a page for that task,
612   -but the processor placement is not updated, until that tasks pid is
613   -rewritten to the 'tasks' file of its cpuset. This is done to avoid
614   -impacting the scheduler code in the kernel with a check for changes
615   -in a tasks processor placement.
616   -
617   -Normally, once a page is allocated (given a physical page
618   -of main memory) then that page stays on whatever node it
619   -was allocated, so long as it remains allocated, even if the
620   -cpusets memory placement policy 'mems' subsequently changes.
621   -If the cpuset flag file 'memory_migrate' is set true, then when
622   -tasks are attached to that cpuset, any pages that task had
623   -allocated to it on nodes in its previous cpuset are migrated
624   -to the tasks new cpuset. The relative placement of the page within
625   -the cpuset is preserved during these migration operations if possible.
626   -For example if the page was on the second valid node of the prior cpuset
627   -then the page will be placed on the second valid node of the new cpuset.
628   -
629   -Also if 'memory_migrate' is set true, then if that cpusets
630   -'mems' file is modified, pages allocated to tasks in that
631   -cpuset, that were on nodes in the previous setting of 'mems',
632   -will be moved to nodes in the new setting of 'mems.'
633   -Pages that were not in the tasks prior cpuset, or in the cpusets
634   -prior 'mems' setting, will not be moved.
635   -
636   -There is an exception to the above. If hotplug functionality is used
637   -to remove all the CPUs that are currently assigned to a cpuset,
638   -then all the tasks in that cpuset will be moved to the nearest ancestor
639   -with non-empty cpus. But the moving of some (or all) tasks might fail if
640   -cpuset is bound with another cgroup subsystem which has some restrictions
641   -on task attaching. In this failing case, those tasks will stay
642   -in the original cpuset, and the kernel will automatically update
643   -their cpus_allowed to allow all online CPUs. When memory hotplug
644   -functionality for removing Memory Nodes is available, a similar exception
645   -is expected to apply there as well. In general, the kernel prefers to
646   -violate cpuset placement, over starving a task that has had all
647   -its allowed CPUs or Memory Nodes taken offline.
648   -
649   -There is a second exception to the above. GFP_ATOMIC requests are
650   -kernel internal allocations that must be satisfied, immediately.
651   -The kernel may drop some request, in rare cases even panic, if a
652   -GFP_ATOMIC alloc fails. If the request cannot be satisfied within
653   -the current tasks cpuset, then we relax the cpuset, and look for
654   -memory anywhere we can find it. It's better to violate the cpuset
655   -than stress the kernel.
656   -
657   -To start a new job that is to be contained within a cpuset, the steps are:
658   -
659   - 1) mkdir /dev/cpuset
660   - 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
661   - 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
662   - the /dev/cpuset virtual file system.
663   - 4) Start a task that will be the "founding father" of the new job.
664   - 5) Attach that task to the new cpuset by writing its pid to the
665   - /dev/cpuset tasks file for that cpuset.
666   - 6) fork, exec or clone the job tasks from this founding father task.
667   -
668   -For example, the following sequence of commands will setup a cpuset
669   -named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
670   -and then start a subshell 'sh' in that cpuset:
671   -
672   - mount -t cgroup -ocpuset cpuset /dev/cpuset
673   - cd /dev/cpuset
674   - mkdir Charlie
675   - cd Charlie
676   - /bin/echo 2-3 > cpus
677   - /bin/echo 1 > mems
678   - /bin/echo $$ > tasks
679   - sh
680   - # The subshell 'sh' is now running in cpuset Charlie
681   - # The next line should display '/Charlie'
682   - cat /proc/self/cpuset
683   -
684   -In the future, a C library interface to cpusets will likely be
685   -available. For now, the only way to query or modify cpusets is
686   -via the cpuset file system, using the various cd, mkdir, echo, cat,
687   -rmdir commands from the shell, or their equivalent from C.
688   -
689   -The sched_setaffinity calls can also be done at the shell prompt using
690   -SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
691   -calls can be done at the shell prompt using the numactl command
692   -(part of Andi Kleen's numa package).
693   -
694   -2. Usage Examples and Syntax
695   -============================
696   -
697   -2.1 Basic Usage
698   ----------------
699   -
700   -Creating, modifying, using the cpusets can be done through the cpuset
701   -virtual filesystem.
702   -
703   -To mount it, type:
704   -# mount -t cgroup -o cpuset cpuset /dev/cpuset
705   -
706   -Then under /dev/cpuset you can find a tree that corresponds to the
707   -tree of the cpusets in the system. For instance, /dev/cpuset
708   -is the cpuset that holds the whole system.
709   -
710   -If you want to create a new cpuset under /dev/cpuset:
711   -# cd /dev/cpuset
712   -# mkdir my_cpuset
713   -
714   -Now you want to do something with this cpuset.
715   -# cd my_cpuset
716   -
717   -In this directory you can find several files:
718   -# ls
719   -cpu_exclusive memory_migrate mems tasks
720   -cpus memory_pressure notify_on_release
721   -mem_exclusive memory_spread_page sched_load_balance
722   -mem_hardwall memory_spread_slab sched_relax_domain_level
723   -
724   -Reading them will give you information about the state of this cpuset:
725   -the CPUs and Memory Nodes it can use, the processes that are using
726   -it, its properties. By writing to these files you can manipulate
727   -the cpuset.
728   -
729   -Set some flags:
730   -# /bin/echo 1 > cpu_exclusive
731   -
732   -Add some cpus:
733   -# /bin/echo 0-7 > cpus
734   -
735   -Add some mems:
736   -# /bin/echo 0-7 > mems
737   -
738   -Now attach your shell to this cpuset:
739   -# /bin/echo $$ > tasks
740   -
741   -You can also create cpusets inside your cpuset by using mkdir in this
742   -directory.
743   -# mkdir my_sub_cs
744   -
745   -To remove a cpuset, just use rmdir:
746   -# rmdir my_sub_cs
747   -This will fail if the cpuset is in use (has cpusets inside, or has
748   -processes attached).
749   -
750   -Note that for legacy reasons, the "cpuset" filesystem exists as a
751   -wrapper around the cgroup filesystem.
752   -
753   -The command
754   -
755   -mount -t cpuset X /dev/cpuset
756   -
757   -is equivalent to
758   -
759   -mount -t cgroup -ocpuset X /dev/cpuset
760   -echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
761   -
762   -2.2 Adding/removing cpus
763   -------------------------
764   -
765   -This is the syntax to use when writing in the cpus or mems files
766   -in cpuset directories:
767   -
768   -# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4
769   -# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4
770   -
771   -2.3 Setting flags
772   ------------------
773   -
774   -The syntax is very simple:
775   -
776   -# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive'
777   -# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive'
778   -
779   -2.4 Attaching processes
780   ------------------------
781   -
782   -# /bin/echo PID > tasks
783   -
784   -Note that it is PID, not PIDs. You can only attach ONE task at a time.
785   -If you have several tasks to attach, you have to do it one after another:
786   -
787   -# /bin/echo PID1 > tasks
788   -# /bin/echo PID2 > tasks
789   - ...
790   -# /bin/echo PIDn > tasks
791   -
792   -
793   -3. Questions
794   -============
795   -
796   -Q: what's up with this '/bin/echo' ?
797   -A: bash's builtin 'echo' command does not check calls to write() against
798   - errors. If you use it in the cpuset file system, you won't be
799   - able to tell whether a command succeeded or failed.
800   -
801   -Q: When I attach processes, only the first of the line gets really attached !
802   -A: We can only return one error code per call to write(). So you should also
803   - put only ONE pid.
804   -
805   -4. Contact
806   -==========
807   -
808   -Web: http://www.bullopensource.org/cpuset
Documentation/scheduler/sched-design-CFS.txt
... ... @@ -231,7 +231,7 @@
231 231  
232 232 This options needs CONFIG_CGROUPS to be defined, and lets the administrator
233 233 create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
234   - Documentation/cgroups.txt for more information about this filesystem.
  234 + Documentation/cgroups/cgroups.txt for more information about this filesystem.
235 235  
236 236 Only one of these options to group tasks can be chosen and not both.
237 237  
include/linux/res_counter.h
... ... @@ -9,7 +9,7 @@
9 9 *
10 10 * Author: Pavel Emelianov <xemul@openvz.org>
11 11 *
12   - * See Documentation/controllers/resource_counter.txt for more
  12 + * See Documentation/cgroups/resource_counter.txt for more
13 13 * info about what this counter is.
14 14 */
15 15  
... ... @@ -323,8 +323,8 @@
323 323 This option allows you to create arbitrary task groups
324 324 using the "cgroup" pseudo filesystem and control
325 325 the cpu bandwidth allocated to each such task group.
326   - Refer to Documentation/cgroups.txt for more information
327   - on "cgroup" pseudo filesystem.
  326 + Refer to Documentation/cgroups/cgroups.txt for more
  327 + information on "cgroup" pseudo filesystem.
328 328  
329 329 endchoice
330 330  
331 331  
... ... @@ -335,10 +335,9 @@
335 335 use with process control subsystems such as Cpusets, CFS, memory
336 336 controls or device isolation.
337 337 See
338   - - Documentation/cpusets.txt (Cpusets)
339 338 - Documentation/scheduler/sched-design-CFS.txt (CFS)
340   - - Documentation/cgroups/ (features for grouping, isolation)
341   - - Documentation/controllers/ (features for resource control)
  339 + - Documentation/cgroups/ (features for grouping, isolation
  340 + and resource control)
342 341  
343 342 Say N if unsure.
344 343  
... ... @@ -568,7 +568,7 @@
568 568 * load balancing domains (sched domains) as specified by that partial
569 569 * partition.
570 570 *
571   - * See "What is sched_load_balance" in Documentation/cpusets.txt
  571 + * See "What is sched_load_balance" in Documentation/cgroups/cpusets.txt
572 572 * for a background explanation of this.
573 573 *
574 574 * Does not return errors, on the theory that the callers of this