cgroups: consolidate cgroup documents

Move Documentation/cpusets.txt and Documentation/controllers/* to Documentation/cgroups/ Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cgroups: consolidate cgroup documents
Move Documentation/cpusets.txt and Documentation/controllers/* to Documentation/cgroups/ Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan · Linus Torvalds
1 parent 23964d2d02
Showing 17 changed files with 1824 additions and 1824 deletions Side-by-side Diff
Documentation/cgroups/cgroups.txt
Documentation/cgroups/cpuacct.txt
Documentation/cgroups/cpusets.txt
Documentation/cgroups/devices.txt
Documentation/cgroups/memcg_test.txt
Documentation/cgroups/memory.txt
Documentation/cgroups/resource_counter.txt
Documentation/controllers/cpuacct.txt
Documentation/controllers/devices.txt
Documentation/controllers/memcg_test.txt
Documentation/controllers/memory.txt
Documentation/controllers/resource_counter.txt
Documentation/cpusets.txt
Documentation/scheduler/sched-design-CFS.txt
include/linux/res_counter.h
init/Kconfig
kernel/cpuset.c
 				CGROUPS
 				-------
  
-Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
+Written by Paul Menage <menage@google.com> based on
+Documentation/cgroups/cpusets.txt
  
 Original copyright statements from cpusets.txt:
 Portions Copyright (C) 2004 BULL SA.
@@ -68,7 +69,7 @@
 tracking. The intention is that other subsystems hook into the generic
 cgroup support to provide new attributes for cgroups, such as
 accounting/limiting the resources which processes in a cgroup can
-access. For example, cpusets (see Documentation/cpusets.txt) allows
+access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows
 you to associate a set of CPUs and a set of memory nodes with the
 tasks in each cgroup.
  
+CPU Accounting Controller
+-------------------------
+
+The CPU accounting controller is used to group tasks using cgroups and
+account the CPU usage of these groups of tasks.
+
+The CPU accounting controller supports multi-hierarchy groups. An accounting
+group accumulates the CPU usage of all of its child groups and the tasks
+directly present in its group.
+
+Accounting groups can be created by first mounting the cgroup filesystem.
+
+# mkdir /cgroups
+# mount -t cgroup -ocpuacct none /cgroups
+
+With the above step, the initial or the parent accounting group
+becomes visible at /cgroups. At bootup, this group includes all the
+tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
+/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
+this group which is essentially the CPU time obtained by all the tasks
+in the system.
+
+New accounting groups can be created under the parent group /cgroups.
+
+# cd /cgroups
+# mkdir g1
+# echo $$ > g1
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it. CPU time consumed by this bash and its children
+can be obtained from g1/cpuacct.usage and the same is accumulated in
+/cgroups/cpuacct.usage also.
+				CPUSETS
+				-------
+
+Copyright (C) 2004 BULL SA.
+Written by Simon.Derr@bull.net
+
+Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
+Modified by Paul Jackson <pj@sgi.com>
+Modified by Christoph Lameter <clameter@sgi.com>
+Modified by Paul Menage <menage@google.com>
+Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
+
+CONTENTS:
+=========
+
+1. Cpusets
+  1.1 What are cpusets ?
+  1.2 Why are cpusets needed ?
+  1.3 How are cpusets implemented ?
+  1.4 What are exclusive cpusets ?
+  1.5 What is memory_pressure ?
+  1.6 What is memory spread ?
+  1.7 What is sched_load_balance ?
+  1.8 What is sched_relax_domain_level ?
+  1.9 How do I use cpusets ?
+2. Usage Examples and Syntax
+  2.1 Basic Usage
+  2.2 Adding/removing cpus
+  2.3 Setting flags
+  2.4 Attaching processes
+3. Questions
+4. Contact
+
+1. Cpusets
+==========
+
+1.1 What are cpusets ?
+----------------------
+
+Cpusets provide a mechanism for assigning a set of CPUs and Memory
+Nodes to a set of tasks.   In this document "Memory Node" refers to
+an on-line node that contains memory.
+
+Cpusets constrain the CPU and Memory placement of tasks to only
+the resources within a tasks current cpuset.  They form a nested
+hierarchy visible in a virtual file system.  These are the essential
+hooks, beyond what is already present, required to manage dynamic
+job placement on large systems.
+
+Cpusets use the generic cgroup subsystem described in
+Documentation/cgroups/cgroups.txt.
+
+Requests by a task, using the sched_setaffinity(2) system call to
+include CPUs in its CPU affinity mask, and using the mbind(2) and
+set_mempolicy(2) system calls to include Memory Nodes in its memory
+policy, are both filtered through that tasks cpuset, filtering out any
+CPUs or Memory Nodes not in that cpuset.  The scheduler will not
+schedule a task on a CPU that is not allowed in its cpus_allowed
+vector, and the kernel page allocator will not allocate a page on a
+node that is not allowed in the requesting tasks mems_allowed vector.
+
+User level code may create and destroy cpusets by name in the cgroup
+virtual file system, manage the attributes and permissions of these
+cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
+specify and query to which cpuset a task is assigned, and list the
+task pids assigned to a cpuset.
+
+
+1.2 Why are cpusets needed ?
+----------------------------
+
+The management of large computer systems, with many processors (CPUs),
+complex memory cache hierarchies and multiple Memory Nodes having
+non-uniform access times (NUMA) presents additional challenges for
+the efficient scheduling and memory placement of processes.
+
+Frequently more modest sized systems can be operated with adequate
+efficiency just by letting the operating system automatically share
+the available CPU and Memory resources amongst the requesting tasks.
+
+But larger systems, which benefit more from careful processor and
+memory placement to reduce memory access times and contention,
+and which typically represent a larger investment for the customer,
+can benefit from explicitly placing jobs on properly sized subsets of
+the system.
+
+This can be especially valuable on:
+
+    * Web Servers running multiple instances of the same web application,
+    * Servers running different applications (for instance, a web server
+      and a database), or
+    * NUMA systems running large HPC applications with demanding
+      performance characteristics.
+
+These subsets, or "soft partitions" must be able to be dynamically
+adjusted, as the job mix changes, without impacting other concurrently
+executing jobs. The location of the running jobs pages may also be moved
+when the memory locations are changed.
+
+The kernel cpuset patch provides the minimum essential kernel
+mechanisms required to efficiently implement such subsets.  It
+leverages existing CPU and Memory Placement facilities in the Linux
+kernel to avoid any additional impact on the critical scheduler or
+memory allocator code.
+
+
+1.3 How are cpusets implemented ?
+---------------------------------
+
+Cpusets provide a Linux kernel mechanism to constrain which CPUs and
+Memory Nodes are used by a process or set of processes.
+
+The Linux kernel already has a pair of mechanisms to specify on which
+CPUs a task may be scheduled (sched_setaffinity) and on which Memory
+Nodes it may obtain memory (mbind, set_mempolicy).
+
+Cpusets extends these two mechanisms as follows:
+
+ - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
+   kernel.
+ - Each task in the system is attached to a cpuset, via a pointer
+   in the task structure to a reference counted cgroup structure.
+ - Calls to sched_setaffinity are filtered to just those CPUs
+   allowed in that tasks cpuset.
+ - Calls to mbind and set_mempolicy are filtered to just
+   those Memory Nodes allowed in that tasks cpuset.
+ - The root cpuset contains all the systems CPUs and Memory
+   Nodes.
+ - For any cpuset, one can define child cpusets containing a subset
+   of the parents CPU and Memory Node resources.
+ - The hierarchy of cpusets can be mounted at /dev/cpuset, for
+   browsing and manipulation from user space.
+ - A cpuset may be marked exclusive, which ensures that no other
+   cpuset (except direct ancestors and descendents) may contain
+   any overlapping CPUs or Memory Nodes.
+ - You can list all the tasks (by pid) attached to any cpuset.
+
+The implementation of cpusets requires a few, simple hooks
+into the rest of the kernel, none in performance critical paths:
+
+ - in init/main.c, to initialize the root cpuset at system boot.
+ - in fork and exit, to attach and detach a task from its cpuset.
+ - in sched_setaffinity, to mask the requested CPUs by what's
+   allowed in that tasks cpuset.
+ - in sched.c migrate_all_tasks(), to keep migrating tasks within
+   the CPUs allowed by their cpuset, if possible.
+ - in the mbind and set_mempolicy system calls, to mask the requested
+   Memory Nodes by what's allowed in that tasks cpuset.
+ - in page_alloc.c, to restrict memory to allowed nodes.
+ - in vmscan.c, to restrict page recovery to the current cpuset.
+
+You should mount the "cgroup" filesystem type in order to enable
+browsing and modifying the cpusets presently known to the kernel.  No
+new system calls are added for cpusets - all support for querying and
+modifying cpusets is via this cpuset file system.
+
+The /proc/<pid>/status file for each task has four added lines,
+displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
+and mems_allowed (on which Memory Nodes it may obtain memory),
+in the two formats seen in the following example:
+
+  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
+  Cpus_allowed_list:      0-127
+  Mems_allowed:   ffffffff,ffffffff
+  Mems_allowed_list:      0-63
+
+Each cpuset is represented by a directory in the cgroup file system
+containing (on top of the standard cgroup files) the following
+files describing that cpuset:
+
+ - cpus: list of CPUs in that cpuset
+ - mems: list of Memory Nodes in that cpuset
+ - memory_migrate flag: if set, move pages to cpusets nodes
+ - cpu_exclusive flag: is cpu placement exclusive?
+ - mem_exclusive flag: is memory placement exclusive?
+ - mem_hardwall flag:  is memory allocation hardwalled
+ - memory_pressure: measure of how much paging pressure in cpuset
+
+In addition, the root cpuset only has the following file:
+ - memory_pressure_enabled flag: compute memory_pressure?
+
+New cpusets are created using the mkdir system call or shell
+command.  The properties of a cpuset, such as its flags, allowed
+CPUs and Memory Nodes, and attached tasks, are modified by writing
+to the appropriate file in that cpusets directory, as listed above.
+
+The named hierarchical structure of nested cpusets allows partitioning
+a large system into nested, dynamically changeable, "soft-partitions".
+
+The attachment of each task, automatically inherited at fork by any
+children of that task, to a cpuset allows organizing the work load
+on a system into related sets of tasks such that each set is constrained
+to using the CPUs and Memory Nodes of a particular cpuset.  A task
+may be re-attached to any other cpuset, if allowed by the permissions
+on the necessary cpuset file system directories.
+
+Such management of a system "in the large" integrates smoothly with
+the detailed placement done on individual tasks and memory regions
+using the sched_setaffinity, mbind and set_mempolicy system calls.
+
+The following rules apply to each cpuset:
+
+ - Its CPUs and Memory Nodes must be a subset of its parents.
+ - It can't be marked exclusive unless its parent is.
+ - If its cpu or memory is exclusive, they may not overlap any sibling.
+
+These rules, and the natural hierarchy of cpusets, enable efficient
+enforcement of the exclusive guarantee, without having to scan all
+cpusets every time any of them change to ensure nothing overlaps a
+exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
+to represent the cpuset hierarchy provides for a familiar permission
+and name space for cpusets, with a minimum of additional kernel code.
+
+The cpus and mems files in the root (top_cpuset) cpuset are
+read-only.  The cpus file automatically tracks the value of
+cpu_online_map using a CPU hotplug notifier, and the mems file
+automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+nodes with memory--using the cpuset_track_online_nodes() hook.
+
+
+1.4 What are exclusive cpusets ?
+--------------------------------
+
+If a cpuset is cpu or mem exclusive, no other cpuset, other than
+a direct ancestor or descendent, may share any of the same CPUs or
+Memory Nodes.
+
+A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
+i.e. it restricts kernel allocations for page, buffer and other data
+commonly shared by the kernel across multiple users.  All cpusets,
+whether hardwalled or not, restrict allocations of memory for user
+space.  This enables configuring a system so that several independent
+jobs can share common kernel data, such as file system pages, while
+isolating each job's user allocation in its own cpuset.  To do this,
+construct a large mem_exclusive cpuset to hold all the jobs, and
+construct child, non-mem_exclusive cpusets for each individual job.
+Only a small amount of typical kernel memory, such as requests from
+interrupt handlers, is allowed to be taken outside even a
+mem_exclusive cpuset.
+
+
+1.5 What is memory_pressure ?
+-----------------------------
+The memory_pressure of a cpuset provides a simple per-cpuset metric
+of the rate that the tasks in a cpuset are attempting to free up in
+use memory on the nodes of the cpuset to satisfy additional memory
+requests.
+
+This enables batch managers monitoring jobs running in dedicated
+cpusets to efficiently detect what level of memory pressure that job
+is causing.
+
+This is useful both on tightly managed systems running a wide mix of
+submitted jobs, which may choose to terminate or re-prioritize jobs that
+are trying to use more memory than allowed on the nodes assigned them,
+and with tightly coupled, long running, massively parallel scientific
+computing jobs that will dramatically fail to meet required performance
+goals if they start to use more memory than allowed to them.
+
+This mechanism provides a very economical way for the batch manager
+to monitor a cpuset for signs of memory pressure.  It's up to the
+batch manager or other user code to decide what to do about it and
+take action.
+
+==> Unless this feature is enabled by writing "1" to the special file
+    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
+    code of __alloc_pages() for this metric reduces to simply noticing
+    that the cpuset_memory_pressure_enabled flag is zero.  So only
+    systems that enable this feature will compute the metric.
+
+Why a per-cpuset, running average:
+
+    Because this meter is per-cpuset, rather than per-task or mm,
+    the system load imposed by a batch scheduler monitoring this
+    metric is sharply reduced on large systems, because a scan of
+    the tasklist can be avoided on each set of queries.
+
+    Because this meter is a running average, instead of an accumulating
+    counter, a batch scheduler can detect memory pressure with a
+    single read, instead of having to read and accumulate results
+    for a period of time.
+
+    Because this meter is per-cpuset rather than per-task or mm,
+    the batch scheduler can obtain the key information, memory
+    pressure in a cpuset, with a single read, rather than having to
+    query and accumulate results over all the (dynamically changing)
+    set of tasks in the cpuset.
+
+A per-cpuset simple digital filter (requires a spinlock and 3 words
+of data per-cpuset) is kept, and updated by any task attached to that
+cpuset, if it enters the synchronous (direct) page reclaim code.
+
+A per-cpuset file provides an integer number representing the recent
+(half-life of 10 seconds) rate of direct page reclaims caused by
+the tasks in the cpuset, in units of reclaims attempted per second,
+times 1000.
+
+
+1.6 What is memory spread ?
+---------------------------
+There are two boolean flag files per cpuset that control where the
+kernel allocates pages for the file system buffers and related in
+kernel data structures.  They are called 'memory_spread_page' and
+'memory_spread_slab'.
+
+If the per-cpuset boolean flag file 'memory_spread_page' is set, then
+the kernel will spread the file system buffers (page cache) evenly
+over all the nodes that the faulting task is allowed to use, instead
+of preferring to put those pages on the node where the task is running.
+
+If the per-cpuset boolean flag file 'memory_spread_slab' is set,
+then the kernel will spread some file system related slab caches,
+such as for inodes and dentries evenly over all the nodes that the
+faulting task is allowed to use, instead of preferring to put those
+pages on the node where the task is running.
+
+The setting of these flags does not affect anonymous data segment or
+stack segment pages of a task.
+
+By default, both kinds of memory spreading are off, and memory
+pages are allocated on the node local to where the task is running,
+except perhaps as modified by the tasks NUMA mempolicy or cpuset
+configuration, so long as sufficient free memory pages are available.
+
+When new cpusets are created, they inherit the memory spread settings
+of their parent.
+
+Setting memory spreading causes allocations for the affected page
+or slab caches to ignore the tasks NUMA mempolicy and be spread
+instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
+mempolicies will not notice any change in these calls as a result of
+their containing tasks memory spread settings.  If memory spreading
+is turned off, then the currently specified NUMA mempolicy once again
+applies to memory page allocations.
+
+Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
+files.  By default they contain "0", meaning that the feature is off
+for that cpuset.  If a "1" is written to that file, then that turns
+the named feature on.
+
+The implementation is simple.
+
+Setting the flag 'memory_spread_page' turns on a per-process flag
+PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
+joins that cpuset.  The page allocation calls for the page cache
+is modified to perform an inline check for this PF_SPREAD_PAGE task
+flag, and if set, a call to a new routine cpuset_mem_spread_node()
+returns the node to prefer for the allocation.
+
+Similarly, setting 'memory_spread_slab' turns on the flag
+PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
+pages from the node returned by cpuset_mem_spread_node().
+
+The cpuset_mem_spread_node() routine is also simple.  It uses the
+value of a per-task rotor cpuset_mem_spread_rotor to select the next
+node in the current tasks mems_allowed to prefer for the allocation.
+
+This memory placement policy is also known (in other contexts) as
+round-robin or interleave.
+
+This policy can provide substantial improvements for jobs that need
+to place thread local data on the corresponding node, but that need
+to access large file system data sets that need to be spread across
+the several nodes in the jobs cpuset in order to fit.  Without this
+policy, especially for jobs that might have one thread reading in the
+data set, the memory allocation across the nodes in the jobs cpuset
+can become very uneven.
+
+1.7 What is sched_load_balance ?
+--------------------------------
+
+The kernel scheduler (kernel/sched.c) automatically load balances
+tasks.  If one CPU is underutilized, kernel code running on that
+CPU will look for tasks on other more overloaded CPUs and move those
+tasks to itself, within the constraints of such placement mechanisms
+as cpusets and sched_setaffinity.
+
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the task list increases more than
+linearly with the number of CPUs being balanced.  So the scheduler
+has support to  partition the systems CPUs into a number of sched
+domains such that it only load balances within each sched domain.
+Each sched domain covers some subset of the CPUs in the system;
+no two sched domains overlap; some CPUs might not be in any sched
+domain and hence won't be load balanced.
+
+Put simply, it costs less to balance between two smaller sched domains
+than one big one, but doing so means that overloads in one of the
+two domains won't be load balanced to the other one.
+
+By default, there is one sched domain covering all CPUs, except those
+marked isolated using the kernel boot time "isolcpus=" argument.
+
+This default load balancing across all CPUs is not well suited for
+the following two situations:
+ 1) On large systems, load balancing across many CPUs is expensive.
+    If the system is managed using cpusets to place independent jobs
+    on separate sets of CPUs, full load balancing is unnecessary.
+ 2) Systems supporting realtime on some CPUs need to minimize
+    system overhead on those CPUs, including avoiding task load
+    balancing if that is not needed.
+
+When the per-cpuset flag "sched_load_balance" is enabled (the default
+setting), it requests that all the CPUs in that cpusets allowed 'cpus'
+be contained in a single sched domain, ensuring that load balancing
+can move a task (not otherwised pinned, as by sched_setaffinity)
+from any CPU in that cpuset to any other.
+
+When the per-cpuset flag "sched_load_balance" is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+--except-- in so far as is necessary because some overlapping cpuset
+has "sched_load_balance" enabled.
+
+So, for example, if the top cpuset has the flag "sched_load_balance"
+enabled, then the scheduler will have one sched domain covering all
+CPUs, and the setting of the "sched_load_balance" flag in any other
+cpusets won't matter, as we're already fully load balancing.
+
+Therefore in the above two situations, the top cpuset flag
+"sched_load_balance" should be disabled, and only some of the smaller,
+child cpusets have this flag enabled.
+
+When doing this, you don't usually want to leave any unpinned tasks in
+the top cpuset that might use non-trivial amounts of CPU, as such tasks
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendent cpusets.  Even if
+such a task could use spare CPU cycles in some other CPUs, the kernel
+scheduler might not consider the possibility of load balancing that
+task to that underused CPU.
+
+Of course, tasks pinned to a particular CPU can be left in a cpuset
+that disables "sched_load_balance" as those tasks aren't going anywhere
+else anyway.
+
+There is an impedance mismatch here, between cpusets and sched domains.
+Cpusets are hierarchical and nest.  Sched domains are flat; they don't
+overlap and each CPU is in at most one sched domain.
+
+It is necessary for sched domains to be flat because load balancing
+across partially overlapping sets of CPUs would risk unstable dynamics
+that would be beyond our understanding.  So if each of two partially
+overlapping cpusets enables the flag 'sched_load_balance', then we
+form a single sched domain that is a superset of both.  We won't move
+a task to a CPU outside it cpuset, but the scheduler load balancing
+code might waste some compute cycles considering that possibility.
+
+This mismatch is why there is not a simple one-to-one relation
+between which cpusets have the flag "sched_load_balance" enabled,
+and the sched domain configuration.  If a cpuset enables the flag, it
+will get balancing across all its CPUs, but if it disables the flag,
+it will only be assured of no load balancing if no other overlapping
+cpuset enables the flag.
+
+If two cpusets have partially overlapping 'cpus' allowed, and only
+one of them has this flag enabled, then the other may find its
+tasks only partially load balanced, just on the overlapping CPUs.
+This is just the general case of the top_cpuset example given a few
+paragraphs above.  In the general case, as in the top cpuset case,
+don't leave tasks that might use non-trivial amounts of CPU in
+such partially load balanced cpusets, as they may be artificially
+constrained to some subset of the CPUs allowed to them, for lack of
+load balancing to the other CPUs.
+
+1.7.1 sched_load_balance implementation details.
+------------------------------------------------
+
+The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
+to most cpuset flags.)  When enabled for a cpuset, the kernel will
+ensure that it can load balance across all the CPUs in that cpuset
+(makes sure that all the CPUs in the cpus_allowed of that cpuset are
+in the same sched domain.)
+
+If two overlapping cpusets both have 'sched_load_balance' enabled,
+then they will be (must be) both in the same sched domain.
+
+If, as is the default, the top cpuset has 'sched_load_balance' enabled,
+then by the above that means there is a single sched domain covering
+the whole system, regardless of any other cpuset settings.
+
+The kernel commits to user space that it will avoid load balancing
+where it can.  It will pick as fine a granularity partition of sched
+domains as it can while still providing load balancing for any set
+of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
+
+The internal kernel cpuset to scheduler interface passes from the
+cpuset code to the scheduler code a partition of the load balanced
+CPUs in the system. This partition is a set of subsets (represented
+as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
+the CPUs that must be load balanced.
+
+Whenever the 'sched_load_balance' flag changes, or CPUs come or go
+from a cpuset with this flag enabled, or a cpuset with this flag
+enabled is removed, the cpuset code builds a new such partition and
+passes it to the scheduler sched domain setup code, to have the sched
+domains rebuilt as necessary.
+
+This partition exactly defines what sched domains the scheduler should
+setup - one sched domain for each element (cpumask_t) in the partition.
+
+The scheduler remembers the currently active sched domain partitions.
+When the scheduler routine partition_sched_domains() is invoked from
+the cpuset code to update these sched domains, it compares the new
+partition requested with the current, and updates its sched domains,
+removing the old and adding the new, for each change.
+
+
+1.8 What is sched_relax_domain_level ?
+--------------------------------------
+
+In sched domain, the scheduler migrates tasks in 2 ways; periodic load
+balance on tick, and at time of some schedule events.
+
+When a task is woken up, scheduler try to move the task on idle CPU.
+For example, if a task A running on CPU X activates another task B
+on the same CPU X, and if CPU Y is X's sibling and performing idle,
+then scheduler migrate task B to CPU Y so that task B can start on
+CPU Y without waiting task A on CPU X.
+
+And if a CPU run out of tasks in its runqueue, the CPU try to pull
+extra tasks from other busy CPUs to help them before it is going to
+be idle.
+
+Of course it takes some searching cost to find movable tasks and/or
+idle CPUs, the scheduler might not search all CPUs in the domain
+everytime.  In fact, in some architectures, the searching ranges on
+events are limited in the same socket or node where the CPU locates,
+while the load balance on tick searchs all.
+
+For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
+is idle while CPU X and the siblings are busy, scheduler can't migrate
+woken task B from X to Z since it is out of its searching range.
+As the result, task B on CPU X need to wait task A or wait load balance
+on the next tick.  For some applications in special situation, waiting
+1 tick may be too long.
+
+The 'sched_relax_domain_level' file allows you to request changing
+this searching range as you like.  This file takes int value which
+indicates size of searching range in levels ideally as follows,
+otherwise initial value -1 that indicates the cpuset has no request.
+
+  -1  : no request. use system default or follow request of others.
+   0  : no search.
+   1  : search siblings (hyperthreads in a core).
+   2  : search cores in a package.
+   3  : search cpus in a node [= system wide on non-NUMA system]
+ ( 4  : search nodes in a chunk of node [on NUMA system] )
+ ( 5  : search system wide [on NUMA system] )
+
+The system default is architecture dependent.  The system default
+can be changed using the relax_domain_level= boot parameter.
+
+This file is per-cpuset and affect the sched domain where the cpuset
+belongs to.  Therefore if the flag 'sched_load_balance' of a cpuset
+is disabled, then 'sched_relax_domain_level' have no effect since
+there is no sched domain belonging the cpuset.
+
+If multiple cpusets are overlapping and hence they form a single sched
+domain, the largest value among those is used.  Be careful, if one
+requests 0 and others are -1 then 0 is used.
+
+Note that modifying this file will have both good and bad effects,
+and whether it is acceptable or not will be depend on your situation.
+Don't modify this file if you are not sure.
+
+If your situation is:
+ - The migration costs between each cpu can be assumed considerably
+   small(for you) due to your special application's behavior or
+   special hardware support for CPU cache etc.
+ - The searching cost doesn't have impact(for you) or you can make
+   the searching cost enough small by managing cpuset to compact etc.
+ - The latency is required even it sacrifices cache hit rate etc.
+then increasing 'sched_relax_domain_level' would benefit you.
+
+
+1.9 How do I use cpusets ?
+--------------------------
+
+In order to minimize the impact of cpusets on critical kernel
+code, such as the scheduler, and due to the fact that the kernel
+does not support one task updating the memory placement of another
+task directly, the impact on a task of changing its cpuset CPU
+or Memory Node placement, or of changing to which cpuset a task
+is attached, is subtle.
+
+If a cpuset has its Memory Nodes modified, then for each task attached
+to that cpuset, the next time that the kernel attempts to allocate
+a page of memory for that task, the kernel will notice the change
+in the tasks cpuset, and update its per-task memory placement to
+remain within the new cpusets memory placement.  If the task was using
+mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
+its new cpuset, then the task will continue to use whatever subset
+of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
+was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
+in the new cpuset, then the task will be essentially treated as if it
+was MPOL_BIND bound to the new cpuset (even though its numa placement,
+as queried by get_mempolicy(), doesn't change).  If a task is moved
+from one cpuset to another, then the kernel will adjust the tasks
+memory placement, as above, the next time that the kernel attempts
+to allocate a page of memory for that task.
+
+If a cpuset has its 'cpus' modified, then each task in that cpuset
+will have its allowed CPU placement changed immediately.  Similarly,
+if a tasks pid is written to a cpusets 'tasks' file, in either its
+current cpuset or another cpuset, then its allowed CPU placement is
+changed immediately.  If such a task had been bound to some subset
+of its cpuset using the sched_setaffinity() call, the task will be
+allowed to run on any CPU allowed in its new cpuset, negating the
+affect of the prior sched_setaffinity() call.
+
+In summary, the memory placement of a task whose cpuset is changed is
+updated by the kernel, on the next allocation of a page for that task,
+but the processor placement is not updated, until that tasks pid is
+rewritten to the 'tasks' file of its cpuset.  This is done to avoid
+impacting the scheduler code in the kernel with a check for changes
+in a tasks processor placement.
+
+Normally, once a page is allocated (given a physical page
+of main memory) then that page stays on whatever node it
+was allocated, so long as it remains allocated, even if the
+cpusets memory placement policy 'mems' subsequently changes.
+If the cpuset flag file 'memory_migrate' is set true, then when
+tasks are attached to that cpuset, any pages that task had
+allocated to it on nodes in its previous cpuset are migrated
+to the tasks new cpuset. The relative placement of the page within
+the cpuset is preserved during these migration operations if possible.
+For example if the page was on the second valid node of the prior cpuset
+then the page will be placed on the second valid node of the new cpuset.
+
+Also if 'memory_migrate' is set true, then if that cpusets
+'mems' file is modified, pages allocated to tasks in that
+cpuset, that were on nodes in the previous setting of 'mems',
+will be moved to nodes in the new setting of 'mems.'
+Pages that were not in the tasks prior cpuset, or in the cpusets
+prior 'mems' setting, will not be moved.
+
+There is an exception to the above.  If hotplug functionality is used
+to remove all the CPUs that are currently assigned to a cpuset,
+then all the tasks in that cpuset will be moved to the nearest ancestor
+with non-empty cpus.  But the moving of some (or all) tasks might fail if
+cpuset is bound with another cgroup subsystem which has some restrictions
+on task attaching.  In this failing case, those tasks will stay
+in the original cpuset, and the kernel will automatically update
+their cpus_allowed to allow all online CPUs.  When memory hotplug
+functionality for removing Memory Nodes is available, a similar exception
+is expected to apply there as well.  In general, the kernel prefers to
+violate cpuset placement, over starving a task that has had all
+its allowed CPUs or Memory Nodes taken offline.
+
+There is a second exception to the above.  GFP_ATOMIC requests are
+kernel internal allocations that must be satisfied, immediately.
+The kernel may drop some request, in rare cases even panic, if a
+GFP_ATOMIC alloc fails.  If the request cannot be satisfied within
+the current tasks cpuset, then we relax the cpuset, and look for
+memory anywhere we can find it.  It's better to violate the cpuset
+than stress the kernel.
+
+To start a new job that is to be contained within a cpuset, the steps are:
+
+ 1) mkdir /dev/cpuset
+ 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
+ 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
+    the /dev/cpuset virtual file system.
+ 4) Start a task that will be the "founding father" of the new job.
+ 5) Attach that task to the new cpuset by writing its pid to the
+    /dev/cpuset tasks file for that cpuset.
+ 6) fork, exec or clone the job tasks from this founding father task.
+
+For example, the following sequence of commands will setup a cpuset
+named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
+and then start a subshell 'sh' in that cpuset:
+
+  mount -t cgroup -ocpuset cpuset /dev/cpuset
+  cd /dev/cpuset
+  mkdir Charlie
+  cd Charlie
+  /bin/echo 2-3 > cpus
+  /bin/echo 1 > mems
+  /bin/echo $$ > tasks
+  sh
+  # The subshell 'sh' is now running in cpuset Charlie
+  # The next line should display '/Charlie'
+  cat /proc/self/cpuset
+
+In the future, a C library interface to cpusets will likely be
+available.  For now, the only way to query or modify cpusets is
+via the cpuset file system, using the various cd, mkdir, echo, cat,
+rmdir commands from the shell, or their equivalent from C.
+
+The sched_setaffinity calls can also be done at the shell prompt using
+SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
+calls can be done at the shell prompt using the numactl command
+(part of Andi Kleen's numa package).
+
+2. Usage Examples and Syntax
+============================
+
+2.1 Basic Usage
+---------------
+
+Creating, modifying, using the cpusets can be done through the cpuset
+virtual filesystem.
+
+To mount it, type:
+# mount -t cgroup -o cpuset cpuset /dev/cpuset
+
+Then under /dev/cpuset you can find a tree that corresponds to the
+tree of the cpusets in the system. For instance, /dev/cpuset
+is the cpuset that holds the whole system.
+
+If you want to create a new cpuset under /dev/cpuset:
+# cd /dev/cpuset
+# mkdir my_cpuset
+
+Now you want to do something with this cpuset.
+# cd my_cpuset
+
+In this directory you can find several files:
+# ls
+cpu_exclusive  memory_migrate      mems                      tasks
+cpus           memory_pressure     notify_on_release
+mem_exclusive  memory_spread_page  sched_load_balance
+mem_hardwall   memory_spread_slab  sched_relax_domain_level
+
+Reading them will give you information about the state of this cpuset:
+the CPUs and Memory Nodes it can use, the processes that are using
+it, its properties.  By writing to these files you can manipulate
+the cpuset.
+
+Set some flags:
+# /bin/echo 1 > cpu_exclusive
+
+Add some cpus:
+# /bin/echo 0-7 > cpus
+
+Add some mems:
+# /bin/echo 0-7 > mems
+
+Now attach your shell to this cpuset:
+# /bin/echo $$ > tasks
+
+You can also create cpusets inside your cpuset by using mkdir in this
+directory.
+# mkdir my_sub_cs
+
+To remove a cpuset, just use rmdir:
+# rmdir my_sub_cs
+This will fail if the cpuset is in use (has cpusets inside, or has
+processes attached).
+
+Note that for legacy reasons, the "cpuset" filesystem exists as a
+wrapper around the cgroup filesystem.
+
+The command
+
+mount -t cpuset X /dev/cpuset
+
+is equivalent to
+
+mount -t cgroup -ocpuset X /dev/cpuset
+echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
+
+2.2 Adding/removing cpus
+------------------------
+
+This is the syntax to use when writing in the cpus or mems files
+in cpuset directories:
+
+# /bin/echo 1-4 > cpus		-> set cpus list to cpus 1,2,3,4
+# /bin/echo 1,2,3,4 > cpus	-> set cpus list to cpus 1,2,3,4
+
+2.3 Setting flags
+-----------------
+
+The syntax is very simple:
+
+# /bin/echo 1 > cpu_exclusive 	-> set flag 'cpu_exclusive'
+# /bin/echo 0 > cpu_exclusive 	-> unset flag 'cpu_exclusive'
+
+2.4 Attaching processes
+-----------------------
+
+# /bin/echo PID > tasks
+
+Note that it is PID, not PIDs. You can only attach ONE task at a time.
+If you have several tasks to attach, you have to do it one after another:
+
+# /bin/echo PID1 > tasks
+# /bin/echo PID2 > tasks
+	...
+# /bin/echo PIDn > tasks
+
+
+3. Questions
+============
+
+Q: what's up with this '/bin/echo' ?
+A: bash's builtin 'echo' command does not check calls to write() against
+   errors. If you use it in the cpuset file system, you won't be
+   able to tell whether a command succeeded or failed.
+
+Q: When I attach processes, only the first of the line gets really attached !
+A: We can only return one error code per call to write(). So you should also
+   put only ONE pid.
+
+4. Contact
+==========
+
+Web: http://www.bullopensource.org/cpuset
+Device Whitelist Controller
+
+1. Description:
+
+Implement a cgroup to track and enforce open and mknod restrictions
+on device files.  A device cgroup associates a device access
+whitelist with each cgroup.  A whitelist entry has 4 fields.
+'type' is a (all), c (char), or b (block).  'all' means it applies
+to all types and all major and minor numbers.  Major and minor are
+either an integer or * for all.  Access is a composition of r
+(read), w (write), and m (mknod).
+
+The root device cgroup starts with rwm to 'all'.  A child device
+cgroup gets a copy of the parent.  Administrators can then remove
+devices from the whitelist or add new entries.  A child cgroup can
+never receive a device access which is denied by its parent.  However
+when a device access is removed from a parent it will not also be
+removed from the child(ren).
+
+2. User Interface
+
+An entry is added using devices.allow, and removed using
+devices.deny.  For instance
+
+	echo 'c 1:3 mr' > /cgroups/1/devices.allow
+
+allows cgroup 1 to read and mknod the device usually known as
+/dev/null.  Doing
+
+	echo a > /cgroups/1/devices.deny
+
+will remove the default 'a *:* rwm' entry. Doing
+
+	echo a > /cgroups/1/devices.allow
+
+will add the 'a *:* rwm' entry to the whitelist.
+
+3. Security
+
+Any task can move itself between cgroups.  This clearly won't
+suffice, but we can decide the best way to adequately restrict
+movement as people get some experience with this.  We may just want
+to require CAP_SYS_ADMIN, which at least is a separate bit from
+CAP_MKNOD.  We may want to just refuse moving to a cgroup which
+isn't a descendent of the current one.  Or we may want to use
+CAP_MAC_ADMIN, since we really are trying to lock down root.
+
+CAP_SYS_ADMIN is needed to modify the whitelist or move another
+task to a new cgroup.  (Again we'll probably want to change that).
+
+A cgroup may not be granted more permissions than the cgroup's
+parent has.
+Memory Resource Controller(Memcg)  Implementation Memo.
+Last Updated: 2008/12/15
+Base Kernel Version: based on 2.6.28-rc8-mm.
+
+Because VM is getting complex (one of reasons is memcg...), memcg's behavior
+is complex. This is a document for memcg's internal behavior.
+Please note that implementation details can be changed.
+
+(*) Topics on API should be in Documentation/cgroups/memory.txt)
+
+0. How to record usage ?
+   2 objects are used.
+
+   page_cgroup ....an object per page.
+	Allocated at boot or memory hotplug. Freed at memory hot removal.
+
+   swap_cgroup ... an entry per swp_entry.
+	Allocated at swapon(). Freed at swapoff().
+
+   The page_cgroup has USED bit and double count against a page_cgroup never
+   occurs. swap_cgroup is used only when a charged page is swapped-out.
+
+1. Charge
+
+   a page/swp_entry may be charged (usage += PAGE_SIZE) at
+
+	mem_cgroup_newpage_charge()
+	  Called at new page fault and Copy-On-Write.
+
+	mem_cgroup_try_charge_swapin()
+	  Called at do_swap_page() (page fault on swap entry) and swapoff.
+	  Followed by charge-commit-cancel protocol. (With swap accounting)
+	  At commit, a charge recorded in swap_cgroup is removed.
+
+	mem_cgroup_cache_charge()
+	  Called at add_to_page_cache()
+
+	mem_cgroup_cache_charge_swapin()
+	  Called at shmem's swapin.
+
+	mem_cgroup_prepare_migration()
+	  Called before migration. "extra" charge is done and followed by
+	  charge-commit-cancel protocol.
+	  At commit, charge against oldpage or newpage will be committed.
+
+2. Uncharge
+  a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
+
+	mem_cgroup_uncharge_page()
+	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
+	  to 0. If the page is SwapCache, uncharge is delayed until
+	  mem_cgroup_uncharge_swapcache().
+
+	mem_cgroup_uncharge_cache_page()
+	  Called when a page-cache is deleted from radix-tree. If the page is
+	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
+
+	mem_cgroup_uncharge_swapcache()
+	  Called when SwapCache is removed from radix-tree. The charge itself
+	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
+	  charge to swap occurs.)
+
+	mem_cgroup_uncharge_swap()
+	  Called when swp_entry's refcnt goes down to 0. A charge against swap
+	  disappears.
+
+	mem_cgroup_end_migration(old, new)
+	At success of migration old is uncharged (if necessary), a charge
+	to new page is committed. At failure, charge to old page is committed.
+
+3. charge-commit-cancel
+	In some case, we can't know this "charge" is valid or not at charging
+	(because of races).
+	To handle such case, there are charge-commit-cancel functions.
+		mem_cgroup_try_charge_XXX
+		mem_cgroup_commit_charge_XXX
+		mem_cgroup_cancel_charge_XXX
+	these are used in swap-in and migration.
+
+	At try_charge(), there are no flags to say "this page is charged".
+	at this point, usage += PAGE_SIZE.
+
+	At commit(), the function checks the page should be charged or not
+	and set flags or avoid charging.(usage -= PAGE_SIZE)
+
+	At cancel(), simply usage -= PAGE_SIZE.
+
+Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
+
+4. Anonymous
+	Anonymous page is newly allocated at
+		  - page fault into MAP_ANONYMOUS mapping.
+		  - Copy-On-Write.
+ 	It is charged right after it's allocated before doing any page table
+	related operations. Of course, it's uncharged when another page is used
+	for the fault address.
+
+	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
+	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
+	are done at page_remove_rmap() when page_mapcount() goes down to 0.
+
+	Another page freeing is by page-reclaim (vmscan.c) and anonymous
+	pages are swapped out. In this case, the page is marked as
+	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
+	as SwapCache(). It's delayed until __delete_from_swap_cache().
+
+	4.1 Swap-in.
+	At swap-in, the page is taken from swap-cache. There are 2 cases.
+
+	(a) If the SwapCache is newly allocated and read, it has no charges.
+	(b) If the SwapCache has been mapped by processes, it has been
+	    charged already.
+
+	This swap-in is one of the most complicated work. In do_swap_page(),
+	following events occur when pte is unchanged.
+
+	(1) the page (SwapCache) is looked up.
+	(2) lock_page()
+	(3) try_charge_swapin()
+	(4) reuse_swap_page() (may call delete_swap_cache())
+	(5) commit_charge_swapin()
+	(6) swap_free().
+
+	Considering following situation for example.
+
+	(A) The page has not been charged before (2) and reuse_swap_page()
+	    doesn't call delete_from_swap_cache().
+	(B) The page has not been charged before (2) and reuse_swap_page()
+	    calls delete_from_swap_cache().
+	(C) The page has been charged before (2) and reuse_swap_page() doesn't
+	    call delete_from_swap_cache().
+	(D) The page has been charged before (2) and reuse_swap_page() calls
+	    delete_from_swap_cache().
+
+	    memory.usage/memsw.usage changes to this page/swp_entry will be
+	 Case          (A)      (B)       (C)     (D)
+         Event
+       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
+          ===========================================
+          (3)        +1/+1    +1/+1     +1/+1   +1/+1
+          (4)          -       0/ 0       -     -1/ 0
+          (5)         0/-1     0/ 0     -1/-1    0/ 0
+          (6)          -       0/-1       -      0/-1
+          ===========================================
+       Result         1/ 1     1/ 1      1/ 1    1/ 1
+
+       In any cases, charges to this page should be 1/ 1.
+
+	4.2 Swap-out.
+	At swap-out, typical state transition is below.
+
+	(a) add to swap cache. (marked as SwapCache)
+	    swp_entry's refcnt += 1.
+	(b) fully unmapped.
+	    swp_entry's refcnt += # of ptes.
+	(c) write back to swap.
+	(d) delete from swap cache. (remove from SwapCache)
+	    swp_entry's refcnt -= 1.
+
+
+	At (b), the page is marked as SwapCache and not uncharged.
+	At (d), the page is removed from SwapCache and a charge in page_cgroup
+	is moved to swap_cgroup.
+
+	Finally, at task exit,
+	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
+	Here, a charge in swap_cgroup disappears.
+
+5. Page Cache
+   	Page Cache is charged at
+	- add_to_page_cache_locked().
+
+	uncharged at
+	- __remove_from_page_cache().
+
+	The logic is very clear. (About migration, see below)
+	Note: __remove_from_page_cache() is called by remove_from_page_cache()
+	and __remove_mapping().
+
+6. Shmem(tmpfs) Page Cache
+	Memcg's charge/uncharge have special handlers of shmem. The best way
+	to understand shmem's page state transition is to read mm/shmem.c.
+	But brief explanation of the behavior of memcg around shmem will be
+	helpful to understand the logic.
+
+	Shmem's page (just leaf page, not direct/indirect block) can be on
+		- radix-tree of shmem's inode.
+		- SwapCache.
+		- Both on radix-tree and SwapCache. This happens at swap-in
+		  and swap-out,
+
+	It's charged when...
+	- A new page is added to shmem's radix-tree.
+	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
+	It's uncharged when
+	- A page is removed from radix-tree and not SwapCache.
+	- When SwapCache is removed, a charge is moved to swap_cgroup.
+	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
+	  disappears.
+
+7. Page Migration
+   	One of the most complicated functions is page-migration-handler.
+	Memcg has 2 routines. Assume that we are migrating a page's contents
+	from OLDPAGE to NEWPAGE.
+
+	Usual migration logic is..
+	(a) remove the page from LRU.
+	(b) allocate NEWPAGE (migration target)
+	(c) lock by lock_page().
+	(d) unmap all mappings.
+	(e-1) If necessary, replace entry in radix-tree.
+	(e-2) move contents of a page.
+	(f) map all mappings again.
+	(g) pushback the page to LRU.
+	(-) OLDPAGE will be freed.
+
+	Before (g), memcg should complete all necessary charge/uncharge to
+	NEWPAGE/OLDPAGE.
+
+	The point is....
+	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
+          try_to_unmap() drops all mapcount and the page will not be
+	  SwapCache.
+
+	- If OLDPAGE is SwapCache, charges will be kept at (g) because
+	  __delete_from_swap_cache() isn't called at (e-1)
+
+	- If OLDPAGE is page-cache, charges will be kept at (g) because
+	  remove_from_swap_cache() isn't called at (e-1)
+
+	memcg provides following hooks.
+
+	- mem_cgroup_prepare_migration(OLDPAGE)
+	  Called after (b) to account a charge (usage += PAGE_SIZE) against
+	  memcg which OLDPAGE belongs to.
+
+        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
+	  Called after (f) before (g).
+	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
+	  charged, a charge by prepare_migration() is automatically canceled.
+	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
+
+	  But zap_pte() (by exit or munmap) can be called while migration,
+	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+8. LRU
+        Each memcg has its own private LRU. Now, it's handling is under global
+	VM's control (means that it's handled under global zone->lru_lock).
+	Almost all routines around memcg's LRU is called by global LRU's
+	list management functions under zone->lru_lock().
+
+	A special function is mem_cgroup_isolate_pages(). This scans
+	memcg's private LRU and call __isolate_lru_page() to extract a page
+	from LRU.
+	(By __isolate_lru_page(), the page is removed from both of global and
+	 private LRU.)
+
+
+9. Typical Tests.
+
+ Tests for racy cases.
+
+ 9.1 Small limit to memcg.
+	When you do test to do racy case, it's good test to set memcg's limit
+	to be very small rather than GB. Many races found in the test under
+	xKB or xxMB limits.
+	(Memory behavior under GB and Memory behavior under MB shows very
+	 different situation.)
+
+ 9.2 Shmem
+	Historically, memcg's shmem handling was poor and we saw some amount
+	of troubles here. This is because shmem is page-cache but can be
+	SwapCache. Test with shmem/tmpfs is always good test.
+
+ 9.3 Migration
+	For NUMA, migration is an another special case. To do easy test, cpuset
+	is useful. Following is a sample script to do migration.
+
+	mount -t cgroup -o cpuset none /opt/cpuset
+
+	mkdir /opt/cpuset/01
+	echo 1 > /opt/cpuset/01/cpuset.cpus
+	echo 0 > /opt/cpuset/01/cpuset.mems
+	echo 1 > /opt/cpuset/01/cpuset.memory_migrate
+	mkdir /opt/cpuset/02
+	echo 1 > /opt/cpuset/02/cpuset.cpus
+	echo 1 > /opt/cpuset/02/cpuset.mems
+	echo 1 > /opt/cpuset/02/cpuset.memory_migrate
+
+	In above set, when you moves a task from 01 to 02, page migration to
+	node 0 to node 1 will occur. Following is a script to migrate all
+	under cpuset.
+	--
+	move_task()
+	{
+	for pid in $1
+        do
+                /bin/echo $pid >$2/tasks 2>/dev/null
+		echo -n $pid
+		echo -n " "
+        done
+	echo END
+	}
+
+	G1_TASK=`cat ${G1}/tasks`
+	G2_TASK=`cat ${G2}/tasks`
+	move_task "${G1_TASK}" ${G2} &
+	--
+ 9.4 Memory hotplug.
+	memory hotplug test is one of good test.
+	to offline memory, do following.
+	# echo offline > /sys/devices/system/memory/memoryXXX/state
+	(XXX is the place of memory)
+	This is an easy way to test page migration, too.
+
+ 9.5 mkdir/rmdir
+	When using hierarchy, mkdir/rmdir test should be done.
+	Use tests like the following.
+
+	echo 1 >/opt/cgroup/01/memory/use_hierarchy
+	mkdir /opt/cgroup/01/child_a
+	mkdir /opt/cgroup/01/child_b
+
+	set limit to 01.
+	add limit to 01/child_b
+	run jobs under child_a and child_b
+
+	create/delete following groups at random while jobs are running.
+	/opt/cgroup/01/child_a/child_aa
+	/opt/cgroup/01/child_b/child_bb
+	/opt/cgroup/01/child_c
+
+	running new jobs in new group is also good.
+
+ 9.6 Mount with other subsystems.
+	Mounting with other subsystems is a good test because there is a
+	race and lock dependency with other cgroup subsystems.
+
+	example)
+	# mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
+
+	and do task move, mkdir, rmdir etc...under this.
+Memory Resource Controller
+
+NOTE: The Memory Resource Controller has been generically been referred
+to as the memory controller in this document. Do not confuse memory controller
+used here with the memory controller that is used in hardware.
+
+Salient features
+
+a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
+b. The infrastructure allows easy addition of other types of memory to control
+c. Provides *zero overhead* for non memory controller users
+d. Provides a double LRU: global memory pressure causes reclaim from the
+   global LRU; a cgroup on hitting a limit, reclaims from the per
+   cgroup LRU
+
+NOTE: Swap Cache (unmapped) is not accounted now.
+
+Benefits and Purpose of the memory controller
+
+The memory controller isolates the memory behaviour of a group of tasks
+from the rest of the system. The article on LWN [12] mentions some probable
+uses of the memory controller. The memory controller can be used to
+
+a. Isolate an application or a group of applications
+   Memory hungry applications can be isolated and limited to a smaller
+   amount of memory.
+b. Create a cgroup with limited amount of memory, this can be used
+   as a good alternative to booting with mem=XXXX.
+c. Virtualization solutions can control the amount of memory they want
+   to assign to a virtual machine instance.
+d. A CD/DVD burner could control the amount of memory used by the
+   rest of the system to ensure that burning does not fail due to lack
+   of available memory.
+e. There are several other use cases, find one or use the controller just
+   for fun (to learn and hack on the VM subsystem).
+
+1. History
+
+The memory controller has a long history. A request for comments for the memory
+controller was posted by Balbir Singh [1]. At the time the RFC was posted
+there were several implementations for memory control. The goal of the
+RFC was to build consensus and agreement for the minimal features required
+for memory control. The first RSS controller was posted by Balbir Singh[2]
+in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
+RSS controller. At OLS, at the resource management BoF, everyone suggested
+that we handle both page cache and RSS together. Another request was raised
+to allow user space handling of OOM. The current memory controller is
+at version 6; it combines both mapped (RSS) and unmapped Page
+Cache Control [11].
+
+2. Memory Control
+
+Memory is a unique resource in the sense that it is present in a limited
+amount. If a task requires a lot of CPU processing, the task can spread
+its processing over a period of hours, days, months or years, but with
+memory, the same physical memory needs to be reused to accomplish the task.
+
+The memory controller implementation has been divided into phases. These
+are:
+
+1. Memory controller
+2. mlock(2) controller
+3. Kernel user memory accounting and slab control
+4. user mappings length controller
+
+The memory controller is the first controller developed.
+
+2.1. Design
+
+The core of the design is a counter called the res_counter. The res_counter
+tracks the current memory usage and limit of the group of processes associated
+with the controller. Each cgroup has a memory controller specific data
+structure (mem_cgroup) associated with it.
+
+2.2. Accounting
+
+		+--------------------+
+		|  mem_cgroup     |
+		|  (res_counter)     |
+		+--------------------+
+		 /            ^      \
+		/             |       \
+           +---------------+  |        +---------------+
+           | mm_struct     |  |....    | mm_struct     |
+           |               |  |        |               |
+           +---------------+  |        +---------------+
+                              |
+                              + --------------+
+                                              |
+           +---------------+           +------+--------+
+           | page          +---------->  page_cgroup|
+           |               |           |               |
+           +---------------+           +---------------+
+
+             (Figure 1: Hierarchy of Accounting)
+
+
+Figure 1 shows the important aspects of the controller
+
+1. Accounting happens per cgroup
+2. Each mm_struct knows about which cgroup it belongs to
+3. Each page has a pointer to the page_cgroup, which in turn knows the
+   cgroup it belongs to
+
+The accounting is done as follows: mem_cgroup_charge() is invoked to setup
+the necessary data structures and check if the cgroup that is being charged
+is over its limit. If it is then reclaim is invoked on the cgroup.
+More details can be found in the reclaim section of this document.
+If everything goes well, a page meta-data-structure called page_cgroup is
+allocated and associated with the page.  This routine also adds the page to
+the per cgroup LRU.
+
+2.2.1 Accounting details
+
+All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
+(some pages which never be reclaimable and will not be on global LRU
+ are not accounted. we just accounts pages under usual vm management.)
+
+RSS pages are accounted at page_fault unless they've already been accounted
+for earlier. A file page will be accounted for as Page Cache when it's
+inserted into inode (radix-tree). While it's mapped into the page tables of
+processes, duplicate accounting is carefully avoided.
+
+A RSS page is unaccounted when it's fully unmapped. A PageCache page is
+unaccounted when it's removed from radix-tree.
+
+At page migration, accounting information is kept.
+
+Note: we just account pages-on-lru because our purpose is to control amount
+of used pages. not-on-lru pages are tend to be out-of-control from vm view.
+
+2.3 Shared Page Accounting
+
+Shared pages are accounted on the basis of the first touch approach. The
+cgroup that first touches a page is accounted for the page. The principle
+behind this approach is that a cgroup that aggressively uses a shared
+page will eventually get charged for it (once it is uncharged from
+the cgroup that brought it in -- this will happen on memory pressure).
+
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim
+
+Each cgroup maintains a per cgroup LRU that consists of an active
+and inactive list. When a cgroup goes over its limit, we first try
+to reclaim memory from the cgroup so as to make space for the new
+pages that the cgroup has touched. If the reclaim is unsuccessful,
+an OOM routine is invoked to select and kill the bulkiest task in the
+cgroup.
+
+The reclaim algorithm has not been modified for cgroups, except that
+pages that are selected for reclaiming come from the per cgroup LRU
+list.
+
+2. Locking
+
+The memory controller uses the following hierarchy
+
+1. zone->lru_lock is used for selecting pages to be isolated
+2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
+3. lock_page_cgroup() is used to protect page->page_cgroup
+
+3. User Interface
+
+0. Configuration
+
+a. Enable CONFIG_CGROUPS
+b. Enable CONFIG_RESOURCE_COUNTERS
+c. Enable CONFIG_CGROUP_MEM_RES_CTLR
+
+1. Prepare the cgroups
+# mkdir -p /cgroups
+# mount -t cgroup none /cgroups -o memory
+
+2. Make the new group and move bash into it
+# mkdir /cgroups/0
+# echo $$ >  /cgroups/0/tasks
+
+Since now we're in the 0 cgroup,
+We can alter the memory limit:
+# echo 4M > /cgroups/0/memory.limit_in_bytes
+
+NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
+mega or gigabytes.
+
+# cat /cgroups/0/memory.limit_in_bytes
+4194304
+
+NOTE: The interface has now changed to display the usage in bytes
+instead of pages
+
+We can check the usage:
+# cat /cgroups/0/memory.usage_in_bytes
+1216512
+
+A successful write to this file does not guarantee a successful set of
+this limit to the value written into the file.  This can be due to a
+number of factors, such as rounding up to page boundaries or the total
+availability of memory on the system.  The user is required to re-read
+this file after a write to guarantee the value committed by the kernel.
+
+# echo 1 > memory.limit_in_bytes
+# cat memory.limit_in_bytes
+4096
+
+The memory.failcnt field gives the number of times that the cgroup limit was
+exceeded.
+
+The memory.stat file gives accounting information. Now, the number of
+caches, RSS and Active pages/Inactive pages are shown.
+
+4. Testing
+
+Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
+Apart from that v6 has been tested with several applications and regular
+daily use. The controller has also been tested on the PPC64, x86_64 and
+UML platforms.
+
+4.1 Troubleshooting
+
+Sometimes a user might find that the application under a cgroup is
+terminated. There are several causes for this:
+
+1. The cgroup limit is too low (just too low to do anything useful)
+2. The user is using anonymous memory and swap is turned off or too low
+
+A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
+some of the pages cached in the cgroup (page cache pages).
+
+4.2 Task migration
+
+When a task migrates from one cgroup to another, it's charge is not
+carried forward. The pages allocated from the original cgroup still
+remain charged to it, the charge is dropped when the page is freed or
+reclaimed.
+
+4.3 Removing a cgroup
+
+A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
+cgroup might have some charge associated with it, even though all
+tasks have migrated away from it.
+Such charges are freed(at default) or moved to its parent. When moved,
+both of RSS and CACHES are moved to parent.
+If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
+
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is discarded and a cgroup which uses swap (swapcache)
+will be charged as a new owner of it.
+
+
+5. Misc. interfaces.
+
+5.1 force_empty
+  memory.force_empty interface is provided to make cgroup's memory usage empty.
+  You can use this interface only when the cgroup has no tasks.
+  When writing anything to this
+
+  # echo 0 > memory.force_empty
+
+  Almost all pages tracked by this memcg will be unmapped and freed. Some of
+  pages cannot be freed because it's locked or in-use. Such pages are moved
+  to parent and this cgroup will be empty. But this may return -EBUSY in
+  some too busy case.
+
+  Typical use case of this interface is that calling this before rmdir().
+  Because rmdir() moves all pages to parent, some out-of-use page caches can be
+  moved to the parent. If you want to avoid that, force_empty will be useful.
+
+5.2 stat file
+  memory.stat file includes following statistics (now)
+	cache			- # of pages from page-cache and shmem.
+	rss			- # of pages from anonymous memory.
+	pgpgin			- # of event of charging
+	pgpgout			- # of event of uncharging
+	active_anon		- # of pages on active lru of anon, shmem.
+	inactive_anon 		- # of pages on active lru of anon, shmem
+	active_file		- # of pages on active lru of file-cache
+	inactive_file		- # of pages on inactive lru of file cache
+	unevictable		- # of pages cannot be reclaimed.(mlocked etc)
+
+	Below is depend on CONFIG_DEBUG_VM.
+	inactive_ratio		- VM inernal parameter. (see mm/page_alloc.c)
+	recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
+	recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
+	recent_scanned_anon 	- VM internal parameter. (see mm/vmscan.c)
+	recent_scanned_file 	- VM internal parameter. (see mm/vmscan.c)
+
+  Memo:
+	recent_rotated means recent frequency of lru rotation.
+	recent_scanned means recent # of scans to lru.
+	showing for better debug please see the code for meanings.
+
+
+5.3 swappiness
+  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+
+  Following cgroup's swapiness can't be changed.
+  - root cgroup (uses /proc/sys/vm/swappiness).
+  - a cgroup which uses hierarchy and it has child cgroup.
+  - a cgroup which uses hierarchy and not the root of hierarchy.
+
+
+6. Hierarchy support
+
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy
+
+		root
+	     /  |   \
+           /	|    \
+	  a	b	c
+			| \
+			|  \
+			d   e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
+that has memory.use_hierarchy enabled.  If one of the ancestors goes over its
+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
+children of the ancestor.
+
+6.1 Enabling hierarchical accounting and reclaim
+
+The memory controller by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
+
+# echo 1 > memory.use_hierarchy
+
+The feature can be disabled by
+
+# echo 0 > memory.use_hierarchy
+
+NOTE1: Enabling/disabling will fail if the cgroup already has other
+cgroups created below it.
+
+NOTE2: This feature can be enabled/disabled per subtree.
+
+7. TODO
+
+1. Add support for accounting huge pages (as a separate controller)
+2. Make per-cgroup scanner reclaim not-shared pages first
+3. Teach controller to account for shared-pages
+4. Start reclamation in the background when the limit is
+   not yet hit but the usage is getting closer
+
+Summary
+
+Overall, the memory controller has been a stable controller and has been
+commented and discussed quite extensively in the community.
+
+References
+
+1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
+2. Singh, Balbir. Memory Controller (RSS Control),
+   http://lwn.net/Articles/222762/
+3. Emelianov, Pavel. Resource controllers based on process cgroups
+   http://lkml.org/lkml/2007/3/6/198
+4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
+   http://lkml.org/lkml/2007/4/9/78
+5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
+   http://lkml.org/lkml/2007/5/30/244
+6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
+7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
+   subsystem (v3), http://lwn.net/Articles/235534/
+8. Singh, Balbir. RSS controller v2 test results (lmbench),
+   http://lkml.org/lkml/2007/5/17/232
+9. Singh, Balbir. RSS controller v2 AIM9 results
+   http://lkml.org/lkml/2007/5/18/1
+10. Singh, Balbir. Memory controller v6 test results,
+    http://lkml.org/lkml/2007/8/19/36
+11. Singh, Balbir. Memory controller introduction (v6),
+    http://lkml.org/lkml/2007/8/17/69
+12. Corbet, Jonathan, Controlling memory use in cgroups,
+    http://lwn.net/Articles/243795/
+
+		The Resource Counter
+
+The resource counter, declared at include/linux/res_counter.h,
+is supposed to facilitate the resource management by controllers
+by providing common stuff for accounting.
+
+This "stuff" includes the res_counter structure and routines
+to work with it.
+
+
+
+1. Crucial parts of the res_counter structure
+
+ a. unsigned long long usage
+
+ 	The usage value shows the amount of a resource that is consumed
+	by a group at a given time. The units of measurement should be
+	determined by the controller that uses this counter. E.g. it can
+	be bytes, items or any other unit the controller operates on.
+
+ b. unsigned long long max_usage
+
+ 	The maximal value of the usage over time.
+
+ 	This value is useful when gathering statistical information about
+	the particular group, as it shows the actual resource requirements
+	for a particular group, not just some usage snapshot.
+
+ c. unsigned long long limit
+
+ 	The maximal allowed amount of resource to consume by the group. In
+	case the group requests for more resources, so that the usage value
+	would exceed the limit, the resource allocation is rejected (see
+	the next section).
+
+ d. unsigned long long failcnt
+
+ 	The failcnt stands for "failures counter". This is the number of
+	resource allocation attempts that failed.
+
+ c. spinlock_t lock
+
+ 	Protects changes of the above values.
+
+
+
+2. Basic accounting routines
+
+ a. void res_counter_init(struct res_counter *rc)
+
+ 	Initializes the resource counter. As usual, should be the first
+	routine called for a new counter.
+
+ b. int res_counter_charge[_locked]
+			(struct res_counter *rc, unsigned long val)
+
+	When a resource is about to be allocated it has to be accounted
+	with the appropriate resource counter (controller should determine
+	which one to use on its own). This operation is called "charging".
+
+	This is not very important which operation - resource allocation
+	or charging - is performed first, but
+	  * if the allocation is performed first, this may create a
+	    temporary resource over-usage by the time resource counter is
+	    charged;
+	  * if the charging is performed first, then it should be uncharged
+	    on error path (if the one is called).
+
+ c. void res_counter_uncharge[_locked]
+			(struct res_counter *rc, unsigned long val)
+
+	When a resource is released (freed) it should be de-accounted
+	from the resource counter it was accounted to.  This is called
+	"uncharging".
+
+    The _locked routines imply that the res_counter->lock is taken.
+
+
+ 2.1 Other accounting routines
+
+    There are more routines that may help you with common needs, like
+    checking whether the limit is reached or resetting the max_usage
+    value. They are all declared in include/linux/res_counter.h.
+
+
+
+3. Analyzing the resource counter registrations
+
+ a. If the failcnt value constantly grows, this means that the counter's
+    limit is too tight. Either the group is misbehaving and consumes too
+    many resources, or the configuration is not suitable for the group
+    and the limit should be increased.
+
+ b. The max_usage value can be used to quickly tune the group. One may
+    set the limits to maximal values and either load the container with
+    a common pattern or leave one for a while. After this the max_usage
+    value shows the amount of memory the container would require during
+    its common activity.
+
+    Setting the limit a bit above this value gives a pretty good
+    configuration that works in most of the cases.
+
+ c. If the max_usage is much less than the limit, but the failcnt value
+    is growing, then the group tries to allocate a big chunk of resource
+    at once.
+
+ d. If the max_usage is much less than the limit, but the failcnt value
+    is 0, then this group is given too high limit, that it does not
+    require. It is better to lower the limit a bit leaving more resource
+    for other groups.
+
+
+
+4. Communication with the control groups subsystem (cgroups)
+
+All the resource controllers that are using cgroups and resource counters
+should provide files (in the cgroup filesystem) to work with the resource
+counter fields. They are recommended to adhere to the following rules:
+
+ a. File names
+
+ 	Field name	File name
+	---------------------------------------------------
+	usage		usage_in_<unit_of_measurement>
+	max_usage	max_usage_in_<unit_of_measurement>
+	limit		limit_in_<unit_of_measurement>
+	failcnt		failcnt
+	lock		no file :)
+
+ b. Reading from file should show the corresponding field value in the
+    appropriate format.
+
+ c. Writing to file
+
+ 	Field		Expected behavior
+	----------------------------------
+	usage		prohibited
+	max_usage	reset to usage
+	limit		set the limit
+	failcnt		reset to zero
+
+
+
+5. Usage example
+
+ a. Declare a task group (take a look at cgroups subsystem for this) and
+    fold a res_counter into it
+
+	struct my_group {
+		struct res_counter res;
+
+		<other fields>
+	}
+
+ b. Put hooks in resource allocation/release paths
+
+ 	int alloc_something(...)
+	{
+		if (res_counter_charge(res_counter_ptr, amount) < 0)
+			return -ENOMEM;
+
+		<allocate the resource and return to the caller>
+	}
+
+	void release_something(...)
+	{
+		res_counter_uncharge(res_counter_ptr, amount);
+
+		<release the resource>
+	}
+
+    In order to keep the usage value self-consistent, both the
+    "res_counter_ptr" and the "amount" in release_something() should be
+    the same as they were in the alloc_something() when the releasing
+    resource was allocated.
+
+ c. Provide the way to read res_counter values and set them (the cgroups
+    still can help with it).
+
+ c. Compile and run :)
-CPU Accounting Controller
--------------------------
-
-The CPU accounting controller is used to group tasks using cgroups and
-account the CPU usage of these groups of tasks.
-
-The CPU accounting controller supports multi-hierarchy groups. An accounting
-group accumulates the CPU usage of all of its child groups and the tasks
-directly present in its group.
-
-Accounting groups can be created by first mounting the cgroup filesystem.
-
-# mkdir /cgroups
-# mount -t cgroup -ocpuacct none /cgroups
-
-With the above step, the initial or the parent accounting group
-becomes visible at /cgroups. At bootup, this group includes all the
-tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
-/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
-this group which is essentially the CPU time obtained by all the tasks
-in the system.
-
-New accounting groups can be created under the parent group /cgroups.
-
-# cd /cgroups
-# mkdir g1
-# echo $$ > g1
-
-The above steps create a new group g1 and move the current shell
-process (bash) into it. CPU time consumed by this bash and its children
-can be obtained from g1/cpuacct.usage and the same is accumulated in
-/cgroups/cpuacct.usage also.
-Device Whitelist Controller
-
-1. Description:
-
-Implement a cgroup to track and enforce open and mknod restrictions
-on device files.  A device cgroup associates a device access
-whitelist with each cgroup.  A whitelist entry has 4 fields.
-'type' is a (all), c (char), or b (block).  'all' means it applies
-to all types and all major and minor numbers.  Major and minor are
-either an integer or * for all.  Access is a composition of r
-(read), w (write), and m (mknod).
-
-The root device cgroup starts with rwm to 'all'.  A child device
-cgroup gets a copy of the parent.  Administrators can then remove
-devices from the whitelist or add new entries.  A child cgroup can
-never receive a device access which is denied by its parent.  However
-when a device access is removed from a parent it will not also be
-removed from the child(ren).
-
-2. User Interface
-
-An entry is added using devices.allow, and removed using
-devices.deny.  For instance
-
-	echo 'c 1:3 mr' > /cgroups/1/devices.allow
-
-allows cgroup 1 to read and mknod the device usually known as
-/dev/null.  Doing
-
-	echo a > /cgroups/1/devices.deny
-
-will remove the default 'a *:* rwm' entry. Doing
-
-	echo a > /cgroups/1/devices.allow
-
-will add the 'a *:* rwm' entry to the whitelist.
-
-3. Security
-
-Any task can move itself between cgroups.  This clearly won't
-suffice, but we can decide the best way to adequately restrict
-movement as people get some experience with this.  We may just want
-to require CAP_SYS_ADMIN, which at least is a separate bit from
-CAP_MKNOD.  We may want to just refuse moving to a cgroup which
-isn't a descendent of the current one.  Or we may want to use
-CAP_MAC_ADMIN, since we really are trying to lock down root.
-
-CAP_SYS_ADMIN is needed to modify the whitelist or move another
-task to a new cgroup.  (Again we'll probably want to change that).
-
-A cgroup may not be granted more permissions than the cgroup's
-parent has.
-Memory Resource Controller(Memcg)  Implementation Memo.
-Last Updated: 2008/12/15
-Base Kernel Version: based on 2.6.28-rc8-mm.
-
-Because VM is getting complex (one of reasons is memcg...), memcg's behavior
-is complex. This is a document for memcg's internal behavior.
-Please note that implementation details can be changed.
-
-(*) Topics on API should be in Documentation/controllers/memory.txt)
-
-0. How to record usage ?
-   2 objects are used.
-
-   page_cgroup ....an object per page.
-	Allocated at boot or memory hotplug. Freed at memory hot removal.
-
-   swap_cgroup ... an entry per swp_entry.
-	Allocated at swapon(). Freed at swapoff().
-
-   The page_cgroup has USED bit and double count against a page_cgroup never
-   occurs. swap_cgroup is used only when a charged page is swapped-out.
-
-1. Charge
-
-   a page/swp_entry may be charged (usage += PAGE_SIZE) at
-
-	mem_cgroup_newpage_charge()
-	  Called at new page fault and Copy-On-Write.
-
-	mem_cgroup_try_charge_swapin()
-	  Called at do_swap_page() (page fault on swap entry) and swapoff.
-	  Followed by charge-commit-cancel protocol. (With swap accounting)
-	  At commit, a charge recorded in swap_cgroup is removed.
-
-	mem_cgroup_cache_charge()
-	  Called at add_to_page_cache()
-
-	mem_cgroup_cache_charge_swapin()
-	  Called at shmem's swapin.
-
-	mem_cgroup_prepare_migration()
-	  Called before migration. "extra" charge is done and followed by
-	  charge-commit-cancel protocol.
-	  At commit, charge against oldpage or newpage will be committed.
-
-2. Uncharge
-  a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
-
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
-
-	mem_cgroup_uncharge_swap()
-	  Called when swp_entry's refcnt goes down to 0. A charge against swap
-	  disappears.
-
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
-3. charge-commit-cancel
-	In some case, we can't know this "charge" is valid or not at charging
-	(because of races).
-	To handle such case, there are charge-commit-cancel functions.
-		mem_cgroup_try_charge_XXX
-		mem_cgroup_commit_charge_XXX
-		mem_cgroup_cancel_charge_XXX
-	these are used in swap-in and migration.
-
-	At try_charge(), there are no flags to say "this page is charged".
-	at this point, usage += PAGE_SIZE.
-
-	At commit(), the function checks the page should be charged or not
-	and set flags or avoid charging.(usage -= PAGE_SIZE)
-
-	At cancel(), simply usage -= PAGE_SIZE.
-
-Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
-
-4. Anonymous
-	Anonymous page is newly allocated at
-		  - page fault into MAP_ANONYMOUS mapping.
-		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
-
-	4.1 Swap-in.
-	At swap-in, the page is taken from swap-cache. There are 2 cases.
-
-	(a) If the SwapCache is newly allocated and read, it has no charges.
-	(b) If the SwapCache has been mapped by processes, it has been
-	    charged already.
-
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
-	4.2 Swap-out.
-	At swap-out, typical state transition is below.
-
-	(a) add to swap cache. (marked as SwapCache)
-	    swp_entry's refcnt += 1.
-	(b) fully unmapped.
-	    swp_entry's refcnt += # of ptes.
-	(c) write back to swap.
-	(d) delete from swap cache. (remove from SwapCache)
-	    swp_entry's refcnt -= 1.
-
-
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
-	Finally, at task exit,
-	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
-
-5. Page Cache
-   	Page Cache is charged at
-	- add_to_page_cache_locked().
-
-	uncharged at
-	- __remove_from_page_cache().
-
-	The logic is very clear. (About migration, see below)
-	Note: __remove_from_page_cache() is called by remove_from_page_cache()
-	and __remove_mapping().
-
-6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
-	But brief explanation of the behavior of memcg around shmem will be
-	helpful to understand the logic.
-
-	Shmem's page (just leaf page, not direct/indirect block) can be on
-		- radix-tree of shmem's inode.
-		- SwapCache.
-		- Both on radix-tree and SwapCache. This happens at swap-in
-		  and swap-out,
-
-	It's charged when...
-	- A new page is added to shmem's radix-tree.
-	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
-
-7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
-
-8. LRU
-        Each memcg has its own private LRU. Now, it's handling is under global
-	VM's control (means that it's handled under global zone->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under zone->lru_lock().
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-	(By __isolate_lru_page(), the page is removed from both of global and
-	 private LRU.)
-
-
-9. Typical Tests.
-
- Tests for racy cases.
-
- 9.1 Small limit to memcg.
-	When you do test to do racy case, it's good test to set memcg's limit
-	to be very small rather than GB. Many races found in the test under
-	xKB or xxMB limits.
-	(Memory behavior under GB and Memory behavior under MB shows very
-	 different situation.)
-
- 9.2 Shmem
-	Historically, memcg's shmem handling was poor and we saw some amount
-	of troubles here. This is because shmem is page-cache but can be
-	SwapCache. Test with shmem/tmpfs is always good test.
-
- 9.3 Migration
-	For NUMA, migration is an another special case. To do easy test, cpuset
-	is useful. Following is a sample script to do migration.
-
-	mount -t cgroup -o cpuset none /opt/cpuset
-
-	mkdir /opt/cpuset/01
-	echo 1 > /opt/cpuset/01/cpuset.cpus
-	echo 0 > /opt/cpuset/01/cpuset.mems
-	echo 1 > /opt/cpuset/01/cpuset.memory_migrate
-	mkdir /opt/cpuset/02
-	echo 1 > /opt/cpuset/02/cpuset.cpus
-	echo 1 > /opt/cpuset/02/cpuset.mems
-	echo 1 > /opt/cpuset/02/cpuset.memory_migrate
-
-	In above set, when you moves a task from 01 to 02, page migration to
-	node 0 to node 1 will occur. Following is a script to migrate all
-	under cpuset.
-	--
-	move_task()
-	{
-	for pid in $1
-        do
-                /bin/echo $pid >$2/tasks 2>/dev/null
-		echo -n $pid
-		echo -n " "
-        done
-	echo END
-	}
-
-	G1_TASK=`cat ${G1}/tasks`
-	G2_TASK=`cat ${G2}/tasks`
-	move_task "${G1_TASK}" ${G2} &
-	--
- 9.4 Memory hotplug.
-	memory hotplug test is one of good test.
-	to offline memory, do following.
-	# echo offline > /sys/devices/system/memory/memoryXXX/state
-	(XXX is the place of memory)
-	This is an easy way to test page migration, too.
-
- 9.5 mkdir/rmdir
-	When using hierarchy, mkdir/rmdir test should be done.
-	Use tests like the following.
-
-	echo 1 >/opt/cgroup/01/memory/use_hierarchy
-	mkdir /opt/cgroup/01/child_a
-	mkdir /opt/cgroup/01/child_b
-
-	set limit to 01.
-	add limit to 01/child_b
-	run jobs under child_a and child_b
-
-	create/delete following groups at random while jobs are running.
-	/opt/cgroup/01/child_a/child_aa
-	/opt/cgroup/01/child_b/child_bb
-	/opt/cgroup/01/child_c
-
-	running new jobs in new group is also good.
-
- 9.6 Mount with other subsystems.
-	Mounting with other subsystems is a good test because there is a
-	race and lock dependency with other cgroup subsystems.
-
-	example)
-	# mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
-
-	and do task move, mkdir, rmdir etc...under this.
-Memory Resource Controller
-
-NOTE: The Memory Resource Controller has been generically been referred
-to as the memory controller in this document. Do not confuse memory controller
-used here with the memory controller that is used in hardware.
-
-Salient features
-
-a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
-b. The infrastructure allows easy addition of other types of memory to control
-c. Provides *zero overhead* for non memory controller users
-d. Provides a double LRU: global memory pressure causes reclaim from the
-   global LRU; a cgroup on hitting a limit, reclaims from the per
-   cgroup LRU
-
-NOTE: Swap Cache (unmapped) is not accounted now.
-
-Benefits and Purpose of the memory controller
-
-The memory controller isolates the memory behaviour of a group of tasks
-from the rest of the system. The article on LWN [12] mentions some probable
-uses of the memory controller. The memory controller can be used to
-
-a. Isolate an application or a group of applications
-   Memory hungry applications can be isolated and limited to a smaller
-   amount of memory.
-b. Create a cgroup with limited amount of memory, this can be used
-   as a good alternative to booting with mem=XXXX.
-c. Virtualization solutions can control the amount of memory they want
-   to assign to a virtual machine instance.
-d. A CD/DVD burner could control the amount of memory used by the
-   rest of the system to ensure that burning does not fail due to lack
-   of available memory.
-e. There are several other use cases, find one or use the controller just
-   for fun (to learn and hack on the VM subsystem).
-
-1. History
-
-The memory controller has a long history. A request for comments for the memory
-controller was posted by Balbir Singh [1]. At the time the RFC was posted
-there were several implementations for memory control. The goal of the
-RFC was to build consensus and agreement for the minimal features required
-for memory control. The first RSS controller was posted by Balbir Singh[2]
-in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
-RSS controller. At OLS, at the resource management BoF, everyone suggested
-that we handle both page cache and RSS together. Another request was raised
-to allow user space handling of OOM. The current memory controller is
-at version 6; it combines both mapped (RSS) and unmapped Page
-Cache Control [11].
-
-2. Memory Control
-
-Memory is a unique resource in the sense that it is present in a limited
-amount. If a task requires a lot of CPU processing, the task can spread
-its processing over a period of hours, days, months or years, but with
-memory, the same physical memory needs to be reused to accomplish the task.
-
-The memory controller implementation has been divided into phases. These
-are:
-
-1. Memory controller
-2. mlock(2) controller
-3. Kernel user memory accounting and slab control
-4. user mappings length controller
-
-The memory controller is the first controller developed.
-
-2.1. Design
-
-The core of the design is a counter called the res_counter. The res_counter
-tracks the current memory usage and limit of the group of processes associated
-with the controller. Each cgroup has a memory controller specific data
-structure (mem_cgroup) associated with it.
-
-2.2. Accounting
-
-		+--------------------+
-		|  mem_cgroup     |
-		|  (res_counter)     |
-		+--------------------+
-		 /            ^      \
-		/             |       \
-           +---------------+  |        +---------------+
-           | mm_struct     |  |....    | mm_struct     |
-           |               |  |        |               |
-           +---------------+  |        +---------------+
-                              |
-                              + --------------+
-                                              |
-           +---------------+           +------+--------+
-           | page          +---------->  page_cgroup|
-           |               |           |               |
-           +---------------+           +---------------+
-
-             (Figure 1: Hierarchy of Accounting)
-
-
-Figure 1 shows the important aspects of the controller
-
-1. Accounting happens per cgroup
-2. Each mm_struct knows about which cgroup it belongs to
-3. Each page has a pointer to the page_cgroup, which in turn knows the
-   cgroup it belongs to
-
-The accounting is done as follows: mem_cgroup_charge() is invoked to setup
-the necessary data structures and check if the cgroup that is being charged
-is over its limit. If it is then reclaim is invoked on the cgroup.
-More details can be found in the reclaim section of this document.
-If everything goes well, a page meta-data-structure called page_cgroup is
-allocated and associated with the page.  This routine also adds the page to
-the per cgroup LRU.
-
-2.2.1 Accounting details
-
-All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
-(some pages which never be reclaimable and will not be on global LRU
- are not accounted. we just accounts pages under usual vm management.)
-
-RSS pages are accounted at page_fault unless they've already been accounted
-for earlier. A file page will be accounted for as Page Cache when it's
-inserted into inode (radix-tree). While it's mapped into the page tables of
-processes, duplicate accounting is carefully avoided.
-
-A RSS page is unaccounted when it's fully unmapped. A PageCache page is
-unaccounted when it's removed from radix-tree.
-
-At page migration, accounting information is kept.
-
-Note: we just account pages-on-lru because our purpose is to control amount
-of used pages. not-on-lru pages are tend to be out-of-control from vm view.
-
-2.3 Shared Page Accounting
-
-Shared pages are accounted on the basis of the first touch approach. The
-cgroup that first touches a page is accounted for the page. The principle
-behind this approach is that a cgroup that aggressively uses a shared
-page will eventually get charged for it (once it is uncharged from
-the cgroup that brought it in -- this will happen on memory pressure).
-
-Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
-When you do swapoff and make swapped-out pages of shmem(tmpfs) to
-be backed into memory in force, charges for pages are accounted against the
-caller of swapoff rather than the users of shmem.
-
-
-2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
-Swap Extension allows you to record charge for swap. A swapped-in page is
-charged back to original page allocator if possible.
-
-When swap is accounted, following files are added.
- - memory.memsw.usage_in_bytes.
- - memory.memsw.limit_in_bytes.
-
-usage of mem+swap is limited by memsw.limit_in_bytes.
-
-Note: why 'mem+swap' rather than swap.
-The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
-to move account from memory to swap...there is no change in usage of
-mem+swap.
-
-In other words, when we want to limit the usage of swap without affecting
-global LRU, mem+swap limit is better than just limiting swap from OS point
-of view.
-
-2.5 Reclaim
-
-Each cgroup maintains a per cgroup LRU that consists of an active
-and inactive list. When a cgroup goes over its limit, we first try
-to reclaim memory from the cgroup so as to make space for the new
-pages that the cgroup has touched. If the reclaim is unsuccessful,
-an OOM routine is invoked to select and kill the bulkiest task in the
-cgroup.
-
-The reclaim algorithm has not been modified for cgroups, except that
-pages that are selected for reclaiming come from the per cgroup LRU
-list.
-
-2. Locking
-
-The memory controller uses the following hierarchy
-
-1. zone->lru_lock is used for selecting pages to be isolated
-2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
-3. lock_page_cgroup() is used to protect page->page_cgroup
-
-3. User Interface
-
-0. Configuration
-
-a. Enable CONFIG_CGROUPS
-b. Enable CONFIG_RESOURCE_COUNTERS
-c. Enable CONFIG_CGROUP_MEM_RES_CTLR
-
-1. Prepare the cgroups
-# mkdir -p /cgroups
-# mount -t cgroup none /cgroups -o memory
-
-2. Make the new group and move bash into it
-# mkdir /cgroups/0
-# echo $$ >  /cgroups/0/tasks
-
-Since now we're in the 0 cgroup,
-We can alter the memory limit:
-# echo 4M > /cgroups/0/memory.limit_in_bytes
-
-NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
-mega or gigabytes.
-
-# cat /cgroups/0/memory.limit_in_bytes
-4194304
-
-NOTE: The interface has now changed to display the usage in bytes
-instead of pages
-
-We can check the usage:
-# cat /cgroups/0/memory.usage_in_bytes
-1216512
-
-A successful write to this file does not guarantee a successful set of
-this limit to the value written into the file.  This can be due to a
-number of factors, such as rounding up to page boundaries or the total
-availability of memory on the system.  The user is required to re-read
-this file after a write to guarantee the value committed by the kernel.
-
-# echo 1 > memory.limit_in_bytes
-# cat memory.limit_in_bytes
-4096
-
-The memory.failcnt field gives the number of times that the cgroup limit was
-exceeded.
-
-The memory.stat file gives accounting information. Now, the number of
-caches, RSS and Active pages/Inactive pages are shown.
-
-4. Testing
-
-Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
-Apart from that v6 has been tested with several applications and regular
-daily use. The controller has also been tested on the PPC64, x86_64 and
-UML platforms.
-
-4.1 Troubleshooting
-
-Sometimes a user might find that the application under a cgroup is
-terminated. There are several causes for this:
-
-1. The cgroup limit is too low (just too low to do anything useful)
-2. The user is using anonymous memory and swap is turned off or too low
-
-A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
-some of the pages cached in the cgroup (page cache pages).
-
-4.2 Task migration
-
-When a task migrates from one cgroup to another, it's charge is not
-carried forward. The pages allocated from the original cgroup still
-remain charged to it, the charge is dropped when the page is freed or
-reclaimed.
-
-4.3 Removing a cgroup
-
-A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
-cgroup might have some charge associated with it, even though all
-tasks have migrated away from it.
-Such charges are freed(at default) or moved to its parent. When moved,
-both of RSS and CACHES are moved to parent.
-If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
-
-Charges recorded in swap information is not updated at removal of cgroup.
-Recorded information is discarded and a cgroup which uses swap (swapcache)
-will be charged as a new owner of it.
-
-
-5. Misc. interfaces.
-
-5.1 force_empty
-  memory.force_empty interface is provided to make cgroup's memory usage empty.
-  You can use this interface only when the cgroup has no tasks.
-  When writing anything to this
-
-  # echo 0 > memory.force_empty
-
-  Almost all pages tracked by this memcg will be unmapped and freed. Some of
-  pages cannot be freed because it's locked or in-use. Such pages are moved
-  to parent and this cgroup will be empty. But this may return -EBUSY in
-  some too busy case.
-
-  Typical use case of this interface is that calling this before rmdir().
-  Because rmdir() moves all pages to parent, some out-of-use page caches can be
-  moved to the parent. If you want to avoid that, force_empty will be useful.
-
-5.2 stat file
-  memory.stat file includes following statistics (now)
-	cache			- # of pages from page-cache and shmem.
-	rss			- # of pages from anonymous memory.
-	pgpgin			- # of event of charging
-	pgpgout			- # of event of uncharging
-	active_anon		- # of pages on active lru of anon, shmem.
-	inactive_anon 		- # of pages on active lru of anon, shmem
-	active_file		- # of pages on active lru of file-cache
-	inactive_file		- # of pages on inactive lru of file cache
-	unevictable		- # of pages cannot be reclaimed.(mlocked etc)
-
-	Below is depend on CONFIG_DEBUG_VM.
-	inactive_ratio		- VM inernal parameter. (see mm/page_alloc.c)
-	recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
-	recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
-	recent_scanned_anon 	- VM internal parameter. (see mm/vmscan.c)
-	recent_scanned_file 	- VM internal parameter. (see mm/vmscan.c)
-
-  Memo:
-	recent_rotated means recent frequency of lru rotation.
-	recent_scanned means recent # of scans to lru.
-	showing for better debug please see the code for meanings.
-
-
-5.3 swappiness
-  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
-
-  Following cgroup's swapiness can't be changed.
-  - root cgroup (uses /proc/sys/vm/swappiness).
-  - a cgroup which uses hierarchy and it has child cgroup.
-  - a cgroup which uses hierarchy and not the root of hierarchy.
-
-
-6. Hierarchy support
-
-The memory controller supports a deep hierarchy and hierarchical accounting.
-The hierarchy is created by creating the appropriate cgroups in the
-cgroup filesystem. Consider for example, the following cgroup filesystem
-hierarchy
-
-		root
-	     /  |   \
-           /	|    \
-	  a	b	c
-			| \
-			|  \
-			d   e
-
-In the diagram above, with hierarchical accounting enabled, all memory
-usage of e, is accounted to its ancestors up until the root (i.e, c and root),
-that has memory.use_hierarchy enabled.  If one of the ancestors goes over its
-limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
-children of the ancestor.
-
-6.1 Enabling hierarchical accounting and reclaim
-
-The memory controller by default disables the hierarchy feature. Support
-can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
-
-# echo 1 > memory.use_hierarchy
-
-The feature can be disabled by
-
-# echo 0 > memory.use_hierarchy
-
-NOTE1: Enabling/disabling will fail if the cgroup already has other
-cgroups created below it.
-
-NOTE2: This feature can be enabled/disabled per subtree.
-
-7. TODO
-
-1. Add support for accounting huge pages (as a separate controller)
-2. Make per-cgroup scanner reclaim not-shared pages first
-3. Teach controller to account for shared-pages
-4. Start reclamation in the background when the limit is
-   not yet hit but the usage is getting closer
-
-Summary
-
-Overall, the memory controller has been a stable controller and has been
-commented and discussed quite extensively in the community.
-
-References
-
-1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
-2. Singh, Balbir. Memory Controller (RSS Control),
-   http://lwn.net/Articles/222762/
-3. Emelianov, Pavel. Resource controllers based on process cgroups
-   http://lkml.org/lkml/2007/3/6/198
-4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
-   http://lkml.org/lkml/2007/4/9/78
-5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
-   http://lkml.org/lkml/2007/5/30/244
-6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
-7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
-   subsystem (v3), http://lwn.net/Articles/235534/
-8. Singh, Balbir. RSS controller v2 test results (lmbench),
-   http://lkml.org/lkml/2007/5/17/232
-9. Singh, Balbir. RSS controller v2 AIM9 results
-   http://lkml.org/lkml/2007/5/18/1
-10. Singh, Balbir. Memory controller v6 test results,
-    http://lkml.org/lkml/2007/8/19/36
-11. Singh, Balbir. Memory controller introduction (v6),
-    http://lkml.org/lkml/2007/8/17/69
-12. Corbet, Jonathan, Controlling memory use in cgroups,
-    http://lwn.net/Articles/243795/
-
-		The Resource Counter
-
-The resource counter, declared at include/linux/res_counter.h,
-is supposed to facilitate the resource management by controllers
-by providing common stuff for accounting.
-
-This "stuff" includes the res_counter structure and routines
-to work with it.
-
-
-
-1. Crucial parts of the res_counter structure
-
- a. unsigned long long usage
-
- 	The usage value shows the amount of a resource that is consumed
-	by a group at a given time. The units of measurement should be
-	determined by the controller that uses this counter. E.g. it can
-	be bytes, items or any other unit the controller operates on.
-
- b. unsigned long long max_usage
-
- 	The maximal value of the usage over time.
-
- 	This value is useful when gathering statistical information about
-	the particular group, as it shows the actual resource requirements
-	for a particular group, not just some usage snapshot.
-
- c. unsigned long long limit
-
- 	The maximal allowed amount of resource to consume by the group. In
-	case the group requests for more resources, so that the usage value
-	would exceed the limit, the resource allocation is rejected (see
-	the next section).
-
- d. unsigned long long failcnt
-
- 	The failcnt stands for "failures counter". This is the number of
-	resource allocation attempts that failed.
-
- c. spinlock_t lock
-
- 	Protects changes of the above values.
-
-
-
-2. Basic accounting routines
-
- a. void res_counter_init(struct res_counter *rc)
-
- 	Initializes the resource counter. As usual, should be the first
-	routine called for a new counter.
-
- b. int res_counter_charge[_locked]
-			(struct res_counter *rc, unsigned long val)
-
-	When a resource is about to be allocated it has to be accounted
-	with the appropriate resource counter (controller should determine
-	which one to use on its own). This operation is called "charging".
-
-	This is not very important which operation - resource allocation
-	or charging - is performed first, but
-	  * if the allocation is performed first, this may create a
-	    temporary resource over-usage by the time resource counter is
-	    charged;
-	  * if the charging is performed first, then it should be uncharged
-	    on error path (if the one is called).
-
- c. void res_counter_uncharge[_locked]
-			(struct res_counter *rc, unsigned long val)
-
-	When a resource is released (freed) it should be de-accounted
-	from the resource counter it was accounted to.  This is called
-	"uncharging".
-
-    The _locked routines imply that the res_counter->lock is taken.
-
-
- 2.1 Other accounting routines
-
-    There are more routines that may help you with common needs, like
-    checking whether the limit is reached or resetting the max_usage
-    value. They are all declared in include/linux/res_counter.h.
-
-
-
-3. Analyzing the resource counter registrations
-
- a. If the failcnt value constantly grows, this means that the counter's
-    limit is too tight. Either the group is misbehaving and consumes too
-    many resources, or the configuration is not suitable for the group
-    and the limit should be increased.
-
- b. The max_usage value can be used to quickly tune the group. One may
-    set the limits to maximal values and either load the container with
-    a common pattern or leave one for a while. After this the max_usage
-    value shows the amount of memory the container would require during
-    its common activity.
-
-    Setting the limit a bit above this value gives a pretty good
-    configuration that works in most of the cases.
-
- c. If the max_usage is much less than the limit, but the failcnt value
-    is growing, then the group tries to allocate a big chunk of resource
-    at once.
-
- d. If the max_usage is much less than the limit, but the failcnt value
-    is 0, then this group is given too high limit, that it does not
-    require. It is better to lower the limit a bit leaving more resource
-    for other groups.
-
-
-
-4. Communication with the control groups subsystem (cgroups)
-
-All the resource controllers that are using cgroups and resource counters
-should provide files (in the cgroup filesystem) to work with the resource
-counter fields. They are recommended to adhere to the following rules:
-
- a. File names
-
- 	Field name	File name
-	---------------------------------------------------
-	usage		usage_in_<unit_of_measurement>
-	max_usage	max_usage_in_<unit_of_measurement>
-	limit		limit_in_<unit_of_measurement>
-	failcnt		failcnt
-	lock		no file :)
-
- b. Reading from file should show the corresponding field value in the
-    appropriate format.
-
- c. Writing to file
-
- 	Field		Expected behavior
-	----------------------------------
-	usage		prohibited
-	max_usage	reset to usage
-	limit		set the limit
-	failcnt		reset to zero
-
-
-
-5. Usage example
-
- a. Declare a task group (take a look at cgroups subsystem for this) and
-    fold a res_counter into it
-
-	struct my_group {
-		struct res_counter res;
-
-		<other fields>
-	}
-
- b. Put hooks in resource allocation/release paths
-
- 	int alloc_something(...)
-	{
-		if (res_counter_charge(res_counter_ptr, amount) < 0)
-			return -ENOMEM;
-
-		<allocate the resource and return to the caller>
-	}
-
-	void release_something(...)
-	{
-		res_counter_uncharge(res_counter_ptr, amount);
-
-		<release the resource>
-	}
-
-    In order to keep the usage value self-consistent, both the
-    "res_counter_ptr" and the "amount" in release_something() should be
-    the same as they were in the alloc_something() when the releasing
-    resource was allocated.
-
- c. Provide the way to read res_counter values and set them (the cgroups
-    still can help with it).
-
- c. Compile and run :)
-				CPUSETS
-				-------
-
-Copyright (C) 2004 BULL SA.
-Written by Simon.Derr@bull.net
-
-Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
-Modified by Paul Jackson <pj@sgi.com>
-Modified by Christoph Lameter <clameter@sgi.com>
-Modified by Paul Menage <menage@google.com>
-Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
-
-CONTENTS:
-=========
-
-1. Cpusets
-  1.1 What are cpusets ?
-  1.2 Why are cpusets needed ?
-  1.3 How are cpusets implemented ?
-  1.4 What are exclusive cpusets ?
-  1.5 What is memory_pressure ?
-  1.6 What is memory spread ?
-  1.7 What is sched_load_balance ?
-  1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
-2. Usage Examples and Syntax
-  2.1 Basic Usage
-  2.2 Adding/removing cpus
-  2.3 Setting flags
-  2.4 Attaching processes
-3. Questions
-4. Contact
-
-1. Cpusets
-==========
-
-1.1 What are cpusets ?
-----------------------
-
-Cpusets provide a mechanism for assigning a set of CPUs and Memory
-Nodes to a set of tasks.   In this document "Memory Node" refers to
-an on-line node that contains memory.
-
-Cpusets constrain the CPU and Memory placement of tasks to only
-the resources within a tasks current cpuset.  They form a nested
-hierarchy visible in a virtual file system.  These are the essential
-hooks, beyond what is already present, required to manage dynamic
-job placement on large systems.
-
-Cpusets use the generic cgroup subsystem described in
-Documentation/cgroups/cgroups.txt.
-
-Requests by a task, using the sched_setaffinity(2) system call to
-include CPUs in its CPU affinity mask, and using the mbind(2) and
-set_mempolicy(2) system calls to include Memory Nodes in its memory
-policy, are both filtered through that tasks cpuset, filtering out any
-CPUs or Memory Nodes not in that cpuset.  The scheduler will not
-schedule a task on a CPU that is not allowed in its cpus_allowed
-vector, and the kernel page allocator will not allocate a page on a
-node that is not allowed in the requesting tasks mems_allowed vector.
-
-User level code may create and destroy cpusets by name in the cgroup
-virtual file system, manage the attributes and permissions of these
-cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
-specify and query to which cpuset a task is assigned, and list the
-task pids assigned to a cpuset.
-
-
-1.2 Why are cpusets needed ?
-----------------------------
-
-The management of large computer systems, with many processors (CPUs),
-complex memory cache hierarchies and multiple Memory Nodes having
-non-uniform access times (NUMA) presents additional challenges for
-the efficient scheduling and memory placement of processes.
-
-Frequently more modest sized systems can be operated with adequate
-efficiency just by letting the operating system automatically share
-the available CPU and Memory resources amongst the requesting tasks.
-
-But larger systems, which benefit more from careful processor and
-memory placement to reduce memory access times and contention,
-and which typically represent a larger investment for the customer,
-can benefit from explicitly placing jobs on properly sized subsets of
-the system.
-
-This can be especially valuable on:
-
-    * Web Servers running multiple instances of the same web application,
-    * Servers running different applications (for instance, a web server
-      and a database), or
-    * NUMA systems running large HPC applications with demanding
-      performance characteristics.
-
-These subsets, or "soft partitions" must be able to be dynamically
-adjusted, as the job mix changes, without impacting other concurrently
-executing jobs. The location of the running jobs pages may also be moved
-when the memory locations are changed.
-
-The kernel cpuset patch provides the minimum essential kernel
-mechanisms required to efficiently implement such subsets.  It
-leverages existing CPU and Memory Placement facilities in the Linux
-kernel to avoid any additional impact on the critical scheduler or
-memory allocator code.
-
-
-1.3 How are cpusets implemented ?
----------------------------------
-
-Cpusets provide a Linux kernel mechanism to constrain which CPUs and
-Memory Nodes are used by a process or set of processes.
-
-The Linux kernel already has a pair of mechanisms to specify on which
-CPUs a task may be scheduled (sched_setaffinity) and on which Memory
-Nodes it may obtain memory (mbind, set_mempolicy).
-
-Cpusets extends these two mechanisms as follows:
-
- - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
-   kernel.
- - Each task in the system is attached to a cpuset, via a pointer
-   in the task structure to a reference counted cgroup structure.
- - Calls to sched_setaffinity are filtered to just those CPUs
-   allowed in that tasks cpuset.
- - Calls to mbind and set_mempolicy are filtered to just
-   those Memory Nodes allowed in that tasks cpuset.
- - The root cpuset contains all the systems CPUs and Memory
-   Nodes.
- - For any cpuset, one can define child cpusets containing a subset
-   of the parents CPU and Memory Node resources.
- - The hierarchy of cpusets can be mounted at /dev/cpuset, for
-   browsing and manipulation from user space.
- - A cpuset may be marked exclusive, which ensures that no other
-   cpuset (except direct ancestors and descendents) may contain
-   any overlapping CPUs or Memory Nodes.
- - You can list all the tasks (by pid) attached to any cpuset.
-
-The implementation of cpusets requires a few, simple hooks
-into the rest of the kernel, none in performance critical paths:
-
- - in init/main.c, to initialize the root cpuset at system boot.
- - in fork and exit, to attach and detach a task from its cpuset.
- - in sched_setaffinity, to mask the requested CPUs by what's
-   allowed in that tasks cpuset.
- - in sched.c migrate_all_tasks(), to keep migrating tasks within
-   the CPUs allowed by their cpuset, if possible.
- - in the mbind and set_mempolicy system calls, to mask the requested
-   Memory Nodes by what's allowed in that tasks cpuset.
- - in page_alloc.c, to restrict memory to allowed nodes.
- - in vmscan.c, to restrict page recovery to the current cpuset.
-
-You should mount the "cgroup" filesystem type in order to enable
-browsing and modifying the cpusets presently known to the kernel.  No
-new system calls are added for cpusets - all support for querying and
-modifying cpusets is via this cpuset file system.
-
-The /proc/<pid>/status file for each task has four added lines,
-displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
-and mems_allowed (on which Memory Nodes it may obtain memory),
-in the two formats seen in the following example:
-
-  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
-  Cpus_allowed_list:      0-127
-  Mems_allowed:   ffffffff,ffffffff
-  Mems_allowed_list:      0-63
-
-Each cpuset is represented by a directory in the cgroup file system
-containing (on top of the standard cgroup files) the following
-files describing that cpuset:
-
- - cpus: list of CPUs in that cpuset
- - mems: list of Memory Nodes in that cpuset
- - memory_migrate flag: if set, move pages to cpusets nodes
- - cpu_exclusive flag: is cpu placement exclusive?
- - mem_exclusive flag: is memory placement exclusive?
- - mem_hardwall flag:  is memory allocation hardwalled
- - memory_pressure: measure of how much paging pressure in cpuset
-
-In addition, the root cpuset only has the following file:
- - memory_pressure_enabled flag: compute memory_pressure?
-
-New cpusets are created using the mkdir system call or shell
-command.  The properties of a cpuset, such as its flags, allowed
-CPUs and Memory Nodes, and attached tasks, are modified by writing
-to the appropriate file in that cpusets directory, as listed above.
-
-The named hierarchical structure of nested cpusets allows partitioning
-a large system into nested, dynamically changeable, "soft-partitions".
-
-The attachment of each task, automatically inherited at fork by any
-children of that task, to a cpuset allows organizing the work load
-on a system into related sets of tasks such that each set is constrained
-to using the CPUs and Memory Nodes of a particular cpuset.  A task
-may be re-attached to any other cpuset, if allowed by the permissions
-on the necessary cpuset file system directories.
-
-Such management of a system "in the large" integrates smoothly with
-the detailed placement done on individual tasks and memory regions
-using the sched_setaffinity, mbind and set_mempolicy system calls.
-
-The following rules apply to each cpuset:
-
- - Its CPUs and Memory Nodes must be a subset of its parents.
- - It can't be marked exclusive unless its parent is.
- - If its cpu or memory is exclusive, they may not overlap any sibling.
-
-These rules, and the natural hierarchy of cpusets, enable efficient
-enforcement of the exclusive guarantee, without having to scan all
-cpusets every time any of them change to ensure nothing overlaps a
-exclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
-to represent the cpuset hierarchy provides for a familiar permission
-and name space for cpusets, with a minimum of additional kernel code.
-
-The cpus and mems files in the root (top_cpuset) cpuset are
-read-only.  The cpus file automatically tracks the value of
-cpu_online_map using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
-nodes with memory--using the cpuset_track_online_nodes() hook.
-
-
-1.4 What are exclusive cpusets ?
---------------------------------
-
-If a cpuset is cpu or mem exclusive, no other cpuset, other than
-a direct ancestor or descendent, may share any of the same CPUs or
-Memory Nodes.
-
-A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled",
-i.e. it restricts kernel allocations for page, buffer and other data
-commonly shared by the kernel across multiple users.  All cpusets,
-whether hardwalled or not, restrict allocations of memory for user
-space.  This enables configuring a system so that several independent
-jobs can share common kernel data, such as file system pages, while
-isolating each job's user allocation in its own cpuset.  To do this,
-construct a large mem_exclusive cpuset to hold all the jobs, and
-construct child, non-mem_exclusive cpusets for each individual job.
-Only a small amount of typical kernel memory, such as requests from
-interrupt handlers, is allowed to be taken outside even a
-mem_exclusive cpuset.
-
-
-1.5 What is memory_pressure ?
------------------------------
-The memory_pressure of a cpuset provides a simple per-cpuset metric
-of the rate that the tasks in a cpuset are attempting to free up in
-use memory on the nodes of the cpuset to satisfy additional memory
-requests.
-
-This enables batch managers monitoring jobs running in dedicated
-cpusets to efficiently detect what level of memory pressure that job
-is causing.
-
-This is useful both on tightly managed systems running a wide mix of
-submitted jobs, which may choose to terminate or re-prioritize jobs that
-are trying to use more memory than allowed on the nodes assigned them,
-and with tightly coupled, long running, massively parallel scientific
-computing jobs that will dramatically fail to meet required performance
-goals if they start to use more memory than allowed to them.
-
-This mechanism provides a very economical way for the batch manager
-to monitor a cpuset for signs of memory pressure.  It's up to the
-batch manager or other user code to decide what to do about it and
-take action.
-
-==> Unless this feature is enabled by writing "1" to the special file
-    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
-    code of __alloc_pages() for this metric reduces to simply noticing
-    that the cpuset_memory_pressure_enabled flag is zero.  So only
-    systems that enable this feature will compute the metric.
-
-Why a per-cpuset, running average:
-
-    Because this meter is per-cpuset, rather than per-task or mm,
-    the system load imposed by a batch scheduler monitoring this
-    metric is sharply reduced on large systems, because a scan of
-    the tasklist can be avoided on each set of queries.
-
-    Because this meter is a running average, instead of an accumulating
-    counter, a batch scheduler can detect memory pressure with a
-    single read, instead of having to read and accumulate results
-    for a period of time.
-
-    Because this meter is per-cpuset rather than per-task or mm,
-    the batch scheduler can obtain the key information, memory
-    pressure in a cpuset, with a single read, rather than having to
-    query and accumulate results over all the (dynamically changing)
-    set of tasks in the cpuset.
-
-A per-cpuset simple digital filter (requires a spinlock and 3 words
-of data per-cpuset) is kept, and updated by any task attached to that
-cpuset, if it enters the synchronous (direct) page reclaim code.
-
-A per-cpuset file provides an integer number representing the recent
-(half-life of 10 seconds) rate of direct page reclaims caused by
-the tasks in the cpuset, in units of reclaims attempted per second,
-times 1000.
-
-
-1.6 What is memory spread ?
----------------------------
-There are two boolean flag files per cpuset that control where the
-kernel allocates pages for the file system buffers and related in
-kernel data structures.  They are called 'memory_spread_page' and
-'memory_spread_slab'.
-
-If the per-cpuset boolean flag file 'memory_spread_page' is set, then
-the kernel will spread the file system buffers (page cache) evenly
-over all the nodes that the faulting task is allowed to use, instead
-of preferring to put those pages on the node where the task is running.
-
-If the per-cpuset boolean flag file 'memory_spread_slab' is set,
-then the kernel will spread some file system related slab caches,
-such as for inodes and dentries evenly over all the nodes that the
-faulting task is allowed to use, instead of preferring to put those
-pages on the node where the task is running.
-
-The setting of these flags does not affect anonymous data segment or
-stack segment pages of a task.
-
-By default, both kinds of memory spreading are off, and memory
-pages are allocated on the node local to where the task is running,
-except perhaps as modified by the tasks NUMA mempolicy or cpuset
-configuration, so long as sufficient free memory pages are available.
-
-When new cpusets are created, they inherit the memory spread settings
-of their parent.
-
-Setting memory spreading causes allocations for the affected page
-or slab caches to ignore the tasks NUMA mempolicy and be spread
-instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
-mempolicies will not notice any change in these calls as a result of
-their containing tasks memory spread settings.  If memory spreading
-is turned off, then the currently specified NUMA mempolicy once again
-applies to memory page allocations.
-
-Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
-files.  By default they contain "0", meaning that the feature is off
-for that cpuset.  If a "1" is written to that file, then that turns
-the named feature on.
-
-The implementation is simple.
-
-Setting the flag 'memory_spread_page' turns on a per-process flag
-PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
-joins that cpuset.  The page allocation calls for the page cache
-is modified to perform an inline check for this PF_SPREAD_PAGE task
-flag, and if set, a call to a new routine cpuset_mem_spread_node()
-returns the node to prefer for the allocation.
-
-Similarly, setting 'memory_spread_slab' turns on the flag
-PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
-pages from the node returned by cpuset_mem_spread_node().
-
-The cpuset_mem_spread_node() routine is also simple.  It uses the
-value of a per-task rotor cpuset_mem_spread_rotor to select the next
-node in the current tasks mems_allowed to prefer for the allocation.
-
-This memory placement policy is also known (in other contexts) as
-round-robin or interleave.
-
-This policy can provide substantial improvements for jobs that need
-to place thread local data on the corresponding node, but that need
-to access large file system data sets that need to be spread across
-the several nodes in the jobs cpuset in order to fit.  Without this
-policy, especially for jobs that might have one thread reading in the
-data set, the memory allocation across the nodes in the jobs cpuset
-can become very uneven.
-
-1.7 What is sched_load_balance ?
---------------------------------
-
-The kernel scheduler (kernel/sched.c) automatically load balances
-tasks.  If one CPU is underutilized, kernel code running on that
-CPU will look for tasks on other more overloaded CPUs and move those
-tasks to itself, within the constraints of such placement mechanisms
-as cpusets and sched_setaffinity.
-
-The algorithmic cost of load balancing and its impact on key shared
-kernel data structures such as the task list increases more than
-linearly with the number of CPUs being balanced.  So the scheduler
-has support to  partition the systems CPUs into a number of sched
-domains such that it only load balances within each sched domain.
-Each sched domain covers some subset of the CPUs in the system;
-no two sched domains overlap; some CPUs might not be in any sched
-domain and hence won't be load balanced.
-
-Put simply, it costs less to balance between two smaller sched domains
-than one big one, but doing so means that overloads in one of the
-two domains won't be load balanced to the other one.
-
-By default, there is one sched domain covering all CPUs, except those
-marked isolated using the kernel boot time "isolcpus=" argument.
-
-This default load balancing across all CPUs is not well suited for
-the following two situations:
- 1) On large systems, load balancing across many CPUs is expensive.
-    If the system is managed using cpusets to place independent jobs
-    on separate sets of CPUs, full load balancing is unnecessary.
- 2) Systems supporting realtime on some CPUs need to minimize
-    system overhead on those CPUs, including avoiding task load
-    balancing if that is not needed.
-
-When the per-cpuset flag "sched_load_balance" is enabled (the default
-setting), it requests that all the CPUs in that cpusets allowed 'cpus'
-be contained in a single sched domain, ensuring that load balancing
-can move a task (not otherwised pinned, as by sched_setaffinity)
-from any CPU in that cpuset to any other.
-
-When the per-cpuset flag "sched_load_balance" is disabled, then the
-scheduler will avoid load balancing across the CPUs in that cpuset,
---except-- in so far as is necessary because some overlapping cpuset
-has "sched_load_balance" enabled.
-
-So, for example, if the top cpuset has the flag "sched_load_balance"
-enabled, then the scheduler will have one sched domain covering all
-CPUs, and the setting of the "sched_load_balance" flag in any other
-cpusets won't matter, as we're already fully load balancing.
-
-Therefore in the above two situations, the top cpuset flag
-"sched_load_balance" should be disabled, and only some of the smaller,
-child cpusets have this flag enabled.
-
-When doing this, you don't usually want to leave any unpinned tasks in
-the top cpuset that might use non-trivial amounts of CPU, as such tasks
-may be artificially constrained to some subset of CPUs, depending on
-the particulars of this flag setting in descendent cpusets.  Even if
-such a task could use spare CPU cycles in some other CPUs, the kernel
-scheduler might not consider the possibility of load balancing that
-task to that underused CPU.
-
-Of course, tasks pinned to a particular CPU can be left in a cpuset
-that disables "sched_load_balance" as those tasks aren't going anywhere
-else anyway.
-
-There is an impedance mismatch here, between cpusets and sched domains.
-Cpusets are hierarchical and nest.  Sched domains are flat; they don't
-overlap and each CPU is in at most one sched domain.
-
-It is necessary for sched domains to be flat because load balancing
-across partially overlapping sets of CPUs would risk unstable dynamics
-that would be beyond our understanding.  So if each of two partially
-overlapping cpusets enables the flag 'sched_load_balance', then we
-form a single sched domain that is a superset of both.  We won't move
-a task to a CPU outside it cpuset, but the scheduler load balancing
-code might waste some compute cycles considering that possibility.
-
-This mismatch is why there is not a simple one-to-one relation
-between which cpusets have the flag "sched_load_balance" enabled,
-and the sched domain configuration.  If a cpuset enables the flag, it
-will get balancing across all its CPUs, but if it disables the flag,
-it will only be assured of no load balancing if no other overlapping
-cpuset enables the flag.
-
-If two cpusets have partially overlapping 'cpus' allowed, and only
-one of them has this flag enabled, then the other may find its
-tasks only partially load balanced, just on the overlapping CPUs.
-This is just the general case of the top_cpuset example given a few
-paragraphs above.  In the general case, as in the top cpuset case,
-don't leave tasks that might use non-trivial amounts of CPU in
-such partially load balanced cpusets, as they may be artificially
-constrained to some subset of the CPUs allowed to them, for lack of
-load balancing to the other CPUs.
-
-1.7.1 sched_load_balance implementation details.
-------------------------------------------------
-
-The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
-to most cpuset flags.)  When enabled for a cpuset, the kernel will
-ensure that it can load balance across all the CPUs in that cpuset
-(makes sure that all the CPUs in the cpus_allowed of that cpuset are
-in the same sched domain.)
-
-If two overlapping cpusets both have 'sched_load_balance' enabled,
-then they will be (must be) both in the same sched domain.
-
-If, as is the default, the top cpuset has 'sched_load_balance' enabled,
-then by the above that means there is a single sched domain covering
-the whole system, regardless of any other cpuset settings.
-
-The kernel commits to user space that it will avoid load balancing
-where it can.  It will pick as fine a granularity partition of sched
-domains as it can while still providing load balancing for any set
-of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
-
-The internal kernel cpuset to scheduler interface passes from the
-cpuset code to the scheduler code a partition of the load balanced
-CPUs in the system. This partition is a set of subsets (represented
-as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
-the CPUs that must be load balanced.
-
-Whenever the 'sched_load_balance' flag changes, or CPUs come or go
-from a cpuset with this flag enabled, or a cpuset with this flag
-enabled is removed, the cpuset code builds a new such partition and
-passes it to the scheduler sched domain setup code, to have the sched
-domains rebuilt as necessary.
-
-This partition exactly defines what sched domains the scheduler should
-setup - one sched domain for each element (cpumask_t) in the partition.
-
-The scheduler remembers the currently active sched domain partitions.
-When the scheduler routine partition_sched_domains() is invoked from
-the cpuset code to update these sched domains, it compares the new
-partition requested with the current, and updates its sched domains,
-removing the old and adding the new, for each change.
-
-
-1.8 What is sched_relax_domain_level ?
---------------------------------------
-
-In sched domain, the scheduler migrates tasks in 2 ways; periodic load
-balance on tick, and at time of some schedule events.
-
-When a task is woken up, scheduler try to move the task on idle CPU.
-For example, if a task A running on CPU X activates another task B
-on the same CPU X, and if CPU Y is X's sibling and performing idle,
-then scheduler migrate task B to CPU Y so that task B can start on
-CPU Y without waiting task A on CPU X.
-
-And if a CPU run out of tasks in its runqueue, the CPU try to pull
-extra tasks from other busy CPUs to help them before it is going to
-be idle.
-
-Of course it takes some searching cost to find movable tasks and/or
-idle CPUs, the scheduler might not search all CPUs in the domain
-everytime.  In fact, in some architectures, the searching ranges on
-events are limited in the same socket or node where the CPU locates,
-while the load balance on tick searchs all.
-
-For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
-is idle while CPU X and the siblings are busy, scheduler can't migrate
-woken task B from X to Z since it is out of its searching range.
-As the result, task B on CPU X need to wait task A or wait load balance
-on the next tick.  For some applications in special situation, waiting
-1 tick may be too long.
-
-The 'sched_relax_domain_level' file allows you to request changing
-this searching range as you like.  This file takes int value which
-indicates size of searching range in levels ideally as follows,
-otherwise initial value -1 that indicates the cpuset has no request.
-
-  -1  : no request. use system default or follow request of others.
-   0  : no search.
-   1  : search siblings (hyperthreads in a core).
-   2  : search cores in a package.
-   3  : search cpus in a node [= system wide on non-NUMA system]
- ( 4  : search nodes in a chunk of node [on NUMA system] )
- ( 5  : search system wide [on NUMA system] )
-
-The system default is architecture dependent.  The system default
-can be changed using the relax_domain_level= boot parameter.
-
-This file is per-cpuset and affect the sched domain where the cpuset
-belongs to.  Therefore if the flag 'sched_load_balance' of a cpuset
-is disabled, then 'sched_relax_domain_level' have no effect since
-there is no sched domain belonging the cpuset.
-
-If multiple cpusets are overlapping and hence they form a single sched
-domain, the largest value among those is used.  Be careful, if one
-requests 0 and others are -1 then 0 is used.
-
-Note that modifying this file will have both good and bad effects,
-and whether it is acceptable or not will be depend on your situation.
-Don't modify this file if you are not sure.
-
-If your situation is:
- - The migration costs between each cpu can be assumed considerably
-   small(for you) due to your special application's behavior or
-   special hardware support for CPU cache etc.
- - The searching cost doesn't have impact(for you) or you can make
-   the searching cost enough small by managing cpuset to compact etc.
- - The latency is required even it sacrifices cache hit rate etc.
-then increasing 'sched_relax_domain_level' would benefit you.
-
-
-1.9 How do I use cpusets ?
---------------------------
-
-In order to minimize the impact of cpusets on critical kernel
-code, such as the scheduler, and due to the fact that the kernel
-does not support one task updating the memory placement of another
-task directly, the impact on a task of changing its cpuset CPU
-or Memory Node placement, or of changing to which cpuset a task
-is attached, is subtle.
-
-If a cpuset has its Memory Nodes modified, then for each task attached
-to that cpuset, the next time that the kernel attempts to allocate
-a page of memory for that task, the kernel will notice the change
-in the tasks cpuset, and update its per-task memory placement to
-remain within the new cpusets memory placement.  If the task was using
-mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
-its new cpuset, then the task will continue to use whatever subset
-of MPOL_BIND nodes are still allowed in the new cpuset.  If the task
-was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
-in the new cpuset, then the task will be essentially treated as if it
-was MPOL_BIND bound to the new cpuset (even though its numa placement,
-as queried by get_mempolicy(), doesn't change).  If a task is moved
-from one cpuset to another, then the kernel will adjust the tasks
-memory placement, as above, the next time that the kernel attempts
-to allocate a page of memory for that task.
-
-If a cpuset has its 'cpus' modified, then each task in that cpuset
-will have its allowed CPU placement changed immediately.  Similarly,
-if a tasks pid is written to a cpusets 'tasks' file, in either its
-current cpuset or another cpuset, then its allowed CPU placement is
-changed immediately.  If such a task had been bound to some subset
-of its cpuset using the sched_setaffinity() call, the task will be
-allowed to run on any CPU allowed in its new cpuset, negating the
-affect of the prior sched_setaffinity() call.
-
-In summary, the memory placement of a task whose cpuset is changed is
-updated by the kernel, on the next allocation of a page for that task,
-but the processor placement is not updated, until that tasks pid is
-rewritten to the 'tasks' file of its cpuset.  This is done to avoid
-impacting the scheduler code in the kernel with a check for changes
-in a tasks processor placement.
-
-Normally, once a page is allocated (given a physical page
-of main memory) then that page stays on whatever node it
-was allocated, so long as it remains allocated, even if the
-cpusets memory placement policy 'mems' subsequently changes.
-If the cpuset flag file 'memory_migrate' is set true, then when
-tasks are attached to that cpuset, any pages that task had
-allocated to it on nodes in its previous cpuset are migrated
-to the tasks new cpuset. The relative placement of the page within
-the cpuset is preserved during these migration operations if possible.
-For example if the page was on the second valid node of the prior cpuset
-then the page will be placed on the second valid node of the new cpuset.
-
-Also if 'memory_migrate' is set true, then if that cpusets
-'mems' file is modified, pages allocated to tasks in that
-cpuset, that were on nodes in the previous setting of 'mems',
-will be moved to nodes in the new setting of 'mems.'
-Pages that were not in the tasks prior cpuset, or in the cpusets
-prior 'mems' setting, will not be moved.
-
-There is an exception to the above.  If hotplug functionality is used
-to remove all the CPUs that are currently assigned to a cpuset,
-then all the tasks in that cpuset will be moved to the nearest ancestor
-with non-empty cpus.  But the moving of some (or all) tasks might fail if
-cpuset is bound with another cgroup subsystem which has some restrictions
-on task attaching.  In this failing case, those tasks will stay
-in the original cpuset, and the kernel will automatically update
-their cpus_allowed to allow all online CPUs.  When memory hotplug
-functionality for removing Memory Nodes is available, a similar exception
-is expected to apply there as well.  In general, the kernel prefers to
-violate cpuset placement, over starving a task that has had all
-its allowed CPUs or Memory Nodes taken offline.
-
-There is a second exception to the above.  GFP_ATOMIC requests are
-kernel internal allocations that must be satisfied, immediately.
-The kernel may drop some request, in rare cases even panic, if a
-GFP_ATOMIC alloc fails.  If the request cannot be satisfied within
-the current tasks cpuset, then we relax the cpuset, and look for
-memory anywhere we can find it.  It's better to violate the cpuset
-than stress the kernel.
-
-To start a new job that is to be contained within a cpuset, the steps are:
-
- 1) mkdir /dev/cpuset
- 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
- 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
-    the /dev/cpuset virtual file system.
- 4) Start a task that will be the "founding father" of the new job.
- 5) Attach that task to the new cpuset by writing its pid to the
-    /dev/cpuset tasks file for that cpuset.
- 6) fork, exec or clone the job tasks from this founding father task.
-
-For example, the following sequence of commands will setup a cpuset
-named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
-and then start a subshell 'sh' in that cpuset:
-
-  mount -t cgroup -ocpuset cpuset /dev/cpuset
-  cd /dev/cpuset
-  mkdir Charlie
-  cd Charlie
-  /bin/echo 2-3 > cpus
-  /bin/echo 1 > mems
-  /bin/echo $$ > tasks
-  sh
-  # The subshell 'sh' is now running in cpuset Charlie
-  # The next line should display '/Charlie'
-  cat /proc/self/cpuset
-
-In the future, a C library interface to cpusets will likely be
-available.  For now, the only way to query or modify cpusets is
-via the cpuset file system, using the various cd, mkdir, echo, cat,
-rmdir commands from the shell, or their equivalent from C.
-
-The sched_setaffinity calls can also be done at the shell prompt using
-SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
-calls can be done at the shell prompt using the numactl command
-(part of Andi Kleen's numa package).
-
-2. Usage Examples and Syntax
-============================
-
-2.1 Basic Usage
----------------
-
-Creating, modifying, using the cpusets can be done through the cpuset
-virtual filesystem.
-
-To mount it, type:
-# mount -t cgroup -o cpuset cpuset /dev/cpuset
-
-Then under /dev/cpuset you can find a tree that corresponds to the
-tree of the cpusets in the system. For instance, /dev/cpuset
-is the cpuset that holds the whole system.
-
-If you want to create a new cpuset under /dev/cpuset:
-# cd /dev/cpuset
-# mkdir my_cpuset
-
-Now you want to do something with this cpuset.
-# cd my_cpuset
-
-In this directory you can find several files:
-# ls
-cpu_exclusive  memory_migrate      mems                      tasks
-cpus           memory_pressure     notify_on_release
-mem_exclusive  memory_spread_page  sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
-
-Reading them will give you information about the state of this cpuset:
-the CPUs and Memory Nodes it can use, the processes that are using
-it, its properties.  By writing to these files you can manipulate
-the cpuset.
-
-Set some flags:
-# /bin/echo 1 > cpu_exclusive
-
-Add some cpus:
-# /bin/echo 0-7 > cpus
-
-Add some mems:
-# /bin/echo 0-7 > mems
-
-Now attach your shell to this cpuset:
-# /bin/echo $$ > tasks
-
-You can also create cpusets inside your cpuset by using mkdir in this
-directory.
-# mkdir my_sub_cs
-
-To remove a cpuset, just use rmdir:
-# rmdir my_sub_cs
-This will fail if the cpuset is in use (has cpusets inside, or has
-processes attached).
-
-Note that for legacy reasons, the "cpuset" filesystem exists as a
-wrapper around the cgroup filesystem.
-
-The command
-
-mount -t cpuset X /dev/cpuset
-
-is equivalent to
-
-mount -t cgroup -ocpuset X /dev/cpuset
-echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
-
-2.2 Adding/removing cpus
-------------------------
-
-This is the syntax to use when writing in the cpus or mems files
-in cpuset directories:
-
-# /bin/echo 1-4 > cpus		-> set cpus list to cpus 1,2,3,4
-# /bin/echo 1,2,3,4 > cpus	-> set cpus list to cpus 1,2,3,4
-
-2.3 Setting flags
------------------
-
-The syntax is very simple:
-
-# /bin/echo 1 > cpu_exclusive 	-> set flag 'cpu_exclusive'
-# /bin/echo 0 > cpu_exclusive 	-> unset flag 'cpu_exclusive'
-
-2.4 Attaching processes
------------------------
-
-# /bin/echo PID > tasks
-
-Note that it is PID, not PIDs. You can only attach ONE task at a time.
-If you have several tasks to attach, you have to do it one after another:
-
-# /bin/echo PID1 > tasks
-# /bin/echo PID2 > tasks
-	...
-# /bin/echo PIDn > tasks
-
-
-3. Questions
-============
-
-Q: what's up with this '/bin/echo' ?
-A: bash's builtin 'echo' command does not check calls to write() against
-   errors. If you use it in the cpuset file system, you won't be
-   able to tell whether a command succeeded or failed.
-
-Q: When I attach processes, only the first of the line gets really attached !
-A: We can only return one error code per call to write(). So you should also
-   put only ONE pid.
-
-4. Contact
-==========
-
-Web: http://www.bullopensource.org/cpuset
@@ -231,7 +231,7 @@
  
    This options needs CONFIG_CGROUPS to be defined, and lets the administrator
    create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
-   Documentation/cgroups.txt for more information about this filesystem.
+   Documentation/cgroups/cgroups.txt for more information about this filesystem.
  
 Only one of these options to group tasks can be chosen and not both.
  
@@ -9,7 +9,7 @@
  *
  * Author: Pavel Emelianov <xemul@openvz.org>
  *
- * See Documentation/controllers/resource_counter.txt for more
+ * See Documentation/cgroups/resource_counter.txt for more
  * info about what this counter is.
  */
  
@@ -323,8 +323,8 @@
 	  This option allows you to create arbitrary task groups
 	  using the "cgroup" pseudo filesystem and control
 	  the cpu bandwidth allocated to each such task group.
-	  Refer to Documentation/cgroups.txt for more information
-	  on "cgroup" pseudo filesystem.
+	  Refer to Documentation/cgroups/cgroups.txt for more
+	  information on "cgroup" pseudo filesystem.
  
 endchoice
  
  
@@ -335,10 +335,9 @@
 	  use with process control subsystems such as Cpusets, CFS, memory
 	  controls or device isolation.
 	  See
-		- Documentation/cpusets.txt	(Cpusets)
 		- Documentation/scheduler/sched-design-CFS.txt	(CFS)
-		- Documentation/cgroups/ (features for grouping, isolation)
-		- Documentation/controllers/ (features for resource control)
+		- Documentation/cgroups/ (features for grouping, isolation
+					  and resource control)
  
 	  Say N if unsure.
  
@@ -568,7 +568,7 @@
  * load balancing domains (sched domains) as specified by that partial
  * partition.
  *
- * See "What is sched_load_balance" in Documentation/cpusets.txt
+ * See "What is sched_load_balance" in Documentation/cgroups/cpusets.txt
  * for a background explanation of this.
  *
  * Does not return errors, on the theory that the callers of this
1	1	CGROUPS
2	2	-------
3	3
4		-Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt
	4	+Written by Paul Menage <menage@google.com> based on
	5	+Documentation/cgroups/cpusets.txt
5	6
6	7	Original copyright statements from cpusets.txt:
7	8	Portions Copyright (C) 2004 BULL SA.
...	...	@@ -68,7 +69,7 @@
68	69	tracking. The intention is that other subsystems hook into the generic
69	70	cgroup support to provide new attributes for cgroups, such as
70	71	accounting/limiting the resources which processes in a cgroup can
71		-access. For example, cpusets (see Documentation/cpusets.txt) allows
	72	+access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows
72	73	you to associate a set of CPUs and a set of memory nodes with the
73	74	tasks in each cgroup.
74	75
	1	+CPU Accounting Controller
	2	+-------------------------
	3	+
	4	+The CPU accounting controller is used to group tasks using cgroups and
	5	+account the CPU usage of these groups of tasks.
	6	+
	7	+The CPU accounting controller supports multi-hierarchy groups. An accounting
	8	+group accumulates the CPU usage of all of its child groups and the tasks
	9	+directly present in its group.
	10	+
	11	+Accounting groups can be created by first mounting the cgroup filesystem.
	12	+
	13	+# mkdir /cgroups
	14	+# mount -t cgroup -ocpuacct none /cgroups
	15	+
	16	+With the above step, the initial or the parent accounting group
	17	+becomes visible at /cgroups. At bootup, this group includes all the
	18	+tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
	19	+/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
	20	+this group which is essentially the CPU time obtained by all the tasks
	21	+in the system.
	22	+
	23	+New accounting groups can be created under the parent group /cgroups.
	24	+
	25	+# cd /cgroups
	26	+# mkdir g1
	27	+# echo $$ > g1
	28	+
	29	+The above steps create a new group g1 and move the current shell
	30	+process (bash) into it. CPU time consumed by this bash and its children
	31	+can be obtained from g1/cpuacct.usage and the same is accumulated in
	32	+/cgroups/cpuacct.usage also.
	1	+Device Whitelist Controller
	2	+
	3	+1. Description:
	4	+
	5	+Implement a cgroup to track and enforce open and mknod restrictions
	6	+on device files. A device cgroup associates a device access
	7	+whitelist with each cgroup. A whitelist entry has 4 fields.
	8	+'type' is a (all), c (char), or b (block). 'all' means it applies
	9	+to all types and all major and minor numbers. Major and minor are
	10	+either an integer or * for all. Access is a composition of r
	11	+(read), w (write), and m (mknod).
	12	+
	13	+The root device cgroup starts with rwm to 'all'. A child device
	14	+cgroup gets a copy of the parent. Administrators can then remove
	15	+devices from the whitelist or add new entries. A child cgroup can
	16	+never receive a device access which is denied by its parent. However
	17	+when a device access is removed from a parent it will not also be
	18	+removed from the child(ren).
	19	+
	20	+2. User Interface
	21	+
	22	+An entry is added using devices.allow, and removed using
	23	+devices.deny. For instance
	24	+
	25	+ echo 'c 1:3 mr' > /cgroups/1/devices.allow
	26	+
	27	+allows cgroup 1 to read and mknod the device usually known as
	28	+/dev/null. Doing
	29	+
	30	+ echo a > /cgroups/1/devices.deny
	31	+
	32	+will remove the default 'a : rwm' entry. Doing
	33	+
	34	+ echo a > /cgroups/1/devices.allow
	35	+
	36	+will add the 'a : rwm' entry to the whitelist.
	37	+
	38	+3. Security
	39	+
	40	+Any task can move itself between cgroups. This clearly won't
	41	+suffice, but we can decide the best way to adequately restrict
	42	+movement as people get some experience with this. We may just want
	43	+to require CAP_SYS_ADMIN, which at least is a separate bit from
	44	+CAP_MKNOD. We may want to just refuse moving to a cgroup which
	45	+isn't a descendent of the current one. Or we may want to use
	46	+CAP_MAC_ADMIN, since we really are trying to lock down root.
	47	+
	48	+CAP_SYS_ADMIN is needed to modify the whitelist or move another
	49	+task to a new cgroup. (Again we'll probably want to change that).
	50	+
	51	+A cgroup may not be granted more permissions than the cgroup's
	52	+parent has.
	1	+Memory Resource Controller(Memcg) Implementation Memo.
	2	+Last Updated: 2008/12/15
	3	+Base Kernel Version: based on 2.6.28-rc8-mm.
	4	+
	5	+Because VM is getting complex (one of reasons is memcg...), memcg's behavior
	6	+is complex. This is a document for memcg's internal behavior.
	7	+Please note that implementation details can be changed.
	8	+
	9	+(*) Topics on API should be in Documentation/cgroups/memory.txt)
	10	+
	11	+0. How to record usage ?
	12	+ 2 objects are used.
	13	+
	14	+ page_cgroup ....an object per page.
	15	+ Allocated at boot or memory hotplug. Freed at memory hot removal.
	16	+
	17	+ swap_cgroup ... an entry per swp_entry.
	18	+ Allocated at swapon(). Freed at swapoff().
	19	+
	20	+ The page_cgroup has USED bit and double count against a page_cgroup never
	21	+ occurs. swap_cgroup is used only when a charged page is swapped-out.
	22	+
	23	+1. Charge
	24	+
	25	+ a page/swp_entry may be charged (usage += PAGE_SIZE) at
	26	+
	27	+ mem_cgroup_newpage_charge()
	28	+ Called at new page fault and Copy-On-Write.
	29	+
	30	+ mem_cgroup_try_charge_swapin()
	31	+ Called at do_swap_page() (page fault on swap entry) and swapoff.
	32	+ Followed by charge-commit-cancel protocol. (With swap accounting)
	33	+ At commit, a charge recorded in swap_cgroup is removed.
	34	+
	35	+ mem_cgroup_cache_charge()
	36	+ Called at add_to_page_cache()
	37	+
	38	+ mem_cgroup_cache_charge_swapin()
	39	+ Called at shmem's swapin.
	40	+
	41	+ mem_cgroup_prepare_migration()
	42	+ Called before migration. "extra" charge is done and followed by
	43	+ charge-commit-cancel protocol.
	44	+ At commit, charge against oldpage or newpage will be committed.
	45	+
	46	+2. Uncharge
	47	+ a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
	48	+
	49	+ mem_cgroup_uncharge_page()
	50	+ Called when an anonymous page is fully unmapped. I.e., mapcount goes
	51	+ to 0. If the page is SwapCache, uncharge is delayed until
	52	+ mem_cgroup_uncharge_swapcache().
	53	+
	54	+ mem_cgroup_uncharge_cache_page()
	55	+ Called when a page-cache is deleted from radix-tree. If the page is
	56	+ SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
	57	+
	58	+ mem_cgroup_uncharge_swapcache()
	59	+ Called when SwapCache is removed from radix-tree. The charge itself
	60	+ is moved to swap_cgroup. (If mem+swap controller is disabled, no
	61	+ charge to swap occurs.)
	62	+
	63	+ mem_cgroup_uncharge_swap()
	64	+ Called when swp_entry's refcnt goes down to 0. A charge against swap
	65	+ disappears.
	66	+
	67	+ mem_cgroup_end_migration(old, new)
	68	+ At success of migration old is uncharged (if necessary), a charge
	69	+ to new page is committed. At failure, charge to old page is committed.
	70	+
	71	+3. charge-commit-cancel
	72	+ In some case, we can't know this "charge" is valid or not at charging
	73	+ (because of races).
	74	+ To handle such case, there are charge-commit-cancel functions.
	75	+ mem_cgroup_try_charge_XXX
	76	+ mem_cgroup_commit_charge_XXX
	77	+ mem_cgroup_cancel_charge_XXX
	78	+ these are used in swap-in and migration.
	79	+
	80	+ At try_charge(), there are no flags to say "this page is charged".
	81	+ at this point, usage += PAGE_SIZE.
	82	+
	83	+ At commit(), the function checks the page should be charged or not
	84	+ and set flags or avoid charging.(usage -= PAGE_SIZE)
	85	+
	86	+ At cancel(), simply usage -= PAGE_SIZE.
	87	+
	88	+Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
	89	+
	90	+4. Anonymous
	91	+ Anonymous page is newly allocated at
	92	+ - page fault into MAP_ANONYMOUS mapping.
	93	+ - Copy-On-Write.
	94	+ It is charged right after it's allocated before doing any page table
	95	+ related operations. Of course, it's uncharged when another page is used
	96	+ for the fault address.
	97	+
	98	+ At freeing anonymous page (by exit() or munmap()), zap_pte() is called
	99	+ and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
	100	+ are done at page_remove_rmap() when page_mapcount() goes down to 0.
	101	+
	102	+ Another page freeing is by page-reclaim (vmscan.c) and anonymous
	103	+ pages are swapped out. In this case, the page is marked as
	104	+ PageSwapCache(). uncharge() routine doesn't uncharge the page marked
	105	+ as SwapCache(). It's delayed until __delete_from_swap_cache().
	106	+
	107	+ 4.1 Swap-in.
	108	+ At swap-in, the page is taken from swap-cache. There are 2 cases.
	109	+
	110	+ (a) If the SwapCache is newly allocated and read, it has no charges.
	111	+ (b) If the SwapCache has been mapped by processes, it has been
	112	+ charged already.
	113	+
	114	+ This swap-in is one of the most complicated work. In do_swap_page(),
	115	+ following events occur when pte is unchanged.
	116	+
	117	+ (1) the page (SwapCache) is looked up.
	118	+ (2) lock_page()
	119	+ (3) try_charge_swapin()
	120	+ (4) reuse_swap_page() (may call delete_swap_cache())
	121	+ (5) commit_charge_swapin()
	122	+ (6) swap_free().
	123	+
	124	+ Considering following situation for example.
	125	+
	126	+ (A) The page has not been charged before (2) and reuse_swap_page()
	127	+ doesn't call delete_from_swap_cache().
	128	+ (B) The page has not been charged before (2) and reuse_swap_page()
	129	+ calls delete_from_swap_cache().
	130	+ (C) The page has been charged before (2) and reuse_swap_page() doesn't
	131	+ call delete_from_swap_cache().
	132	+ (D) The page has been charged before (2) and reuse_swap_page() calls
	133	+ delete_from_swap_cache().
	134	+
	135	+ memory.usage/memsw.usage changes to this page/swp_entry will be
	136	+ Case (A) (B) (C) (D)
	137	+ Event
	138	+ Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
	139	+ ===========================================
	140	+ (3) +1/+1 +1/+1 +1/+1 +1/+1
	141	+ (4) - 0/ 0 - -1/ 0
	142	+ (5) 0/-1 0/ 0 -1/-1 0/ 0
	143	+ (6) - 0/-1 - 0/-1
	144	+ ===========================================
	145	+ Result 1/ 1 1/ 1 1/ 1 1/ 1
	146	+
	147	+ In any cases, charges to this page should be 1/ 1.
	148	+
	149	+ 4.2 Swap-out.
	150	+ At swap-out, typical state transition is below.
	151	+
	152	+ (a) add to swap cache. (marked as SwapCache)
	153	+ swp_entry's refcnt += 1.
	154	+ (b) fully unmapped.
	155	+ swp_entry's refcnt += # of ptes.
	156	+ (c) write back to swap.
	157	+ (d) delete from swap cache. (remove from SwapCache)
	158	+ swp_entry's refcnt -= 1.
	159	+
	160	+
	161	+ At (b), the page is marked as SwapCache and not uncharged.
	162	+ At (d), the page is removed from SwapCache and a charge in page_cgroup
	163	+ is moved to swap_cgroup.
	164	+
	165	+ Finally, at task exit,
	166	+ (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
	167	+ Here, a charge in swap_cgroup disappears.
	168	+
	169	+5. Page Cache
	170	+ Page Cache is charged at
	171	+ - add_to_page_cache_locked().
	172	+
	173	+ uncharged at
	174	+ - __remove_from_page_cache().
	175	+
	176	+ The logic is very clear. (About migration, see below)
	177	+ Note: __remove_from_page_cache() is called by remove_from_page_cache()
	178	+ and __remove_mapping().
	179	+
	180	+6. Shmem(tmpfs) Page Cache
	181	+ Memcg's charge/uncharge have special handlers of shmem. The best way
	182	+ to understand shmem's page state transition is to read mm/shmem.c.
	183	+ But brief explanation of the behavior of memcg around shmem will be
	184	+ helpful to understand the logic.
	185	+
	186	+ Shmem's page (just leaf page, not direct/indirect block) can be on
	187	+ - radix-tree of shmem's inode.
	188	+ - SwapCache.
	189	+ - Both on radix-tree and SwapCache. This happens at swap-in
	190	+ and swap-out,
	191	+
	192	+ It's charged when...
	193	+ - A new page is added to shmem's radix-tree.
	194	+ - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
	195	+ It's uncharged when
	196	+ - A page is removed from radix-tree and not SwapCache.
	197	+ - When SwapCache is removed, a charge is moved to swap_cgroup.
	198	+ - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
	199	+ disappears.
	200	+
	201	+7. Page Migration
	202	+ One of the most complicated functions is page-migration-handler.
	203	+ Memcg has 2 routines. Assume that we are migrating a page's contents
	204	+ from OLDPAGE to NEWPAGE.
	205	+
	206	+ Usual migration logic is..
	207	+ (a) remove the page from LRU.
	208	+ (b) allocate NEWPAGE (migration target)
	209	+ (c) lock by lock_page().
	210	+ (d) unmap all mappings.
	211	+ (e-1) If necessary, replace entry in radix-tree.
	212	+ (e-2) move contents of a page.
	213	+ (f) map all mappings again.
	214	+ (g) pushback the page to LRU.
	215	+ (-) OLDPAGE will be freed.
	216	+
	217	+ Before (g), memcg should complete all necessary charge/uncharge to
	218	+ NEWPAGE/OLDPAGE.
	219	+
	220	+ The point is....
	221	+ - If OLDPAGE is anonymous, all charges will be dropped at (d) because
	222	+ try_to_unmap() drops all mapcount and the page will not be
	223	+ SwapCache.
	224	+
	225	+ - If OLDPAGE is SwapCache, charges will be kept at (g) because
	226	+ __delete_from_swap_cache() isn't called at (e-1)
	227	+
	228	+ - If OLDPAGE is page-cache, charges will be kept at (g) because
	229	+ remove_from_swap_cache() isn't called at (e-1)
	230	+
	231	+ memcg provides following hooks.
	232	+
	233	+ - mem_cgroup_prepare_migration(OLDPAGE)
	234	+ Called after (b) to account a charge (usage += PAGE_SIZE) against
	235	+ memcg which OLDPAGE belongs to.
	236	+
	237	+ - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
	238	+ Called after (f) before (g).
	239	+ If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
	240	+ charged, a charge by prepare_migration() is automatically canceled.
	241	+ If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
	242	+
	243	+ But zap_pte() (by exit or munmap) can be called while migration,
	244	+ we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
	245	+
	246	+8. LRU
	247	+ Each memcg has its own private LRU. Now, it's handling is under global
	248	+ VM's control (means that it's handled under global zone->lru_lock).
	249	+ Almost all routines around memcg's LRU is called by global LRU's
	250	+ list management functions under zone->lru_lock().
	251	+
	252	+ A special function is mem_cgroup_isolate_pages(). This scans
	253	+ memcg's private LRU and call __isolate_lru_page() to extract a page
	254	+ from LRU.
	255	+ (By __isolate_lru_page(), the page is removed from both of global and
	256	+ private LRU.)
	257	+
	258	+
	259	+9. Typical Tests.
	260	+
	261	+ Tests for racy cases.
	262	+
	263	+ 9.1 Small limit to memcg.
	264	+ When you do test to do racy case, it's good test to set memcg's limit
	265	+ to be very small rather than GB. Many races found in the test under
	266	+ xKB or xxMB limits.
	267	+ (Memory behavior under GB and Memory behavior under MB shows very
	268	+ different situation.)
	269	+
	270	+ 9.2 Shmem
	271	+ Historically, memcg's shmem handling was poor and we saw some amount
	272	+ of troubles here. This is because shmem is page-cache but can be
	273	+ SwapCache. Test with shmem/tmpfs is always good test.
	274	+
	275	+ 9.3 Migration
	276	+ For NUMA, migration is an another special case. To do easy test, cpuset
	277	+ is useful. Following is a sample script to do migration.
	278	+
	279	+ mount -t cgroup -o cpuset none /opt/cpuset
	280	+
	281	+ mkdir /opt/cpuset/01
	282	+ echo 1 > /opt/cpuset/01/cpuset.cpus
	283	+ echo 0 > /opt/cpuset/01/cpuset.mems
	284	+ echo 1 > /opt/cpuset/01/cpuset.memory_migrate
	285	+ mkdir /opt/cpuset/02
	286	+ echo 1 > /opt/cpuset/02/cpuset.cpus
	287	+ echo 1 > /opt/cpuset/02/cpuset.mems
	288	+ echo 1 > /opt/cpuset/02/cpuset.memory_migrate
	289	+
	290	+ In above set, when you moves a task from 01 to 02, page migration to
	291	+ node 0 to node 1 will occur. Following is a script to migrate all
	292	+ under cpuset.
	293	+ --
	294	+ move_task()
	295	+ {
	296	+ for pid in $1
	297	+ do
	298	+ /bin/echo $pid >$2/tasks 2>/dev/null
	299	+ echo -n $pid
	300	+ echo -n " "
	301	+ done
	302	+ echo END
	303	+ }
	304	+
	305	+ G1_TASK=`cat ${G1}/tasks`
	306	+ G2_TASK=`cat ${G2}/tasks`
	307	+ move_task "${G1_TASK}" ${G2} &
	308	+ --
	309	+ 9.4 Memory hotplug.
	310	+ memory hotplug test is one of good test.
	311	+ to offline memory, do following.
	312	+ # echo offline > /sys/devices/system/memory/memoryXXX/state
	313	+ (XXX is the place of memory)
	314	+ This is an easy way to test page migration, too.
	315	+
	316	+ 9.5 mkdir/rmdir
	317	+ When using hierarchy, mkdir/rmdir test should be done.
	318	+ Use tests like the following.
	319	+
	320	+ echo 1 >/opt/cgroup/01/memory/use_hierarchy
	321	+ mkdir /opt/cgroup/01/child_a
	322	+ mkdir /opt/cgroup/01/child_b
	323	+
	324	+ set limit to 01.
	325	+ add limit to 01/child_b
	326	+ run jobs under child_a and child_b
	327	+
	328	+ create/delete following groups at random while jobs are running.
	329	+ /opt/cgroup/01/child_a/child_aa
	330	+ /opt/cgroup/01/child_b/child_bb
	331	+ /opt/cgroup/01/child_c
	332	+
	333	+ running new jobs in new group is also good.
	334	+
	335	+ 9.6 Mount with other subsystems.
	336	+ Mounting with other subsystems is a good test because there is a
	337	+ race and lock dependency with other cgroup subsystems.
	338	+
	339	+ example)
	340	+ # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
	341	+
	342	+ and do task move, mkdir, rmdir etc...under this.
	1	+Memory Resource Controller
	2	+
	3	+NOTE: The Memory Resource Controller has been generically been referred
	4	+to as the memory controller in this document. Do not confuse memory controller
	5	+used here with the memory controller that is used in hardware.
	6	+
	7	+Salient features
	8	+
	9	+a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
	10	+b. The infrastructure allows easy addition of other types of memory to control
	11	+c. Provides zero overhead for non memory controller users
	12	+d. Provides a double LRU: global memory pressure causes reclaim from the
	13	+ global LRU; a cgroup on hitting a limit, reclaims from the per
	14	+ cgroup LRU
	15	+
	16	+NOTE: Swap Cache (unmapped) is not accounted now.
	17	+
	18	+Benefits and Purpose of the memory controller
	19	+
	20	+The memory controller isolates the memory behaviour of a group of tasks
	21	+from the rest of the system. The article on LWN [12] mentions some probable
	22	+uses of the memory controller. The memory controller can be used to
	23	+
	24	+a. Isolate an application or a group of applications
	25	+ Memory hungry applications can be isolated and limited to a smaller
	26	+ amount of memory.
	27	+b. Create a cgroup with limited amount of memory, this can be used
	28	+ as a good alternative to booting with mem=XXXX.
	29	+c. Virtualization solutions can control the amount of memory they want
	30	+ to assign to a virtual machine instance.
	31	+d. A CD/DVD burner could control the amount of memory used by the
	32	+ rest of the system to ensure that burning does not fail due to lack
	33	+ of available memory.
	34	+e. There are several other use cases, find one or use the controller just
	35	+ for fun (to learn and hack on the VM subsystem).
	36	+
	37	+1. History
	38	+
	39	+The memory controller has a long history. A request for comments for the memory
	40	+controller was posted by Balbir Singh [1]. At the time the RFC was posted
	41	+there were several implementations for memory control. The goal of the
	42	+RFC was to build consensus and agreement for the minimal features required
	43	+for memory control. The first RSS controller was posted by Balbir Singh[2]
	44	+in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
	45	+RSS controller. At OLS, at the resource management BoF, everyone suggested
	46	+that we handle both page cache and RSS together. Another request was raised
	47	+to allow user space handling of OOM. The current memory controller is
	48	+at version 6; it combines both mapped (RSS) and unmapped Page
	49	+Cache Control [11].
	50	+
	51	+2. Memory Control
	52	+
	53	+Memory is a unique resource in the sense that it is present in a limited
	54	+amount. If a task requires a lot of CPU processing, the task can spread
	55	+its processing over a period of hours, days, months or years, but with
	56	+memory, the same physical memory needs to be reused to accomplish the task.
	57	+
	58	+The memory controller implementation has been divided into phases. These
	59	+are:
	60	+
	61	+1. Memory controller
	62	+2. mlock(2) controller
	63	+3. Kernel user memory accounting and slab control
	64	+4. user mappings length controller
	65	+
	66	+The memory controller is the first controller developed.
	67	+
	68	+2.1. Design
	69	+
	70	+The core of the design is a counter called the res_counter. The res_counter
	71	+tracks the current memory usage and limit of the group of processes associated
	72	+with the controller. Each cgroup has a memory controller specific data
	73	+structure (mem_cgroup) associated with it.
	74	+
	75	+2.2. Accounting
	76	+
	77	+ +--------------------+
	78	+ \| mem_cgroup \|
	79	+ \| (res_counter) \|
	80	+ +--------------------+
	81	+ / ^ \
	82	+ / \| \
	83	+ +---------------+ \| +---------------+
	84	+ \| mm_struct \| \|.... \| mm_struct \|
	85	+ \| \| \| \| \|
	86	+ +---------------+ \| +---------------+
	87	+ \|
	88	+ + --------------+
	89	+ \|
	90	+ +---------------+ +------+--------+
	91	+ \| page +----------> page_cgroup\|
	92	+ \| \| \| \|
	93	+ +---------------+ +---------------+
	94	+
	95	+ (Figure 1: Hierarchy of Accounting)
	96	+
	97	+
	98	+Figure 1 shows the important aspects of the controller
	99	+
	100	+1. Accounting happens per cgroup
	101	+2. Each mm_struct knows about which cgroup it belongs to
	102	+3. Each page has a pointer to the page_cgroup, which in turn knows the
	103	+ cgroup it belongs to
	104	+
	105	+The accounting is done as follows: mem_cgroup_charge() is invoked to setup
	106	+the necessary data structures and check if the cgroup that is being charged
	107	+is over its limit. If it is then reclaim is invoked on the cgroup.
	108	+More details can be found in the reclaim section of this document.
	109	+If everything goes well, a page meta-data-structure called page_cgroup is
	110	+allocated and associated with the page. This routine also adds the page to
	111	+the per cgroup LRU.
	112	+
	113	+2.2.1 Accounting details
	114	+
	115	+All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
	116	+(some pages which never be reclaimable and will not be on global LRU
	117	+ are not accounted. we just accounts pages under usual vm management.)
	118	+
	119	+RSS pages are accounted at page_fault unless they've already been accounted
	120	+for earlier. A file page will be accounted for as Page Cache when it's
	121	+inserted into inode (radix-tree). While it's mapped into the page tables of
	122	+processes, duplicate accounting is carefully avoided.
	123	+
	124	+A RSS page is unaccounted when it's fully unmapped. A PageCache page is
	125	+unaccounted when it's removed from radix-tree.
	126	+
	127	+At page migration, accounting information is kept.
	128	+
	129	+Note: we just account pages-on-lru because our purpose is to control amount
	130	+of used pages. not-on-lru pages are tend to be out-of-control from vm view.
	131	+
	132	+2.3 Shared Page Accounting
	133	+
	134	+Shared pages are accounted on the basis of the first touch approach. The
	135	+cgroup that first touches a page is accounted for the page. The principle
	136	+behind this approach is that a cgroup that aggressively uses a shared
	137	+page will eventually get charged for it (once it is uncharged from
	138	+the cgroup that brought it in -- this will happen on memory pressure).
	139	+
	140	+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
	141	+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
	142	+be backed into memory in force, charges for pages are accounted against the
	143	+caller of swapoff rather than the users of shmem.
	144	+
	145	+
	146	+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
	147	+Swap Extension allows you to record charge for swap. A swapped-in page is
	148	+charged back to original page allocator if possible.
	149	+
	150	+When swap is accounted, following files are added.
	151	+ - memory.memsw.usage_in_bytes.
	152	+ - memory.memsw.limit_in_bytes.
	153	+
	154	+usage of mem+swap is limited by memsw.limit_in_bytes.
	155	+
	156	+Note: why 'mem+swap' rather than swap.
	157	+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
	158	+to move account from memory to swap...there is no change in usage of
	159	+mem+swap.
	160	+
	161	+In other words, when we want to limit the usage of swap without affecting
	162	+global LRU, mem+swap limit is better than just limiting swap from OS point
	163	+of view.
	164	+
	165	+2.5 Reclaim
	166	+
	167	+Each cgroup maintains a per cgroup LRU that consists of an active
	168	+and inactive list. When a cgroup goes over its limit, we first try
	169	+to reclaim memory from the cgroup so as to make space for the new
	170	+pages that the cgroup has touched. If the reclaim is unsuccessful,
	171	+an OOM routine is invoked to select and kill the bulkiest task in the
	172	+cgroup.
	173	+
	174	+The reclaim algorithm has not been modified for cgroups, except that
	175	+pages that are selected for reclaiming come from the per cgroup LRU
	176	+list.
	177	+
	178	+2. Locking
	179	+
	180	+The memory controller uses the following hierarchy
	181	+
	182	+1. zone->lru_lock is used for selecting pages to be isolated
	183	+2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
	184	+3. lock_page_cgroup() is used to protect page->page_cgroup
	185	+
	186	+3. User Interface
	187	+
	188	+0. Configuration
	189	+
	190	+a. Enable CONFIG_CGROUPS
	191	+b. Enable CONFIG_RESOURCE_COUNTERS
	192	+c. Enable CONFIG_CGROUP_MEM_RES_CTLR
	193	+
	194	+1. Prepare the cgroups
	195	+# mkdir -p /cgroups
	196	+# mount -t cgroup none /cgroups -o memory
	197	+
	198	+2. Make the new group and move bash into it
	199	+# mkdir /cgroups/0
	200	+# echo $$ > /cgroups/0/tasks
	201	+
	202	+Since now we're in the 0 cgroup,
	203	+We can alter the memory limit:
	204	+# echo 4M > /cgroups/0/memory.limit_in_bytes
	205	+
	206	+NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
	207	+mega or gigabytes.
	208	+
	209	+# cat /cgroups/0/memory.limit_in_bytes
	210	+4194304
	211	+
	212	+NOTE: The interface has now changed to display the usage in bytes
	213	+instead of pages
	214	+
	215	+We can check the usage:
	216	+# cat /cgroups/0/memory.usage_in_bytes
	217	+1216512
	218	+
	219	+A successful write to this file does not guarantee a successful set of
	220	+this limit to the value written into the file. This can be due to a
	221	+number of factors, such as rounding up to page boundaries or the total
	222	+availability of memory on the system. The user is required to re-read
	223	+this file after a write to guarantee the value committed by the kernel.
	224	+
	225	+# echo 1 > memory.limit_in_bytes
	226	+# cat memory.limit_in_bytes
	227	+4096
	228	+
	229	+The memory.failcnt field gives the number of times that the cgroup limit was
	230	+exceeded.
	231	+
	232	+The memory.stat file gives accounting information. Now, the number of
	233	+caches, RSS and Active pages/Inactive pages are shown.
	234	+
	235	+4. Testing
	236	+
	237	+Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
	238	+Apart from that v6 has been tested with several applications and regular
	239	+daily use. The controller has also been tested on the PPC64, x86_64 and
	240	+UML platforms.
	241	+
	242	+4.1 Troubleshooting
	243	+
	244	+Sometimes a user might find that the application under a cgroup is
	245	+terminated. There are several causes for this:
	246	+
	247	+1. The cgroup limit is too low (just too low to do anything useful)
	248	+2. The user is using anonymous memory and swap is turned off or too low
	249	+
	250	+A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
	251	+some of the pages cached in the cgroup (page cache pages).
	252	+
	253	+4.2 Task migration
	254	+
	255	+When a task migrates from one cgroup to another, it's charge is not
	256	+carried forward. The pages allocated from the original cgroup still
	257	+remain charged to it, the charge is dropped when the page is freed or
	258	+reclaimed.
	259	+
	260	+4.3 Removing a cgroup
	261	+
	262	+A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
	263	+cgroup might have some charge associated with it, even though all
	264	+tasks have migrated away from it.
	265	+Such charges are freed(at default) or moved to its parent. When moved,
	266	+both of RSS and CACHES are moved to parent.
	267	+If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
	268	+
	269	+Charges recorded in swap information is not updated at removal of cgroup.
	270	+Recorded information is discarded and a cgroup which uses swap (swapcache)
	271	+will be charged as a new owner of it.
	272	+
	273	+
	274	+5. Misc. interfaces.
	275	+
	276	+5.1 force_empty
	277	+ memory.force_empty interface is provided to make cgroup's memory usage empty.
	278	+ You can use this interface only when the cgroup has no tasks.
	279	+ When writing anything to this
	280	+
	281	+ # echo 0 > memory.force_empty
	282	+
	283	+ Almost all pages tracked by this memcg will be unmapped and freed. Some of
	284	+ pages cannot be freed because it's locked or in-use. Such pages are moved
	285	+ to parent and this cgroup will be empty. But this may return -EBUSY in
	286	+ some too busy case.
	287	+
	288	+ Typical use case of this interface is that calling this before rmdir().
	289	+ Because rmdir() moves all pages to parent, some out-of-use page caches can be
	290	+ moved to the parent. If you want to avoid that, force_empty will be useful.
	291	+
	292	+5.2 stat file
	293	+ memory.stat file includes following statistics (now)
	294	+ cache - # of pages from page-cache and shmem.
	295	+ rss - # of pages from anonymous memory.
	296	+ pgpgin - # of event of charging
	297	+ pgpgout - # of event of uncharging
	298	+ active_anon - # of pages on active lru of anon, shmem.
	299	+ inactive_anon - # of pages on active lru of anon, shmem
	300	+ active_file - # of pages on active lru of file-cache
	301	+ inactive_file - # of pages on inactive lru of file cache
	302	+ unevictable - # of pages cannot be reclaimed.(mlocked etc)
	303	+
	304	+ Below is depend on CONFIG_DEBUG_VM.
	305	+ inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
	306	+ recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
	307	+ recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
	308	+ recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
	309	+ recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
	310	+
	311	+ Memo:
	312	+ recent_rotated means recent frequency of lru rotation.
	313	+ recent_scanned means recent # of scans to lru.
	314	+ showing for better debug please see the code for meanings.
	315	+
	316	+
	317	+5.3 swappiness
	318	+ Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
	319	+
	320	+ Following cgroup's swapiness can't be changed.
	321	+ - root cgroup (uses /proc/sys/vm/swappiness).
	322	+ - a cgroup which uses hierarchy and it has child cgroup.
	323	+ - a cgroup which uses hierarchy and not the root of hierarchy.
	324	+
	325	+
	326	+6. Hierarchy support
	327	+
	328	+The memory controller supports a deep hierarchy and hierarchical accounting.
	329	+The hierarchy is created by creating the appropriate cgroups in the
	330	+cgroup filesystem. Consider for example, the following cgroup filesystem
	331	+hierarchy
	332	+
	333	+ root
	334	+ / \| \
	335	+ / \| \
	336	+ a b c
	337	+ \| \
	338	+ \| \
	339	+ d e
	340	+
	341	+In the diagram above, with hierarchical accounting enabled, all memory
	342	+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
	343	+that has memory.use_hierarchy enabled. If one of the ancestors goes over its
	344	+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
	345	+children of the ancestor.
	346	+
	347	+6.1 Enabling hierarchical accounting and reclaim
	348	+
	349	+The memory controller by default disables the hierarchy feature. Support
	350	+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
	351	+
	352	+# echo 1 > memory.use_hierarchy
	353	+
	354	+The feature can be disabled by
	355	+
	356	+# echo 0 > memory.use_hierarchy
	357	+
	358	+NOTE1: Enabling/disabling will fail if the cgroup already has other
	359	+cgroups created below it.
	360	+
	361	+NOTE2: This feature can be enabled/disabled per subtree.
	362	+
	363	+7. TODO
	364	+
	365	+1. Add support for accounting huge pages (as a separate controller)
	366	+2. Make per-cgroup scanner reclaim not-shared pages first
	367	+3. Teach controller to account for shared-pages
	368	+4. Start reclamation in the background when the limit is
	369	+ not yet hit but the usage is getting closer
	370	+
	371	+Summary
	372	+
	373	+Overall, the memory controller has been a stable controller and has been
	374	+commented and discussed quite extensively in the community.
	375	+
	376	+References
	377	+
	378	+1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
	379	+2. Singh, Balbir. Memory Controller (RSS Control),
	380	+ http://lwn.net/Articles/222762/
	381	+3. Emelianov, Pavel. Resource controllers based on process cgroups
	382	+ http://lkml.org/lkml/2007/3/6/198
	383	+4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
	384	+ http://lkml.org/lkml/2007/4/9/78
	385	+5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
	386	+ http://lkml.org/lkml/2007/5/30/244
	387	+6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
	388	+7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
	389	+ subsystem (v3), http://lwn.net/Articles/235534/
	390	+8. Singh, Balbir. RSS controller v2 test results (lmbench),
	391	+ http://lkml.org/lkml/2007/5/17/232
	392	+9. Singh, Balbir. RSS controller v2 AIM9 results
	393	+ http://lkml.org/lkml/2007/5/18/1
	394	+10. Singh, Balbir. Memory controller v6 test results,
	395	+ http://lkml.org/lkml/2007/8/19/36
	396	+11. Singh, Balbir. Memory controller introduction (v6),
	397	+ http://lkml.org/lkml/2007/8/17/69
	398	+12. Corbet, Jonathan, Controlling memory use in cgroups,
	399	+ http://lwn.net/Articles/243795/
	1	+
	2	+ The Resource Counter
	3	+
	4	+The resource counter, declared at include/linux/res_counter.h,
	5	+is supposed to facilitate the resource management by controllers
	6	+by providing common stuff for accounting.
	7	+
	8	+This "stuff" includes the res_counter structure and routines
	9	+to work with it.
	10	+
	11	+
	12	+
	13	+1. Crucial parts of the res_counter structure
	14	+
	15	+ a. unsigned long long usage
	16	+
	17	+ The usage value shows the amount of a resource that is consumed
	18	+ by a group at a given time. The units of measurement should be
	19	+ determined by the controller that uses this counter. E.g. it can
	20	+ be bytes, items or any other unit the controller operates on.
	21	+
	22	+ b. unsigned long long max_usage
	23	+
	24	+ The maximal value of the usage over time.
	25	+
	26	+ This value is useful when gathering statistical information about
	27	+ the particular group, as it shows the actual resource requirements
	28	+ for a particular group, not just some usage snapshot.
	29	+
	30	+ c. unsigned long long limit
	31	+
	32	+ The maximal allowed amount of resource to consume by the group. In
	33	+ case the group requests for more resources, so that the usage value
	34	+ would exceed the limit, the resource allocation is rejected (see
	35	+ the next section).
	36	+
	37	+ d. unsigned long long failcnt
	38	+
	39	+ The failcnt stands for "failures counter". This is the number of
	40	+ resource allocation attempts that failed.
	41	+
	42	+ c. spinlock_t lock
	43	+
	44	+ Protects changes of the above values.
	45	+
	46	+
	47	+
	48	+2. Basic accounting routines
	49	+
	50	+ a. void res_counter_init(struct res_counter *rc)
	51	+
	52	+ Initializes the resource counter. As usual, should be the first
	53	+ routine called for a new counter.
	54	+
	55	+ b. int res_counter_charge[_locked]
	56	+ (struct res_counter *rc, unsigned long val)
	57	+
	58	+ When a resource is about to be allocated it has to be accounted
	59	+ with the appropriate resource counter (controller should determine
	60	+ which one to use on its own). This operation is called "charging".
	61	+
	62	+ This is not very important which operation - resource allocation
	63	+ or charging - is performed first, but
	64	+ * if the allocation is performed first, this may create a
	65	+ temporary resource over-usage by the time resource counter is
	66	+ charged;
	67	+ * if the charging is performed first, then it should be uncharged
	68	+ on error path (if the one is called).
	69	+
	70	+ c. void res_counter_uncharge[_locked]
	71	+ (struct res_counter *rc, unsigned long val)
	72	+
	73	+ When a resource is released (freed) it should be de-accounted
	74	+ from the resource counter it was accounted to. This is called
	75	+ "uncharging".
	76	+
	77	+ The _locked routines imply that the res_counter->lock is taken.
	78	+
	79	+
	80	+ 2.1 Other accounting routines
	81	+
	82	+ There are more routines that may help you with common needs, like
	83	+ checking whether the limit is reached or resetting the max_usage
	84	+ value. They are all declared in include/linux/res_counter.h.
	85	+
	86	+
	87	+
	88	+3. Analyzing the resource counter registrations
	89	+
	90	+ a. If the failcnt value constantly grows, this means that the counter's
	91	+ limit is too tight. Either the group is misbehaving and consumes too
	92	+ many resources, or the configuration is not suitable for the group
	93	+ and the limit should be increased.
	94	+
	95	+ b. The max_usage value can be used to quickly tune the group. One may
	96	+ set the limits to maximal values and either load the container with
	97	+ a common pattern or leave one for a while. After this the max_usage
	98	+ value shows the amount of memory the container would require during
	99	+ its common activity.
	100	+
	101	+ Setting the limit a bit above this value gives a pretty good
	102	+ configuration that works in most of the cases.
	103	+
	104	+ c. If the max_usage is much less than the limit, but the failcnt value
	105	+ is growing, then the group tries to allocate a big chunk of resource
	106	+ at once.
	107	+
	108	+ d. If the max_usage is much less than the limit, but the failcnt value
	109	+ is 0, then this group is given too high limit, that it does not
	110	+ require. It is better to lower the limit a bit leaving more resource
	111	+ for other groups.
	112	+
	113	+
	114	+
	115	+4. Communication with the control groups subsystem (cgroups)
	116	+
	117	+All the resource controllers that are using cgroups and resource counters
	118	+should provide files (in the cgroup filesystem) to work with the resource
	119	+counter fields. They are recommended to adhere to the following rules:
	120	+
	121	+ a. File names
	122	+
	123	+ Field name File name
	124	+ ---------------------------------------------------
	125	+ usage usage_in_<unit_of_measurement>
	126	+ max_usage max_usage_in_<unit_of_measurement>
	127	+ limit limit_in_<unit_of_measurement>
	128	+ failcnt failcnt
	129	+ lock no file :)
	130	+
	131	+ b. Reading from file should show the corresponding field value in the
	132	+ appropriate format.
	133	+
	134	+ c. Writing to file
	135	+
	136	+ Field Expected behavior
	137	+ ----------------------------------
	138	+ usage prohibited
	139	+ max_usage reset to usage
	140	+ limit set the limit
	141	+ failcnt reset to zero
	142	+
	143	+
	144	+
	145	+5. Usage example
	146	+
	147	+ a. Declare a task group (take a look at cgroups subsystem for this) and
	148	+ fold a res_counter into it
	149	+
	150	+ struct my_group {
	151	+ struct res_counter res;
	152	+
	153	+ <other fields>
	154	+ }
	155	+
	156	+ b. Put hooks in resource allocation/release paths
	157	+
	158	+ int alloc_something(...)
	159	+ {
	160	+ if (res_counter_charge(res_counter_ptr, amount) < 0)
	161	+ return -ENOMEM;
	162	+
	163	+ <allocate the resource and return to the caller>
	164	+ }
	165	+
	166	+ void release_something(...)
	167	+ {
	168	+ res_counter_uncharge(res_counter_ptr, amount);
	169	+
	170	+ <release the resource>
	171	+ }
	172	+
	173	+ In order to keep the usage value self-consistent, both the
	174	+ "res_counter_ptr" and the "amount" in release_something() should be
	175	+ the same as they were in the alloc_something() when the releasing
	176	+ resource was allocated.
	177	+
	178	+ c. Provide the way to read res_counter values and set them (the cgroups
	179	+ still can help with it).
	180	+
	181	+ c. Compile and run :)
1		-CPU Accounting Controller
2		--------------------------
3		-
4		-The CPU accounting controller is used to group tasks using cgroups and
5		-account the CPU usage of these groups of tasks.
6		-
7		-The CPU accounting controller supports multi-hierarchy groups. An accounting
8		-group accumulates the CPU usage of all of its child groups and the tasks
9		-directly present in its group.
10		-
11		-Accounting groups can be created by first mounting the cgroup filesystem.
12		-
13		-# mkdir /cgroups
14		-# mount -t cgroup -ocpuacct none /cgroups
15		-
16		-With the above step, the initial or the parent accounting group
17		-becomes visible at /cgroups. At bootup, this group includes all the
18		-tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
19		-/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
20		-this group which is essentially the CPU time obtained by all the tasks
21		-in the system.
22		-
23		-New accounting groups can be created under the parent group /cgroups.
24		-
25		-# cd /cgroups
26		-# mkdir g1
27		-# echo $$ > g1
28		-
29		-The above steps create a new group g1 and move the current shell
30		-process (bash) into it. CPU time consumed by this bash and its children
31		-can be obtained from g1/cpuacct.usage and the same is accumulated in
32		-/cgroups/cpuacct.usage also.
1		-Device Whitelist Controller
2		-
3		-1. Description:
4		-
5		-Implement a cgroup to track and enforce open and mknod restrictions
6		-on device files. A device cgroup associates a device access
7		-whitelist with each cgroup. A whitelist entry has 4 fields.
8		-'type' is a (all), c (char), or b (block). 'all' means it applies
9		-to all types and all major and minor numbers. Major and minor are
10		-either an integer or * for all. Access is a composition of r
11		-(read), w (write), and m (mknod).
12		-
13		-The root device cgroup starts with rwm to 'all'. A child device
14		-cgroup gets a copy of the parent. Administrators can then remove
15		-devices from the whitelist or add new entries. A child cgroup can
16		-never receive a device access which is denied by its parent. However
17		-when a device access is removed from a parent it will not also be
18		-removed from the child(ren).
19		-
20		-2. User Interface
21		-
22		-An entry is added using devices.allow, and removed using
23		-devices.deny. For instance
24		-
25		- echo 'c 1:3 mr' > /cgroups/1/devices.allow
26		-
27		-allows cgroup 1 to read and mknod the device usually known as
28		-/dev/null. Doing
29		-
30		- echo a > /cgroups/1/devices.deny
31		-
32		-will remove the default 'a : rwm' entry. Doing
33		-
34		- echo a > /cgroups/1/devices.allow
35		-
36		-will add the 'a : rwm' entry to the whitelist.
37		-
38		-3. Security
39		-
40		-Any task can move itself between cgroups. This clearly won't
41		-suffice, but we can decide the best way to adequately restrict
42		-movement as people get some experience with this. We may just want
43		-to require CAP_SYS_ADMIN, which at least is a separate bit from
44		-CAP_MKNOD. We may want to just refuse moving to a cgroup which
45		-isn't a descendent of the current one. Or we may want to use
46		-CAP_MAC_ADMIN, since we really are trying to lock down root.
47		-
48		-CAP_SYS_ADMIN is needed to modify the whitelist or move another
49		-task to a new cgroup. (Again we'll probably want to change that).
50		-
51		-A cgroup may not be granted more permissions than the cgroup's
52		-parent has.
1		-Memory Resource Controller(Memcg) Implementation Memo.
2		-Last Updated: 2008/12/15
3		-Base Kernel Version: based on 2.6.28-rc8-mm.
4		-
5		-Because VM is getting complex (one of reasons is memcg...), memcg's behavior
6		-is complex. This is a document for memcg's internal behavior.
7		-Please note that implementation details can be changed.
8		-
9		-(*) Topics on API should be in Documentation/controllers/memory.txt)
10		-
11		-0. How to record usage ?
12		- 2 objects are used.
13		-
14		- page_cgroup ....an object per page.
15		- Allocated at boot or memory hotplug. Freed at memory hot removal.
16		-
17		- swap_cgroup ... an entry per swp_entry.
18		- Allocated at swapon(). Freed at swapoff().
19		-
20		- The page_cgroup has USED bit and double count against a page_cgroup never
21		- occurs. swap_cgroup is used only when a charged page is swapped-out.
22		-
23		-1. Charge
24		-
25		- a page/swp_entry may be charged (usage += PAGE_SIZE) at
26		-
27		- mem_cgroup_newpage_charge()
28		- Called at new page fault and Copy-On-Write.
29		-
30		- mem_cgroup_try_charge_swapin()
31		- Called at do_swap_page() (page fault on swap entry) and swapoff.
32		- Followed by charge-commit-cancel protocol. (With swap accounting)
33		- At commit, a charge recorded in swap_cgroup is removed.
34		-
35		- mem_cgroup_cache_charge()
36		- Called at add_to_page_cache()
37		-
38		- mem_cgroup_cache_charge_swapin()
39		- Called at shmem's swapin.
40		-
41		- mem_cgroup_prepare_migration()
42		- Called before migration. "extra" charge is done and followed by
43		- charge-commit-cancel protocol.
44		- At commit, charge against oldpage or newpage will be committed.
45		-
46		-2. Uncharge
47		- a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
48		-
49		- mem_cgroup_uncharge_page()
50		- Called when an anonymous page is fully unmapped. I.e., mapcount goes
51		- to 0. If the page is SwapCache, uncharge is delayed until
52		- mem_cgroup_uncharge_swapcache().
53		-
54		- mem_cgroup_uncharge_cache_page()
55		- Called when a page-cache is deleted from radix-tree. If the page is
56		- SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
57		-
58		- mem_cgroup_uncharge_swapcache()
59		- Called when SwapCache is removed from radix-tree. The charge itself
60		- is moved to swap_cgroup. (If mem+swap controller is disabled, no
61		- charge to swap occurs.)
62		-
63		- mem_cgroup_uncharge_swap()
64		- Called when swp_entry's refcnt goes down to 0. A charge against swap
65		- disappears.
66		-
67		- mem_cgroup_end_migration(old, new)
68		- At success of migration old is uncharged (if necessary), a charge
69		- to new page is committed. At failure, charge to old page is committed.
70		-
71		-3. charge-commit-cancel
72		- In some case, we can't know this "charge" is valid or not at charging
73		- (because of races).
74		- To handle such case, there are charge-commit-cancel functions.
75		- mem_cgroup_try_charge_XXX
76		- mem_cgroup_commit_charge_XXX
77		- mem_cgroup_cancel_charge_XXX
78		- these are used in swap-in and migration.
79		-
80		- At try_charge(), there are no flags to say "this page is charged".
81		- at this point, usage += PAGE_SIZE.
82		-
83		- At commit(), the function checks the page should be charged or not
84		- and set flags or avoid charging.(usage -= PAGE_SIZE)
85		-
86		- At cancel(), simply usage -= PAGE_SIZE.
87		-
88		-Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
89		-
90		-4. Anonymous
91		- Anonymous page is newly allocated at
92		- - page fault into MAP_ANONYMOUS mapping.
93		- - Copy-On-Write.
94		- It is charged right after it's allocated before doing any page table
95		- related operations. Of course, it's uncharged when another page is used
96		- for the fault address.
97		-
98		- At freeing anonymous page (by exit() or munmap()), zap_pte() is called
99		- and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
100		- are done at page_remove_rmap() when page_mapcount() goes down to 0.
101		-
102		- Another page freeing is by page-reclaim (vmscan.c) and anonymous
103		- pages are swapped out. In this case, the page is marked as
104		- PageSwapCache(). uncharge() routine doesn't uncharge the page marked
105		- as SwapCache(). It's delayed until __delete_from_swap_cache().
106		-
107		- 4.1 Swap-in.
108		- At swap-in, the page is taken from swap-cache. There are 2 cases.
109		-
110		- (a) If the SwapCache is newly allocated and read, it has no charges.
111		- (b) If the SwapCache has been mapped by processes, it has been
112		- charged already.
113		-
114		- This swap-in is one of the most complicated work. In do_swap_page(),
115		- following events occur when pte is unchanged.
116		-
117		- (1) the page (SwapCache) is looked up.
118		- (2) lock_page()
119		- (3) try_charge_swapin()
120		- (4) reuse_swap_page() (may call delete_swap_cache())
121		- (5) commit_charge_swapin()
122		- (6) swap_free().
123		-
124		- Considering following situation for example.
125		-
126		- (A) The page has not been charged before (2) and reuse_swap_page()
127		- doesn't call delete_from_swap_cache().
128		- (B) The page has not been charged before (2) and reuse_swap_page()
129		- calls delete_from_swap_cache().
130		- (C) The page has been charged before (2) and reuse_swap_page() doesn't
131		- call delete_from_swap_cache().
132		- (D) The page has been charged before (2) and reuse_swap_page() calls
133		- delete_from_swap_cache().
134		-
135		- memory.usage/memsw.usage changes to this page/swp_entry will be
136		- Case (A) (B) (C) (D)
137		- Event
138		- Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
139		- ===========================================
140		- (3) +1/+1 +1/+1 +1/+1 +1/+1
141		- (4) - 0/ 0 - -1/ 0
142		- (5) 0/-1 0/ 0 -1/-1 0/ 0
143		- (6) - 0/-1 - 0/-1
144		- ===========================================
145		- Result 1/ 1 1/ 1 1/ 1 1/ 1
146		-
147		- In any cases, charges to this page should be 1/ 1.
148		-
149		- 4.2 Swap-out.
150		- At swap-out, typical state transition is below.
151		-
152		- (a) add to swap cache. (marked as SwapCache)
153		- swp_entry's refcnt += 1.
154		- (b) fully unmapped.
155		- swp_entry's refcnt += # of ptes.
156		- (c) write back to swap.
157		- (d) delete from swap cache. (remove from SwapCache)
158		- swp_entry's refcnt -= 1.
159		-
160		-
161		- At (b), the page is marked as SwapCache and not uncharged.
162		- At (d), the page is removed from SwapCache and a charge in page_cgroup
163		- is moved to swap_cgroup.
164		-
165		- Finally, at task exit,
166		- (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
167		- Here, a charge in swap_cgroup disappears.
168		-
169		-5. Page Cache
170		- Page Cache is charged at
171		- - add_to_page_cache_locked().
172		-
173		- uncharged at
174		- - __remove_from_page_cache().
175		-
176		- The logic is very clear. (About migration, see below)
177		- Note: __remove_from_page_cache() is called by remove_from_page_cache()
178		- and __remove_mapping().
179		-
180		-6. Shmem(tmpfs) Page Cache
181		- Memcg's charge/uncharge have special handlers of shmem. The best way
182		- to understand shmem's page state transition is to read mm/shmem.c.
183		- But brief explanation of the behavior of memcg around shmem will be
184		- helpful to understand the logic.
185		-
186		- Shmem's page (just leaf page, not direct/indirect block) can be on
187		- - radix-tree of shmem's inode.
188		- - SwapCache.
189		- - Both on radix-tree and SwapCache. This happens at swap-in
190		- and swap-out,
191		-
192		- It's charged when...
193		- - A new page is added to shmem's radix-tree.
194		- - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
195		- It's uncharged when
196		- - A page is removed from radix-tree and not SwapCache.
197		- - When SwapCache is removed, a charge is moved to swap_cgroup.
198		- - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
199		- disappears.
200		-
201		-7. Page Migration
202		- One of the most complicated functions is page-migration-handler.
203		- Memcg has 2 routines. Assume that we are migrating a page's contents
204		- from OLDPAGE to NEWPAGE.
205		-
206		- Usual migration logic is..
207		- (a) remove the page from LRU.
208		- (b) allocate NEWPAGE (migration target)
209		- (c) lock by lock_page().
210		- (d) unmap all mappings.
211		- (e-1) If necessary, replace entry in radix-tree.
212		- (e-2) move contents of a page.
213		- (f) map all mappings again.
214		- (g) pushback the page to LRU.
215		- (-) OLDPAGE will be freed.
216		-
217		- Before (g), memcg should complete all necessary charge/uncharge to
218		- NEWPAGE/OLDPAGE.
219		-
220		- The point is....
221		- - If OLDPAGE is anonymous, all charges will be dropped at (d) because
222		- try_to_unmap() drops all mapcount and the page will not be
223		- SwapCache.
224		-
225		- - If OLDPAGE is SwapCache, charges will be kept at (g) because
226		- __delete_from_swap_cache() isn't called at (e-1)
227		-
228		- - If OLDPAGE is page-cache, charges will be kept at (g) because
229		- remove_from_swap_cache() isn't called at (e-1)
230		-
231		- memcg provides following hooks.
232		-
233		- - mem_cgroup_prepare_migration(OLDPAGE)
234		- Called after (b) to account a charge (usage += PAGE_SIZE) against
235		- memcg which OLDPAGE belongs to.
236		-
237		- - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
238		- Called after (f) before (g).
239		- If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
240		- charged, a charge by prepare_migration() is automatically canceled.
241		- If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
242		-
243		- But zap_pte() (by exit or munmap) can be called while migration,
244		- we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
245		-
246		-8. LRU
247		- Each memcg has its own private LRU. Now, it's handling is under global
248		- VM's control (means that it's handled under global zone->lru_lock).
249		- Almost all routines around memcg's LRU is called by global LRU's
250		- list management functions under zone->lru_lock().
251		-
252		- A special function is mem_cgroup_isolate_pages(). This scans
253		- memcg's private LRU and call __isolate_lru_page() to extract a page
254		- from LRU.
255		- (By __isolate_lru_page(), the page is removed from both of global and
256		- private LRU.)
257		-
258		-
259		-9. Typical Tests.
260		-
261		- Tests for racy cases.
262		-
263		- 9.1 Small limit to memcg.
264		- When you do test to do racy case, it's good test to set memcg's limit
265		- to be very small rather than GB. Many races found in the test under
266		- xKB or xxMB limits.
267		- (Memory behavior under GB and Memory behavior under MB shows very
268		- different situation.)
269		-
270		- 9.2 Shmem
271		- Historically, memcg's shmem handling was poor and we saw some amount
272		- of troubles here. This is because shmem is page-cache but can be
273		- SwapCache. Test with shmem/tmpfs is always good test.
274		-
275		- 9.3 Migration
276		- For NUMA, migration is an another special case. To do easy test, cpuset
277		- is useful. Following is a sample script to do migration.
278		-
279		- mount -t cgroup -o cpuset none /opt/cpuset
280		-
281		- mkdir /opt/cpuset/01
282		- echo 1 > /opt/cpuset/01/cpuset.cpus
283		- echo 0 > /opt/cpuset/01/cpuset.mems
284		- echo 1 > /opt/cpuset/01/cpuset.memory_migrate
285		- mkdir /opt/cpuset/02
286		- echo 1 > /opt/cpuset/02/cpuset.cpus
287		- echo 1 > /opt/cpuset/02/cpuset.mems
288		- echo 1 > /opt/cpuset/02/cpuset.memory_migrate
289		-
290		- In above set, when you moves a task from 01 to 02, page migration to
291		- node 0 to node 1 will occur. Following is a script to migrate all
292		- under cpuset.
293		- --
294		- move_task()
295		- {
296		- for pid in $1
297		- do
298		- /bin/echo $pid >$2/tasks 2>/dev/null
299		- echo -n $pid
300		- echo -n " "
301		- done
302		- echo END
303		- }
304		-
305		- G1_TASK=`cat ${G1}/tasks`
306		- G2_TASK=`cat ${G2}/tasks`
307		- move_task "${G1_TASK}" ${G2} &
308		- --
309		- 9.4 Memory hotplug.
310		- memory hotplug test is one of good test.
311		- to offline memory, do following.
312		- # echo offline > /sys/devices/system/memory/memoryXXX/state
313		- (XXX is the place of memory)
314		- This is an easy way to test page migration, too.
315		-
316		- 9.5 mkdir/rmdir
317		- When using hierarchy, mkdir/rmdir test should be done.
318		- Use tests like the following.
319		-
320		- echo 1 >/opt/cgroup/01/memory/use_hierarchy
321		- mkdir /opt/cgroup/01/child_a
322		- mkdir /opt/cgroup/01/child_b
323		-
324		- set limit to 01.
325		- add limit to 01/child_b
326		- run jobs under child_a and child_b
327		-
328		- create/delete following groups at random while jobs are running.
329		- /opt/cgroup/01/child_a/child_aa
330		- /opt/cgroup/01/child_b/child_bb
331		- /opt/cgroup/01/child_c
332		-
333		- running new jobs in new group is also good.
334		-
335		- 9.6 Mount with other subsystems.
336		- Mounting with other subsystems is a good test because there is a
337		- race and lock dependency with other cgroup subsystems.
338		-
339		- example)
340		- # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
341		-
342		- and do task move, mkdir, rmdir etc...under this.
1		-Memory Resource Controller
2		-
3		-NOTE: The Memory Resource Controller has been generically been referred
4		-to as the memory controller in this document. Do not confuse memory controller
5		-used here with the memory controller that is used in hardware.
6		-
7		-Salient features
8		-
9		-a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages
10		-b. The infrastructure allows easy addition of other types of memory to control
11		-c. Provides zero overhead for non memory controller users
12		-d. Provides a double LRU: global memory pressure causes reclaim from the
13		- global LRU; a cgroup on hitting a limit, reclaims from the per
14		- cgroup LRU
15		-
16		-NOTE: Swap Cache (unmapped) is not accounted now.
17		-
18		-Benefits and Purpose of the memory controller
19		-
20		-The memory controller isolates the memory behaviour of a group of tasks
21		-from the rest of the system. The article on LWN [12] mentions some probable
22		-uses of the memory controller. The memory controller can be used to
23		-
24		-a. Isolate an application or a group of applications
25		- Memory hungry applications can be isolated and limited to a smaller
26		- amount of memory.
27		-b. Create a cgroup with limited amount of memory, this can be used
28		- as a good alternative to booting with mem=XXXX.
29		-c. Virtualization solutions can control the amount of memory they want
30		- to assign to a virtual machine instance.
31		-d. A CD/DVD burner could control the amount of memory used by the
32		- rest of the system to ensure that burning does not fail due to lack
33		- of available memory.
34		-e. There are several other use cases, find one or use the controller just
35		- for fun (to learn and hack on the VM subsystem).
36		-
37		-1. History
38		-
39		-The memory controller has a long history. A request for comments for the memory
40		-controller was posted by Balbir Singh [1]. At the time the RFC was posted
41		-there were several implementations for memory control. The goal of the
42		-RFC was to build consensus and agreement for the minimal features required
43		-for memory control. The first RSS controller was posted by Balbir Singh[2]
44		-in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
45		-RSS controller. At OLS, at the resource management BoF, everyone suggested
46		-that we handle both page cache and RSS together. Another request was raised
47		-to allow user space handling of OOM. The current memory controller is
48		-at version 6; it combines both mapped (RSS) and unmapped Page
49		-Cache Control [11].
50		-
51		-2. Memory Control
52		-
53		-Memory is a unique resource in the sense that it is present in a limited
54		-amount. If a task requires a lot of CPU processing, the task can spread
55		-its processing over a period of hours, days, months or years, but with
56		-memory, the same physical memory needs to be reused to accomplish the task.
57		-
58		-The memory controller implementation has been divided into phases. These
59		-are:
60		-
61		-1. Memory controller
62		-2. mlock(2) controller
63		-3. Kernel user memory accounting and slab control
64		-4. user mappings length controller
65		-
66		-The memory controller is the first controller developed.
67		-
68		-2.1. Design
69		-
70		-The core of the design is a counter called the res_counter. The res_counter
71		-tracks the current memory usage and limit of the group of processes associated
72		-with the controller. Each cgroup has a memory controller specific data
73		-structure (mem_cgroup) associated with it.
74		-
75		-2.2. Accounting
76		-
77		- +--------------------+
78		- \| mem_cgroup \|
79		- \| (res_counter) \|
80		- +--------------------+
81		- / ^ \
82		- / \| \
83		- +---------------+ \| +---------------+
84		- \| mm_struct \| \|.... \| mm_struct \|
85		- \| \| \| \| \|
86		- +---------------+ \| +---------------+
87		- \|
88		- + --------------+
89		- \|
90		- +---------------+ +------+--------+
91		- \| page +----------> page_cgroup\|
92		- \| \| \| \|
93		- +---------------+ +---------------+
94		-
95		- (Figure 1: Hierarchy of Accounting)
96		-
97		-
98		-Figure 1 shows the important aspects of the controller
99		-
100		-1. Accounting happens per cgroup
101		-2. Each mm_struct knows about which cgroup it belongs to
102		-3. Each page has a pointer to the page_cgroup, which in turn knows the
103		- cgroup it belongs to
104		-
105		-The accounting is done as follows: mem_cgroup_charge() is invoked to setup
106		-the necessary data structures and check if the cgroup that is being charged
107		-is over its limit. If it is then reclaim is invoked on the cgroup.
108		-More details can be found in the reclaim section of this document.
109		-If everything goes well, a page meta-data-structure called page_cgroup is
110		-allocated and associated with the page. This routine also adds the page to
111		-the per cgroup LRU.
112		-
113		-2.2.1 Accounting details
114		-
115		-All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
116		-(some pages which never be reclaimable and will not be on global LRU
117		- are not accounted. we just accounts pages under usual vm management.)
118		-
119		-RSS pages are accounted at page_fault unless they've already been accounted
120		-for earlier. A file page will be accounted for as Page Cache when it's
121		-inserted into inode (radix-tree). While it's mapped into the page tables of
122		-processes, duplicate accounting is carefully avoided.
123		-
124		-A RSS page is unaccounted when it's fully unmapped. A PageCache page is
125		-unaccounted when it's removed from radix-tree.
126		-
127		-At page migration, accounting information is kept.
128		-
129		-Note: we just account pages-on-lru because our purpose is to control amount
130		-of used pages. not-on-lru pages are tend to be out-of-control from vm view.
131		-
132		-2.3 Shared Page Accounting
133		-
134		-Shared pages are accounted on the basis of the first touch approach. The
135		-cgroup that first touches a page is accounted for the page. The principle
136		-behind this approach is that a cgroup that aggressively uses a shared
137		-page will eventually get charged for it (once it is uncharged from
138		-the cgroup that brought it in -- this will happen on memory pressure).
139		-
140		-Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
141		-When you do swapoff and make swapped-out pages of shmem(tmpfs) to
142		-be backed into memory in force, charges for pages are accounted against the
143		-caller of swapoff rather than the users of shmem.
144		-
145		-
146		-2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
147		-Swap Extension allows you to record charge for swap. A swapped-in page is
148		-charged back to original page allocator if possible.
149		-
150		-When swap is accounted, following files are added.
151		- - memory.memsw.usage_in_bytes.
152		- - memory.memsw.limit_in_bytes.
153		-
154		-usage of mem+swap is limited by memsw.limit_in_bytes.
155		-
156		-Note: why 'mem+swap' rather than swap.
157		-The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
158		-to move account from memory to swap...there is no change in usage of
159		-mem+swap.
160		-
161		-In other words, when we want to limit the usage of swap without affecting
162		-global LRU, mem+swap limit is better than just limiting swap from OS point
163		-of view.
164		-
165		-2.5 Reclaim
166		-
167		-Each cgroup maintains a per cgroup LRU that consists of an active
168		-and inactive list. When a cgroup goes over its limit, we first try
169		-to reclaim memory from the cgroup so as to make space for the new
170		-pages that the cgroup has touched. If the reclaim is unsuccessful,
171		-an OOM routine is invoked to select and kill the bulkiest task in the
172		-cgroup.
173		-
174		-The reclaim algorithm has not been modified for cgroups, except that
175		-pages that are selected for reclaiming come from the per cgroup LRU
176		-list.
177		-
178		-2. Locking
179		-
180		-The memory controller uses the following hierarchy
181		-
182		-1. zone->lru_lock is used for selecting pages to be isolated
183		-2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
184		-3. lock_page_cgroup() is used to protect page->page_cgroup
185		-
186		-3. User Interface
187		-
188		-0. Configuration
189		-
190		-a. Enable CONFIG_CGROUPS
191		-b. Enable CONFIG_RESOURCE_COUNTERS
192		-c. Enable CONFIG_CGROUP_MEM_RES_CTLR
193		-
194		-1. Prepare the cgroups
195		-# mkdir -p /cgroups
196		-# mount -t cgroup none /cgroups -o memory
197		-
198		-2. Make the new group and move bash into it
199		-# mkdir /cgroups/0
200		-# echo $$ > /cgroups/0/tasks
201		-
202		-Since now we're in the 0 cgroup,
203		-We can alter the memory limit:
204		-# echo 4M > /cgroups/0/memory.limit_in_bytes
205		-
206		-NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
207		-mega or gigabytes.
208		-
209		-# cat /cgroups/0/memory.limit_in_bytes
210		-4194304
211		-
212		-NOTE: The interface has now changed to display the usage in bytes
213		-instead of pages
214		-
215		-We can check the usage:
216		-# cat /cgroups/0/memory.usage_in_bytes
217		-1216512
218		-
219		-A successful write to this file does not guarantee a successful set of
220		-this limit to the value written into the file. This can be due to a
221		-number of factors, such as rounding up to page boundaries or the total
222		-availability of memory on the system. The user is required to re-read
223		-this file after a write to guarantee the value committed by the kernel.
224		-
225		-# echo 1 > memory.limit_in_bytes
226		-# cat memory.limit_in_bytes
227		-4096
228		-
229		-The memory.failcnt field gives the number of times that the cgroup limit was
230		-exceeded.
231		-
232		-The memory.stat file gives accounting information. Now, the number of
233		-caches, RSS and Active pages/Inactive pages are shown.
234		-
235		-4. Testing
236		-
237		-Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
238		-Apart from that v6 has been tested with several applications and regular
239		-daily use. The controller has also been tested on the PPC64, x86_64 and
240		-UML platforms.
241		-
242		-4.1 Troubleshooting
243		-
244		-Sometimes a user might find that the application under a cgroup is
245		-terminated. There are several causes for this:
246		-
247		-1. The cgroup limit is too low (just too low to do anything useful)
248		-2. The user is using anonymous memory and swap is turned off or too low
249		-
250		-A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
251		-some of the pages cached in the cgroup (page cache pages).
252		-
253		-4.2 Task migration
254		-
255		-When a task migrates from one cgroup to another, it's charge is not
256		-carried forward. The pages allocated from the original cgroup still
257		-remain charged to it, the charge is dropped when the page is freed or
258		-reclaimed.
259		-
260		-4.3 Removing a cgroup
261		-
262		-A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
263		-cgroup might have some charge associated with it, even though all
264		-tasks have migrated away from it.
265		-Such charges are freed(at default) or moved to its parent. When moved,
266		-both of RSS and CACHES are moved to parent.
267		-If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
268		-
269		-Charges recorded in swap information is not updated at removal of cgroup.
270		-Recorded information is discarded and a cgroup which uses swap (swapcache)
271		-will be charged as a new owner of it.
272		-
273		-
274		-5. Misc. interfaces.
275		-
276		-5.1 force_empty
277		- memory.force_empty interface is provided to make cgroup's memory usage empty.
278		- You can use this interface only when the cgroup has no tasks.
279		- When writing anything to this
280		-
281		- # echo 0 > memory.force_empty
282		-
283		- Almost all pages tracked by this memcg will be unmapped and freed. Some of
284		- pages cannot be freed because it's locked or in-use. Such pages are moved
285		- to parent and this cgroup will be empty. But this may return -EBUSY in
286		- some too busy case.
287		-
288		- Typical use case of this interface is that calling this before rmdir().
289		- Because rmdir() moves all pages to parent, some out-of-use page caches can be
290		- moved to the parent. If you want to avoid that, force_empty will be useful.
291		-
292		-5.2 stat file
293		- memory.stat file includes following statistics (now)
294		- cache - # of pages from page-cache and shmem.
295		- rss - # of pages from anonymous memory.
296		- pgpgin - # of event of charging
297		- pgpgout - # of event of uncharging
298		- active_anon - # of pages on active lru of anon, shmem.
299		- inactive_anon - # of pages on active lru of anon, shmem
300		- active_file - # of pages on active lru of file-cache
301		- inactive_file - # of pages on inactive lru of file cache
302		- unevictable - # of pages cannot be reclaimed.(mlocked etc)
303		-
304		- Below is depend on CONFIG_DEBUG_VM.
305		- inactive_ratio - VM inernal parameter. (see mm/page_alloc.c)
306		- recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
307		- recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
308		- recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
309		- recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
310		-
311		- Memo:
312		- recent_rotated means recent frequency of lru rotation.
313		- recent_scanned means recent # of scans to lru.
314		- showing for better debug please see the code for meanings.
315		-
316		-
317		-5.3 swappiness
318		- Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
319		-
320		- Following cgroup's swapiness can't be changed.
321		- - root cgroup (uses /proc/sys/vm/swappiness).
322		- - a cgroup which uses hierarchy and it has child cgroup.
323		- - a cgroup which uses hierarchy and not the root of hierarchy.
324		-
325		-
326		-6. Hierarchy support
327		-
328		-The memory controller supports a deep hierarchy and hierarchical accounting.
329		-The hierarchy is created by creating the appropriate cgroups in the
330		-cgroup filesystem. Consider for example, the following cgroup filesystem
331		-hierarchy
332		-
333		- root
334		- / \| \
335		- / \| \
336		- a b c
337		- \| \
338		- \| \
339		- d e
340		-
341		-In the diagram above, with hierarchical accounting enabled, all memory
342		-usage of e, is accounted to its ancestors up until the root (i.e, c and root),
343		-that has memory.use_hierarchy enabled. If one of the ancestors goes over its
344		-limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
345		-children of the ancestor.
346		-
347		-6.1 Enabling hierarchical accounting and reclaim
348		-
349		-The memory controller by default disables the hierarchy feature. Support
350		-can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
351		-
352		-# echo 1 > memory.use_hierarchy
353		-
354		-The feature can be disabled by
355		-
356		-# echo 0 > memory.use_hierarchy
357		-
358		-NOTE1: Enabling/disabling will fail if the cgroup already has other
359		-cgroups created below it.
360		-
361		-NOTE2: This feature can be enabled/disabled per subtree.
362		-
363		-7. TODO
364		-
365		-1. Add support for accounting huge pages (as a separate controller)
366		-2. Make per-cgroup scanner reclaim not-shared pages first
367		-3. Teach controller to account for shared-pages
368		-4. Start reclamation in the background when the limit is
369		- not yet hit but the usage is getting closer
370		-
371		-Summary
372		-
373		-Overall, the memory controller has been a stable controller and has been
374		-commented and discussed quite extensively in the community.
375		-
376		-References
377		-
378		-1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
379		-2. Singh, Balbir. Memory Controller (RSS Control),
380		- http://lwn.net/Articles/222762/
381		-3. Emelianov, Pavel. Resource controllers based on process cgroups
382		- http://lkml.org/lkml/2007/3/6/198
383		-4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
384		- http://lkml.org/lkml/2007/4/9/78
385		-5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
386		- http://lkml.org/lkml/2007/5/30/244
387		-6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
388		-7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
389		- subsystem (v3), http://lwn.net/Articles/235534/
390		-8. Singh, Balbir. RSS controller v2 test results (lmbench),
391		- http://lkml.org/lkml/2007/5/17/232
392		-9. Singh, Balbir. RSS controller v2 AIM9 results
393		- http://lkml.org/lkml/2007/5/18/1
394		-10. Singh, Balbir. Memory controller v6 test results,
395		- http://lkml.org/lkml/2007/8/19/36
396		-11. Singh, Balbir. Memory controller introduction (v6),
397		- http://lkml.org/lkml/2007/8/17/69
398		-12. Corbet, Jonathan, Controlling memory use in cgroups,
399		- http://lwn.net/Articles/243795/