Commit 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8
Committed by
Linus Torvalds
1 parent
23964d2d02
Exists in
master
and in
39 other branches
cgroups: consolidate cgroup documents
Move Documentation/cpusets.txt and Documentation/controllers/* to Documentation/cgroups/ Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 17 changed files with 1824 additions and 1824 deletions Side-by-side Diff
- Documentation/cgroups/cgroups.txt
- Documentation/cgroups/cpuacct.txt
- Documentation/cgroups/cpusets.txt
- Documentation/cgroups/devices.txt
- Documentation/cgroups/memcg_test.txt
- Documentation/cgroups/memory.txt
- Documentation/cgroups/resource_counter.txt
- Documentation/controllers/cpuacct.txt
- Documentation/controllers/devices.txt
- Documentation/controllers/memcg_test.txt
- Documentation/controllers/memory.txt
- Documentation/controllers/resource_counter.txt
- Documentation/cpusets.txt
- Documentation/scheduler/sched-design-CFS.txt
- include/linux/res_counter.h
- init/Kconfig
- kernel/cpuset.c
Documentation/cgroups/cgroups.txt
1 | 1 | CGROUPS |
2 | 2 | ------- |
3 | 3 | |
4 | -Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt | |
4 | +Written by Paul Menage <menage@google.com> based on | |
5 | +Documentation/cgroups/cpusets.txt | |
5 | 6 | |
6 | 7 | Original copyright statements from cpusets.txt: |
7 | 8 | Portions Copyright (C) 2004 BULL SA. |
... | ... | @@ -68,7 +69,7 @@ |
68 | 69 | tracking. The intention is that other subsystems hook into the generic |
69 | 70 | cgroup support to provide new attributes for cgroups, such as |
70 | 71 | accounting/limiting the resources which processes in a cgroup can |
71 | -access. For example, cpusets (see Documentation/cpusets.txt) allows | |
72 | +access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows | |
72 | 73 | you to associate a set of CPUs and a set of memory nodes with the |
73 | 74 | tasks in each cgroup. |
74 | 75 |
Documentation/cgroups/cpuacct.txt
1 | +CPU Accounting Controller | |
2 | +------------------------- | |
3 | + | |
4 | +The CPU accounting controller is used to group tasks using cgroups and | |
5 | +account the CPU usage of these groups of tasks. | |
6 | + | |
7 | +The CPU accounting controller supports multi-hierarchy groups. An accounting | |
8 | +group accumulates the CPU usage of all of its child groups and the tasks | |
9 | +directly present in its group. | |
10 | + | |
11 | +Accounting groups can be created by first mounting the cgroup filesystem. | |
12 | + | |
13 | +# mkdir /cgroups | |
14 | +# mount -t cgroup -ocpuacct none /cgroups | |
15 | + | |
16 | +With the above step, the initial or the parent accounting group | |
17 | +becomes visible at /cgroups. At bootup, this group includes all the | |
18 | +tasks in the system. /cgroups/tasks lists the tasks in this cgroup. | |
19 | +/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by | |
20 | +this group which is essentially the CPU time obtained by all the tasks | |
21 | +in the system. | |
22 | + | |
23 | +New accounting groups can be created under the parent group /cgroups. | |
24 | + | |
25 | +# cd /cgroups | |
26 | +# mkdir g1 | |
27 | +# echo $$ > g1 | |
28 | + | |
29 | +The above steps create a new group g1 and move the current shell | |
30 | +process (bash) into it. CPU time consumed by this bash and its children | |
31 | +can be obtained from g1/cpuacct.usage and the same is accumulated in | |
32 | +/cgroups/cpuacct.usage also. |
Documentation/cgroups/cpusets.txt
1 | + CPUSETS | |
2 | + ------- | |
3 | + | |
4 | +Copyright (C) 2004 BULL SA. | |
5 | +Written by Simon.Derr@bull.net | |
6 | + | |
7 | +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | |
8 | +Modified by Paul Jackson <pj@sgi.com> | |
9 | +Modified by Christoph Lameter <clameter@sgi.com> | |
10 | +Modified by Paul Menage <menage@google.com> | |
11 | +Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | |
12 | + | |
13 | +CONTENTS: | |
14 | +========= | |
15 | + | |
16 | +1. Cpusets | |
17 | + 1.1 What are cpusets ? | |
18 | + 1.2 Why are cpusets needed ? | |
19 | + 1.3 How are cpusets implemented ? | |
20 | + 1.4 What are exclusive cpusets ? | |
21 | + 1.5 What is memory_pressure ? | |
22 | + 1.6 What is memory spread ? | |
23 | + 1.7 What is sched_load_balance ? | |
24 | + 1.8 What is sched_relax_domain_level ? | |
25 | + 1.9 How do I use cpusets ? | |
26 | +2. Usage Examples and Syntax | |
27 | + 2.1 Basic Usage | |
28 | + 2.2 Adding/removing cpus | |
29 | + 2.3 Setting flags | |
30 | + 2.4 Attaching processes | |
31 | +3. Questions | |
32 | +4. Contact | |
33 | + | |
34 | +1. Cpusets | |
35 | +========== | |
36 | + | |
37 | +1.1 What are cpusets ? | |
38 | +---------------------- | |
39 | + | |
40 | +Cpusets provide a mechanism for assigning a set of CPUs and Memory | |
41 | +Nodes to a set of tasks. In this document "Memory Node" refers to | |
42 | +an on-line node that contains memory. | |
43 | + | |
44 | +Cpusets constrain the CPU and Memory placement of tasks to only | |
45 | +the resources within a tasks current cpuset. They form a nested | |
46 | +hierarchy visible in a virtual file system. These are the essential | |
47 | +hooks, beyond what is already present, required to manage dynamic | |
48 | +job placement on large systems. | |
49 | + | |
50 | +Cpusets use the generic cgroup subsystem described in | |
51 | +Documentation/cgroups/cgroups.txt. | |
52 | + | |
53 | +Requests by a task, using the sched_setaffinity(2) system call to | |
54 | +include CPUs in its CPU affinity mask, and using the mbind(2) and | |
55 | +set_mempolicy(2) system calls to include Memory Nodes in its memory | |
56 | +policy, are both filtered through that tasks cpuset, filtering out any | |
57 | +CPUs or Memory Nodes not in that cpuset. The scheduler will not | |
58 | +schedule a task on a CPU that is not allowed in its cpus_allowed | |
59 | +vector, and the kernel page allocator will not allocate a page on a | |
60 | +node that is not allowed in the requesting tasks mems_allowed vector. | |
61 | + | |
62 | +User level code may create and destroy cpusets by name in the cgroup | |
63 | +virtual file system, manage the attributes and permissions of these | |
64 | +cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | |
65 | +specify and query to which cpuset a task is assigned, and list the | |
66 | +task pids assigned to a cpuset. | |
67 | + | |
68 | + | |
69 | +1.2 Why are cpusets needed ? | |
70 | +---------------------------- | |
71 | + | |
72 | +The management of large computer systems, with many processors (CPUs), | |
73 | +complex memory cache hierarchies and multiple Memory Nodes having | |
74 | +non-uniform access times (NUMA) presents additional challenges for | |
75 | +the efficient scheduling and memory placement of processes. | |
76 | + | |
77 | +Frequently more modest sized systems can be operated with adequate | |
78 | +efficiency just by letting the operating system automatically share | |
79 | +the available CPU and Memory resources amongst the requesting tasks. | |
80 | + | |
81 | +But larger systems, which benefit more from careful processor and | |
82 | +memory placement to reduce memory access times and contention, | |
83 | +and which typically represent a larger investment for the customer, | |
84 | +can benefit from explicitly placing jobs on properly sized subsets of | |
85 | +the system. | |
86 | + | |
87 | +This can be especially valuable on: | |
88 | + | |
89 | + * Web Servers running multiple instances of the same web application, | |
90 | + * Servers running different applications (for instance, a web server | |
91 | + and a database), or | |
92 | + * NUMA systems running large HPC applications with demanding | |
93 | + performance characteristics. | |
94 | + | |
95 | +These subsets, or "soft partitions" must be able to be dynamically | |
96 | +adjusted, as the job mix changes, without impacting other concurrently | |
97 | +executing jobs. The location of the running jobs pages may also be moved | |
98 | +when the memory locations are changed. | |
99 | + | |
100 | +The kernel cpuset patch provides the minimum essential kernel | |
101 | +mechanisms required to efficiently implement such subsets. It | |
102 | +leverages existing CPU and Memory Placement facilities in the Linux | |
103 | +kernel to avoid any additional impact on the critical scheduler or | |
104 | +memory allocator code. | |
105 | + | |
106 | + | |
107 | +1.3 How are cpusets implemented ? | |
108 | +--------------------------------- | |
109 | + | |
110 | +Cpusets provide a Linux kernel mechanism to constrain which CPUs and | |
111 | +Memory Nodes are used by a process or set of processes. | |
112 | + | |
113 | +The Linux kernel already has a pair of mechanisms to specify on which | |
114 | +CPUs a task may be scheduled (sched_setaffinity) and on which Memory | |
115 | +Nodes it may obtain memory (mbind, set_mempolicy). | |
116 | + | |
117 | +Cpusets extends these two mechanisms as follows: | |
118 | + | |
119 | + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | |
120 | + kernel. | |
121 | + - Each task in the system is attached to a cpuset, via a pointer | |
122 | + in the task structure to a reference counted cgroup structure. | |
123 | + - Calls to sched_setaffinity are filtered to just those CPUs | |
124 | + allowed in that tasks cpuset. | |
125 | + - Calls to mbind and set_mempolicy are filtered to just | |
126 | + those Memory Nodes allowed in that tasks cpuset. | |
127 | + - The root cpuset contains all the systems CPUs and Memory | |
128 | + Nodes. | |
129 | + - For any cpuset, one can define child cpusets containing a subset | |
130 | + of the parents CPU and Memory Node resources. | |
131 | + - The hierarchy of cpusets can be mounted at /dev/cpuset, for | |
132 | + browsing and manipulation from user space. | |
133 | + - A cpuset may be marked exclusive, which ensures that no other | |
134 | + cpuset (except direct ancestors and descendents) may contain | |
135 | + any overlapping CPUs or Memory Nodes. | |
136 | + - You can list all the tasks (by pid) attached to any cpuset. | |
137 | + | |
138 | +The implementation of cpusets requires a few, simple hooks | |
139 | +into the rest of the kernel, none in performance critical paths: | |
140 | + | |
141 | + - in init/main.c, to initialize the root cpuset at system boot. | |
142 | + - in fork and exit, to attach and detach a task from its cpuset. | |
143 | + - in sched_setaffinity, to mask the requested CPUs by what's | |
144 | + allowed in that tasks cpuset. | |
145 | + - in sched.c migrate_all_tasks(), to keep migrating tasks within | |
146 | + the CPUs allowed by their cpuset, if possible. | |
147 | + - in the mbind and set_mempolicy system calls, to mask the requested | |
148 | + Memory Nodes by what's allowed in that tasks cpuset. | |
149 | + - in page_alloc.c, to restrict memory to allowed nodes. | |
150 | + - in vmscan.c, to restrict page recovery to the current cpuset. | |
151 | + | |
152 | +You should mount the "cgroup" filesystem type in order to enable | |
153 | +browsing and modifying the cpusets presently known to the kernel. No | |
154 | +new system calls are added for cpusets - all support for querying and | |
155 | +modifying cpusets is via this cpuset file system. | |
156 | + | |
157 | +The /proc/<pid>/status file for each task has four added lines, | |
158 | +displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | |
159 | +and mems_allowed (on which Memory Nodes it may obtain memory), | |
160 | +in the two formats seen in the following example: | |
161 | + | |
162 | + Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | |
163 | + Cpus_allowed_list: 0-127 | |
164 | + Mems_allowed: ffffffff,ffffffff | |
165 | + Mems_allowed_list: 0-63 | |
166 | + | |
167 | +Each cpuset is represented by a directory in the cgroup file system | |
168 | +containing (on top of the standard cgroup files) the following | |
169 | +files describing that cpuset: | |
170 | + | |
171 | + - cpus: list of CPUs in that cpuset | |
172 | + - mems: list of Memory Nodes in that cpuset | |
173 | + - memory_migrate flag: if set, move pages to cpusets nodes | |
174 | + - cpu_exclusive flag: is cpu placement exclusive? | |
175 | + - mem_exclusive flag: is memory placement exclusive? | |
176 | + - mem_hardwall flag: is memory allocation hardwalled | |
177 | + - memory_pressure: measure of how much paging pressure in cpuset | |
178 | + | |
179 | +In addition, the root cpuset only has the following file: | |
180 | + - memory_pressure_enabled flag: compute memory_pressure? | |
181 | + | |
182 | +New cpusets are created using the mkdir system call or shell | |
183 | +command. The properties of a cpuset, such as its flags, allowed | |
184 | +CPUs and Memory Nodes, and attached tasks, are modified by writing | |
185 | +to the appropriate file in that cpusets directory, as listed above. | |
186 | + | |
187 | +The named hierarchical structure of nested cpusets allows partitioning | |
188 | +a large system into nested, dynamically changeable, "soft-partitions". | |
189 | + | |
190 | +The attachment of each task, automatically inherited at fork by any | |
191 | +children of that task, to a cpuset allows organizing the work load | |
192 | +on a system into related sets of tasks such that each set is constrained | |
193 | +to using the CPUs and Memory Nodes of a particular cpuset. A task | |
194 | +may be re-attached to any other cpuset, if allowed by the permissions | |
195 | +on the necessary cpuset file system directories. | |
196 | + | |
197 | +Such management of a system "in the large" integrates smoothly with | |
198 | +the detailed placement done on individual tasks and memory regions | |
199 | +using the sched_setaffinity, mbind and set_mempolicy system calls. | |
200 | + | |
201 | +The following rules apply to each cpuset: | |
202 | + | |
203 | + - Its CPUs and Memory Nodes must be a subset of its parents. | |
204 | + - It can't be marked exclusive unless its parent is. | |
205 | + - If its cpu or memory is exclusive, they may not overlap any sibling. | |
206 | + | |
207 | +These rules, and the natural hierarchy of cpusets, enable efficient | |
208 | +enforcement of the exclusive guarantee, without having to scan all | |
209 | +cpusets every time any of them change to ensure nothing overlaps a | |
210 | +exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |
211 | +to represent the cpuset hierarchy provides for a familiar permission | |
212 | +and name space for cpusets, with a minimum of additional kernel code. | |
213 | + | |
214 | +The cpus and mems files in the root (top_cpuset) cpuset are | |
215 | +read-only. The cpus file automatically tracks the value of | |
216 | +cpu_online_map using a CPU hotplug notifier, and the mems file | |
217 | +automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., | |
218 | +nodes with memory--using the cpuset_track_online_nodes() hook. | |
219 | + | |
220 | + | |
221 | +1.4 What are exclusive cpusets ? | |
222 | +-------------------------------- | |
223 | + | |
224 | +If a cpuset is cpu or mem exclusive, no other cpuset, other than | |
225 | +a direct ancestor or descendent, may share any of the same CPUs or | |
226 | +Memory Nodes. | |
227 | + | |
228 | +A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | |
229 | +i.e. it restricts kernel allocations for page, buffer and other data | |
230 | +commonly shared by the kernel across multiple users. All cpusets, | |
231 | +whether hardwalled or not, restrict allocations of memory for user | |
232 | +space. This enables configuring a system so that several independent | |
233 | +jobs can share common kernel data, such as file system pages, while | |
234 | +isolating each job's user allocation in its own cpuset. To do this, | |
235 | +construct a large mem_exclusive cpuset to hold all the jobs, and | |
236 | +construct child, non-mem_exclusive cpusets for each individual job. | |
237 | +Only a small amount of typical kernel memory, such as requests from | |
238 | +interrupt handlers, is allowed to be taken outside even a | |
239 | +mem_exclusive cpuset. | |
240 | + | |
241 | + | |
242 | +1.5 What is memory_pressure ? | |
243 | +----------------------------- | |
244 | +The memory_pressure of a cpuset provides a simple per-cpuset metric | |
245 | +of the rate that the tasks in a cpuset are attempting to free up in | |
246 | +use memory on the nodes of the cpuset to satisfy additional memory | |
247 | +requests. | |
248 | + | |
249 | +This enables batch managers monitoring jobs running in dedicated | |
250 | +cpusets to efficiently detect what level of memory pressure that job | |
251 | +is causing. | |
252 | + | |
253 | +This is useful both on tightly managed systems running a wide mix of | |
254 | +submitted jobs, which may choose to terminate or re-prioritize jobs that | |
255 | +are trying to use more memory than allowed on the nodes assigned them, | |
256 | +and with tightly coupled, long running, massively parallel scientific | |
257 | +computing jobs that will dramatically fail to meet required performance | |
258 | +goals if they start to use more memory than allowed to them. | |
259 | + | |
260 | +This mechanism provides a very economical way for the batch manager | |
261 | +to monitor a cpuset for signs of memory pressure. It's up to the | |
262 | +batch manager or other user code to decide what to do about it and | |
263 | +take action. | |
264 | + | |
265 | +==> Unless this feature is enabled by writing "1" to the special file | |
266 | + /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | |
267 | + code of __alloc_pages() for this metric reduces to simply noticing | |
268 | + that the cpuset_memory_pressure_enabled flag is zero. So only | |
269 | + systems that enable this feature will compute the metric. | |
270 | + | |
271 | +Why a per-cpuset, running average: | |
272 | + | |
273 | + Because this meter is per-cpuset, rather than per-task or mm, | |
274 | + the system load imposed by a batch scheduler monitoring this | |
275 | + metric is sharply reduced on large systems, because a scan of | |
276 | + the tasklist can be avoided on each set of queries. | |
277 | + | |
278 | + Because this meter is a running average, instead of an accumulating | |
279 | + counter, a batch scheduler can detect memory pressure with a | |
280 | + single read, instead of having to read and accumulate results | |
281 | + for a period of time. | |
282 | + | |
283 | + Because this meter is per-cpuset rather than per-task or mm, | |
284 | + the batch scheduler can obtain the key information, memory | |
285 | + pressure in a cpuset, with a single read, rather than having to | |
286 | + query and accumulate results over all the (dynamically changing) | |
287 | + set of tasks in the cpuset. | |
288 | + | |
289 | +A per-cpuset simple digital filter (requires a spinlock and 3 words | |
290 | +of data per-cpuset) is kept, and updated by any task attached to that | |
291 | +cpuset, if it enters the synchronous (direct) page reclaim code. | |
292 | + | |
293 | +A per-cpuset file provides an integer number representing the recent | |
294 | +(half-life of 10 seconds) rate of direct page reclaims caused by | |
295 | +the tasks in the cpuset, in units of reclaims attempted per second, | |
296 | +times 1000. | |
297 | + | |
298 | + | |
299 | +1.6 What is memory spread ? | |
300 | +--------------------------- | |
301 | +There are two boolean flag files per cpuset that control where the | |
302 | +kernel allocates pages for the file system buffers and related in | |
303 | +kernel data structures. They are called 'memory_spread_page' and | |
304 | +'memory_spread_slab'. | |
305 | + | |
306 | +If the per-cpuset boolean flag file 'memory_spread_page' is set, then | |
307 | +the kernel will spread the file system buffers (page cache) evenly | |
308 | +over all the nodes that the faulting task is allowed to use, instead | |
309 | +of preferring to put those pages on the node where the task is running. | |
310 | + | |
311 | +If the per-cpuset boolean flag file 'memory_spread_slab' is set, | |
312 | +then the kernel will spread some file system related slab caches, | |
313 | +such as for inodes and dentries evenly over all the nodes that the | |
314 | +faulting task is allowed to use, instead of preferring to put those | |
315 | +pages on the node where the task is running. | |
316 | + | |
317 | +The setting of these flags does not affect anonymous data segment or | |
318 | +stack segment pages of a task. | |
319 | + | |
320 | +By default, both kinds of memory spreading are off, and memory | |
321 | +pages are allocated on the node local to where the task is running, | |
322 | +except perhaps as modified by the tasks NUMA mempolicy or cpuset | |
323 | +configuration, so long as sufficient free memory pages are available. | |
324 | + | |
325 | +When new cpusets are created, they inherit the memory spread settings | |
326 | +of their parent. | |
327 | + | |
328 | +Setting memory spreading causes allocations for the affected page | |
329 | +or slab caches to ignore the tasks NUMA mempolicy and be spread | |
330 | +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | |
331 | +mempolicies will not notice any change in these calls as a result of | |
332 | +their containing tasks memory spread settings. If memory spreading | |
333 | +is turned off, then the currently specified NUMA mempolicy once again | |
334 | +applies to memory page allocations. | |
335 | + | |
336 | +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | |
337 | +files. By default they contain "0", meaning that the feature is off | |
338 | +for that cpuset. If a "1" is written to that file, then that turns | |
339 | +the named feature on. | |
340 | + | |
341 | +The implementation is simple. | |
342 | + | |
343 | +Setting the flag 'memory_spread_page' turns on a per-process flag | |
344 | +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | |
345 | +joins that cpuset. The page allocation calls for the page cache | |
346 | +is modified to perform an inline check for this PF_SPREAD_PAGE task | |
347 | +flag, and if set, a call to a new routine cpuset_mem_spread_node() | |
348 | +returns the node to prefer for the allocation. | |
349 | + | |
350 | +Similarly, setting 'memory_spread_slab' turns on the flag | |
351 | +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | |
352 | +pages from the node returned by cpuset_mem_spread_node(). | |
353 | + | |
354 | +The cpuset_mem_spread_node() routine is also simple. It uses the | |
355 | +value of a per-task rotor cpuset_mem_spread_rotor to select the next | |
356 | +node in the current tasks mems_allowed to prefer for the allocation. | |
357 | + | |
358 | +This memory placement policy is also known (in other contexts) as | |
359 | +round-robin or interleave. | |
360 | + | |
361 | +This policy can provide substantial improvements for jobs that need | |
362 | +to place thread local data on the corresponding node, but that need | |
363 | +to access large file system data sets that need to be spread across | |
364 | +the several nodes in the jobs cpuset in order to fit. Without this | |
365 | +policy, especially for jobs that might have one thread reading in the | |
366 | +data set, the memory allocation across the nodes in the jobs cpuset | |
367 | +can become very uneven. | |
368 | + | |
369 | +1.7 What is sched_load_balance ? | |
370 | +-------------------------------- | |
371 | + | |
372 | +The kernel scheduler (kernel/sched.c) automatically load balances | |
373 | +tasks. If one CPU is underutilized, kernel code running on that | |
374 | +CPU will look for tasks on other more overloaded CPUs and move those | |
375 | +tasks to itself, within the constraints of such placement mechanisms | |
376 | +as cpusets and sched_setaffinity. | |
377 | + | |
378 | +The algorithmic cost of load balancing and its impact on key shared | |
379 | +kernel data structures such as the task list increases more than | |
380 | +linearly with the number of CPUs being balanced. So the scheduler | |
381 | +has support to partition the systems CPUs into a number of sched | |
382 | +domains such that it only load balances within each sched domain. | |
383 | +Each sched domain covers some subset of the CPUs in the system; | |
384 | +no two sched domains overlap; some CPUs might not be in any sched | |
385 | +domain and hence won't be load balanced. | |
386 | + | |
387 | +Put simply, it costs less to balance between two smaller sched domains | |
388 | +than one big one, but doing so means that overloads in one of the | |
389 | +two domains won't be load balanced to the other one. | |
390 | + | |
391 | +By default, there is one sched domain covering all CPUs, except those | |
392 | +marked isolated using the kernel boot time "isolcpus=" argument. | |
393 | + | |
394 | +This default load balancing across all CPUs is not well suited for | |
395 | +the following two situations: | |
396 | + 1) On large systems, load balancing across many CPUs is expensive. | |
397 | + If the system is managed using cpusets to place independent jobs | |
398 | + on separate sets of CPUs, full load balancing is unnecessary. | |
399 | + 2) Systems supporting realtime on some CPUs need to minimize | |
400 | + system overhead on those CPUs, including avoiding task load | |
401 | + balancing if that is not needed. | |
402 | + | |
403 | +When the per-cpuset flag "sched_load_balance" is enabled (the default | |
404 | +setting), it requests that all the CPUs in that cpusets allowed 'cpus' | |
405 | +be contained in a single sched domain, ensuring that load balancing | |
406 | +can move a task (not otherwised pinned, as by sched_setaffinity) | |
407 | +from any CPU in that cpuset to any other. | |
408 | + | |
409 | +When the per-cpuset flag "sched_load_balance" is disabled, then the | |
410 | +scheduler will avoid load balancing across the CPUs in that cpuset, | |
411 | +--except-- in so far as is necessary because some overlapping cpuset | |
412 | +has "sched_load_balance" enabled. | |
413 | + | |
414 | +So, for example, if the top cpuset has the flag "sched_load_balance" | |
415 | +enabled, then the scheduler will have one sched domain covering all | |
416 | +CPUs, and the setting of the "sched_load_balance" flag in any other | |
417 | +cpusets won't matter, as we're already fully load balancing. | |
418 | + | |
419 | +Therefore in the above two situations, the top cpuset flag | |
420 | +"sched_load_balance" should be disabled, and only some of the smaller, | |
421 | +child cpusets have this flag enabled. | |
422 | + | |
423 | +When doing this, you don't usually want to leave any unpinned tasks in | |
424 | +the top cpuset that might use non-trivial amounts of CPU, as such tasks | |
425 | +may be artificially constrained to some subset of CPUs, depending on | |
426 | +the particulars of this flag setting in descendent cpusets. Even if | |
427 | +such a task could use spare CPU cycles in some other CPUs, the kernel | |
428 | +scheduler might not consider the possibility of load balancing that | |
429 | +task to that underused CPU. | |
430 | + | |
431 | +Of course, tasks pinned to a particular CPU can be left in a cpuset | |
432 | +that disables "sched_load_balance" as those tasks aren't going anywhere | |
433 | +else anyway. | |
434 | + | |
435 | +There is an impedance mismatch here, between cpusets and sched domains. | |
436 | +Cpusets are hierarchical and nest. Sched domains are flat; they don't | |
437 | +overlap and each CPU is in at most one sched domain. | |
438 | + | |
439 | +It is necessary for sched domains to be flat because load balancing | |
440 | +across partially overlapping sets of CPUs would risk unstable dynamics | |
441 | +that would be beyond our understanding. So if each of two partially | |
442 | +overlapping cpusets enables the flag 'sched_load_balance', then we | |
443 | +form a single sched domain that is a superset of both. We won't move | |
444 | +a task to a CPU outside it cpuset, but the scheduler load balancing | |
445 | +code might waste some compute cycles considering that possibility. | |
446 | + | |
447 | +This mismatch is why there is not a simple one-to-one relation | |
448 | +between which cpusets have the flag "sched_load_balance" enabled, | |
449 | +and the sched domain configuration. If a cpuset enables the flag, it | |
450 | +will get balancing across all its CPUs, but if it disables the flag, | |
451 | +it will only be assured of no load balancing if no other overlapping | |
452 | +cpuset enables the flag. | |
453 | + | |
454 | +If two cpusets have partially overlapping 'cpus' allowed, and only | |
455 | +one of them has this flag enabled, then the other may find its | |
456 | +tasks only partially load balanced, just on the overlapping CPUs. | |
457 | +This is just the general case of the top_cpuset example given a few | |
458 | +paragraphs above. In the general case, as in the top cpuset case, | |
459 | +don't leave tasks that might use non-trivial amounts of CPU in | |
460 | +such partially load balanced cpusets, as they may be artificially | |
461 | +constrained to some subset of the CPUs allowed to them, for lack of | |
462 | +load balancing to the other CPUs. | |
463 | + | |
464 | +1.7.1 sched_load_balance implementation details. | |
465 | +------------------------------------------------ | |
466 | + | |
467 | +The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | |
468 | +to most cpuset flags.) When enabled for a cpuset, the kernel will | |
469 | +ensure that it can load balance across all the CPUs in that cpuset | |
470 | +(makes sure that all the CPUs in the cpus_allowed of that cpuset are | |
471 | +in the same sched domain.) | |
472 | + | |
473 | +If two overlapping cpusets both have 'sched_load_balance' enabled, | |
474 | +then they will be (must be) both in the same sched domain. | |
475 | + | |
476 | +If, as is the default, the top cpuset has 'sched_load_balance' enabled, | |
477 | +then by the above that means there is a single sched domain covering | |
478 | +the whole system, regardless of any other cpuset settings. | |
479 | + | |
480 | +The kernel commits to user space that it will avoid load balancing | |
481 | +where it can. It will pick as fine a granularity partition of sched | |
482 | +domains as it can while still providing load balancing for any set | |
483 | +of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | |
484 | + | |
485 | +The internal kernel cpuset to scheduler interface passes from the | |
486 | +cpuset code to the scheduler code a partition of the load balanced | |
487 | +CPUs in the system. This partition is a set of subsets (represented | |
488 | +as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | |
489 | +the CPUs that must be load balanced. | |
490 | + | |
491 | +Whenever the 'sched_load_balance' flag changes, or CPUs come or go | |
492 | +from a cpuset with this flag enabled, or a cpuset with this flag | |
493 | +enabled is removed, the cpuset code builds a new such partition and | |
494 | +passes it to the scheduler sched domain setup code, to have the sched | |
495 | +domains rebuilt as necessary. | |
496 | + | |
497 | +This partition exactly defines what sched domains the scheduler should | |
498 | +setup - one sched domain for each element (cpumask_t) in the partition. | |
499 | + | |
500 | +The scheduler remembers the currently active sched domain partitions. | |
501 | +When the scheduler routine partition_sched_domains() is invoked from | |
502 | +the cpuset code to update these sched domains, it compares the new | |
503 | +partition requested with the current, and updates its sched domains, | |
504 | +removing the old and adding the new, for each change. | |
505 | + | |
506 | + | |
507 | +1.8 What is sched_relax_domain_level ? | |
508 | +-------------------------------------- | |
509 | + | |
510 | +In sched domain, the scheduler migrates tasks in 2 ways; periodic load | |
511 | +balance on tick, and at time of some schedule events. | |
512 | + | |
513 | +When a task is woken up, scheduler try to move the task on idle CPU. | |
514 | +For example, if a task A running on CPU X activates another task B | |
515 | +on the same CPU X, and if CPU Y is X's sibling and performing idle, | |
516 | +then scheduler migrate task B to CPU Y so that task B can start on | |
517 | +CPU Y without waiting task A on CPU X. | |
518 | + | |
519 | +And if a CPU run out of tasks in its runqueue, the CPU try to pull | |
520 | +extra tasks from other busy CPUs to help them before it is going to | |
521 | +be idle. | |
522 | + | |
523 | +Of course it takes some searching cost to find movable tasks and/or | |
524 | +idle CPUs, the scheduler might not search all CPUs in the domain | |
525 | +everytime. In fact, in some architectures, the searching ranges on | |
526 | +events are limited in the same socket or node where the CPU locates, | |
527 | +while the load balance on tick searchs all. | |
528 | + | |
529 | +For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | |
530 | +is idle while CPU X and the siblings are busy, scheduler can't migrate | |
531 | +woken task B from X to Z since it is out of its searching range. | |
532 | +As the result, task B on CPU X need to wait task A or wait load balance | |
533 | +on the next tick. For some applications in special situation, waiting | |
534 | +1 tick may be too long. | |
535 | + | |
536 | +The 'sched_relax_domain_level' file allows you to request changing | |
537 | +this searching range as you like. This file takes int value which | |
538 | +indicates size of searching range in levels ideally as follows, | |
539 | +otherwise initial value -1 that indicates the cpuset has no request. | |
540 | + | |
541 | + -1 : no request. use system default or follow request of others. | |
542 | + 0 : no search. | |
543 | + 1 : search siblings (hyperthreads in a core). | |
544 | + 2 : search cores in a package. | |
545 | + 3 : search cpus in a node [= system wide on non-NUMA system] | |
546 | + ( 4 : search nodes in a chunk of node [on NUMA system] ) | |
547 | + ( 5 : search system wide [on NUMA system] ) | |
548 | + | |
549 | +The system default is architecture dependent. The system default | |
550 | +can be changed using the relax_domain_level= boot parameter. | |
551 | + | |
552 | +This file is per-cpuset and affect the sched domain where the cpuset | |
553 | +belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | |
554 | +is disabled, then 'sched_relax_domain_level' have no effect since | |
555 | +there is no sched domain belonging the cpuset. | |
556 | + | |
557 | +If multiple cpusets are overlapping and hence they form a single sched | |
558 | +domain, the largest value among those is used. Be careful, if one | |
559 | +requests 0 and others are -1 then 0 is used. | |
560 | + | |
561 | +Note that modifying this file will have both good and bad effects, | |
562 | +and whether it is acceptable or not will be depend on your situation. | |
563 | +Don't modify this file if you are not sure. | |
564 | + | |
565 | +If your situation is: | |
566 | + - The migration costs between each cpu can be assumed considerably | |
567 | + small(for you) due to your special application's behavior or | |
568 | + special hardware support for CPU cache etc. | |
569 | + - The searching cost doesn't have impact(for you) or you can make | |
570 | + the searching cost enough small by managing cpuset to compact etc. | |
571 | + - The latency is required even it sacrifices cache hit rate etc. | |
572 | +then increasing 'sched_relax_domain_level' would benefit you. | |
573 | + | |
574 | + | |
575 | +1.9 How do I use cpusets ? | |
576 | +-------------------------- | |
577 | + | |
578 | +In order to minimize the impact of cpusets on critical kernel | |
579 | +code, such as the scheduler, and due to the fact that the kernel | |
580 | +does not support one task updating the memory placement of another | |
581 | +task directly, the impact on a task of changing its cpuset CPU | |
582 | +or Memory Node placement, or of changing to which cpuset a task | |
583 | +is attached, is subtle. | |
584 | + | |
585 | +If a cpuset has its Memory Nodes modified, then for each task attached | |
586 | +to that cpuset, the next time that the kernel attempts to allocate | |
587 | +a page of memory for that task, the kernel will notice the change | |
588 | +in the tasks cpuset, and update its per-task memory placement to | |
589 | +remain within the new cpusets memory placement. If the task was using | |
590 | +mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | |
591 | +its new cpuset, then the task will continue to use whatever subset | |
592 | +of MPOL_BIND nodes are still allowed in the new cpuset. If the task | |
593 | +was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |
594 | +in the new cpuset, then the task will be essentially treated as if it | |
595 | +was MPOL_BIND bound to the new cpuset (even though its numa placement, | |
596 | +as queried by get_mempolicy(), doesn't change). If a task is moved | |
597 | +from one cpuset to another, then the kernel will adjust the tasks | |
598 | +memory placement, as above, the next time that the kernel attempts | |
599 | +to allocate a page of memory for that task. | |
600 | + | |
601 | +If a cpuset has its 'cpus' modified, then each task in that cpuset | |
602 | +will have its allowed CPU placement changed immediately. Similarly, | |
603 | +if a tasks pid is written to a cpusets 'tasks' file, in either its | |
604 | +current cpuset or another cpuset, then its allowed CPU placement is | |
605 | +changed immediately. If such a task had been bound to some subset | |
606 | +of its cpuset using the sched_setaffinity() call, the task will be | |
607 | +allowed to run on any CPU allowed in its new cpuset, negating the | |
608 | +affect of the prior sched_setaffinity() call. | |
609 | + | |
610 | +In summary, the memory placement of a task whose cpuset is changed is | |
611 | +updated by the kernel, on the next allocation of a page for that task, | |
612 | +but the processor placement is not updated, until that tasks pid is | |
613 | +rewritten to the 'tasks' file of its cpuset. This is done to avoid | |
614 | +impacting the scheduler code in the kernel with a check for changes | |
615 | +in a tasks processor placement. | |
616 | + | |
617 | +Normally, once a page is allocated (given a physical page | |
618 | +of main memory) then that page stays on whatever node it | |
619 | +was allocated, so long as it remains allocated, even if the | |
620 | +cpusets memory placement policy 'mems' subsequently changes. | |
621 | +If the cpuset flag file 'memory_migrate' is set true, then when | |
622 | +tasks are attached to that cpuset, any pages that task had | |
623 | +allocated to it on nodes in its previous cpuset are migrated | |
624 | +to the tasks new cpuset. The relative placement of the page within | |
625 | +the cpuset is preserved during these migration operations if possible. | |
626 | +For example if the page was on the second valid node of the prior cpuset | |
627 | +then the page will be placed on the second valid node of the new cpuset. | |
628 | + | |
629 | +Also if 'memory_migrate' is set true, then if that cpusets | |
630 | +'mems' file is modified, pages allocated to tasks in that | |
631 | +cpuset, that were on nodes in the previous setting of 'mems', | |
632 | +will be moved to nodes in the new setting of 'mems.' | |
633 | +Pages that were not in the tasks prior cpuset, or in the cpusets | |
634 | +prior 'mems' setting, will not be moved. | |
635 | + | |
636 | +There is an exception to the above. If hotplug functionality is used | |
637 | +to remove all the CPUs that are currently assigned to a cpuset, | |
638 | +then all the tasks in that cpuset will be moved to the nearest ancestor | |
639 | +with non-empty cpus. But the moving of some (or all) tasks might fail if | |
640 | +cpuset is bound with another cgroup subsystem which has some restrictions | |
641 | +on task attaching. In this failing case, those tasks will stay | |
642 | +in the original cpuset, and the kernel will automatically update | |
643 | +their cpus_allowed to allow all online CPUs. When memory hotplug | |
644 | +functionality for removing Memory Nodes is available, a similar exception | |
645 | +is expected to apply there as well. In general, the kernel prefers to | |
646 | +violate cpuset placement, over starving a task that has had all | |
647 | +its allowed CPUs or Memory Nodes taken offline. | |
648 | + | |
649 | +There is a second exception to the above. GFP_ATOMIC requests are | |
650 | +kernel internal allocations that must be satisfied, immediately. | |
651 | +The kernel may drop some request, in rare cases even panic, if a | |
652 | +GFP_ATOMIC alloc fails. If the request cannot be satisfied within | |
653 | +the current tasks cpuset, then we relax the cpuset, and look for | |
654 | +memory anywhere we can find it. It's better to violate the cpuset | |
655 | +than stress the kernel. | |
656 | + | |
657 | +To start a new job that is to be contained within a cpuset, the steps are: | |
658 | + | |
659 | + 1) mkdir /dev/cpuset | |
660 | + 2) mount -t cgroup -ocpuset cpuset /dev/cpuset | |
661 | + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | |
662 | + the /dev/cpuset virtual file system. | |
663 | + 4) Start a task that will be the "founding father" of the new job. | |
664 | + 5) Attach that task to the new cpuset by writing its pid to the | |
665 | + /dev/cpuset tasks file for that cpuset. | |
666 | + 6) fork, exec or clone the job tasks from this founding father task. | |
667 | + | |
668 | +For example, the following sequence of commands will setup a cpuset | |
669 | +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | |
670 | +and then start a subshell 'sh' in that cpuset: | |
671 | + | |
672 | + mount -t cgroup -ocpuset cpuset /dev/cpuset | |
673 | + cd /dev/cpuset | |
674 | + mkdir Charlie | |
675 | + cd Charlie | |
676 | + /bin/echo 2-3 > cpus | |
677 | + /bin/echo 1 > mems | |
678 | + /bin/echo $$ > tasks | |
679 | + sh | |
680 | + # The subshell 'sh' is now running in cpuset Charlie | |
681 | + # The next line should display '/Charlie' | |
682 | + cat /proc/self/cpuset | |
683 | + | |
684 | +In the future, a C library interface to cpusets will likely be | |
685 | +available. For now, the only way to query or modify cpusets is | |
686 | +via the cpuset file system, using the various cd, mkdir, echo, cat, | |
687 | +rmdir commands from the shell, or their equivalent from C. | |
688 | + | |
689 | +The sched_setaffinity calls can also be done at the shell prompt using | |
690 | +SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | |
691 | +calls can be done at the shell prompt using the numactl command | |
692 | +(part of Andi Kleen's numa package). | |
693 | + | |
694 | +2. Usage Examples and Syntax | |
695 | +============================ | |
696 | + | |
697 | +2.1 Basic Usage | |
698 | +--------------- | |
699 | + | |
700 | +Creating, modifying, using the cpusets can be done through the cpuset | |
701 | +virtual filesystem. | |
702 | + | |
703 | +To mount it, type: | |
704 | +# mount -t cgroup -o cpuset cpuset /dev/cpuset | |
705 | + | |
706 | +Then under /dev/cpuset you can find a tree that corresponds to the | |
707 | +tree of the cpusets in the system. For instance, /dev/cpuset | |
708 | +is the cpuset that holds the whole system. | |
709 | + | |
710 | +If you want to create a new cpuset under /dev/cpuset: | |
711 | +# cd /dev/cpuset | |
712 | +# mkdir my_cpuset | |
713 | + | |
714 | +Now you want to do something with this cpuset. | |
715 | +# cd my_cpuset | |
716 | + | |
717 | +In this directory you can find several files: | |
718 | +# ls | |
719 | +cpu_exclusive memory_migrate mems tasks | |
720 | +cpus memory_pressure notify_on_release | |
721 | +mem_exclusive memory_spread_page sched_load_balance | |
722 | +mem_hardwall memory_spread_slab sched_relax_domain_level | |
723 | + | |
724 | +Reading them will give you information about the state of this cpuset: | |
725 | +the CPUs and Memory Nodes it can use, the processes that are using | |
726 | +it, its properties. By writing to these files you can manipulate | |
727 | +the cpuset. | |
728 | + | |
729 | +Set some flags: | |
730 | +# /bin/echo 1 > cpu_exclusive | |
731 | + | |
732 | +Add some cpus: | |
733 | +# /bin/echo 0-7 > cpus | |
734 | + | |
735 | +Add some mems: | |
736 | +# /bin/echo 0-7 > mems | |
737 | + | |
738 | +Now attach your shell to this cpuset: | |
739 | +# /bin/echo $$ > tasks | |
740 | + | |
741 | +You can also create cpusets inside your cpuset by using mkdir in this | |
742 | +directory. | |
743 | +# mkdir my_sub_cs | |
744 | + | |
745 | +To remove a cpuset, just use rmdir: | |
746 | +# rmdir my_sub_cs | |
747 | +This will fail if the cpuset is in use (has cpusets inside, or has | |
748 | +processes attached). | |
749 | + | |
750 | +Note that for legacy reasons, the "cpuset" filesystem exists as a | |
751 | +wrapper around the cgroup filesystem. | |
752 | + | |
753 | +The command | |
754 | + | |
755 | +mount -t cpuset X /dev/cpuset | |
756 | + | |
757 | +is equivalent to | |
758 | + | |
759 | +mount -t cgroup -ocpuset X /dev/cpuset | |
760 | +echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |
761 | + | |
762 | +2.2 Adding/removing cpus | |
763 | +------------------------ | |
764 | + | |
765 | +This is the syntax to use when writing in the cpus or mems files | |
766 | +in cpuset directories: | |
767 | + | |
768 | +# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | |
769 | +# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | |
770 | + | |
771 | +2.3 Setting flags | |
772 | +----------------- | |
773 | + | |
774 | +The syntax is very simple: | |
775 | + | |
776 | +# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | |
777 | +# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | |
778 | + | |
779 | +2.4 Attaching processes | |
780 | +----------------------- | |
781 | + | |
782 | +# /bin/echo PID > tasks | |
783 | + | |
784 | +Note that it is PID, not PIDs. You can only attach ONE task at a time. | |
785 | +If you have several tasks to attach, you have to do it one after another: | |
786 | + | |
787 | +# /bin/echo PID1 > tasks | |
788 | +# /bin/echo PID2 > tasks | |
789 | + ... | |
790 | +# /bin/echo PIDn > tasks | |
791 | + | |
792 | + | |
793 | +3. Questions | |
794 | +============ | |
795 | + | |
796 | +Q: what's up with this '/bin/echo' ? | |
797 | +A: bash's builtin 'echo' command does not check calls to write() against | |
798 | + errors. If you use it in the cpuset file system, you won't be | |
799 | + able to tell whether a command succeeded or failed. | |
800 | + | |
801 | +Q: When I attach processes, only the first of the line gets really attached ! | |
802 | +A: We can only return one error code per call to write(). So you should also | |
803 | + put only ONE pid. | |
804 | + | |
805 | +4. Contact | |
806 | +========== | |
807 | + | |
808 | +Web: http://www.bullopensource.org/cpuset |
Documentation/cgroups/devices.txt
1 | +Device Whitelist Controller | |
2 | + | |
3 | +1. Description: | |
4 | + | |
5 | +Implement a cgroup to track and enforce open and mknod restrictions | |
6 | +on device files. A device cgroup associates a device access | |
7 | +whitelist with each cgroup. A whitelist entry has 4 fields. | |
8 | +'type' is a (all), c (char), or b (block). 'all' means it applies | |
9 | +to all types and all major and minor numbers. Major and minor are | |
10 | +either an integer or * for all. Access is a composition of r | |
11 | +(read), w (write), and m (mknod). | |
12 | + | |
13 | +The root device cgroup starts with rwm to 'all'. A child device | |
14 | +cgroup gets a copy of the parent. Administrators can then remove | |
15 | +devices from the whitelist or add new entries. A child cgroup can | |
16 | +never receive a device access which is denied by its parent. However | |
17 | +when a device access is removed from a parent it will not also be | |
18 | +removed from the child(ren). | |
19 | + | |
20 | +2. User Interface | |
21 | + | |
22 | +An entry is added using devices.allow, and removed using | |
23 | +devices.deny. For instance | |
24 | + | |
25 | + echo 'c 1:3 mr' > /cgroups/1/devices.allow | |
26 | + | |
27 | +allows cgroup 1 to read and mknod the device usually known as | |
28 | +/dev/null. Doing | |
29 | + | |
30 | + echo a > /cgroups/1/devices.deny | |
31 | + | |
32 | +will remove the default 'a *:* rwm' entry. Doing | |
33 | + | |
34 | + echo a > /cgroups/1/devices.allow | |
35 | + | |
36 | +will add the 'a *:* rwm' entry to the whitelist. | |
37 | + | |
38 | +3. Security | |
39 | + | |
40 | +Any task can move itself between cgroups. This clearly won't | |
41 | +suffice, but we can decide the best way to adequately restrict | |
42 | +movement as people get some experience with this. We may just want | |
43 | +to require CAP_SYS_ADMIN, which at least is a separate bit from | |
44 | +CAP_MKNOD. We may want to just refuse moving to a cgroup which | |
45 | +isn't a descendent of the current one. Or we may want to use | |
46 | +CAP_MAC_ADMIN, since we really are trying to lock down root. | |
47 | + | |
48 | +CAP_SYS_ADMIN is needed to modify the whitelist or move another | |
49 | +task to a new cgroup. (Again we'll probably want to change that). | |
50 | + | |
51 | +A cgroup may not be granted more permissions than the cgroup's | |
52 | +parent has. |
Documentation/cgroups/memcg_test.txt
1 | +Memory Resource Controller(Memcg) Implementation Memo. | |
2 | +Last Updated: 2008/12/15 | |
3 | +Base Kernel Version: based on 2.6.28-rc8-mm. | |
4 | + | |
5 | +Because VM is getting complex (one of reasons is memcg...), memcg's behavior | |
6 | +is complex. This is a document for memcg's internal behavior. | |
7 | +Please note that implementation details can be changed. | |
8 | + | |
9 | +(*) Topics on API should be in Documentation/cgroups/memory.txt) | |
10 | + | |
11 | +0. How to record usage ? | |
12 | + 2 objects are used. | |
13 | + | |
14 | + page_cgroup ....an object per page. | |
15 | + Allocated at boot or memory hotplug. Freed at memory hot removal. | |
16 | + | |
17 | + swap_cgroup ... an entry per swp_entry. | |
18 | + Allocated at swapon(). Freed at swapoff(). | |
19 | + | |
20 | + The page_cgroup has USED bit and double count against a page_cgroup never | |
21 | + occurs. swap_cgroup is used only when a charged page is swapped-out. | |
22 | + | |
23 | +1. Charge | |
24 | + | |
25 | + a page/swp_entry may be charged (usage += PAGE_SIZE) at | |
26 | + | |
27 | + mem_cgroup_newpage_charge() | |
28 | + Called at new page fault and Copy-On-Write. | |
29 | + | |
30 | + mem_cgroup_try_charge_swapin() | |
31 | + Called at do_swap_page() (page fault on swap entry) and swapoff. | |
32 | + Followed by charge-commit-cancel protocol. (With swap accounting) | |
33 | + At commit, a charge recorded in swap_cgroup is removed. | |
34 | + | |
35 | + mem_cgroup_cache_charge() | |
36 | + Called at add_to_page_cache() | |
37 | + | |
38 | + mem_cgroup_cache_charge_swapin() | |
39 | + Called at shmem's swapin. | |
40 | + | |
41 | + mem_cgroup_prepare_migration() | |
42 | + Called before migration. "extra" charge is done and followed by | |
43 | + charge-commit-cancel protocol. | |
44 | + At commit, charge against oldpage or newpage will be committed. | |
45 | + | |
46 | +2. Uncharge | |
47 | + a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by | |
48 | + | |
49 | + mem_cgroup_uncharge_page() | |
50 | + Called when an anonymous page is fully unmapped. I.e., mapcount goes | |
51 | + to 0. If the page is SwapCache, uncharge is delayed until | |
52 | + mem_cgroup_uncharge_swapcache(). | |
53 | + | |
54 | + mem_cgroup_uncharge_cache_page() | |
55 | + Called when a page-cache is deleted from radix-tree. If the page is | |
56 | + SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). | |
57 | + | |
58 | + mem_cgroup_uncharge_swapcache() | |
59 | + Called when SwapCache is removed from radix-tree. The charge itself | |
60 | + is moved to swap_cgroup. (If mem+swap controller is disabled, no | |
61 | + charge to swap occurs.) | |
62 | + | |
63 | + mem_cgroup_uncharge_swap() | |
64 | + Called when swp_entry's refcnt goes down to 0. A charge against swap | |
65 | + disappears. | |
66 | + | |
67 | + mem_cgroup_end_migration(old, new) | |
68 | + At success of migration old is uncharged (if necessary), a charge | |
69 | + to new page is committed. At failure, charge to old page is committed. | |
70 | + | |
71 | +3. charge-commit-cancel | |
72 | + In some case, we can't know this "charge" is valid or not at charging | |
73 | + (because of races). | |
74 | + To handle such case, there are charge-commit-cancel functions. | |
75 | + mem_cgroup_try_charge_XXX | |
76 | + mem_cgroup_commit_charge_XXX | |
77 | + mem_cgroup_cancel_charge_XXX | |
78 | + these are used in swap-in and migration. | |
79 | + | |
80 | + At try_charge(), there are no flags to say "this page is charged". | |
81 | + at this point, usage += PAGE_SIZE. | |
82 | + | |
83 | + At commit(), the function checks the page should be charged or not | |
84 | + and set flags or avoid charging.(usage -= PAGE_SIZE) | |
85 | + | |
86 | + At cancel(), simply usage -= PAGE_SIZE. | |
87 | + | |
88 | +Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |
89 | + | |
90 | +4. Anonymous | |
91 | + Anonymous page is newly allocated at | |
92 | + - page fault into MAP_ANONYMOUS mapping. | |
93 | + - Copy-On-Write. | |
94 | + It is charged right after it's allocated before doing any page table | |
95 | + related operations. Of course, it's uncharged when another page is used | |
96 | + for the fault address. | |
97 | + | |
98 | + At freeing anonymous page (by exit() or munmap()), zap_pte() is called | |
99 | + and pages for ptes are freed one by one.(see mm/memory.c). Uncharges | |
100 | + are done at page_remove_rmap() when page_mapcount() goes down to 0. | |
101 | + | |
102 | + Another page freeing is by page-reclaim (vmscan.c) and anonymous | |
103 | + pages are swapped out. In this case, the page is marked as | |
104 | + PageSwapCache(). uncharge() routine doesn't uncharge the page marked | |
105 | + as SwapCache(). It's delayed until __delete_from_swap_cache(). | |
106 | + | |
107 | + 4.1 Swap-in. | |
108 | + At swap-in, the page is taken from swap-cache. There are 2 cases. | |
109 | + | |
110 | + (a) If the SwapCache is newly allocated and read, it has no charges. | |
111 | + (b) If the SwapCache has been mapped by processes, it has been | |
112 | + charged already. | |
113 | + | |
114 | + This swap-in is one of the most complicated work. In do_swap_page(), | |
115 | + following events occur when pte is unchanged. | |
116 | + | |
117 | + (1) the page (SwapCache) is looked up. | |
118 | + (2) lock_page() | |
119 | + (3) try_charge_swapin() | |
120 | + (4) reuse_swap_page() (may call delete_swap_cache()) | |
121 | + (5) commit_charge_swapin() | |
122 | + (6) swap_free(). | |
123 | + | |
124 | + Considering following situation for example. | |
125 | + | |
126 | + (A) The page has not been charged before (2) and reuse_swap_page() | |
127 | + doesn't call delete_from_swap_cache(). | |
128 | + (B) The page has not been charged before (2) and reuse_swap_page() | |
129 | + calls delete_from_swap_cache(). | |
130 | + (C) The page has been charged before (2) and reuse_swap_page() doesn't | |
131 | + call delete_from_swap_cache(). | |
132 | + (D) The page has been charged before (2) and reuse_swap_page() calls | |
133 | + delete_from_swap_cache(). | |
134 | + | |
135 | + memory.usage/memsw.usage changes to this page/swp_entry will be | |
136 | + Case (A) (B) (C) (D) | |
137 | + Event | |
138 | + Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 | |
139 | + =========================================== | |
140 | + (3) +1/+1 +1/+1 +1/+1 +1/+1 | |
141 | + (4) - 0/ 0 - -1/ 0 | |
142 | + (5) 0/-1 0/ 0 -1/-1 0/ 0 | |
143 | + (6) - 0/-1 - 0/-1 | |
144 | + =========================================== | |
145 | + Result 1/ 1 1/ 1 1/ 1 1/ 1 | |
146 | + | |
147 | + In any cases, charges to this page should be 1/ 1. | |
148 | + | |
149 | + 4.2 Swap-out. | |
150 | + At swap-out, typical state transition is below. | |
151 | + | |
152 | + (a) add to swap cache. (marked as SwapCache) | |
153 | + swp_entry's refcnt += 1. | |
154 | + (b) fully unmapped. | |
155 | + swp_entry's refcnt += # of ptes. | |
156 | + (c) write back to swap. | |
157 | + (d) delete from swap cache. (remove from SwapCache) | |
158 | + swp_entry's refcnt -= 1. | |
159 | + | |
160 | + | |
161 | + At (b), the page is marked as SwapCache and not uncharged. | |
162 | + At (d), the page is removed from SwapCache and a charge in page_cgroup | |
163 | + is moved to swap_cgroup. | |
164 | + | |
165 | + Finally, at task exit, | |
166 | + (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. | |
167 | + Here, a charge in swap_cgroup disappears. | |
168 | + | |
169 | +5. Page Cache | |
170 | + Page Cache is charged at | |
171 | + - add_to_page_cache_locked(). | |
172 | + | |
173 | + uncharged at | |
174 | + - __remove_from_page_cache(). | |
175 | + | |
176 | + The logic is very clear. (About migration, see below) | |
177 | + Note: __remove_from_page_cache() is called by remove_from_page_cache() | |
178 | + and __remove_mapping(). | |
179 | + | |
180 | +6. Shmem(tmpfs) Page Cache | |
181 | + Memcg's charge/uncharge have special handlers of shmem. The best way | |
182 | + to understand shmem's page state transition is to read mm/shmem.c. | |
183 | + But brief explanation of the behavior of memcg around shmem will be | |
184 | + helpful to understand the logic. | |
185 | + | |
186 | + Shmem's page (just leaf page, not direct/indirect block) can be on | |
187 | + - radix-tree of shmem's inode. | |
188 | + - SwapCache. | |
189 | + - Both on radix-tree and SwapCache. This happens at swap-in | |
190 | + and swap-out, | |
191 | + | |
192 | + It's charged when... | |
193 | + - A new page is added to shmem's radix-tree. | |
194 | + - A swp page is read. (move a charge from swap_cgroup to page_cgroup) | |
195 | + It's uncharged when | |
196 | + - A page is removed from radix-tree and not SwapCache. | |
197 | + - When SwapCache is removed, a charge is moved to swap_cgroup. | |
198 | + - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup | |
199 | + disappears. | |
200 | + | |
201 | +7. Page Migration | |
202 | + One of the most complicated functions is page-migration-handler. | |
203 | + Memcg has 2 routines. Assume that we are migrating a page's contents | |
204 | + from OLDPAGE to NEWPAGE. | |
205 | + | |
206 | + Usual migration logic is.. | |
207 | + (a) remove the page from LRU. | |
208 | + (b) allocate NEWPAGE (migration target) | |
209 | + (c) lock by lock_page(). | |
210 | + (d) unmap all mappings. | |
211 | + (e-1) If necessary, replace entry in radix-tree. | |
212 | + (e-2) move contents of a page. | |
213 | + (f) map all mappings again. | |
214 | + (g) pushback the page to LRU. | |
215 | + (-) OLDPAGE will be freed. | |
216 | + | |
217 | + Before (g), memcg should complete all necessary charge/uncharge to | |
218 | + NEWPAGE/OLDPAGE. | |
219 | + | |
220 | + The point is.... | |
221 | + - If OLDPAGE is anonymous, all charges will be dropped at (d) because | |
222 | + try_to_unmap() drops all mapcount and the page will not be | |
223 | + SwapCache. | |
224 | + | |
225 | + - If OLDPAGE is SwapCache, charges will be kept at (g) because | |
226 | + __delete_from_swap_cache() isn't called at (e-1) | |
227 | + | |
228 | + - If OLDPAGE is page-cache, charges will be kept at (g) because | |
229 | + remove_from_swap_cache() isn't called at (e-1) | |
230 | + | |
231 | + memcg provides following hooks. | |
232 | + | |
233 | + - mem_cgroup_prepare_migration(OLDPAGE) | |
234 | + Called after (b) to account a charge (usage += PAGE_SIZE) against | |
235 | + memcg which OLDPAGE belongs to. | |
236 | + | |
237 | + - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) | |
238 | + Called after (f) before (g). | |
239 | + If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already | |
240 | + charged, a charge by prepare_migration() is automatically canceled. | |
241 | + If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. | |
242 | + | |
243 | + But zap_pte() (by exit or munmap) can be called while migration, | |
244 | + we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | |
245 | + | |
246 | +8. LRU | |
247 | + Each memcg has its own private LRU. Now, it's handling is under global | |
248 | + VM's control (means that it's handled under global zone->lru_lock). | |
249 | + Almost all routines around memcg's LRU is called by global LRU's | |
250 | + list management functions under zone->lru_lock(). | |
251 | + | |
252 | + A special function is mem_cgroup_isolate_pages(). This scans | |
253 | + memcg's private LRU and call __isolate_lru_page() to extract a page | |
254 | + from LRU. | |
255 | + (By __isolate_lru_page(), the page is removed from both of global and | |
256 | + private LRU.) | |
257 | + | |
258 | + | |
259 | +9. Typical Tests. | |
260 | + | |
261 | + Tests for racy cases. | |
262 | + | |
263 | + 9.1 Small limit to memcg. | |
264 | + When you do test to do racy case, it's good test to set memcg's limit | |
265 | + to be very small rather than GB. Many races found in the test under | |
266 | + xKB or xxMB limits. | |
267 | + (Memory behavior under GB and Memory behavior under MB shows very | |
268 | + different situation.) | |
269 | + | |
270 | + 9.2 Shmem | |
271 | + Historically, memcg's shmem handling was poor and we saw some amount | |
272 | + of troubles here. This is because shmem is page-cache but can be | |
273 | + SwapCache. Test with shmem/tmpfs is always good test. | |
274 | + | |
275 | + 9.3 Migration | |
276 | + For NUMA, migration is an another special case. To do easy test, cpuset | |
277 | + is useful. Following is a sample script to do migration. | |
278 | + | |
279 | + mount -t cgroup -o cpuset none /opt/cpuset | |
280 | + | |
281 | + mkdir /opt/cpuset/01 | |
282 | + echo 1 > /opt/cpuset/01/cpuset.cpus | |
283 | + echo 0 > /opt/cpuset/01/cpuset.mems | |
284 | + echo 1 > /opt/cpuset/01/cpuset.memory_migrate | |
285 | + mkdir /opt/cpuset/02 | |
286 | + echo 1 > /opt/cpuset/02/cpuset.cpus | |
287 | + echo 1 > /opt/cpuset/02/cpuset.mems | |
288 | + echo 1 > /opt/cpuset/02/cpuset.memory_migrate | |
289 | + | |
290 | + In above set, when you moves a task from 01 to 02, page migration to | |
291 | + node 0 to node 1 will occur. Following is a script to migrate all | |
292 | + under cpuset. | |
293 | + -- | |
294 | + move_task() | |
295 | + { | |
296 | + for pid in $1 | |
297 | + do | |
298 | + /bin/echo $pid >$2/tasks 2>/dev/null | |
299 | + echo -n $pid | |
300 | + echo -n " " | |
301 | + done | |
302 | + echo END | |
303 | + } | |
304 | + | |
305 | + G1_TASK=`cat ${G1}/tasks` | |
306 | + G2_TASK=`cat ${G2}/tasks` | |
307 | + move_task "${G1_TASK}" ${G2} & | |
308 | + -- | |
309 | + 9.4 Memory hotplug. | |
310 | + memory hotplug test is one of good test. | |
311 | + to offline memory, do following. | |
312 | + # echo offline > /sys/devices/system/memory/memoryXXX/state | |
313 | + (XXX is the place of memory) | |
314 | + This is an easy way to test page migration, too. | |
315 | + | |
316 | + 9.5 mkdir/rmdir | |
317 | + When using hierarchy, mkdir/rmdir test should be done. | |
318 | + Use tests like the following. | |
319 | + | |
320 | + echo 1 >/opt/cgroup/01/memory/use_hierarchy | |
321 | + mkdir /opt/cgroup/01/child_a | |
322 | + mkdir /opt/cgroup/01/child_b | |
323 | + | |
324 | + set limit to 01. | |
325 | + add limit to 01/child_b | |
326 | + run jobs under child_a and child_b | |
327 | + | |
328 | + create/delete following groups at random while jobs are running. | |
329 | + /opt/cgroup/01/child_a/child_aa | |
330 | + /opt/cgroup/01/child_b/child_bb | |
331 | + /opt/cgroup/01/child_c | |
332 | + | |
333 | + running new jobs in new group is also good. | |
334 | + | |
335 | + 9.6 Mount with other subsystems. | |
336 | + Mounting with other subsystems is a good test because there is a | |
337 | + race and lock dependency with other cgroup subsystems. | |
338 | + | |
339 | + example) | |
340 | + # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | |
341 | + | |
342 | + and do task move, mkdir, rmdir etc...under this. |
Documentation/cgroups/memory.txt
1 | +Memory Resource Controller | |
2 | + | |
3 | +NOTE: The Memory Resource Controller has been generically been referred | |
4 | +to as the memory controller in this document. Do not confuse memory controller | |
5 | +used here with the memory controller that is used in hardware. | |
6 | + | |
7 | +Salient features | |
8 | + | |
9 | +a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages | |
10 | +b. The infrastructure allows easy addition of other types of memory to control | |
11 | +c. Provides *zero overhead* for non memory controller users | |
12 | +d. Provides a double LRU: global memory pressure causes reclaim from the | |
13 | + global LRU; a cgroup on hitting a limit, reclaims from the per | |
14 | + cgroup LRU | |
15 | + | |
16 | +NOTE: Swap Cache (unmapped) is not accounted now. | |
17 | + | |
18 | +Benefits and Purpose of the memory controller | |
19 | + | |
20 | +The memory controller isolates the memory behaviour of a group of tasks | |
21 | +from the rest of the system. The article on LWN [12] mentions some probable | |
22 | +uses of the memory controller. The memory controller can be used to | |
23 | + | |
24 | +a. Isolate an application or a group of applications | |
25 | + Memory hungry applications can be isolated and limited to a smaller | |
26 | + amount of memory. | |
27 | +b. Create a cgroup with limited amount of memory, this can be used | |
28 | + as a good alternative to booting with mem=XXXX. | |
29 | +c. Virtualization solutions can control the amount of memory they want | |
30 | + to assign to a virtual machine instance. | |
31 | +d. A CD/DVD burner could control the amount of memory used by the | |
32 | + rest of the system to ensure that burning does not fail due to lack | |
33 | + of available memory. | |
34 | +e. There are several other use cases, find one or use the controller just | |
35 | + for fun (to learn and hack on the VM subsystem). | |
36 | + | |
37 | +1. History | |
38 | + | |
39 | +The memory controller has a long history. A request for comments for the memory | |
40 | +controller was posted by Balbir Singh [1]. At the time the RFC was posted | |
41 | +there were several implementations for memory control. The goal of the | |
42 | +RFC was to build consensus and agreement for the minimal features required | |
43 | +for memory control. The first RSS controller was posted by Balbir Singh[2] | |
44 | +in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the | |
45 | +RSS controller. At OLS, at the resource management BoF, everyone suggested | |
46 | +that we handle both page cache and RSS together. Another request was raised | |
47 | +to allow user space handling of OOM. The current memory controller is | |
48 | +at version 6; it combines both mapped (RSS) and unmapped Page | |
49 | +Cache Control [11]. | |
50 | + | |
51 | +2. Memory Control | |
52 | + | |
53 | +Memory is a unique resource in the sense that it is present in a limited | |
54 | +amount. If a task requires a lot of CPU processing, the task can spread | |
55 | +its processing over a period of hours, days, months or years, but with | |
56 | +memory, the same physical memory needs to be reused to accomplish the task. | |
57 | + | |
58 | +The memory controller implementation has been divided into phases. These | |
59 | +are: | |
60 | + | |
61 | +1. Memory controller | |
62 | +2. mlock(2) controller | |
63 | +3. Kernel user memory accounting and slab control | |
64 | +4. user mappings length controller | |
65 | + | |
66 | +The memory controller is the first controller developed. | |
67 | + | |
68 | +2.1. Design | |
69 | + | |
70 | +The core of the design is a counter called the res_counter. The res_counter | |
71 | +tracks the current memory usage and limit of the group of processes associated | |
72 | +with the controller. Each cgroup has a memory controller specific data | |
73 | +structure (mem_cgroup) associated with it. | |
74 | + | |
75 | +2.2. Accounting | |
76 | + | |
77 | + +--------------------+ | |
78 | + | mem_cgroup | | |
79 | + | (res_counter) | | |
80 | + +--------------------+ | |
81 | + / ^ \ | |
82 | + / | \ | |
83 | + +---------------+ | +---------------+ | |
84 | + | mm_struct | |.... | mm_struct | | |
85 | + | | | | | | |
86 | + +---------------+ | +---------------+ | |
87 | + | | |
88 | + + --------------+ | |
89 | + | | |
90 | + +---------------+ +------+--------+ | |
91 | + | page +----------> page_cgroup| | |
92 | + | | | | | |
93 | + +---------------+ +---------------+ | |
94 | + | |
95 | + (Figure 1: Hierarchy of Accounting) | |
96 | + | |
97 | + | |
98 | +Figure 1 shows the important aspects of the controller | |
99 | + | |
100 | +1. Accounting happens per cgroup | |
101 | +2. Each mm_struct knows about which cgroup it belongs to | |
102 | +3. Each page has a pointer to the page_cgroup, which in turn knows the | |
103 | + cgroup it belongs to | |
104 | + | |
105 | +The accounting is done as follows: mem_cgroup_charge() is invoked to setup | |
106 | +the necessary data structures and check if the cgroup that is being charged | |
107 | +is over its limit. If it is then reclaim is invoked on the cgroup. | |
108 | +More details can be found in the reclaim section of this document. | |
109 | +If everything goes well, a page meta-data-structure called page_cgroup is | |
110 | +allocated and associated with the page. This routine also adds the page to | |
111 | +the per cgroup LRU. | |
112 | + | |
113 | +2.2.1 Accounting details | |
114 | + | |
115 | +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | |
116 | +(some pages which never be reclaimable and will not be on global LRU | |
117 | + are not accounted. we just accounts pages under usual vm management.) | |
118 | + | |
119 | +RSS pages are accounted at page_fault unless they've already been accounted | |
120 | +for earlier. A file page will be accounted for as Page Cache when it's | |
121 | +inserted into inode (radix-tree). While it's mapped into the page tables of | |
122 | +processes, duplicate accounting is carefully avoided. | |
123 | + | |
124 | +A RSS page is unaccounted when it's fully unmapped. A PageCache page is | |
125 | +unaccounted when it's removed from radix-tree. | |
126 | + | |
127 | +At page migration, accounting information is kept. | |
128 | + | |
129 | +Note: we just account pages-on-lru because our purpose is to control amount | |
130 | +of used pages. not-on-lru pages are tend to be out-of-control from vm view. | |
131 | + | |
132 | +2.3 Shared Page Accounting | |
133 | + | |
134 | +Shared pages are accounted on the basis of the first touch approach. The | |
135 | +cgroup that first touches a page is accounted for the page. The principle | |
136 | +behind this approach is that a cgroup that aggressively uses a shared | |
137 | +page will eventually get charged for it (once it is uncharged from | |
138 | +the cgroup that brought it in -- this will happen on memory pressure). | |
139 | + | |
140 | +Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | |
141 | +When you do swapoff and make swapped-out pages of shmem(tmpfs) to | |
142 | +be backed into memory in force, charges for pages are accounted against the | |
143 | +caller of swapoff rather than the users of shmem. | |
144 | + | |
145 | + | |
146 | +2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | |
147 | +Swap Extension allows you to record charge for swap. A swapped-in page is | |
148 | +charged back to original page allocator if possible. | |
149 | + | |
150 | +When swap is accounted, following files are added. | |
151 | + - memory.memsw.usage_in_bytes. | |
152 | + - memory.memsw.limit_in_bytes. | |
153 | + | |
154 | +usage of mem+swap is limited by memsw.limit_in_bytes. | |
155 | + | |
156 | +Note: why 'mem+swap' rather than swap. | |
157 | +The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | |
158 | +to move account from memory to swap...there is no change in usage of | |
159 | +mem+swap. | |
160 | + | |
161 | +In other words, when we want to limit the usage of swap without affecting | |
162 | +global LRU, mem+swap limit is better than just limiting swap from OS point | |
163 | +of view. | |
164 | + | |
165 | +2.5 Reclaim | |
166 | + | |
167 | +Each cgroup maintains a per cgroup LRU that consists of an active | |
168 | +and inactive list. When a cgroup goes over its limit, we first try | |
169 | +to reclaim memory from the cgroup so as to make space for the new | |
170 | +pages that the cgroup has touched. If the reclaim is unsuccessful, | |
171 | +an OOM routine is invoked to select and kill the bulkiest task in the | |
172 | +cgroup. | |
173 | + | |
174 | +The reclaim algorithm has not been modified for cgroups, except that | |
175 | +pages that are selected for reclaiming come from the per cgroup LRU | |
176 | +list. | |
177 | + | |
178 | +2. Locking | |
179 | + | |
180 | +The memory controller uses the following hierarchy | |
181 | + | |
182 | +1. zone->lru_lock is used for selecting pages to be isolated | |
183 | +2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | |
184 | +3. lock_page_cgroup() is used to protect page->page_cgroup | |
185 | + | |
186 | +3. User Interface | |
187 | + | |
188 | +0. Configuration | |
189 | + | |
190 | +a. Enable CONFIG_CGROUPS | |
191 | +b. Enable CONFIG_RESOURCE_COUNTERS | |
192 | +c. Enable CONFIG_CGROUP_MEM_RES_CTLR | |
193 | + | |
194 | +1. Prepare the cgroups | |
195 | +# mkdir -p /cgroups | |
196 | +# mount -t cgroup none /cgroups -o memory | |
197 | + | |
198 | +2. Make the new group and move bash into it | |
199 | +# mkdir /cgroups/0 | |
200 | +# echo $$ > /cgroups/0/tasks | |
201 | + | |
202 | +Since now we're in the 0 cgroup, | |
203 | +We can alter the memory limit: | |
204 | +# echo 4M > /cgroups/0/memory.limit_in_bytes | |
205 | + | |
206 | +NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | |
207 | +mega or gigabytes. | |
208 | + | |
209 | +# cat /cgroups/0/memory.limit_in_bytes | |
210 | +4194304 | |
211 | + | |
212 | +NOTE: The interface has now changed to display the usage in bytes | |
213 | +instead of pages | |
214 | + | |
215 | +We can check the usage: | |
216 | +# cat /cgroups/0/memory.usage_in_bytes | |
217 | +1216512 | |
218 | + | |
219 | +A successful write to this file does not guarantee a successful set of | |
220 | +this limit to the value written into the file. This can be due to a | |
221 | +number of factors, such as rounding up to page boundaries or the total | |
222 | +availability of memory on the system. The user is required to re-read | |
223 | +this file after a write to guarantee the value committed by the kernel. | |
224 | + | |
225 | +# echo 1 > memory.limit_in_bytes | |
226 | +# cat memory.limit_in_bytes | |
227 | +4096 | |
228 | + | |
229 | +The memory.failcnt field gives the number of times that the cgroup limit was | |
230 | +exceeded. | |
231 | + | |
232 | +The memory.stat file gives accounting information. Now, the number of | |
233 | +caches, RSS and Active pages/Inactive pages are shown. | |
234 | + | |
235 | +4. Testing | |
236 | + | |
237 | +Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | |
238 | +Apart from that v6 has been tested with several applications and regular | |
239 | +daily use. The controller has also been tested on the PPC64, x86_64 and | |
240 | +UML platforms. | |
241 | + | |
242 | +4.1 Troubleshooting | |
243 | + | |
244 | +Sometimes a user might find that the application under a cgroup is | |
245 | +terminated. There are several causes for this: | |
246 | + | |
247 | +1. The cgroup limit is too low (just too low to do anything useful) | |
248 | +2. The user is using anonymous memory and swap is turned off or too low | |
249 | + | |
250 | +A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | |
251 | +some of the pages cached in the cgroup (page cache pages). | |
252 | + | |
253 | +4.2 Task migration | |
254 | + | |
255 | +When a task migrates from one cgroup to another, it's charge is not | |
256 | +carried forward. The pages allocated from the original cgroup still | |
257 | +remain charged to it, the charge is dropped when the page is freed or | |
258 | +reclaimed. | |
259 | + | |
260 | +4.3 Removing a cgroup | |
261 | + | |
262 | +A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | |
263 | +cgroup might have some charge associated with it, even though all | |
264 | +tasks have migrated away from it. | |
265 | +Such charges are freed(at default) or moved to its parent. When moved, | |
266 | +both of RSS and CACHES are moved to parent. | |
267 | +If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | |
268 | + | |
269 | +Charges recorded in swap information is not updated at removal of cgroup. | |
270 | +Recorded information is discarded and a cgroup which uses swap (swapcache) | |
271 | +will be charged as a new owner of it. | |
272 | + | |
273 | + | |
274 | +5. Misc. interfaces. | |
275 | + | |
276 | +5.1 force_empty | |
277 | + memory.force_empty interface is provided to make cgroup's memory usage empty. | |
278 | + You can use this interface only when the cgroup has no tasks. | |
279 | + When writing anything to this | |
280 | + | |
281 | + # echo 0 > memory.force_empty | |
282 | + | |
283 | + Almost all pages tracked by this memcg will be unmapped and freed. Some of | |
284 | + pages cannot be freed because it's locked or in-use. Such pages are moved | |
285 | + to parent and this cgroup will be empty. But this may return -EBUSY in | |
286 | + some too busy case. | |
287 | + | |
288 | + Typical use case of this interface is that calling this before rmdir(). | |
289 | + Because rmdir() moves all pages to parent, some out-of-use page caches can be | |
290 | + moved to the parent. If you want to avoid that, force_empty will be useful. | |
291 | + | |
292 | +5.2 stat file | |
293 | + memory.stat file includes following statistics (now) | |
294 | + cache - # of pages from page-cache and shmem. | |
295 | + rss - # of pages from anonymous memory. | |
296 | + pgpgin - # of event of charging | |
297 | + pgpgout - # of event of uncharging | |
298 | + active_anon - # of pages on active lru of anon, shmem. | |
299 | + inactive_anon - # of pages on active lru of anon, shmem | |
300 | + active_file - # of pages on active lru of file-cache | |
301 | + inactive_file - # of pages on inactive lru of file cache | |
302 | + unevictable - # of pages cannot be reclaimed.(mlocked etc) | |
303 | + | |
304 | + Below is depend on CONFIG_DEBUG_VM. | |
305 | + inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | |
306 | + recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | |
307 | + recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | |
308 | + recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | |
309 | + recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | |
310 | + | |
311 | + Memo: | |
312 | + recent_rotated means recent frequency of lru rotation. | |
313 | + recent_scanned means recent # of scans to lru. | |
314 | + showing for better debug please see the code for meanings. | |
315 | + | |
316 | + | |
317 | +5.3 swappiness | |
318 | + Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | |
319 | + | |
320 | + Following cgroup's swapiness can't be changed. | |
321 | + - root cgroup (uses /proc/sys/vm/swappiness). | |
322 | + - a cgroup which uses hierarchy and it has child cgroup. | |
323 | + - a cgroup which uses hierarchy and not the root of hierarchy. | |
324 | + | |
325 | + | |
326 | +6. Hierarchy support | |
327 | + | |
328 | +The memory controller supports a deep hierarchy and hierarchical accounting. | |
329 | +The hierarchy is created by creating the appropriate cgroups in the | |
330 | +cgroup filesystem. Consider for example, the following cgroup filesystem | |
331 | +hierarchy | |
332 | + | |
333 | + root | |
334 | + / | \ | |
335 | + / | \ | |
336 | + a b c | |
337 | + | \ | |
338 | + | \ | |
339 | + d e | |
340 | + | |
341 | +In the diagram above, with hierarchical accounting enabled, all memory | |
342 | +usage of e, is accounted to its ancestors up until the root (i.e, c and root), | |
343 | +that has memory.use_hierarchy enabled. If one of the ancestors goes over its | |
344 | +limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | |
345 | +children of the ancestor. | |
346 | + | |
347 | +6.1 Enabling hierarchical accounting and reclaim | |
348 | + | |
349 | +The memory controller by default disables the hierarchy feature. Support | |
350 | +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | |
351 | + | |
352 | +# echo 1 > memory.use_hierarchy | |
353 | + | |
354 | +The feature can be disabled by | |
355 | + | |
356 | +# echo 0 > memory.use_hierarchy | |
357 | + | |
358 | +NOTE1: Enabling/disabling will fail if the cgroup already has other | |
359 | +cgroups created below it. | |
360 | + | |
361 | +NOTE2: This feature can be enabled/disabled per subtree. | |
362 | + | |
363 | +7. TODO | |
364 | + | |
365 | +1. Add support for accounting huge pages (as a separate controller) | |
366 | +2. Make per-cgroup scanner reclaim not-shared pages first | |
367 | +3. Teach controller to account for shared-pages | |
368 | +4. Start reclamation in the background when the limit is | |
369 | + not yet hit but the usage is getting closer | |
370 | + | |
371 | +Summary | |
372 | + | |
373 | +Overall, the memory controller has been a stable controller and has been | |
374 | +commented and discussed quite extensively in the community. | |
375 | + | |
376 | +References | |
377 | + | |
378 | +1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ | |
379 | +2. Singh, Balbir. Memory Controller (RSS Control), | |
380 | + http://lwn.net/Articles/222762/ | |
381 | +3. Emelianov, Pavel. Resource controllers based on process cgroups | |
382 | + http://lkml.org/lkml/2007/3/6/198 | |
383 | +4. Emelianov, Pavel. RSS controller based on process cgroups (v2) | |
384 | + http://lkml.org/lkml/2007/4/9/78 | |
385 | +5. Emelianov, Pavel. RSS controller based on process cgroups (v3) | |
386 | + http://lkml.org/lkml/2007/5/30/244 | |
387 | +6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ | |
388 | +7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control | |
389 | + subsystem (v3), http://lwn.net/Articles/235534/ | |
390 | +8. Singh, Balbir. RSS controller v2 test results (lmbench), | |
391 | + http://lkml.org/lkml/2007/5/17/232 | |
392 | +9. Singh, Balbir. RSS controller v2 AIM9 results | |
393 | + http://lkml.org/lkml/2007/5/18/1 | |
394 | +10. Singh, Balbir. Memory controller v6 test results, | |
395 | + http://lkml.org/lkml/2007/8/19/36 | |
396 | +11. Singh, Balbir. Memory controller introduction (v6), | |
397 | + http://lkml.org/lkml/2007/8/17/69 | |
398 | +12. Corbet, Jonathan, Controlling memory use in cgroups, | |
399 | + http://lwn.net/Articles/243795/ |
Documentation/cgroups/resource_counter.txt
1 | + | |
2 | + The Resource Counter | |
3 | + | |
4 | +The resource counter, declared at include/linux/res_counter.h, | |
5 | +is supposed to facilitate the resource management by controllers | |
6 | +by providing common stuff for accounting. | |
7 | + | |
8 | +This "stuff" includes the res_counter structure and routines | |
9 | +to work with it. | |
10 | + | |
11 | + | |
12 | + | |
13 | +1. Crucial parts of the res_counter structure | |
14 | + | |
15 | + a. unsigned long long usage | |
16 | + | |
17 | + The usage value shows the amount of a resource that is consumed | |
18 | + by a group at a given time. The units of measurement should be | |
19 | + determined by the controller that uses this counter. E.g. it can | |
20 | + be bytes, items or any other unit the controller operates on. | |
21 | + | |
22 | + b. unsigned long long max_usage | |
23 | + | |
24 | + The maximal value of the usage over time. | |
25 | + | |
26 | + This value is useful when gathering statistical information about | |
27 | + the particular group, as it shows the actual resource requirements | |
28 | + for a particular group, not just some usage snapshot. | |
29 | + | |
30 | + c. unsigned long long limit | |
31 | + | |
32 | + The maximal allowed amount of resource to consume by the group. In | |
33 | + case the group requests for more resources, so that the usage value | |
34 | + would exceed the limit, the resource allocation is rejected (see | |
35 | + the next section). | |
36 | + | |
37 | + d. unsigned long long failcnt | |
38 | + | |
39 | + The failcnt stands for "failures counter". This is the number of | |
40 | + resource allocation attempts that failed. | |
41 | + | |
42 | + c. spinlock_t lock | |
43 | + | |
44 | + Protects changes of the above values. | |
45 | + | |
46 | + | |
47 | + | |
48 | +2. Basic accounting routines | |
49 | + | |
50 | + a. void res_counter_init(struct res_counter *rc) | |
51 | + | |
52 | + Initializes the resource counter. As usual, should be the first | |
53 | + routine called for a new counter. | |
54 | + | |
55 | + b. int res_counter_charge[_locked] | |
56 | + (struct res_counter *rc, unsigned long val) | |
57 | + | |
58 | + When a resource is about to be allocated it has to be accounted | |
59 | + with the appropriate resource counter (controller should determine | |
60 | + which one to use on its own). This operation is called "charging". | |
61 | + | |
62 | + This is not very important which operation - resource allocation | |
63 | + or charging - is performed first, but | |
64 | + * if the allocation is performed first, this may create a | |
65 | + temporary resource over-usage by the time resource counter is | |
66 | + charged; | |
67 | + * if the charging is performed first, then it should be uncharged | |
68 | + on error path (if the one is called). | |
69 | + | |
70 | + c. void res_counter_uncharge[_locked] | |
71 | + (struct res_counter *rc, unsigned long val) | |
72 | + | |
73 | + When a resource is released (freed) it should be de-accounted | |
74 | + from the resource counter it was accounted to. This is called | |
75 | + "uncharging". | |
76 | + | |
77 | + The _locked routines imply that the res_counter->lock is taken. | |
78 | + | |
79 | + | |
80 | + 2.1 Other accounting routines | |
81 | + | |
82 | + There are more routines that may help you with common needs, like | |
83 | + checking whether the limit is reached or resetting the max_usage | |
84 | + value. They are all declared in include/linux/res_counter.h. | |
85 | + | |
86 | + | |
87 | + | |
88 | +3. Analyzing the resource counter registrations | |
89 | + | |
90 | + a. If the failcnt value constantly grows, this means that the counter's | |
91 | + limit is too tight. Either the group is misbehaving and consumes too | |
92 | + many resources, or the configuration is not suitable for the group | |
93 | + and the limit should be increased. | |
94 | + | |
95 | + b. The max_usage value can be used to quickly tune the group. One may | |
96 | + set the limits to maximal values and either load the container with | |
97 | + a common pattern or leave one for a while. After this the max_usage | |
98 | + value shows the amount of memory the container would require during | |
99 | + its common activity. | |
100 | + | |
101 | + Setting the limit a bit above this value gives a pretty good | |
102 | + configuration that works in most of the cases. | |
103 | + | |
104 | + c. If the max_usage is much less than the limit, but the failcnt value | |
105 | + is growing, then the group tries to allocate a big chunk of resource | |
106 | + at once. | |
107 | + | |
108 | + d. If the max_usage is much less than the limit, but the failcnt value | |
109 | + is 0, then this group is given too high limit, that it does not | |
110 | + require. It is better to lower the limit a bit leaving more resource | |
111 | + for other groups. | |
112 | + | |
113 | + | |
114 | + | |
115 | +4. Communication with the control groups subsystem (cgroups) | |
116 | + | |
117 | +All the resource controllers that are using cgroups and resource counters | |
118 | +should provide files (in the cgroup filesystem) to work with the resource | |
119 | +counter fields. They are recommended to adhere to the following rules: | |
120 | + | |
121 | + a. File names | |
122 | + | |
123 | + Field name File name | |
124 | + --------------------------------------------------- | |
125 | + usage usage_in_<unit_of_measurement> | |
126 | + max_usage max_usage_in_<unit_of_measurement> | |
127 | + limit limit_in_<unit_of_measurement> | |
128 | + failcnt failcnt | |
129 | + lock no file :) | |
130 | + | |
131 | + b. Reading from file should show the corresponding field value in the | |
132 | + appropriate format. | |
133 | + | |
134 | + c. Writing to file | |
135 | + | |
136 | + Field Expected behavior | |
137 | + ---------------------------------- | |
138 | + usage prohibited | |
139 | + max_usage reset to usage | |
140 | + limit set the limit | |
141 | + failcnt reset to zero | |
142 | + | |
143 | + | |
144 | + | |
145 | +5. Usage example | |
146 | + | |
147 | + a. Declare a task group (take a look at cgroups subsystem for this) and | |
148 | + fold a res_counter into it | |
149 | + | |
150 | + struct my_group { | |
151 | + struct res_counter res; | |
152 | + | |
153 | + <other fields> | |
154 | + } | |
155 | + | |
156 | + b. Put hooks in resource allocation/release paths | |
157 | + | |
158 | + int alloc_something(...) | |
159 | + { | |
160 | + if (res_counter_charge(res_counter_ptr, amount) < 0) | |
161 | + return -ENOMEM; | |
162 | + | |
163 | + <allocate the resource and return to the caller> | |
164 | + } | |
165 | + | |
166 | + void release_something(...) | |
167 | + { | |
168 | + res_counter_uncharge(res_counter_ptr, amount); | |
169 | + | |
170 | + <release the resource> | |
171 | + } | |
172 | + | |
173 | + In order to keep the usage value self-consistent, both the | |
174 | + "res_counter_ptr" and the "amount" in release_something() should be | |
175 | + the same as they were in the alloc_something() when the releasing | |
176 | + resource was allocated. | |
177 | + | |
178 | + c. Provide the way to read res_counter values and set them (the cgroups | |
179 | + still can help with it). | |
180 | + | |
181 | + c. Compile and run :) |
Documentation/controllers/cpuacct.txt
1 | -CPU Accounting Controller | |
2 | -------------------------- | |
3 | - | |
4 | -The CPU accounting controller is used to group tasks using cgroups and | |
5 | -account the CPU usage of these groups of tasks. | |
6 | - | |
7 | -The CPU accounting controller supports multi-hierarchy groups. An accounting | |
8 | -group accumulates the CPU usage of all of its child groups and the tasks | |
9 | -directly present in its group. | |
10 | - | |
11 | -Accounting groups can be created by first mounting the cgroup filesystem. | |
12 | - | |
13 | -# mkdir /cgroups | |
14 | -# mount -t cgroup -ocpuacct none /cgroups | |
15 | - | |
16 | -With the above step, the initial or the parent accounting group | |
17 | -becomes visible at /cgroups. At bootup, this group includes all the | |
18 | -tasks in the system. /cgroups/tasks lists the tasks in this cgroup. | |
19 | -/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by | |
20 | -this group which is essentially the CPU time obtained by all the tasks | |
21 | -in the system. | |
22 | - | |
23 | -New accounting groups can be created under the parent group /cgroups. | |
24 | - | |
25 | -# cd /cgroups | |
26 | -# mkdir g1 | |
27 | -# echo $$ > g1 | |
28 | - | |
29 | -The above steps create a new group g1 and move the current shell | |
30 | -process (bash) into it. CPU time consumed by this bash and its children | |
31 | -can be obtained from g1/cpuacct.usage and the same is accumulated in | |
32 | -/cgroups/cpuacct.usage also. |
Documentation/controllers/devices.txt
1 | -Device Whitelist Controller | |
2 | - | |
3 | -1. Description: | |
4 | - | |
5 | -Implement a cgroup to track and enforce open and mknod restrictions | |
6 | -on device files. A device cgroup associates a device access | |
7 | -whitelist with each cgroup. A whitelist entry has 4 fields. | |
8 | -'type' is a (all), c (char), or b (block). 'all' means it applies | |
9 | -to all types and all major and minor numbers. Major and minor are | |
10 | -either an integer or * for all. Access is a composition of r | |
11 | -(read), w (write), and m (mknod). | |
12 | - | |
13 | -The root device cgroup starts with rwm to 'all'. A child device | |
14 | -cgroup gets a copy of the parent. Administrators can then remove | |
15 | -devices from the whitelist or add new entries. A child cgroup can | |
16 | -never receive a device access which is denied by its parent. However | |
17 | -when a device access is removed from a parent it will not also be | |
18 | -removed from the child(ren). | |
19 | - | |
20 | -2. User Interface | |
21 | - | |
22 | -An entry is added using devices.allow, and removed using | |
23 | -devices.deny. For instance | |
24 | - | |
25 | - echo 'c 1:3 mr' > /cgroups/1/devices.allow | |
26 | - | |
27 | -allows cgroup 1 to read and mknod the device usually known as | |
28 | -/dev/null. Doing | |
29 | - | |
30 | - echo a > /cgroups/1/devices.deny | |
31 | - | |
32 | -will remove the default 'a *:* rwm' entry. Doing | |
33 | - | |
34 | - echo a > /cgroups/1/devices.allow | |
35 | - | |
36 | -will add the 'a *:* rwm' entry to the whitelist. | |
37 | - | |
38 | -3. Security | |
39 | - | |
40 | -Any task can move itself between cgroups. This clearly won't | |
41 | -suffice, but we can decide the best way to adequately restrict | |
42 | -movement as people get some experience with this. We may just want | |
43 | -to require CAP_SYS_ADMIN, which at least is a separate bit from | |
44 | -CAP_MKNOD. We may want to just refuse moving to a cgroup which | |
45 | -isn't a descendent of the current one. Or we may want to use | |
46 | -CAP_MAC_ADMIN, since we really are trying to lock down root. | |
47 | - | |
48 | -CAP_SYS_ADMIN is needed to modify the whitelist or move another | |
49 | -task to a new cgroup. (Again we'll probably want to change that). | |
50 | - | |
51 | -A cgroup may not be granted more permissions than the cgroup's | |
52 | -parent has. |
Documentation/controllers/memcg_test.txt
1 | -Memory Resource Controller(Memcg) Implementation Memo. | |
2 | -Last Updated: 2008/12/15 | |
3 | -Base Kernel Version: based on 2.6.28-rc8-mm. | |
4 | - | |
5 | -Because VM is getting complex (one of reasons is memcg...), memcg's behavior | |
6 | -is complex. This is a document for memcg's internal behavior. | |
7 | -Please note that implementation details can be changed. | |
8 | - | |
9 | -(*) Topics on API should be in Documentation/controllers/memory.txt) | |
10 | - | |
11 | -0. How to record usage ? | |
12 | - 2 objects are used. | |
13 | - | |
14 | - page_cgroup ....an object per page. | |
15 | - Allocated at boot or memory hotplug. Freed at memory hot removal. | |
16 | - | |
17 | - swap_cgroup ... an entry per swp_entry. | |
18 | - Allocated at swapon(). Freed at swapoff(). | |
19 | - | |
20 | - The page_cgroup has USED bit and double count against a page_cgroup never | |
21 | - occurs. swap_cgroup is used only when a charged page is swapped-out. | |
22 | - | |
23 | -1. Charge | |
24 | - | |
25 | - a page/swp_entry may be charged (usage += PAGE_SIZE) at | |
26 | - | |
27 | - mem_cgroup_newpage_charge() | |
28 | - Called at new page fault and Copy-On-Write. | |
29 | - | |
30 | - mem_cgroup_try_charge_swapin() | |
31 | - Called at do_swap_page() (page fault on swap entry) and swapoff. | |
32 | - Followed by charge-commit-cancel protocol. (With swap accounting) | |
33 | - At commit, a charge recorded in swap_cgroup is removed. | |
34 | - | |
35 | - mem_cgroup_cache_charge() | |
36 | - Called at add_to_page_cache() | |
37 | - | |
38 | - mem_cgroup_cache_charge_swapin() | |
39 | - Called at shmem's swapin. | |
40 | - | |
41 | - mem_cgroup_prepare_migration() | |
42 | - Called before migration. "extra" charge is done and followed by | |
43 | - charge-commit-cancel protocol. | |
44 | - At commit, charge against oldpage or newpage will be committed. | |
45 | - | |
46 | -2. Uncharge | |
47 | - a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by | |
48 | - | |
49 | - mem_cgroup_uncharge_page() | |
50 | - Called when an anonymous page is fully unmapped. I.e., mapcount goes | |
51 | - to 0. If the page is SwapCache, uncharge is delayed until | |
52 | - mem_cgroup_uncharge_swapcache(). | |
53 | - | |
54 | - mem_cgroup_uncharge_cache_page() | |
55 | - Called when a page-cache is deleted from radix-tree. If the page is | |
56 | - SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). | |
57 | - | |
58 | - mem_cgroup_uncharge_swapcache() | |
59 | - Called when SwapCache is removed from radix-tree. The charge itself | |
60 | - is moved to swap_cgroup. (If mem+swap controller is disabled, no | |
61 | - charge to swap occurs.) | |
62 | - | |
63 | - mem_cgroup_uncharge_swap() | |
64 | - Called when swp_entry's refcnt goes down to 0. A charge against swap | |
65 | - disappears. | |
66 | - | |
67 | - mem_cgroup_end_migration(old, new) | |
68 | - At success of migration old is uncharged (if necessary), a charge | |
69 | - to new page is committed. At failure, charge to old page is committed. | |
70 | - | |
71 | -3. charge-commit-cancel | |
72 | - In some case, we can't know this "charge" is valid or not at charging | |
73 | - (because of races). | |
74 | - To handle such case, there are charge-commit-cancel functions. | |
75 | - mem_cgroup_try_charge_XXX | |
76 | - mem_cgroup_commit_charge_XXX | |
77 | - mem_cgroup_cancel_charge_XXX | |
78 | - these are used in swap-in and migration. | |
79 | - | |
80 | - At try_charge(), there are no flags to say "this page is charged". | |
81 | - at this point, usage += PAGE_SIZE. | |
82 | - | |
83 | - At commit(), the function checks the page should be charged or not | |
84 | - and set flags or avoid charging.(usage -= PAGE_SIZE) | |
85 | - | |
86 | - At cancel(), simply usage -= PAGE_SIZE. | |
87 | - | |
88 | -Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |
89 | - | |
90 | -4. Anonymous | |
91 | - Anonymous page is newly allocated at | |
92 | - - page fault into MAP_ANONYMOUS mapping. | |
93 | - - Copy-On-Write. | |
94 | - It is charged right after it's allocated before doing any page table | |
95 | - related operations. Of course, it's uncharged when another page is used | |
96 | - for the fault address. | |
97 | - | |
98 | - At freeing anonymous page (by exit() or munmap()), zap_pte() is called | |
99 | - and pages for ptes are freed one by one.(see mm/memory.c). Uncharges | |
100 | - are done at page_remove_rmap() when page_mapcount() goes down to 0. | |
101 | - | |
102 | - Another page freeing is by page-reclaim (vmscan.c) and anonymous | |
103 | - pages are swapped out. In this case, the page is marked as | |
104 | - PageSwapCache(). uncharge() routine doesn't uncharge the page marked | |
105 | - as SwapCache(). It's delayed until __delete_from_swap_cache(). | |
106 | - | |
107 | - 4.1 Swap-in. | |
108 | - At swap-in, the page is taken from swap-cache. There are 2 cases. | |
109 | - | |
110 | - (a) If the SwapCache is newly allocated and read, it has no charges. | |
111 | - (b) If the SwapCache has been mapped by processes, it has been | |
112 | - charged already. | |
113 | - | |
114 | - This swap-in is one of the most complicated work. In do_swap_page(), | |
115 | - following events occur when pte is unchanged. | |
116 | - | |
117 | - (1) the page (SwapCache) is looked up. | |
118 | - (2) lock_page() | |
119 | - (3) try_charge_swapin() | |
120 | - (4) reuse_swap_page() (may call delete_swap_cache()) | |
121 | - (5) commit_charge_swapin() | |
122 | - (6) swap_free(). | |
123 | - | |
124 | - Considering following situation for example. | |
125 | - | |
126 | - (A) The page has not been charged before (2) and reuse_swap_page() | |
127 | - doesn't call delete_from_swap_cache(). | |
128 | - (B) The page has not been charged before (2) and reuse_swap_page() | |
129 | - calls delete_from_swap_cache(). | |
130 | - (C) The page has been charged before (2) and reuse_swap_page() doesn't | |
131 | - call delete_from_swap_cache(). | |
132 | - (D) The page has been charged before (2) and reuse_swap_page() calls | |
133 | - delete_from_swap_cache(). | |
134 | - | |
135 | - memory.usage/memsw.usage changes to this page/swp_entry will be | |
136 | - Case (A) (B) (C) (D) | |
137 | - Event | |
138 | - Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 | |
139 | - =========================================== | |
140 | - (3) +1/+1 +1/+1 +1/+1 +1/+1 | |
141 | - (4) - 0/ 0 - -1/ 0 | |
142 | - (5) 0/-1 0/ 0 -1/-1 0/ 0 | |
143 | - (6) - 0/-1 - 0/-1 | |
144 | - =========================================== | |
145 | - Result 1/ 1 1/ 1 1/ 1 1/ 1 | |
146 | - | |
147 | - In any cases, charges to this page should be 1/ 1. | |
148 | - | |
149 | - 4.2 Swap-out. | |
150 | - At swap-out, typical state transition is below. | |
151 | - | |
152 | - (a) add to swap cache. (marked as SwapCache) | |
153 | - swp_entry's refcnt += 1. | |
154 | - (b) fully unmapped. | |
155 | - swp_entry's refcnt += # of ptes. | |
156 | - (c) write back to swap. | |
157 | - (d) delete from swap cache. (remove from SwapCache) | |
158 | - swp_entry's refcnt -= 1. | |
159 | - | |
160 | - | |
161 | - At (b), the page is marked as SwapCache and not uncharged. | |
162 | - At (d), the page is removed from SwapCache and a charge in page_cgroup | |
163 | - is moved to swap_cgroup. | |
164 | - | |
165 | - Finally, at task exit, | |
166 | - (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. | |
167 | - Here, a charge in swap_cgroup disappears. | |
168 | - | |
169 | -5. Page Cache | |
170 | - Page Cache is charged at | |
171 | - - add_to_page_cache_locked(). | |
172 | - | |
173 | - uncharged at | |
174 | - - __remove_from_page_cache(). | |
175 | - | |
176 | - The logic is very clear. (About migration, see below) | |
177 | - Note: __remove_from_page_cache() is called by remove_from_page_cache() | |
178 | - and __remove_mapping(). | |
179 | - | |
180 | -6. Shmem(tmpfs) Page Cache | |
181 | - Memcg's charge/uncharge have special handlers of shmem. The best way | |
182 | - to understand shmem's page state transition is to read mm/shmem.c. | |
183 | - But brief explanation of the behavior of memcg around shmem will be | |
184 | - helpful to understand the logic. | |
185 | - | |
186 | - Shmem's page (just leaf page, not direct/indirect block) can be on | |
187 | - - radix-tree of shmem's inode. | |
188 | - - SwapCache. | |
189 | - - Both on radix-tree and SwapCache. This happens at swap-in | |
190 | - and swap-out, | |
191 | - | |
192 | - It's charged when... | |
193 | - - A new page is added to shmem's radix-tree. | |
194 | - - A swp page is read. (move a charge from swap_cgroup to page_cgroup) | |
195 | - It's uncharged when | |
196 | - - A page is removed from radix-tree and not SwapCache. | |
197 | - - When SwapCache is removed, a charge is moved to swap_cgroup. | |
198 | - - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup | |
199 | - disappears. | |
200 | - | |
201 | -7. Page Migration | |
202 | - One of the most complicated functions is page-migration-handler. | |
203 | - Memcg has 2 routines. Assume that we are migrating a page's contents | |
204 | - from OLDPAGE to NEWPAGE. | |
205 | - | |
206 | - Usual migration logic is.. | |
207 | - (a) remove the page from LRU. | |
208 | - (b) allocate NEWPAGE (migration target) | |
209 | - (c) lock by lock_page(). | |
210 | - (d) unmap all mappings. | |
211 | - (e-1) If necessary, replace entry in radix-tree. | |
212 | - (e-2) move contents of a page. | |
213 | - (f) map all mappings again. | |
214 | - (g) pushback the page to LRU. | |
215 | - (-) OLDPAGE will be freed. | |
216 | - | |
217 | - Before (g), memcg should complete all necessary charge/uncharge to | |
218 | - NEWPAGE/OLDPAGE. | |
219 | - | |
220 | - The point is.... | |
221 | - - If OLDPAGE is anonymous, all charges will be dropped at (d) because | |
222 | - try_to_unmap() drops all mapcount and the page will not be | |
223 | - SwapCache. | |
224 | - | |
225 | - - If OLDPAGE is SwapCache, charges will be kept at (g) because | |
226 | - __delete_from_swap_cache() isn't called at (e-1) | |
227 | - | |
228 | - - If OLDPAGE is page-cache, charges will be kept at (g) because | |
229 | - remove_from_swap_cache() isn't called at (e-1) | |
230 | - | |
231 | - memcg provides following hooks. | |
232 | - | |
233 | - - mem_cgroup_prepare_migration(OLDPAGE) | |
234 | - Called after (b) to account a charge (usage += PAGE_SIZE) against | |
235 | - memcg which OLDPAGE belongs to. | |
236 | - | |
237 | - - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) | |
238 | - Called after (f) before (g). | |
239 | - If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already | |
240 | - charged, a charge by prepare_migration() is automatically canceled. | |
241 | - If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. | |
242 | - | |
243 | - But zap_pte() (by exit or munmap) can be called while migration, | |
244 | - we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | |
245 | - | |
246 | -8. LRU | |
247 | - Each memcg has its own private LRU. Now, it's handling is under global | |
248 | - VM's control (means that it's handled under global zone->lru_lock). | |
249 | - Almost all routines around memcg's LRU is called by global LRU's | |
250 | - list management functions under zone->lru_lock(). | |
251 | - | |
252 | - A special function is mem_cgroup_isolate_pages(). This scans | |
253 | - memcg's private LRU and call __isolate_lru_page() to extract a page | |
254 | - from LRU. | |
255 | - (By __isolate_lru_page(), the page is removed from both of global and | |
256 | - private LRU.) | |
257 | - | |
258 | - | |
259 | -9. Typical Tests. | |
260 | - | |
261 | - Tests for racy cases. | |
262 | - | |
263 | - 9.1 Small limit to memcg. | |
264 | - When you do test to do racy case, it's good test to set memcg's limit | |
265 | - to be very small rather than GB. Many races found in the test under | |
266 | - xKB or xxMB limits. | |
267 | - (Memory behavior under GB and Memory behavior under MB shows very | |
268 | - different situation.) | |
269 | - | |
270 | - 9.2 Shmem | |
271 | - Historically, memcg's shmem handling was poor and we saw some amount | |
272 | - of troubles here. This is because shmem is page-cache but can be | |
273 | - SwapCache. Test with shmem/tmpfs is always good test. | |
274 | - | |
275 | - 9.3 Migration | |
276 | - For NUMA, migration is an another special case. To do easy test, cpuset | |
277 | - is useful. Following is a sample script to do migration. | |
278 | - | |
279 | - mount -t cgroup -o cpuset none /opt/cpuset | |
280 | - | |
281 | - mkdir /opt/cpuset/01 | |
282 | - echo 1 > /opt/cpuset/01/cpuset.cpus | |
283 | - echo 0 > /opt/cpuset/01/cpuset.mems | |
284 | - echo 1 > /opt/cpuset/01/cpuset.memory_migrate | |
285 | - mkdir /opt/cpuset/02 | |
286 | - echo 1 > /opt/cpuset/02/cpuset.cpus | |
287 | - echo 1 > /opt/cpuset/02/cpuset.mems | |
288 | - echo 1 > /opt/cpuset/02/cpuset.memory_migrate | |
289 | - | |
290 | - In above set, when you moves a task from 01 to 02, page migration to | |
291 | - node 0 to node 1 will occur. Following is a script to migrate all | |
292 | - under cpuset. | |
293 | - -- | |
294 | - move_task() | |
295 | - { | |
296 | - for pid in $1 | |
297 | - do | |
298 | - /bin/echo $pid >$2/tasks 2>/dev/null | |
299 | - echo -n $pid | |
300 | - echo -n " " | |
301 | - done | |
302 | - echo END | |
303 | - } | |
304 | - | |
305 | - G1_TASK=`cat ${G1}/tasks` | |
306 | - G2_TASK=`cat ${G2}/tasks` | |
307 | - move_task "${G1_TASK}" ${G2} & | |
308 | - -- | |
309 | - 9.4 Memory hotplug. | |
310 | - memory hotplug test is one of good test. | |
311 | - to offline memory, do following. | |
312 | - # echo offline > /sys/devices/system/memory/memoryXXX/state | |
313 | - (XXX is the place of memory) | |
314 | - This is an easy way to test page migration, too. | |
315 | - | |
316 | - 9.5 mkdir/rmdir | |
317 | - When using hierarchy, mkdir/rmdir test should be done. | |
318 | - Use tests like the following. | |
319 | - | |
320 | - echo 1 >/opt/cgroup/01/memory/use_hierarchy | |
321 | - mkdir /opt/cgroup/01/child_a | |
322 | - mkdir /opt/cgroup/01/child_b | |
323 | - | |
324 | - set limit to 01. | |
325 | - add limit to 01/child_b | |
326 | - run jobs under child_a and child_b | |
327 | - | |
328 | - create/delete following groups at random while jobs are running. | |
329 | - /opt/cgroup/01/child_a/child_aa | |
330 | - /opt/cgroup/01/child_b/child_bb | |
331 | - /opt/cgroup/01/child_c | |
332 | - | |
333 | - running new jobs in new group is also good. | |
334 | - | |
335 | - 9.6 Mount with other subsystems. | |
336 | - Mounting with other subsystems is a good test because there is a | |
337 | - race and lock dependency with other cgroup subsystems. | |
338 | - | |
339 | - example) | |
340 | - # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | |
341 | - | |
342 | - and do task move, mkdir, rmdir etc...under this. |
Documentation/controllers/memory.txt
1 | -Memory Resource Controller | |
2 | - | |
3 | -NOTE: The Memory Resource Controller has been generically been referred | |
4 | -to as the memory controller in this document. Do not confuse memory controller | |
5 | -used here with the memory controller that is used in hardware. | |
6 | - | |
7 | -Salient features | |
8 | - | |
9 | -a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages | |
10 | -b. The infrastructure allows easy addition of other types of memory to control | |
11 | -c. Provides *zero overhead* for non memory controller users | |
12 | -d. Provides a double LRU: global memory pressure causes reclaim from the | |
13 | - global LRU; a cgroup on hitting a limit, reclaims from the per | |
14 | - cgroup LRU | |
15 | - | |
16 | -NOTE: Swap Cache (unmapped) is not accounted now. | |
17 | - | |
18 | -Benefits and Purpose of the memory controller | |
19 | - | |
20 | -The memory controller isolates the memory behaviour of a group of tasks | |
21 | -from the rest of the system. The article on LWN [12] mentions some probable | |
22 | -uses of the memory controller. The memory controller can be used to | |
23 | - | |
24 | -a. Isolate an application or a group of applications | |
25 | - Memory hungry applications can be isolated and limited to a smaller | |
26 | - amount of memory. | |
27 | -b. Create a cgroup with limited amount of memory, this can be used | |
28 | - as a good alternative to booting with mem=XXXX. | |
29 | -c. Virtualization solutions can control the amount of memory they want | |
30 | - to assign to a virtual machine instance. | |
31 | -d. A CD/DVD burner could control the amount of memory used by the | |
32 | - rest of the system to ensure that burning does not fail due to lack | |
33 | - of available memory. | |
34 | -e. There are several other use cases, find one or use the controller just | |
35 | - for fun (to learn and hack on the VM subsystem). | |
36 | - | |
37 | -1. History | |
38 | - | |
39 | -The memory controller has a long history. A request for comments for the memory | |
40 | -controller was posted by Balbir Singh [1]. At the time the RFC was posted | |
41 | -there were several implementations for memory control. The goal of the | |
42 | -RFC was to build consensus and agreement for the minimal features required | |
43 | -for memory control. The first RSS controller was posted by Balbir Singh[2] | |
44 | -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the | |
45 | -RSS controller. At OLS, at the resource management BoF, everyone suggested | |
46 | -that we handle both page cache and RSS together. Another request was raised | |
47 | -to allow user space handling of OOM. The current memory controller is | |
48 | -at version 6; it combines both mapped (RSS) and unmapped Page | |
49 | -Cache Control [11]. | |
50 | - | |
51 | -2. Memory Control | |
52 | - | |
53 | -Memory is a unique resource in the sense that it is present in a limited | |
54 | -amount. If a task requires a lot of CPU processing, the task can spread | |
55 | -its processing over a period of hours, days, months or years, but with | |
56 | -memory, the same physical memory needs to be reused to accomplish the task. | |
57 | - | |
58 | -The memory controller implementation has been divided into phases. These | |
59 | -are: | |
60 | - | |
61 | -1. Memory controller | |
62 | -2. mlock(2) controller | |
63 | -3. Kernel user memory accounting and slab control | |
64 | -4. user mappings length controller | |
65 | - | |
66 | -The memory controller is the first controller developed. | |
67 | - | |
68 | -2.1. Design | |
69 | - | |
70 | -The core of the design is a counter called the res_counter. The res_counter | |
71 | -tracks the current memory usage and limit of the group of processes associated | |
72 | -with the controller. Each cgroup has a memory controller specific data | |
73 | -structure (mem_cgroup) associated with it. | |
74 | - | |
75 | -2.2. Accounting | |
76 | - | |
77 | - +--------------------+ | |
78 | - | mem_cgroup | | |
79 | - | (res_counter) | | |
80 | - +--------------------+ | |
81 | - / ^ \ | |
82 | - / | \ | |
83 | - +---------------+ | +---------------+ | |
84 | - | mm_struct | |.... | mm_struct | | |
85 | - | | | | | | |
86 | - +---------------+ | +---------------+ | |
87 | - | | |
88 | - + --------------+ | |
89 | - | | |
90 | - +---------------+ +------+--------+ | |
91 | - | page +----------> page_cgroup| | |
92 | - | | | | | |
93 | - +---------------+ +---------------+ | |
94 | - | |
95 | - (Figure 1: Hierarchy of Accounting) | |
96 | - | |
97 | - | |
98 | -Figure 1 shows the important aspects of the controller | |
99 | - | |
100 | -1. Accounting happens per cgroup | |
101 | -2. Each mm_struct knows about which cgroup it belongs to | |
102 | -3. Each page has a pointer to the page_cgroup, which in turn knows the | |
103 | - cgroup it belongs to | |
104 | - | |
105 | -The accounting is done as follows: mem_cgroup_charge() is invoked to setup | |
106 | -the necessary data structures and check if the cgroup that is being charged | |
107 | -is over its limit. If it is then reclaim is invoked on the cgroup. | |
108 | -More details can be found in the reclaim section of this document. | |
109 | -If everything goes well, a page meta-data-structure called page_cgroup is | |
110 | -allocated and associated with the page. This routine also adds the page to | |
111 | -the per cgroup LRU. | |
112 | - | |
113 | -2.2.1 Accounting details | |
114 | - | |
115 | -All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. | |
116 | -(some pages which never be reclaimable and will not be on global LRU | |
117 | - are not accounted. we just accounts pages under usual vm management.) | |
118 | - | |
119 | -RSS pages are accounted at page_fault unless they've already been accounted | |
120 | -for earlier. A file page will be accounted for as Page Cache when it's | |
121 | -inserted into inode (radix-tree). While it's mapped into the page tables of | |
122 | -processes, duplicate accounting is carefully avoided. | |
123 | - | |
124 | -A RSS page is unaccounted when it's fully unmapped. A PageCache page is | |
125 | -unaccounted when it's removed from radix-tree. | |
126 | - | |
127 | -At page migration, accounting information is kept. | |
128 | - | |
129 | -Note: we just account pages-on-lru because our purpose is to control amount | |
130 | -of used pages. not-on-lru pages are tend to be out-of-control from vm view. | |
131 | - | |
132 | -2.3 Shared Page Accounting | |
133 | - | |
134 | -Shared pages are accounted on the basis of the first touch approach. The | |
135 | -cgroup that first touches a page is accounted for the page. The principle | |
136 | -behind this approach is that a cgroup that aggressively uses a shared | |
137 | -page will eventually get charged for it (once it is uncharged from | |
138 | -the cgroup that brought it in -- this will happen on memory pressure). | |
139 | - | |
140 | -Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. | |
141 | -When you do swapoff and make swapped-out pages of shmem(tmpfs) to | |
142 | -be backed into memory in force, charges for pages are accounted against the | |
143 | -caller of swapoff rather than the users of shmem. | |
144 | - | |
145 | - | |
146 | -2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | |
147 | -Swap Extension allows you to record charge for swap. A swapped-in page is | |
148 | -charged back to original page allocator if possible. | |
149 | - | |
150 | -When swap is accounted, following files are added. | |
151 | - - memory.memsw.usage_in_bytes. | |
152 | - - memory.memsw.limit_in_bytes. | |
153 | - | |
154 | -usage of mem+swap is limited by memsw.limit_in_bytes. | |
155 | - | |
156 | -Note: why 'mem+swap' rather than swap. | |
157 | -The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | |
158 | -to move account from memory to swap...there is no change in usage of | |
159 | -mem+swap. | |
160 | - | |
161 | -In other words, when we want to limit the usage of swap without affecting | |
162 | -global LRU, mem+swap limit is better than just limiting swap from OS point | |
163 | -of view. | |
164 | - | |
165 | -2.5 Reclaim | |
166 | - | |
167 | -Each cgroup maintains a per cgroup LRU that consists of an active | |
168 | -and inactive list. When a cgroup goes over its limit, we first try | |
169 | -to reclaim memory from the cgroup so as to make space for the new | |
170 | -pages that the cgroup has touched. If the reclaim is unsuccessful, | |
171 | -an OOM routine is invoked to select and kill the bulkiest task in the | |
172 | -cgroup. | |
173 | - | |
174 | -The reclaim algorithm has not been modified for cgroups, except that | |
175 | -pages that are selected for reclaiming come from the per cgroup LRU | |
176 | -list. | |
177 | - | |
178 | -2. Locking | |
179 | - | |
180 | -The memory controller uses the following hierarchy | |
181 | - | |
182 | -1. zone->lru_lock is used for selecting pages to be isolated | |
183 | -2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | |
184 | -3. lock_page_cgroup() is used to protect page->page_cgroup | |
185 | - | |
186 | -3. User Interface | |
187 | - | |
188 | -0. Configuration | |
189 | - | |
190 | -a. Enable CONFIG_CGROUPS | |
191 | -b. Enable CONFIG_RESOURCE_COUNTERS | |
192 | -c. Enable CONFIG_CGROUP_MEM_RES_CTLR | |
193 | - | |
194 | -1. Prepare the cgroups | |
195 | -# mkdir -p /cgroups | |
196 | -# mount -t cgroup none /cgroups -o memory | |
197 | - | |
198 | -2. Make the new group and move bash into it | |
199 | -# mkdir /cgroups/0 | |
200 | -# echo $$ > /cgroups/0/tasks | |
201 | - | |
202 | -Since now we're in the 0 cgroup, | |
203 | -We can alter the memory limit: | |
204 | -# echo 4M > /cgroups/0/memory.limit_in_bytes | |
205 | - | |
206 | -NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, | |
207 | -mega or gigabytes. | |
208 | - | |
209 | -# cat /cgroups/0/memory.limit_in_bytes | |
210 | -4194304 | |
211 | - | |
212 | -NOTE: The interface has now changed to display the usage in bytes | |
213 | -instead of pages | |
214 | - | |
215 | -We can check the usage: | |
216 | -# cat /cgroups/0/memory.usage_in_bytes | |
217 | -1216512 | |
218 | - | |
219 | -A successful write to this file does not guarantee a successful set of | |
220 | -this limit to the value written into the file. This can be due to a | |
221 | -number of factors, such as rounding up to page boundaries or the total | |
222 | -availability of memory on the system. The user is required to re-read | |
223 | -this file after a write to guarantee the value committed by the kernel. | |
224 | - | |
225 | -# echo 1 > memory.limit_in_bytes | |
226 | -# cat memory.limit_in_bytes | |
227 | -4096 | |
228 | - | |
229 | -The memory.failcnt field gives the number of times that the cgroup limit was | |
230 | -exceeded. | |
231 | - | |
232 | -The memory.stat file gives accounting information. Now, the number of | |
233 | -caches, RSS and Active pages/Inactive pages are shown. | |
234 | - | |
235 | -4. Testing | |
236 | - | |
237 | -Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | |
238 | -Apart from that v6 has been tested with several applications and regular | |
239 | -daily use. The controller has also been tested on the PPC64, x86_64 and | |
240 | -UML platforms. | |
241 | - | |
242 | -4.1 Troubleshooting | |
243 | - | |
244 | -Sometimes a user might find that the application under a cgroup is | |
245 | -terminated. There are several causes for this: | |
246 | - | |
247 | -1. The cgroup limit is too low (just too low to do anything useful) | |
248 | -2. The user is using anonymous memory and swap is turned off or too low | |
249 | - | |
250 | -A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of | |
251 | -some of the pages cached in the cgroup (page cache pages). | |
252 | - | |
253 | -4.2 Task migration | |
254 | - | |
255 | -When a task migrates from one cgroup to another, it's charge is not | |
256 | -carried forward. The pages allocated from the original cgroup still | |
257 | -remain charged to it, the charge is dropped when the page is freed or | |
258 | -reclaimed. | |
259 | - | |
260 | -4.3 Removing a cgroup | |
261 | - | |
262 | -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | |
263 | -cgroup might have some charge associated with it, even though all | |
264 | -tasks have migrated away from it. | |
265 | -Such charges are freed(at default) or moved to its parent. When moved, | |
266 | -both of RSS and CACHES are moved to parent. | |
267 | -If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | |
268 | - | |
269 | -Charges recorded in swap information is not updated at removal of cgroup. | |
270 | -Recorded information is discarded and a cgroup which uses swap (swapcache) | |
271 | -will be charged as a new owner of it. | |
272 | - | |
273 | - | |
274 | -5. Misc. interfaces. | |
275 | - | |
276 | -5.1 force_empty | |
277 | - memory.force_empty interface is provided to make cgroup's memory usage empty. | |
278 | - You can use this interface only when the cgroup has no tasks. | |
279 | - When writing anything to this | |
280 | - | |
281 | - # echo 0 > memory.force_empty | |
282 | - | |
283 | - Almost all pages tracked by this memcg will be unmapped and freed. Some of | |
284 | - pages cannot be freed because it's locked or in-use. Such pages are moved | |
285 | - to parent and this cgroup will be empty. But this may return -EBUSY in | |
286 | - some too busy case. | |
287 | - | |
288 | - Typical use case of this interface is that calling this before rmdir(). | |
289 | - Because rmdir() moves all pages to parent, some out-of-use page caches can be | |
290 | - moved to the parent. If you want to avoid that, force_empty will be useful. | |
291 | - | |
292 | -5.2 stat file | |
293 | - memory.stat file includes following statistics (now) | |
294 | - cache - # of pages from page-cache and shmem. | |
295 | - rss - # of pages from anonymous memory. | |
296 | - pgpgin - # of event of charging | |
297 | - pgpgout - # of event of uncharging | |
298 | - active_anon - # of pages on active lru of anon, shmem. | |
299 | - inactive_anon - # of pages on active lru of anon, shmem | |
300 | - active_file - # of pages on active lru of file-cache | |
301 | - inactive_file - # of pages on inactive lru of file cache | |
302 | - unevictable - # of pages cannot be reclaimed.(mlocked etc) | |
303 | - | |
304 | - Below is depend on CONFIG_DEBUG_VM. | |
305 | - inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | |
306 | - recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | |
307 | - recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | |
308 | - recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | |
309 | - recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | |
310 | - | |
311 | - Memo: | |
312 | - recent_rotated means recent frequency of lru rotation. | |
313 | - recent_scanned means recent # of scans to lru. | |
314 | - showing for better debug please see the code for meanings. | |
315 | - | |
316 | - | |
317 | -5.3 swappiness | |
318 | - Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | |
319 | - | |
320 | - Following cgroup's swapiness can't be changed. | |
321 | - - root cgroup (uses /proc/sys/vm/swappiness). | |
322 | - - a cgroup which uses hierarchy and it has child cgroup. | |
323 | - - a cgroup which uses hierarchy and not the root of hierarchy. | |
324 | - | |
325 | - | |
326 | -6. Hierarchy support | |
327 | - | |
328 | -The memory controller supports a deep hierarchy and hierarchical accounting. | |
329 | -The hierarchy is created by creating the appropriate cgroups in the | |
330 | -cgroup filesystem. Consider for example, the following cgroup filesystem | |
331 | -hierarchy | |
332 | - | |
333 | - root | |
334 | - / | \ | |
335 | - / | \ | |
336 | - a b c | |
337 | - | \ | |
338 | - | \ | |
339 | - d e | |
340 | - | |
341 | -In the diagram above, with hierarchical accounting enabled, all memory | |
342 | -usage of e, is accounted to its ancestors up until the root (i.e, c and root), | |
343 | -that has memory.use_hierarchy enabled. If one of the ancestors goes over its | |
344 | -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | |
345 | -children of the ancestor. | |
346 | - | |
347 | -6.1 Enabling hierarchical accounting and reclaim | |
348 | - | |
349 | -The memory controller by default disables the hierarchy feature. Support | |
350 | -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | |
351 | - | |
352 | -# echo 1 > memory.use_hierarchy | |
353 | - | |
354 | -The feature can be disabled by | |
355 | - | |
356 | -# echo 0 > memory.use_hierarchy | |
357 | - | |
358 | -NOTE1: Enabling/disabling will fail if the cgroup already has other | |
359 | -cgroups created below it. | |
360 | - | |
361 | -NOTE2: This feature can be enabled/disabled per subtree. | |
362 | - | |
363 | -7. TODO | |
364 | - | |
365 | -1. Add support for accounting huge pages (as a separate controller) | |
366 | -2. Make per-cgroup scanner reclaim not-shared pages first | |
367 | -3. Teach controller to account for shared-pages | |
368 | -4. Start reclamation in the background when the limit is | |
369 | - not yet hit but the usage is getting closer | |
370 | - | |
371 | -Summary | |
372 | - | |
373 | -Overall, the memory controller has been a stable controller and has been | |
374 | -commented and discussed quite extensively in the community. | |
375 | - | |
376 | -References | |
377 | - | |
378 | -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ | |
379 | -2. Singh, Balbir. Memory Controller (RSS Control), | |
380 | - http://lwn.net/Articles/222762/ | |
381 | -3. Emelianov, Pavel. Resource controllers based on process cgroups | |
382 | - http://lkml.org/lkml/2007/3/6/198 | |
383 | -4. Emelianov, Pavel. RSS controller based on process cgroups (v2) | |
384 | - http://lkml.org/lkml/2007/4/9/78 | |
385 | -5. Emelianov, Pavel. RSS controller based on process cgroups (v3) | |
386 | - http://lkml.org/lkml/2007/5/30/244 | |
387 | -6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ | |
388 | -7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control | |
389 | - subsystem (v3), http://lwn.net/Articles/235534/ | |
390 | -8. Singh, Balbir. RSS controller v2 test results (lmbench), | |
391 | - http://lkml.org/lkml/2007/5/17/232 | |
392 | -9. Singh, Balbir. RSS controller v2 AIM9 results | |
393 | - http://lkml.org/lkml/2007/5/18/1 | |
394 | -10. Singh, Balbir. Memory controller v6 test results, | |
395 | - http://lkml.org/lkml/2007/8/19/36 | |
396 | -11. Singh, Balbir. Memory controller introduction (v6), | |
397 | - http://lkml.org/lkml/2007/8/17/69 | |
398 | -12. Corbet, Jonathan, Controlling memory use in cgroups, | |
399 | - http://lwn.net/Articles/243795/ |
Documentation/controllers/resource_counter.txt
1 | - | |
2 | - The Resource Counter | |
3 | - | |
4 | -The resource counter, declared at include/linux/res_counter.h, | |
5 | -is supposed to facilitate the resource management by controllers | |
6 | -by providing common stuff for accounting. | |
7 | - | |
8 | -This "stuff" includes the res_counter structure and routines | |
9 | -to work with it. | |
10 | - | |
11 | - | |
12 | - | |
13 | -1. Crucial parts of the res_counter structure | |
14 | - | |
15 | - a. unsigned long long usage | |
16 | - | |
17 | - The usage value shows the amount of a resource that is consumed | |
18 | - by a group at a given time. The units of measurement should be | |
19 | - determined by the controller that uses this counter. E.g. it can | |
20 | - be bytes, items or any other unit the controller operates on. | |
21 | - | |
22 | - b. unsigned long long max_usage | |
23 | - | |
24 | - The maximal value of the usage over time. | |
25 | - | |
26 | - This value is useful when gathering statistical information about | |
27 | - the particular group, as it shows the actual resource requirements | |
28 | - for a particular group, not just some usage snapshot. | |
29 | - | |
30 | - c. unsigned long long limit | |
31 | - | |
32 | - The maximal allowed amount of resource to consume by the group. In | |
33 | - case the group requests for more resources, so that the usage value | |
34 | - would exceed the limit, the resource allocation is rejected (see | |
35 | - the next section). | |
36 | - | |
37 | - d. unsigned long long failcnt | |
38 | - | |
39 | - The failcnt stands for "failures counter". This is the number of | |
40 | - resource allocation attempts that failed. | |
41 | - | |
42 | - c. spinlock_t lock | |
43 | - | |
44 | - Protects changes of the above values. | |
45 | - | |
46 | - | |
47 | - | |
48 | -2. Basic accounting routines | |
49 | - | |
50 | - a. void res_counter_init(struct res_counter *rc) | |
51 | - | |
52 | - Initializes the resource counter. As usual, should be the first | |
53 | - routine called for a new counter. | |
54 | - | |
55 | - b. int res_counter_charge[_locked] | |
56 | - (struct res_counter *rc, unsigned long val) | |
57 | - | |
58 | - When a resource is about to be allocated it has to be accounted | |
59 | - with the appropriate resource counter (controller should determine | |
60 | - which one to use on its own). This operation is called "charging". | |
61 | - | |
62 | - This is not very important which operation - resource allocation | |
63 | - or charging - is performed first, but | |
64 | - * if the allocation is performed first, this may create a | |
65 | - temporary resource over-usage by the time resource counter is | |
66 | - charged; | |
67 | - * if the charging is performed first, then it should be uncharged | |
68 | - on error path (if the one is called). | |
69 | - | |
70 | - c. void res_counter_uncharge[_locked] | |
71 | - (struct res_counter *rc, unsigned long val) | |
72 | - | |
73 | - When a resource is released (freed) it should be de-accounted | |
74 | - from the resource counter it was accounted to. This is called | |
75 | - "uncharging". | |
76 | - | |
77 | - The _locked routines imply that the res_counter->lock is taken. | |
78 | - | |
79 | - | |
80 | - 2.1 Other accounting routines | |
81 | - | |
82 | - There are more routines that may help you with common needs, like | |
83 | - checking whether the limit is reached or resetting the max_usage | |
84 | - value. They are all declared in include/linux/res_counter.h. | |
85 | - | |
86 | - | |
87 | - | |
88 | -3. Analyzing the resource counter registrations | |
89 | - | |
90 | - a. If the failcnt value constantly grows, this means that the counter's | |
91 | - limit is too tight. Either the group is misbehaving and consumes too | |
92 | - many resources, or the configuration is not suitable for the group | |
93 | - and the limit should be increased. | |
94 | - | |
95 | - b. The max_usage value can be used to quickly tune the group. One may | |
96 | - set the limits to maximal values and either load the container with | |
97 | - a common pattern or leave one for a while. After this the max_usage | |
98 | - value shows the amount of memory the container would require during | |
99 | - its common activity. | |
100 | - | |
101 | - Setting the limit a bit above this value gives a pretty good | |
102 | - configuration that works in most of the cases. | |
103 | - | |
104 | - c. If the max_usage is much less than the limit, but the failcnt value | |
105 | - is growing, then the group tries to allocate a big chunk of resource | |
106 | - at once. | |
107 | - | |
108 | - d. If the max_usage is much less than the limit, but the failcnt value | |
109 | - is 0, then this group is given too high limit, that it does not | |
110 | - require. It is better to lower the limit a bit leaving more resource | |
111 | - for other groups. | |
112 | - | |
113 | - | |
114 | - | |
115 | -4. Communication with the control groups subsystem (cgroups) | |
116 | - | |
117 | -All the resource controllers that are using cgroups and resource counters | |
118 | -should provide files (in the cgroup filesystem) to work with the resource | |
119 | -counter fields. They are recommended to adhere to the following rules: | |
120 | - | |
121 | - a. File names | |
122 | - | |
123 | - Field name File name | |
124 | - --------------------------------------------------- | |
125 | - usage usage_in_<unit_of_measurement> | |
126 | - max_usage max_usage_in_<unit_of_measurement> | |
127 | - limit limit_in_<unit_of_measurement> | |
128 | - failcnt failcnt | |
129 | - lock no file :) | |
130 | - | |
131 | - b. Reading from file should show the corresponding field value in the | |
132 | - appropriate format. | |
133 | - | |
134 | - c. Writing to file | |
135 | - | |
136 | - Field Expected behavior | |
137 | - ---------------------------------- | |
138 | - usage prohibited | |
139 | - max_usage reset to usage | |
140 | - limit set the limit | |
141 | - failcnt reset to zero | |
142 | - | |
143 | - | |
144 | - | |
145 | -5. Usage example | |
146 | - | |
147 | - a. Declare a task group (take a look at cgroups subsystem for this) and | |
148 | - fold a res_counter into it | |
149 | - | |
150 | - struct my_group { | |
151 | - struct res_counter res; | |
152 | - | |
153 | - <other fields> | |
154 | - } | |
155 | - | |
156 | - b. Put hooks in resource allocation/release paths | |
157 | - | |
158 | - int alloc_something(...) | |
159 | - { | |
160 | - if (res_counter_charge(res_counter_ptr, amount) < 0) | |
161 | - return -ENOMEM; | |
162 | - | |
163 | - <allocate the resource and return to the caller> | |
164 | - } | |
165 | - | |
166 | - void release_something(...) | |
167 | - { | |
168 | - res_counter_uncharge(res_counter_ptr, amount); | |
169 | - | |
170 | - <release the resource> | |
171 | - } | |
172 | - | |
173 | - In order to keep the usage value self-consistent, both the | |
174 | - "res_counter_ptr" and the "amount" in release_something() should be | |
175 | - the same as they were in the alloc_something() when the releasing | |
176 | - resource was allocated. | |
177 | - | |
178 | - c. Provide the way to read res_counter values and set them (the cgroups | |
179 | - still can help with it). | |
180 | - | |
181 | - c. Compile and run :) |
Documentation/cpusets.txt
1 | - CPUSETS | |
2 | - ------- | |
3 | - | |
4 | -Copyright (C) 2004 BULL SA. | |
5 | -Written by Simon.Derr@bull.net | |
6 | - | |
7 | -Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | |
8 | -Modified by Paul Jackson <pj@sgi.com> | |
9 | -Modified by Christoph Lameter <clameter@sgi.com> | |
10 | -Modified by Paul Menage <menage@google.com> | |
11 | -Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | |
12 | - | |
13 | -CONTENTS: | |
14 | -========= | |
15 | - | |
16 | -1. Cpusets | |
17 | - 1.1 What are cpusets ? | |
18 | - 1.2 Why are cpusets needed ? | |
19 | - 1.3 How are cpusets implemented ? | |
20 | - 1.4 What are exclusive cpusets ? | |
21 | - 1.5 What is memory_pressure ? | |
22 | - 1.6 What is memory spread ? | |
23 | - 1.7 What is sched_load_balance ? | |
24 | - 1.8 What is sched_relax_domain_level ? | |
25 | - 1.9 How do I use cpusets ? | |
26 | -2. Usage Examples and Syntax | |
27 | - 2.1 Basic Usage | |
28 | - 2.2 Adding/removing cpus | |
29 | - 2.3 Setting flags | |
30 | - 2.4 Attaching processes | |
31 | -3. Questions | |
32 | -4. Contact | |
33 | - | |
34 | -1. Cpusets | |
35 | -========== | |
36 | - | |
37 | -1.1 What are cpusets ? | |
38 | ----------------------- | |
39 | - | |
40 | -Cpusets provide a mechanism for assigning a set of CPUs and Memory | |
41 | -Nodes to a set of tasks. In this document "Memory Node" refers to | |
42 | -an on-line node that contains memory. | |
43 | - | |
44 | -Cpusets constrain the CPU and Memory placement of tasks to only | |
45 | -the resources within a tasks current cpuset. They form a nested | |
46 | -hierarchy visible in a virtual file system. These are the essential | |
47 | -hooks, beyond what is already present, required to manage dynamic | |
48 | -job placement on large systems. | |
49 | - | |
50 | -Cpusets use the generic cgroup subsystem described in | |
51 | -Documentation/cgroups/cgroups.txt. | |
52 | - | |
53 | -Requests by a task, using the sched_setaffinity(2) system call to | |
54 | -include CPUs in its CPU affinity mask, and using the mbind(2) and | |
55 | -set_mempolicy(2) system calls to include Memory Nodes in its memory | |
56 | -policy, are both filtered through that tasks cpuset, filtering out any | |
57 | -CPUs or Memory Nodes not in that cpuset. The scheduler will not | |
58 | -schedule a task on a CPU that is not allowed in its cpus_allowed | |
59 | -vector, and the kernel page allocator will not allocate a page on a | |
60 | -node that is not allowed in the requesting tasks mems_allowed vector. | |
61 | - | |
62 | -User level code may create and destroy cpusets by name in the cgroup | |
63 | -virtual file system, manage the attributes and permissions of these | |
64 | -cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | |
65 | -specify and query to which cpuset a task is assigned, and list the | |
66 | -task pids assigned to a cpuset. | |
67 | - | |
68 | - | |
69 | -1.2 Why are cpusets needed ? | |
70 | ----------------------------- | |
71 | - | |
72 | -The management of large computer systems, with many processors (CPUs), | |
73 | -complex memory cache hierarchies and multiple Memory Nodes having | |
74 | -non-uniform access times (NUMA) presents additional challenges for | |
75 | -the efficient scheduling and memory placement of processes. | |
76 | - | |
77 | -Frequently more modest sized systems can be operated with adequate | |
78 | -efficiency just by letting the operating system automatically share | |
79 | -the available CPU and Memory resources amongst the requesting tasks. | |
80 | - | |
81 | -But larger systems, which benefit more from careful processor and | |
82 | -memory placement to reduce memory access times and contention, | |
83 | -and which typically represent a larger investment for the customer, | |
84 | -can benefit from explicitly placing jobs on properly sized subsets of | |
85 | -the system. | |
86 | - | |
87 | -This can be especially valuable on: | |
88 | - | |
89 | - * Web Servers running multiple instances of the same web application, | |
90 | - * Servers running different applications (for instance, a web server | |
91 | - and a database), or | |
92 | - * NUMA systems running large HPC applications with demanding | |
93 | - performance characteristics. | |
94 | - | |
95 | -These subsets, or "soft partitions" must be able to be dynamically | |
96 | -adjusted, as the job mix changes, without impacting other concurrently | |
97 | -executing jobs. The location of the running jobs pages may also be moved | |
98 | -when the memory locations are changed. | |
99 | - | |
100 | -The kernel cpuset patch provides the minimum essential kernel | |
101 | -mechanisms required to efficiently implement such subsets. It | |
102 | -leverages existing CPU and Memory Placement facilities in the Linux | |
103 | -kernel to avoid any additional impact on the critical scheduler or | |
104 | -memory allocator code. | |
105 | - | |
106 | - | |
107 | -1.3 How are cpusets implemented ? | |
108 | ---------------------------------- | |
109 | - | |
110 | -Cpusets provide a Linux kernel mechanism to constrain which CPUs and | |
111 | -Memory Nodes are used by a process or set of processes. | |
112 | - | |
113 | -The Linux kernel already has a pair of mechanisms to specify on which | |
114 | -CPUs a task may be scheduled (sched_setaffinity) and on which Memory | |
115 | -Nodes it may obtain memory (mbind, set_mempolicy). | |
116 | - | |
117 | -Cpusets extends these two mechanisms as follows: | |
118 | - | |
119 | - - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | |
120 | - kernel. | |
121 | - - Each task in the system is attached to a cpuset, via a pointer | |
122 | - in the task structure to a reference counted cgroup structure. | |
123 | - - Calls to sched_setaffinity are filtered to just those CPUs | |
124 | - allowed in that tasks cpuset. | |
125 | - - Calls to mbind and set_mempolicy are filtered to just | |
126 | - those Memory Nodes allowed in that tasks cpuset. | |
127 | - - The root cpuset contains all the systems CPUs and Memory | |
128 | - Nodes. | |
129 | - - For any cpuset, one can define child cpusets containing a subset | |
130 | - of the parents CPU and Memory Node resources. | |
131 | - - The hierarchy of cpusets can be mounted at /dev/cpuset, for | |
132 | - browsing and manipulation from user space. | |
133 | - - A cpuset may be marked exclusive, which ensures that no other | |
134 | - cpuset (except direct ancestors and descendents) may contain | |
135 | - any overlapping CPUs or Memory Nodes. | |
136 | - - You can list all the tasks (by pid) attached to any cpuset. | |
137 | - | |
138 | -The implementation of cpusets requires a few, simple hooks | |
139 | -into the rest of the kernel, none in performance critical paths: | |
140 | - | |
141 | - - in init/main.c, to initialize the root cpuset at system boot. | |
142 | - - in fork and exit, to attach and detach a task from its cpuset. | |
143 | - - in sched_setaffinity, to mask the requested CPUs by what's | |
144 | - allowed in that tasks cpuset. | |
145 | - - in sched.c migrate_all_tasks(), to keep migrating tasks within | |
146 | - the CPUs allowed by their cpuset, if possible. | |
147 | - - in the mbind and set_mempolicy system calls, to mask the requested | |
148 | - Memory Nodes by what's allowed in that tasks cpuset. | |
149 | - - in page_alloc.c, to restrict memory to allowed nodes. | |
150 | - - in vmscan.c, to restrict page recovery to the current cpuset. | |
151 | - | |
152 | -You should mount the "cgroup" filesystem type in order to enable | |
153 | -browsing and modifying the cpusets presently known to the kernel. No | |
154 | -new system calls are added for cpusets - all support for querying and | |
155 | -modifying cpusets is via this cpuset file system. | |
156 | - | |
157 | -The /proc/<pid>/status file for each task has four added lines, | |
158 | -displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | |
159 | -and mems_allowed (on which Memory Nodes it may obtain memory), | |
160 | -in the two formats seen in the following example: | |
161 | - | |
162 | - Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff | |
163 | - Cpus_allowed_list: 0-127 | |
164 | - Mems_allowed: ffffffff,ffffffff | |
165 | - Mems_allowed_list: 0-63 | |
166 | - | |
167 | -Each cpuset is represented by a directory in the cgroup file system | |
168 | -containing (on top of the standard cgroup files) the following | |
169 | -files describing that cpuset: | |
170 | - | |
171 | - - cpus: list of CPUs in that cpuset | |
172 | - - mems: list of Memory Nodes in that cpuset | |
173 | - - memory_migrate flag: if set, move pages to cpusets nodes | |
174 | - - cpu_exclusive flag: is cpu placement exclusive? | |
175 | - - mem_exclusive flag: is memory placement exclusive? | |
176 | - - mem_hardwall flag: is memory allocation hardwalled | |
177 | - - memory_pressure: measure of how much paging pressure in cpuset | |
178 | - | |
179 | -In addition, the root cpuset only has the following file: | |
180 | - - memory_pressure_enabled flag: compute memory_pressure? | |
181 | - | |
182 | -New cpusets are created using the mkdir system call or shell | |
183 | -command. The properties of a cpuset, such as its flags, allowed | |
184 | -CPUs and Memory Nodes, and attached tasks, are modified by writing | |
185 | -to the appropriate file in that cpusets directory, as listed above. | |
186 | - | |
187 | -The named hierarchical structure of nested cpusets allows partitioning | |
188 | -a large system into nested, dynamically changeable, "soft-partitions". | |
189 | - | |
190 | -The attachment of each task, automatically inherited at fork by any | |
191 | -children of that task, to a cpuset allows organizing the work load | |
192 | -on a system into related sets of tasks such that each set is constrained | |
193 | -to using the CPUs and Memory Nodes of a particular cpuset. A task | |
194 | -may be re-attached to any other cpuset, if allowed by the permissions | |
195 | -on the necessary cpuset file system directories. | |
196 | - | |
197 | -Such management of a system "in the large" integrates smoothly with | |
198 | -the detailed placement done on individual tasks and memory regions | |
199 | -using the sched_setaffinity, mbind and set_mempolicy system calls. | |
200 | - | |
201 | -The following rules apply to each cpuset: | |
202 | - | |
203 | - - Its CPUs and Memory Nodes must be a subset of its parents. | |
204 | - - It can't be marked exclusive unless its parent is. | |
205 | - - If its cpu or memory is exclusive, they may not overlap any sibling. | |
206 | - | |
207 | -These rules, and the natural hierarchy of cpusets, enable efficient | |
208 | -enforcement of the exclusive guarantee, without having to scan all | |
209 | -cpusets every time any of them change to ensure nothing overlaps a | |
210 | -exclusive cpuset. Also, the use of a Linux virtual file system (vfs) | |
211 | -to represent the cpuset hierarchy provides for a familiar permission | |
212 | -and name space for cpusets, with a minimum of additional kernel code. | |
213 | - | |
214 | -The cpus and mems files in the root (top_cpuset) cpuset are | |
215 | -read-only. The cpus file automatically tracks the value of | |
216 | -cpu_online_map using a CPU hotplug notifier, and the mems file | |
217 | -automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., | |
218 | -nodes with memory--using the cpuset_track_online_nodes() hook. | |
219 | - | |
220 | - | |
221 | -1.4 What are exclusive cpusets ? | |
222 | --------------------------------- | |
223 | - | |
224 | -If a cpuset is cpu or mem exclusive, no other cpuset, other than | |
225 | -a direct ancestor or descendent, may share any of the same CPUs or | |
226 | -Memory Nodes. | |
227 | - | |
228 | -A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | |
229 | -i.e. it restricts kernel allocations for page, buffer and other data | |
230 | -commonly shared by the kernel across multiple users. All cpusets, | |
231 | -whether hardwalled or not, restrict allocations of memory for user | |
232 | -space. This enables configuring a system so that several independent | |
233 | -jobs can share common kernel data, such as file system pages, while | |
234 | -isolating each job's user allocation in its own cpuset. To do this, | |
235 | -construct a large mem_exclusive cpuset to hold all the jobs, and | |
236 | -construct child, non-mem_exclusive cpusets for each individual job. | |
237 | -Only a small amount of typical kernel memory, such as requests from | |
238 | -interrupt handlers, is allowed to be taken outside even a | |
239 | -mem_exclusive cpuset. | |
240 | - | |
241 | - | |
242 | -1.5 What is memory_pressure ? | |
243 | ------------------------------ | |
244 | -The memory_pressure of a cpuset provides a simple per-cpuset metric | |
245 | -of the rate that the tasks in a cpuset are attempting to free up in | |
246 | -use memory on the nodes of the cpuset to satisfy additional memory | |
247 | -requests. | |
248 | - | |
249 | -This enables batch managers monitoring jobs running in dedicated | |
250 | -cpusets to efficiently detect what level of memory pressure that job | |
251 | -is causing. | |
252 | - | |
253 | -This is useful both on tightly managed systems running a wide mix of | |
254 | -submitted jobs, which may choose to terminate or re-prioritize jobs that | |
255 | -are trying to use more memory than allowed on the nodes assigned them, | |
256 | -and with tightly coupled, long running, massively parallel scientific | |
257 | -computing jobs that will dramatically fail to meet required performance | |
258 | -goals if they start to use more memory than allowed to them. | |
259 | - | |
260 | -This mechanism provides a very economical way for the batch manager | |
261 | -to monitor a cpuset for signs of memory pressure. It's up to the | |
262 | -batch manager or other user code to decide what to do about it and | |
263 | -take action. | |
264 | - | |
265 | -==> Unless this feature is enabled by writing "1" to the special file | |
266 | - /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | |
267 | - code of __alloc_pages() for this metric reduces to simply noticing | |
268 | - that the cpuset_memory_pressure_enabled flag is zero. So only | |
269 | - systems that enable this feature will compute the metric. | |
270 | - | |
271 | -Why a per-cpuset, running average: | |
272 | - | |
273 | - Because this meter is per-cpuset, rather than per-task or mm, | |
274 | - the system load imposed by a batch scheduler monitoring this | |
275 | - metric is sharply reduced on large systems, because a scan of | |
276 | - the tasklist can be avoided on each set of queries. | |
277 | - | |
278 | - Because this meter is a running average, instead of an accumulating | |
279 | - counter, a batch scheduler can detect memory pressure with a | |
280 | - single read, instead of having to read and accumulate results | |
281 | - for a period of time. | |
282 | - | |
283 | - Because this meter is per-cpuset rather than per-task or mm, | |
284 | - the batch scheduler can obtain the key information, memory | |
285 | - pressure in a cpuset, with a single read, rather than having to | |
286 | - query and accumulate results over all the (dynamically changing) | |
287 | - set of tasks in the cpuset. | |
288 | - | |
289 | -A per-cpuset simple digital filter (requires a spinlock and 3 words | |
290 | -of data per-cpuset) is kept, and updated by any task attached to that | |
291 | -cpuset, if it enters the synchronous (direct) page reclaim code. | |
292 | - | |
293 | -A per-cpuset file provides an integer number representing the recent | |
294 | -(half-life of 10 seconds) rate of direct page reclaims caused by | |
295 | -the tasks in the cpuset, in units of reclaims attempted per second, | |
296 | -times 1000. | |
297 | - | |
298 | - | |
299 | -1.6 What is memory spread ? | |
300 | ---------------------------- | |
301 | -There are two boolean flag files per cpuset that control where the | |
302 | -kernel allocates pages for the file system buffers and related in | |
303 | -kernel data structures. They are called 'memory_spread_page' and | |
304 | -'memory_spread_slab'. | |
305 | - | |
306 | -If the per-cpuset boolean flag file 'memory_spread_page' is set, then | |
307 | -the kernel will spread the file system buffers (page cache) evenly | |
308 | -over all the nodes that the faulting task is allowed to use, instead | |
309 | -of preferring to put those pages on the node where the task is running. | |
310 | - | |
311 | -If the per-cpuset boolean flag file 'memory_spread_slab' is set, | |
312 | -then the kernel will spread some file system related slab caches, | |
313 | -such as for inodes and dentries evenly over all the nodes that the | |
314 | -faulting task is allowed to use, instead of preferring to put those | |
315 | -pages on the node where the task is running. | |
316 | - | |
317 | -The setting of these flags does not affect anonymous data segment or | |
318 | -stack segment pages of a task. | |
319 | - | |
320 | -By default, both kinds of memory spreading are off, and memory | |
321 | -pages are allocated on the node local to where the task is running, | |
322 | -except perhaps as modified by the tasks NUMA mempolicy or cpuset | |
323 | -configuration, so long as sufficient free memory pages are available. | |
324 | - | |
325 | -When new cpusets are created, they inherit the memory spread settings | |
326 | -of their parent. | |
327 | - | |
328 | -Setting memory spreading causes allocations for the affected page | |
329 | -or slab caches to ignore the tasks NUMA mempolicy and be spread | |
330 | -instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | |
331 | -mempolicies will not notice any change in these calls as a result of | |
332 | -their containing tasks memory spread settings. If memory spreading | |
333 | -is turned off, then the currently specified NUMA mempolicy once again | |
334 | -applies to memory page allocations. | |
335 | - | |
336 | -Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | |
337 | -files. By default they contain "0", meaning that the feature is off | |
338 | -for that cpuset. If a "1" is written to that file, then that turns | |
339 | -the named feature on. | |
340 | - | |
341 | -The implementation is simple. | |
342 | - | |
343 | -Setting the flag 'memory_spread_page' turns on a per-process flag | |
344 | -PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | |
345 | -joins that cpuset. The page allocation calls for the page cache | |
346 | -is modified to perform an inline check for this PF_SPREAD_PAGE task | |
347 | -flag, and if set, a call to a new routine cpuset_mem_spread_node() | |
348 | -returns the node to prefer for the allocation. | |
349 | - | |
350 | -Similarly, setting 'memory_spread_slab' turns on the flag | |
351 | -PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | |
352 | -pages from the node returned by cpuset_mem_spread_node(). | |
353 | - | |
354 | -The cpuset_mem_spread_node() routine is also simple. It uses the | |
355 | -value of a per-task rotor cpuset_mem_spread_rotor to select the next | |
356 | -node in the current tasks mems_allowed to prefer for the allocation. | |
357 | - | |
358 | -This memory placement policy is also known (in other contexts) as | |
359 | -round-robin or interleave. | |
360 | - | |
361 | -This policy can provide substantial improvements for jobs that need | |
362 | -to place thread local data on the corresponding node, but that need | |
363 | -to access large file system data sets that need to be spread across | |
364 | -the several nodes in the jobs cpuset in order to fit. Without this | |
365 | -policy, especially for jobs that might have one thread reading in the | |
366 | -data set, the memory allocation across the nodes in the jobs cpuset | |
367 | -can become very uneven. | |
368 | - | |
369 | -1.7 What is sched_load_balance ? | |
370 | --------------------------------- | |
371 | - | |
372 | -The kernel scheduler (kernel/sched.c) automatically load balances | |
373 | -tasks. If one CPU is underutilized, kernel code running on that | |
374 | -CPU will look for tasks on other more overloaded CPUs and move those | |
375 | -tasks to itself, within the constraints of such placement mechanisms | |
376 | -as cpusets and sched_setaffinity. | |
377 | - | |
378 | -The algorithmic cost of load balancing and its impact on key shared | |
379 | -kernel data structures such as the task list increases more than | |
380 | -linearly with the number of CPUs being balanced. So the scheduler | |
381 | -has support to partition the systems CPUs into a number of sched | |
382 | -domains such that it only load balances within each sched domain. | |
383 | -Each sched domain covers some subset of the CPUs in the system; | |
384 | -no two sched domains overlap; some CPUs might not be in any sched | |
385 | -domain and hence won't be load balanced. | |
386 | - | |
387 | -Put simply, it costs less to balance between two smaller sched domains | |
388 | -than one big one, but doing so means that overloads in one of the | |
389 | -two domains won't be load balanced to the other one. | |
390 | - | |
391 | -By default, there is one sched domain covering all CPUs, except those | |
392 | -marked isolated using the kernel boot time "isolcpus=" argument. | |
393 | - | |
394 | -This default load balancing across all CPUs is not well suited for | |
395 | -the following two situations: | |
396 | - 1) On large systems, load balancing across many CPUs is expensive. | |
397 | - If the system is managed using cpusets to place independent jobs | |
398 | - on separate sets of CPUs, full load balancing is unnecessary. | |
399 | - 2) Systems supporting realtime on some CPUs need to minimize | |
400 | - system overhead on those CPUs, including avoiding task load | |
401 | - balancing if that is not needed. | |
402 | - | |
403 | -When the per-cpuset flag "sched_load_balance" is enabled (the default | |
404 | -setting), it requests that all the CPUs in that cpusets allowed 'cpus' | |
405 | -be contained in a single sched domain, ensuring that load balancing | |
406 | -can move a task (not otherwised pinned, as by sched_setaffinity) | |
407 | -from any CPU in that cpuset to any other. | |
408 | - | |
409 | -When the per-cpuset flag "sched_load_balance" is disabled, then the | |
410 | -scheduler will avoid load balancing across the CPUs in that cpuset, | |
411 | ---except-- in so far as is necessary because some overlapping cpuset | |
412 | -has "sched_load_balance" enabled. | |
413 | - | |
414 | -So, for example, if the top cpuset has the flag "sched_load_balance" | |
415 | -enabled, then the scheduler will have one sched domain covering all | |
416 | -CPUs, and the setting of the "sched_load_balance" flag in any other | |
417 | -cpusets won't matter, as we're already fully load balancing. | |
418 | - | |
419 | -Therefore in the above two situations, the top cpuset flag | |
420 | -"sched_load_balance" should be disabled, and only some of the smaller, | |
421 | -child cpusets have this flag enabled. | |
422 | - | |
423 | -When doing this, you don't usually want to leave any unpinned tasks in | |
424 | -the top cpuset that might use non-trivial amounts of CPU, as such tasks | |
425 | -may be artificially constrained to some subset of CPUs, depending on | |
426 | -the particulars of this flag setting in descendent cpusets. Even if | |
427 | -such a task could use spare CPU cycles in some other CPUs, the kernel | |
428 | -scheduler might not consider the possibility of load balancing that | |
429 | -task to that underused CPU. | |
430 | - | |
431 | -Of course, tasks pinned to a particular CPU can be left in a cpuset | |
432 | -that disables "sched_load_balance" as those tasks aren't going anywhere | |
433 | -else anyway. | |
434 | - | |
435 | -There is an impedance mismatch here, between cpusets and sched domains. | |
436 | -Cpusets are hierarchical and nest. Sched domains are flat; they don't | |
437 | -overlap and each CPU is in at most one sched domain. | |
438 | - | |
439 | -It is necessary for sched domains to be flat because load balancing | |
440 | -across partially overlapping sets of CPUs would risk unstable dynamics | |
441 | -that would be beyond our understanding. So if each of two partially | |
442 | -overlapping cpusets enables the flag 'sched_load_balance', then we | |
443 | -form a single sched domain that is a superset of both. We won't move | |
444 | -a task to a CPU outside it cpuset, but the scheduler load balancing | |
445 | -code might waste some compute cycles considering that possibility. | |
446 | - | |
447 | -This mismatch is why there is not a simple one-to-one relation | |
448 | -between which cpusets have the flag "sched_load_balance" enabled, | |
449 | -and the sched domain configuration. If a cpuset enables the flag, it | |
450 | -will get balancing across all its CPUs, but if it disables the flag, | |
451 | -it will only be assured of no load balancing if no other overlapping | |
452 | -cpuset enables the flag. | |
453 | - | |
454 | -If two cpusets have partially overlapping 'cpus' allowed, and only | |
455 | -one of them has this flag enabled, then the other may find its | |
456 | -tasks only partially load balanced, just on the overlapping CPUs. | |
457 | -This is just the general case of the top_cpuset example given a few | |
458 | -paragraphs above. In the general case, as in the top cpuset case, | |
459 | -don't leave tasks that might use non-trivial amounts of CPU in | |
460 | -such partially load balanced cpusets, as they may be artificially | |
461 | -constrained to some subset of the CPUs allowed to them, for lack of | |
462 | -load balancing to the other CPUs. | |
463 | - | |
464 | -1.7.1 sched_load_balance implementation details. | |
465 | ------------------------------------------------- | |
466 | - | |
467 | -The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | |
468 | -to most cpuset flags.) When enabled for a cpuset, the kernel will | |
469 | -ensure that it can load balance across all the CPUs in that cpuset | |
470 | -(makes sure that all the CPUs in the cpus_allowed of that cpuset are | |
471 | -in the same sched domain.) | |
472 | - | |
473 | -If two overlapping cpusets both have 'sched_load_balance' enabled, | |
474 | -then they will be (must be) both in the same sched domain. | |
475 | - | |
476 | -If, as is the default, the top cpuset has 'sched_load_balance' enabled, | |
477 | -then by the above that means there is a single sched domain covering | |
478 | -the whole system, regardless of any other cpuset settings. | |
479 | - | |
480 | -The kernel commits to user space that it will avoid load balancing | |
481 | -where it can. It will pick as fine a granularity partition of sched | |
482 | -domains as it can while still providing load balancing for any set | |
483 | -of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | |
484 | - | |
485 | -The internal kernel cpuset to scheduler interface passes from the | |
486 | -cpuset code to the scheduler code a partition of the load balanced | |
487 | -CPUs in the system. This partition is a set of subsets (represented | |
488 | -as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | |
489 | -the CPUs that must be load balanced. | |
490 | - | |
491 | -Whenever the 'sched_load_balance' flag changes, or CPUs come or go | |
492 | -from a cpuset with this flag enabled, or a cpuset with this flag | |
493 | -enabled is removed, the cpuset code builds a new such partition and | |
494 | -passes it to the scheduler sched domain setup code, to have the sched | |
495 | -domains rebuilt as necessary. | |
496 | - | |
497 | -This partition exactly defines what sched domains the scheduler should | |
498 | -setup - one sched domain for each element (cpumask_t) in the partition. | |
499 | - | |
500 | -The scheduler remembers the currently active sched domain partitions. | |
501 | -When the scheduler routine partition_sched_domains() is invoked from | |
502 | -the cpuset code to update these sched domains, it compares the new | |
503 | -partition requested with the current, and updates its sched domains, | |
504 | -removing the old and adding the new, for each change. | |
505 | - | |
506 | - | |
507 | -1.8 What is sched_relax_domain_level ? | |
508 | --------------------------------------- | |
509 | - | |
510 | -In sched domain, the scheduler migrates tasks in 2 ways; periodic load | |
511 | -balance on tick, and at time of some schedule events. | |
512 | - | |
513 | -When a task is woken up, scheduler try to move the task on idle CPU. | |
514 | -For example, if a task A running on CPU X activates another task B | |
515 | -on the same CPU X, and if CPU Y is X's sibling and performing idle, | |
516 | -then scheduler migrate task B to CPU Y so that task B can start on | |
517 | -CPU Y without waiting task A on CPU X. | |
518 | - | |
519 | -And if a CPU run out of tasks in its runqueue, the CPU try to pull | |
520 | -extra tasks from other busy CPUs to help them before it is going to | |
521 | -be idle. | |
522 | - | |
523 | -Of course it takes some searching cost to find movable tasks and/or | |
524 | -idle CPUs, the scheduler might not search all CPUs in the domain | |
525 | -everytime. In fact, in some architectures, the searching ranges on | |
526 | -events are limited in the same socket or node where the CPU locates, | |
527 | -while the load balance on tick searchs all. | |
528 | - | |
529 | -For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | |
530 | -is idle while CPU X and the siblings are busy, scheduler can't migrate | |
531 | -woken task B from X to Z since it is out of its searching range. | |
532 | -As the result, task B on CPU X need to wait task A or wait load balance | |
533 | -on the next tick. For some applications in special situation, waiting | |
534 | -1 tick may be too long. | |
535 | - | |
536 | -The 'sched_relax_domain_level' file allows you to request changing | |
537 | -this searching range as you like. This file takes int value which | |
538 | -indicates size of searching range in levels ideally as follows, | |
539 | -otherwise initial value -1 that indicates the cpuset has no request. | |
540 | - | |
541 | - -1 : no request. use system default or follow request of others. | |
542 | - 0 : no search. | |
543 | - 1 : search siblings (hyperthreads in a core). | |
544 | - 2 : search cores in a package. | |
545 | - 3 : search cpus in a node [= system wide on non-NUMA system] | |
546 | - ( 4 : search nodes in a chunk of node [on NUMA system] ) | |
547 | - ( 5 : search system wide [on NUMA system] ) | |
548 | - | |
549 | -The system default is architecture dependent. The system default | |
550 | -can be changed using the relax_domain_level= boot parameter. | |
551 | - | |
552 | -This file is per-cpuset and affect the sched domain where the cpuset | |
553 | -belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | |
554 | -is disabled, then 'sched_relax_domain_level' have no effect since | |
555 | -there is no sched domain belonging the cpuset. | |
556 | - | |
557 | -If multiple cpusets are overlapping and hence they form a single sched | |
558 | -domain, the largest value among those is used. Be careful, if one | |
559 | -requests 0 and others are -1 then 0 is used. | |
560 | - | |
561 | -Note that modifying this file will have both good and bad effects, | |
562 | -and whether it is acceptable or not will be depend on your situation. | |
563 | -Don't modify this file if you are not sure. | |
564 | - | |
565 | -If your situation is: | |
566 | - - The migration costs between each cpu can be assumed considerably | |
567 | - small(for you) due to your special application's behavior or | |
568 | - special hardware support for CPU cache etc. | |
569 | - - The searching cost doesn't have impact(for you) or you can make | |
570 | - the searching cost enough small by managing cpuset to compact etc. | |
571 | - - The latency is required even it sacrifices cache hit rate etc. | |
572 | -then increasing 'sched_relax_domain_level' would benefit you. | |
573 | - | |
574 | - | |
575 | -1.9 How do I use cpusets ? | |
576 | --------------------------- | |
577 | - | |
578 | -In order to minimize the impact of cpusets on critical kernel | |
579 | -code, such as the scheduler, and due to the fact that the kernel | |
580 | -does not support one task updating the memory placement of another | |
581 | -task directly, the impact on a task of changing its cpuset CPU | |
582 | -or Memory Node placement, or of changing to which cpuset a task | |
583 | -is attached, is subtle. | |
584 | - | |
585 | -If a cpuset has its Memory Nodes modified, then for each task attached | |
586 | -to that cpuset, the next time that the kernel attempts to allocate | |
587 | -a page of memory for that task, the kernel will notice the change | |
588 | -in the tasks cpuset, and update its per-task memory placement to | |
589 | -remain within the new cpusets memory placement. If the task was using | |
590 | -mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | |
591 | -its new cpuset, then the task will continue to use whatever subset | |
592 | -of MPOL_BIND nodes are still allowed in the new cpuset. If the task | |
593 | -was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | |
594 | -in the new cpuset, then the task will be essentially treated as if it | |
595 | -was MPOL_BIND bound to the new cpuset (even though its numa placement, | |
596 | -as queried by get_mempolicy(), doesn't change). If a task is moved | |
597 | -from one cpuset to another, then the kernel will adjust the tasks | |
598 | -memory placement, as above, the next time that the kernel attempts | |
599 | -to allocate a page of memory for that task. | |
600 | - | |
601 | -If a cpuset has its 'cpus' modified, then each task in that cpuset | |
602 | -will have its allowed CPU placement changed immediately. Similarly, | |
603 | -if a tasks pid is written to a cpusets 'tasks' file, in either its | |
604 | -current cpuset or another cpuset, then its allowed CPU placement is | |
605 | -changed immediately. If such a task had been bound to some subset | |
606 | -of its cpuset using the sched_setaffinity() call, the task will be | |
607 | -allowed to run on any CPU allowed in its new cpuset, negating the | |
608 | -affect of the prior sched_setaffinity() call. | |
609 | - | |
610 | -In summary, the memory placement of a task whose cpuset is changed is | |
611 | -updated by the kernel, on the next allocation of a page for that task, | |
612 | -but the processor placement is not updated, until that tasks pid is | |
613 | -rewritten to the 'tasks' file of its cpuset. This is done to avoid | |
614 | -impacting the scheduler code in the kernel with a check for changes | |
615 | -in a tasks processor placement. | |
616 | - | |
617 | -Normally, once a page is allocated (given a physical page | |
618 | -of main memory) then that page stays on whatever node it | |
619 | -was allocated, so long as it remains allocated, even if the | |
620 | -cpusets memory placement policy 'mems' subsequently changes. | |
621 | -If the cpuset flag file 'memory_migrate' is set true, then when | |
622 | -tasks are attached to that cpuset, any pages that task had | |
623 | -allocated to it on nodes in its previous cpuset are migrated | |
624 | -to the tasks new cpuset. The relative placement of the page within | |
625 | -the cpuset is preserved during these migration operations if possible. | |
626 | -For example if the page was on the second valid node of the prior cpuset | |
627 | -then the page will be placed on the second valid node of the new cpuset. | |
628 | - | |
629 | -Also if 'memory_migrate' is set true, then if that cpusets | |
630 | -'mems' file is modified, pages allocated to tasks in that | |
631 | -cpuset, that were on nodes in the previous setting of 'mems', | |
632 | -will be moved to nodes in the new setting of 'mems.' | |
633 | -Pages that were not in the tasks prior cpuset, or in the cpusets | |
634 | -prior 'mems' setting, will not be moved. | |
635 | - | |
636 | -There is an exception to the above. If hotplug functionality is used | |
637 | -to remove all the CPUs that are currently assigned to a cpuset, | |
638 | -then all the tasks in that cpuset will be moved to the nearest ancestor | |
639 | -with non-empty cpus. But the moving of some (or all) tasks might fail if | |
640 | -cpuset is bound with another cgroup subsystem which has some restrictions | |
641 | -on task attaching. In this failing case, those tasks will stay | |
642 | -in the original cpuset, and the kernel will automatically update | |
643 | -their cpus_allowed to allow all online CPUs. When memory hotplug | |
644 | -functionality for removing Memory Nodes is available, a similar exception | |
645 | -is expected to apply there as well. In general, the kernel prefers to | |
646 | -violate cpuset placement, over starving a task that has had all | |
647 | -its allowed CPUs or Memory Nodes taken offline. | |
648 | - | |
649 | -There is a second exception to the above. GFP_ATOMIC requests are | |
650 | -kernel internal allocations that must be satisfied, immediately. | |
651 | -The kernel may drop some request, in rare cases even panic, if a | |
652 | -GFP_ATOMIC alloc fails. If the request cannot be satisfied within | |
653 | -the current tasks cpuset, then we relax the cpuset, and look for | |
654 | -memory anywhere we can find it. It's better to violate the cpuset | |
655 | -than stress the kernel. | |
656 | - | |
657 | -To start a new job that is to be contained within a cpuset, the steps are: | |
658 | - | |
659 | - 1) mkdir /dev/cpuset | |
660 | - 2) mount -t cgroup -ocpuset cpuset /dev/cpuset | |
661 | - 3) Create the new cpuset by doing mkdir's and write's (or echo's) in | |
662 | - the /dev/cpuset virtual file system. | |
663 | - 4) Start a task that will be the "founding father" of the new job. | |
664 | - 5) Attach that task to the new cpuset by writing its pid to the | |
665 | - /dev/cpuset tasks file for that cpuset. | |
666 | - 6) fork, exec or clone the job tasks from this founding father task. | |
667 | - | |
668 | -For example, the following sequence of commands will setup a cpuset | |
669 | -named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | |
670 | -and then start a subshell 'sh' in that cpuset: | |
671 | - | |
672 | - mount -t cgroup -ocpuset cpuset /dev/cpuset | |
673 | - cd /dev/cpuset | |
674 | - mkdir Charlie | |
675 | - cd Charlie | |
676 | - /bin/echo 2-3 > cpus | |
677 | - /bin/echo 1 > mems | |
678 | - /bin/echo $$ > tasks | |
679 | - sh | |
680 | - # The subshell 'sh' is now running in cpuset Charlie | |
681 | - # The next line should display '/Charlie' | |
682 | - cat /proc/self/cpuset | |
683 | - | |
684 | -In the future, a C library interface to cpusets will likely be | |
685 | -available. For now, the only way to query or modify cpusets is | |
686 | -via the cpuset file system, using the various cd, mkdir, echo, cat, | |
687 | -rmdir commands from the shell, or their equivalent from C. | |
688 | - | |
689 | -The sched_setaffinity calls can also be done at the shell prompt using | |
690 | -SGI's runon or Robert Love's taskset. The mbind and set_mempolicy | |
691 | -calls can be done at the shell prompt using the numactl command | |
692 | -(part of Andi Kleen's numa package). | |
693 | - | |
694 | -2. Usage Examples and Syntax | |
695 | -============================ | |
696 | - | |
697 | -2.1 Basic Usage | |
698 | ---------------- | |
699 | - | |
700 | -Creating, modifying, using the cpusets can be done through the cpuset | |
701 | -virtual filesystem. | |
702 | - | |
703 | -To mount it, type: | |
704 | -# mount -t cgroup -o cpuset cpuset /dev/cpuset | |
705 | - | |
706 | -Then under /dev/cpuset you can find a tree that corresponds to the | |
707 | -tree of the cpusets in the system. For instance, /dev/cpuset | |
708 | -is the cpuset that holds the whole system. | |
709 | - | |
710 | -If you want to create a new cpuset under /dev/cpuset: | |
711 | -# cd /dev/cpuset | |
712 | -# mkdir my_cpuset | |
713 | - | |
714 | -Now you want to do something with this cpuset. | |
715 | -# cd my_cpuset | |
716 | - | |
717 | -In this directory you can find several files: | |
718 | -# ls | |
719 | -cpu_exclusive memory_migrate mems tasks | |
720 | -cpus memory_pressure notify_on_release | |
721 | -mem_exclusive memory_spread_page sched_load_balance | |
722 | -mem_hardwall memory_spread_slab sched_relax_domain_level | |
723 | - | |
724 | -Reading them will give you information about the state of this cpuset: | |
725 | -the CPUs and Memory Nodes it can use, the processes that are using | |
726 | -it, its properties. By writing to these files you can manipulate | |
727 | -the cpuset. | |
728 | - | |
729 | -Set some flags: | |
730 | -# /bin/echo 1 > cpu_exclusive | |
731 | - | |
732 | -Add some cpus: | |
733 | -# /bin/echo 0-7 > cpus | |
734 | - | |
735 | -Add some mems: | |
736 | -# /bin/echo 0-7 > mems | |
737 | - | |
738 | -Now attach your shell to this cpuset: | |
739 | -# /bin/echo $$ > tasks | |
740 | - | |
741 | -You can also create cpusets inside your cpuset by using mkdir in this | |
742 | -directory. | |
743 | -# mkdir my_sub_cs | |
744 | - | |
745 | -To remove a cpuset, just use rmdir: | |
746 | -# rmdir my_sub_cs | |
747 | -This will fail if the cpuset is in use (has cpusets inside, or has | |
748 | -processes attached). | |
749 | - | |
750 | -Note that for legacy reasons, the "cpuset" filesystem exists as a | |
751 | -wrapper around the cgroup filesystem. | |
752 | - | |
753 | -The command | |
754 | - | |
755 | -mount -t cpuset X /dev/cpuset | |
756 | - | |
757 | -is equivalent to | |
758 | - | |
759 | -mount -t cgroup -ocpuset X /dev/cpuset | |
760 | -echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | |
761 | - | |
762 | -2.2 Adding/removing cpus | |
763 | ------------------------- | |
764 | - | |
765 | -This is the syntax to use when writing in the cpus or mems files | |
766 | -in cpuset directories: | |
767 | - | |
768 | -# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 | |
769 | -# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 | |
770 | - | |
771 | -2.3 Setting flags | |
772 | ------------------ | |
773 | - | |
774 | -The syntax is very simple: | |
775 | - | |
776 | -# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' | |
777 | -# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' | |
778 | - | |
779 | -2.4 Attaching processes | |
780 | ------------------------ | |
781 | - | |
782 | -# /bin/echo PID > tasks | |
783 | - | |
784 | -Note that it is PID, not PIDs. You can only attach ONE task at a time. | |
785 | -If you have several tasks to attach, you have to do it one after another: | |
786 | - | |
787 | -# /bin/echo PID1 > tasks | |
788 | -# /bin/echo PID2 > tasks | |
789 | - ... | |
790 | -# /bin/echo PIDn > tasks | |
791 | - | |
792 | - | |
793 | -3. Questions | |
794 | -============ | |
795 | - | |
796 | -Q: what's up with this '/bin/echo' ? | |
797 | -A: bash's builtin 'echo' command does not check calls to write() against | |
798 | - errors. If you use it in the cpuset file system, you won't be | |
799 | - able to tell whether a command succeeded or failed. | |
800 | - | |
801 | -Q: When I attach processes, only the first of the line gets really attached ! | |
802 | -A: We can only return one error code per call to write(). So you should also | |
803 | - put only ONE pid. | |
804 | - | |
805 | -4. Contact | |
806 | -========== | |
807 | - | |
808 | -Web: http://www.bullopensource.org/cpuset |
Documentation/scheduler/sched-design-CFS.txt
... | ... | @@ -231,7 +231,7 @@ |
231 | 231 | |
232 | 232 | This options needs CONFIG_CGROUPS to be defined, and lets the administrator |
233 | 233 | create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See |
234 | - Documentation/cgroups.txt for more information about this filesystem. | |
234 | + Documentation/cgroups/cgroups.txt for more information about this filesystem. | |
235 | 235 | |
236 | 236 | Only one of these options to group tasks can be chosen and not both. |
237 | 237 |
include/linux/res_counter.h
init/Kconfig
... | ... | @@ -323,8 +323,8 @@ |
323 | 323 | This option allows you to create arbitrary task groups |
324 | 324 | using the "cgroup" pseudo filesystem and control |
325 | 325 | the cpu bandwidth allocated to each such task group. |
326 | - Refer to Documentation/cgroups.txt for more information | |
327 | - on "cgroup" pseudo filesystem. | |
326 | + Refer to Documentation/cgroups/cgroups.txt for more | |
327 | + information on "cgroup" pseudo filesystem. | |
328 | 328 | |
329 | 329 | endchoice |
330 | 330 | |
331 | 331 | |
... | ... | @@ -335,10 +335,9 @@ |
335 | 335 | use with process control subsystems such as Cpusets, CFS, memory |
336 | 336 | controls or device isolation. |
337 | 337 | See |
338 | - - Documentation/cpusets.txt (Cpusets) | |
339 | 338 | - Documentation/scheduler/sched-design-CFS.txt (CFS) |
340 | - - Documentation/cgroups/ (features for grouping, isolation) | |
341 | - - Documentation/controllers/ (features for resource control) | |
339 | + - Documentation/cgroups/ (features for grouping, isolation | |
340 | + and resource control) | |
342 | 341 | |
343 | 342 | Say N if unsure. |
344 | 343 |
kernel/cpuset.c
... | ... | @@ -568,7 +568,7 @@ |
568 | 568 | * load balancing domains (sched domains) as specified by that partial |
569 | 569 | * partition. |
570 | 570 | * |
571 | - * See "What is sched_load_balance" in Documentation/cpusets.txt | |
571 | + * See "What is sched_load_balance" in Documentation/cgroups/cpusets.txt | |
572 | 572 | * for a background explanation of this. |
573 | 573 | * |
574 | 574 | * Does not return errors, on the theory that the callers of this |