Commit dc10e281f5fc42e288ab979294d1d5dc9743ae1b
Committed by
Linus Torvalds
1 parent
87946a7228
Exists in
master
and in
7 other branches
memcg: update documentation
Some information are old, and I think current document doesn't work as "a guide for users". We need summary of all of our controls, at least. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Showing 1 changed file with 198 additions and 93 deletions Side-by-side Diff
Documentation/cgroups/memory.txt
1 | 1 | Memory Resource Controller |
2 | 2 | |
3 | 3 | NOTE: The Memory Resource Controller has been generically been referred |
4 | -to as the memory controller in this document. Do not confuse memory controller | |
5 | -used here with the memory controller that is used in hardware. | |
4 | + to as the memory controller in this document. Do not confuse memory | |
5 | + controller used here with the memory controller that is used in hardware. | |
6 | 6 | |
7 | -Salient features | |
7 | +(For editors) | |
8 | +In this document: | |
9 | + When we mention a cgroup (cgroupfs's directory) with memory controller, | |
10 | + we call it "memory cgroup". When you see git-log and source code, you'll | |
11 | + see patch's title and function names tend to use "memcg". | |
12 | + In this document, we avoid using it. | |
8 | 13 | |
9 | -a. Enable control of Anonymous, Page Cache (mapped and unmapped) and | |
10 | - Swap Cache memory pages. | |
11 | -b. The infrastructure allows easy addition of other types of memory to control | |
12 | -c. Provides *zero overhead* for non memory controller users | |
13 | -d. Provides a double LRU: global memory pressure causes reclaim from the | |
14 | - global LRU; a cgroup on hitting a limit, reclaims from the per | |
15 | - cgroup LRU | |
16 | - | |
17 | 14 | Benefits and Purpose of the memory controller |
18 | 15 | |
19 | 16 | The memory controller isolates the memory behaviour of a group of tasks |
... | ... | @@ -33,6 +30,45 @@ |
33 | 30 | e. There are several other use cases, find one or use the controller just |
34 | 31 | for fun (to learn and hack on the VM subsystem). |
35 | 32 | |
33 | +Current Status: linux-2.6.34-mmotm(development version of 2010/April) | |
34 | + | |
35 | +Features: | |
36 | + - accounting anonymous pages, file caches, swap caches usage and limiting them. | |
37 | + - private LRU and reclaim routine. (system's global LRU and private LRU | |
38 | + work independently from each other) | |
39 | + - optionally, memory+swap usage can be accounted and limited. | |
40 | + - hierarchical accounting | |
41 | + - soft limit | |
42 | + - moving(recharging) account at moving a task is selectable. | |
43 | + - usage threshold notifier | |
44 | + - oom-killer disable knob and oom-notifier | |
45 | + - Root cgroup has no limit controls. | |
46 | + | |
47 | + Kernel memory and Hugepages are not under control yet. We just manage | |
48 | + pages on LRU. To add more controls, we have to take care of performance. | |
49 | + | |
50 | +Brief summary of control files. | |
51 | + | |
52 | + tasks # attach a task(thread) and show list of threads | |
53 | + cgroup.procs # show list of processes | |
54 | + cgroup.event_control # an interface for event_fd() | |
55 | + memory.usage_in_bytes # show current memory(RSS+Cache) usage. | |
56 | + memory.memsw.usage_in_bytes # show current memory+Swap usage | |
57 | + memory.limit_in_bytes # set/show limit of memory usage | |
58 | + memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage | |
59 | + memory.failcnt # show the number of memory usage hits limits | |
60 | + memory.memsw.failcnt # show the number of memory+Swap hits limits | |
61 | + memory.max_usage_in_bytes # show max memory usage recorded | |
62 | + memory.memsw.usage_in_bytes # show max memory+Swap usage recorded | |
63 | + memory.soft_limit_in_bytes # set/show soft limit of memory usage | |
64 | + memory.stat # show various statistics | |
65 | + memory.use_hierarchy # set/show hierarchical account enabled | |
66 | + memory.force_empty # trigger forced move charge to parent | |
67 | + memory.swappiness # set/show swappiness parameter of vmscan | |
68 | + (See sysctl's vm.swappiness) | |
69 | + memory.move_charge_at_immigrate # set/show controls of moving charges | |
70 | + memory.oom_control # set/show oom controls. | |
71 | + | |
36 | 72 | 1. History |
37 | 73 | |
38 | 74 | The memory controller has a long history. A request for comments for the memory |
39 | 75 | |
... | ... | @@ -106,14 +142,14 @@ |
106 | 142 | is over its limit. If it is then reclaim is invoked on the cgroup. |
107 | 143 | More details can be found in the reclaim section of this document. |
108 | 144 | If everything goes well, a page meta-data-structure called page_cgroup is |
109 | -allocated and associated with the page. This routine also adds the page to | |
110 | -the per cgroup LRU. | |
145 | +updated. page_cgroup has its own LRU on cgroup. | |
146 | +(*) page_cgroup structure is allocated at boot/memory-hotplug time. | |
111 | 147 | |
112 | 148 | 2.2.1 Accounting details |
113 | 149 | |
114 | 150 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. |
115 | -(some pages which never be reclaimable and will not be on global LRU | |
116 | - are not accounted. we just accounts pages under usual vm management.) | |
151 | +Some pages which are never reclaimable and will not be on the global LRU | |
152 | +are not accounted. We just account pages under usual VM management. | |
117 | 153 | |
118 | 154 | RSS pages are accounted at page_fault unless they've already been accounted |
119 | 155 | for earlier. A file page will be accounted for as Page Cache when it's |
120 | 156 | |
121 | 157 | |
... | ... | @@ -121,12 +157,19 @@ |
121 | 157 | processes, duplicate accounting is carefully avoided. |
122 | 158 | |
123 | 159 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is |
124 | -unaccounted when it's removed from radix-tree. | |
160 | +unaccounted when it's removed from radix-tree. Even if RSS pages are fully | |
161 | +unmapped (by kswapd), they may exist as SwapCache in the system until they | |
162 | +are really freed. Such SwapCaches also also accounted. | |
163 | +A swapped-in page is not accounted until it's mapped. | |
125 | 164 | |
165 | +Note: The kernel does swapin-readahead and read multiple swaps at once. | |
166 | +This means swapped-in pages may contain pages for other tasks than a task | |
167 | +causing page fault. So, we avoid accounting at swap-in I/O. | |
168 | + | |
126 | 169 | At page migration, accounting information is kept. |
127 | 170 | |
128 | -Note: we just account pages-on-lru because our purpose is to control amount | |
129 | -of used pages. not-on-lru pages are tend to be out-of-control from vm view. | |
171 | +Note: we just account pages-on-LRU because our purpose is to control amount | |
172 | +of used pages; not-on-LRU pages tend to be out-of-control from VM view. | |
130 | 173 | |
131 | 174 | 2.3 Shared Page Accounting |
132 | 175 | |
... | ... | @@ -143,6 +186,7 @@ |
143 | 186 | |
144 | 187 | |
145 | 188 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) |
189 | + | |
146 | 190 | Swap Extension allows you to record charge for swap. A swapped-in page is |
147 | 191 | charged back to original page allocator if possible. |
148 | 192 | |
149 | 193 | |
150 | 194 | |
... | ... | @@ -150,13 +194,20 @@ |
150 | 194 | - memory.memsw.usage_in_bytes. |
151 | 195 | - memory.memsw.limit_in_bytes. |
152 | 196 | |
153 | -usage of mem+swap is limited by memsw.limit_in_bytes. | |
197 | +memsw means memory+swap. Usage of memory+swap is limited by | |
198 | +memsw.limit_in_bytes. | |
154 | 199 | |
155 | -* why 'mem+swap' rather than swap. | |
200 | +Example: Assume a system with 4G of swap. A task which allocates 6G of memory | |
201 | +(by mistake) under 2G memory limitation will use all swap. | |
202 | +In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. | |
203 | +By using memsw limit, you can avoid system OOM which can be caused by swap | |
204 | +shortage. | |
205 | + | |
206 | +* why 'memory+swap' rather than swap. | |
156 | 207 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means |
157 | 208 | to move account from memory to swap...there is no change in usage of |
158 | -mem+swap. In other words, when we want to limit the usage of swap without | |
159 | -affecting global LRU, mem+swap limit is better than just limiting swap from | |
209 | +memory+swap. In other words, when we want to limit the usage of swap without | |
210 | +affecting global LRU, memory+swap limit is better than just limiting swap from | |
160 | 211 | OS point of view. |
161 | 212 | |
162 | 213 | * What happens when a cgroup hits memory.memsw.limit_in_bytes |
163 | 214 | |
... | ... | @@ -168,12 +219,12 @@ |
168 | 219 | |
169 | 220 | 2.5 Reclaim |
170 | 221 | |
171 | -Each cgroup maintains a per cgroup LRU that consists of an active | |
172 | -and inactive list. When a cgroup goes over its limit, we first try | |
222 | +Each cgroup maintains a per cgroup LRU which has the same structure as | |
223 | +global VM. When a cgroup goes over its limit, we first try | |
173 | 224 | to reclaim memory from the cgroup so as to make space for the new |
174 | 225 | pages that the cgroup has touched. If the reclaim is unsuccessful, |
175 | 226 | an OOM routine is invoked to select and kill the bulkiest task in the |
176 | -cgroup. | |
227 | +cgroup. (See 10. OOM Control below.) | |
177 | 228 | |
178 | 229 | The reclaim algorithm has not been modified for cgroups, except that |
179 | 230 | pages that are selected for reclaiming come from the per cgroup LRU |
180 | 231 | |
181 | 232 | |
... | ... | @@ -187,13 +238,19 @@ |
187 | 238 | When oom event notifier is registered, event will be delivered. |
188 | 239 | (See oom_control section) |
189 | 240 | |
190 | -2. Locking | |
241 | +2.6 Locking | |
191 | 242 | |
192 | -The memory controller uses the following hierarchy | |
243 | + lock_page_cgroup()/unlock_page_cgroup() should not be called under | |
244 | + mapping->tree_lock. | |
193 | 245 | |
194 | -1. zone->lru_lock is used for selecting pages to be isolated | |
195 | -2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) | |
196 | -3. lock_page_cgroup() is used to protect page->page_cgroup | |
246 | + Other lock order is following: | |
247 | + PG_locked. | |
248 | + mm->page_table_lock | |
249 | + zone->lru_lock | |
250 | + lock_page_cgroup. | |
251 | + In many cases, just lock_page_cgroup() is called. | |
252 | + per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by | |
253 | + zone->lru_lock, it has no lock of its own. | |
197 | 254 | |
198 | 255 | 3. User Interface |
199 | 256 | |
... | ... | @@ -202,6 +259,7 @@ |
202 | 259 | a. Enable CONFIG_CGROUPS |
203 | 260 | b. Enable CONFIG_RESOURCE_COUNTERS |
204 | 261 | c. Enable CONFIG_CGROUP_MEM_RES_CTLR |
262 | +d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) | |
205 | 263 | |
206 | 264 | 1. Prepare the cgroups |
207 | 265 | # mkdir -p /cgroups |
208 | 266 | |
209 | 267 | |
210 | 268 | |
211 | 269 | |
212 | 270 | |
... | ... | @@ -209,31 +267,28 @@ |
209 | 267 | |
210 | 268 | 2. Make the new group and move bash into it |
211 | 269 | # mkdir /cgroups/0 |
212 | -# echo $$ > /cgroups/0/tasks | |
270 | +# echo $$ > /cgroups/0/tasks | |
213 | 271 | |
214 | -Since now we're in the 0 cgroup, | |
215 | -We can alter the memory limit: | |
272 | +Since now we're in the 0 cgroup, we can alter the memory limit: | |
216 | 273 | # echo 4M > /cgroups/0/memory.limit_in_bytes |
217 | 274 | |
218 | 275 | NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, |
219 | -mega or gigabytes. | |
276 | +mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) | |
277 | + | |
220 | 278 | NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). |
221 | 279 | NOTE: We cannot set limits on the root cgroup any more. |
222 | 280 | |
223 | 281 | # cat /cgroups/0/memory.limit_in_bytes |
224 | 282 | 4194304 |
225 | 283 | |
226 | -NOTE: The interface has now changed to display the usage in bytes | |
227 | -instead of pages | |
228 | - | |
229 | 284 | We can check the usage: |
230 | 285 | # cat /cgroups/0/memory.usage_in_bytes |
231 | 286 | 1216512 |
232 | 287 | |
233 | 288 | A successful write to this file does not guarantee a successful set of |
234 | -this limit to the value written into the file. This can be due to a | |
289 | +this limit to the value written into the file. This can be due to a | |
235 | 290 | number of factors, such as rounding up to page boundaries or the total |
236 | -availability of memory on the system. The user is required to re-read | |
291 | +availability of memory on the system. The user is required to re-read | |
237 | 292 | this file after a write to guarantee the value committed by the kernel. |
238 | 293 | |
239 | 294 | # echo 1 > memory.limit_in_bytes |
240 | 295 | |
241 | 296 | |
... | ... | @@ -248,15 +303,23 @@ |
248 | 303 | |
249 | 304 | 4. Testing |
250 | 305 | |
251 | -Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | |
252 | -Apart from that v6 has been tested with several applications and regular | |
253 | -daily use. The controller has also been tested on the PPC64, x86_64 and | |
254 | -UML platforms. | |
306 | +For testing features and implementation, see memcg_test.txt. | |
255 | 307 | |
308 | +Performance test is also important. To see pure memory controller's overhead, | |
309 | +testing on tmpfs will give you good numbers of small overheads. | |
310 | +Example: do kernel make on tmpfs. | |
311 | + | |
312 | +Page-fault scalability is also important. At measuring parallel | |
313 | +page fault test, multi-process test may be better than multi-thread | |
314 | +test because it has noise of shared objects/status. | |
315 | + | |
316 | +But the above two are testing extreme situations. | |
317 | +Trying usual test under memory controller is always helpful. | |
318 | + | |
256 | 319 | 4.1 Troubleshooting |
257 | 320 | |
258 | 321 | Sometimes a user might find that the application under a cgroup is |
259 | -terminated. There are several causes for this: | |
322 | +terminated by OOM killer. There are several causes for this: | |
260 | 323 | |
261 | 324 | 1. The cgroup limit is too low (just too low to do anything useful) |
262 | 325 | 2. The user is using anonymous memory and swap is turned off or too low |
... | ... | @@ -264,6 +327,9 @@ |
264 | 327 | A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of |
265 | 328 | some of the pages cached in the cgroup (page cache pages). |
266 | 329 | |
330 | +To know what happens, disable OOM_Kill by 10. OOM Control(see below) and | |
331 | +seeing what happens will be helpful. | |
332 | + | |
267 | 333 | 4.2 Task migration |
268 | 334 | |
269 | 335 | When a task migrates from one cgroup to another, its charge is not |
270 | 336 | |
271 | 337 | |
... | ... | @@ -271,17 +337,20 @@ |
271 | 337 | remain charged to it, the charge is dropped when the page is freed or |
272 | 338 | reclaimed. |
273 | 339 | |
274 | -Note: You can move charges of a task along with task migration. See 8. | |
340 | +You can move charges of a task along with task migration. | |
341 | +See 8. "Move charges at task migration" | |
275 | 342 | |
276 | 343 | 4.3 Removing a cgroup |
277 | 344 | |
278 | 345 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a |
279 | 346 | cgroup might have some charge associated with it, even though all |
280 | -tasks have migrated away from it. | |
281 | -Such charges are freed(at default) or moved to its parent. When moved, | |
282 | -both of RSS and CACHES are moved to parent. | |
283 | -If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | |
347 | +tasks have migrated away from it. (because we charge against pages, not | |
348 | +against tasks.) | |
284 | 349 | |
350 | +Such charges are freed or moved to their parent. At moving, both of RSS | |
351 | +and CACHES are moved to parent. | |
352 | +rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also. | |
353 | + | |
285 | 354 | Charges recorded in swap information is not updated at removal of cgroup. |
286 | 355 | Recorded information is discarded and a cgroup which uses swap (swapcache) |
287 | 356 | will be charged as a new owner of it. |
... | ... | @@ -296,10 +365,10 @@ |
296 | 365 | |
297 | 366 | # echo 0 > memory.force_empty |
298 | 367 | |
299 | - Almost all pages tracked by this memcg will be unmapped and freed. Some of | |
300 | - pages cannot be freed because it's locked or in-use. Such pages are moved | |
301 | - to parent and this cgroup will be empty. But this may return -EBUSY in | |
302 | - some too busy case. | |
368 | + Almost all pages tracked by this memory cgroup will be unmapped and freed. | |
369 | + Some pages cannot be freed because they are locked or in-use. Such pages are | |
370 | + moved to parent and this cgroup will be empty. This may return -EBUSY if | |
371 | + VM is too busy to free/move all pages immediately. | |
303 | 372 | |
304 | 373 | Typical use case of this interface is that calling this before rmdir(). |
305 | 374 | Because rmdir() moves all pages to parent, some out-of-use page caches can be |
306 | 375 | |
307 | 376 | |
308 | 377 | |
309 | 378 | |
310 | 379 | |
... | ... | @@ -309,20 +378,42 @@ |
309 | 378 | |
310 | 379 | memory.stat file includes following statistics |
311 | 380 | |
381 | +# per-memory cgroup local status | |
312 | 382 | cache - # of bytes of page cache memory. |
313 | 383 | rss - # of bytes of anonymous and swap cache memory. |
384 | +mapped_file - # of bytes of mapped file (includes tmpfs/shmem) | |
314 | 385 | pgpgin - # of pages paged in (equivalent to # of charging events). |
315 | 386 | pgpgout - # of pages paged out (equivalent to # of uncharging events). |
316 | -active_anon - # of bytes of anonymous and swap cache memory on active | |
317 | - lru list. | |
387 | +swap - # of bytes of swap usage | |
318 | 388 | inactive_anon - # of bytes of anonymous memory and swap cache memory on |
319 | - inactive lru list. | |
320 | -active_file - # of bytes of file-backed memory on active lru list. | |
321 | -inactive_file - # of bytes of file-backed memory on inactive lru list. | |
389 | + LRU list. | |
390 | +active_anon - # of bytes of anonymous and swap cache memory on active | |
391 | + inactive LRU list. | |
392 | +inactive_file - # of bytes of file-backed memory on inactive LRU list. | |
393 | +active_file - # of bytes of file-backed memory on active LRU list. | |
322 | 394 | unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). |
323 | 395 | |
324 | -The following additional stats are dependent on CONFIG_DEBUG_VM. | |
396 | +# status considering hierarchy (see memory.use_hierarchy settings) | |
325 | 397 | |
398 | +hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy | |
399 | + under which the memory cgroup is | |
400 | +hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to | |
401 | + hierarchy under which memory cgroup is. | |
402 | + | |
403 | +total_cache - sum of all children's "cache" | |
404 | +total_rss - sum of all children's "rss" | |
405 | +total_mapped_file - sum of all children's "cache" | |
406 | +total_pgpgin - sum of all children's "pgpgin" | |
407 | +total_pgpgout - sum of all children's "pgpgout" | |
408 | +total_swap - sum of all children's "swap" | |
409 | +total_inactive_anon - sum of all children's "inactive_anon" | |
410 | +total_active_anon - sum of all children's "active_anon" | |
411 | +total_inactive_file - sum of all children's "inactive_file" | |
412 | +total_active_file - sum of all children's "active_file" | |
413 | +total_unevictable - sum of all children's "unevictable" | |
414 | + | |
415 | +# The following additional stats are dependent on CONFIG_DEBUG_VM. | |
416 | + | |
326 | 417 | inactive_ratio - VM internal parameter. (see mm/page_alloc.c) |
327 | 418 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) |
328 | 419 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) |
329 | 420 | |
330 | 421 | |
331 | 422 | |
332 | 423 | |
333 | 424 | |
... | ... | @@ -330,25 +421,38 @@ |
330 | 421 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) |
331 | 422 | |
332 | 423 | Memo: |
333 | - recent_rotated means recent frequency of lru rotation. | |
334 | - recent_scanned means recent # of scans to lru. | |
424 | + recent_rotated means recent frequency of LRU rotation. | |
425 | + recent_scanned means recent # of scans to LRU. | |
335 | 426 | showing for better debug please see the code for meanings. |
336 | 427 | |
337 | 428 | Note: |
338 | 429 | Only anonymous and swap cache memory is listed as part of 'rss' stat. |
339 | 430 | This should not be confused with the true 'resident set size' or the |
340 | - amount of physical memory used by the cgroup. Per-cgroup rss | |
341 | - accounting is not done yet. | |
431 | + amount of physical memory used by the cgroup. | |
432 | + 'rss + file_mapped" will give you resident set size of cgroup. | |
433 | + (Note: file and shmem may be shared among other cgroups. In that case, | |
434 | + file_mapped is accounted only when the memory cgroup is owner of page | |
435 | + cache.) | |
342 | 436 | |
343 | 437 | 5.3 swappiness |
344 | - Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | |
345 | 438 | |
346 | - Following cgroups' swappiness can't be changed. | |
347 | - - root cgroup (uses /proc/sys/vm/swappiness). | |
348 | - - a cgroup which uses hierarchy and it has child cgroup. | |
349 | - - a cgroup which uses hierarchy and not the root of hierarchy. | |
439 | +Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | |
350 | 440 | |
441 | +Following cgroups' swappiness can't be changed. | |
442 | +- root cgroup (uses /proc/sys/vm/swappiness). | |
443 | +- a cgroup which uses hierarchy and it has other cgroup(s) below it. | |
444 | +- a cgroup which uses hierarchy and not the root of hierarchy. | |
351 | 445 | |
446 | +5.4 failcnt | |
447 | + | |
448 | +A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. | |
449 | +This failcnt(== failure count) shows the number of times that a usage counter | |
450 | +hit its limit. When a memory cgroup hits a limit, failcnt increases and | |
451 | +memory under it will be reclaimed. | |
452 | + | |
453 | +You can reset failcnt by writing 0 to failcnt file. | |
454 | +# echo 0 > .../memory.failcnt | |
455 | + | |
352 | 456 | 6. Hierarchy support |
353 | 457 | |
354 | 458 | The memory controller supports a deep hierarchy and hierarchical accounting. |
355 | 459 | |
... | ... | @@ -366,13 +470,13 @@ |
366 | 470 | |
367 | 471 | In the diagram above, with hierarchical accounting enabled, all memory |
368 | 472 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), |
369 | -that has memory.use_hierarchy enabled. If one of the ancestors goes over its | |
473 | +that has memory.use_hierarchy enabled. If one of the ancestors goes over its | |
370 | 474 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the |
371 | 475 | children of the ancestor. |
372 | 476 | |
373 | 477 | 6.1 Enabling hierarchical accounting and reclaim |
374 | 478 | |
375 | -The memory controller by default disables the hierarchy feature. Support | |
479 | +A memory cgroup by default disables the hierarchy feature. Support | |
376 | 480 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup |
377 | 481 | |
378 | 482 | # echo 1 > memory.use_hierarchy |
379 | 483 | |
... | ... | @@ -382,10 +486,10 @@ |
382 | 486 | # echo 0 > memory.use_hierarchy |
383 | 487 | |
384 | 488 | NOTE1: Enabling/disabling will fail if the cgroup already has other |
385 | -cgroups created below it. | |
489 | + cgroups created below it. | |
386 | 490 | |
387 | 491 | NOTE2: When panic_on_oom is set to "2", the whole system will panic in |
388 | -case of an oom event in any cgroup. | |
492 | + case of an OOM event in any cgroup. | |
389 | 493 | |
390 | 494 | 7. Soft limits |
391 | 495 | |
... | ... | @@ -395,7 +499,7 @@ |
395 | 499 | a. There is no memory contention |
396 | 500 | b. They do not exceed their hard limit |
397 | 501 | |
398 | -When the system detects memory contention or low memory control groups | |
502 | +When the system detects memory contention or low memory, control groups | |
399 | 503 | are pushed back to their soft limits. If the soft limit of each control |
400 | 504 | group is very high, they are pushed back as much as possible to make |
401 | 505 | sure that one control group does not starve the others of memory. |
... | ... | @@ -409,7 +513,7 @@ |
409 | 513 | 7.1 Interface |
410 | 514 | |
411 | 515 | Soft limits can be setup by using the following commands (in this example we |
412 | -assume a soft limit of 256 megabytes) | |
516 | +assume a soft limit of 256 MiB) | |
413 | 517 | |
414 | 518 | # echo 256M > memory.soft_limit_in_bytes |
415 | 519 | |
... | ... | @@ -445,7 +549,7 @@ |
445 | 549 | Note: If we cannot find enough space for the task in the destination cgroup, we |
446 | 550 | try to make space by reclaiming memory. Task migration may fail if we |
447 | 551 | cannot make enough space. |
448 | -Note: It can take several seconds if you move charges in giga bytes order. | |
552 | +Note: It can take several seconds if you move charges much. | |
449 | 553 | |
450 | 554 | And if you want disable it again: |
451 | 555 | |
... | ... | @@ -465,7 +569,7 @@ |
465 | 569 | | enable Swap Extension(see 2.4) to enable move of swap charges. |
466 | 570 | -----+------------------------------------------------------------------------ |
467 | 571 | 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) |
468 | - | and swaps of tmpfs file) mmaped by the target task. Unlike the case of | |
572 | + | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | |
469 | 573 | | anonymous pages, file pages(and swaps) in the range mmapped by the task |
470 | 574 | | will be moved even if the task hasn't done page fault, i.e. they might |
471 | 575 | | not be the task's "RSS", but other task's "RSS" that maps the same file. |
472 | 576 | |
... | ... | @@ -482,15 +586,15 @@ |
482 | 586 | |
483 | 587 | 9. Memory thresholds |
484 | 588 | |
485 | -Memory controler implements memory thresholds using cgroups notification | |
589 | +Memory cgroup implements memory thresholds using cgroups notification | |
486 | 590 | API (see cgroups.txt). It allows to register multiple memory and memsw |
487 | 591 | thresholds and gets notifications when it crosses. |
488 | 592 | |
489 | 593 | To register a threshold application need: |
490 | - - create an eventfd using eventfd(2); | |
491 | - - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | |
492 | - - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to | |
493 | - cgroup.event_control. | |
594 | +- create an eventfd using eventfd(2); | |
595 | +- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; | |
596 | +- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to | |
597 | + cgroup.event_control. | |
494 | 598 | |
495 | 599 | Application will be notified through eventfd when memory usage crosses |
496 | 600 | threshold in any direction. |
497 | 601 | |
498 | 602 | |
499 | 603 | |
500 | 604 | |
501 | 605 | |
... | ... | @@ -501,27 +605,28 @@ |
501 | 605 | |
502 | 606 | memory.oom_control file is for OOM notification and other controls. |
503 | 607 | |
504 | -Memory controler implements oom notifier using cgroup notification | |
505 | -API (See cgroups.txt). It allows to register multiple oom notification | |
506 | -delivery and gets notification when oom happens. | |
608 | +Memory cgroup implements OOM notifier using cgroup notification | |
609 | +API (See cgroups.txt). It allows to register multiple OOM notification | |
610 | +delivery and gets notification when OOM happens. | |
507 | 611 | |
508 | 612 | To register a notifier, application need: |
509 | 613 | - create an eventfd using eventfd(2) |
510 | 614 | - open memory.oom_control file |
511 | - - write string like "<event_fd> <memory.oom_control>" to cgroup.event_control | |
615 | + - write string like "<event_fd> <fd of memory.oom_control>" to | |
616 | + cgroup.event_control | |
512 | 617 | |
513 | -Application will be notifier through eventfd when oom happens. | |
618 | +Application will be notified through eventfd when OOM happens. | |
514 | 619 | OOM notification doesn't work for root cgroup. |
515 | 620 | |
516 | -You can disable oom-killer by writing "1" to memory.oom_control file. | |
517 | -As. | |
621 | +You can disable OOM-killer by writing "1" to memory.oom_control file, as: | |
622 | + | |
518 | 623 | #echo 1 > memory.oom_control |
519 | 624 | |
520 | -This operation is only allowed to the top cgroup of subhierarchy. | |
521 | -If oom-killer is disabled, tasks under cgroup will hang/sleep | |
522 | -in memcg's oom-waitq when they request accountable memory. | |
625 | +This operation is only allowed to the top cgroup of sub-hierarchy. | |
626 | +If OOM-killer is disabled, tasks under cgroup will hang/sleep | |
627 | +in memory cgroup's OOM-waitqueue when they request accountable memory. | |
523 | 628 | |
524 | -For running them, you have to relax the memcg's oom sitaution by | |
629 | +For running them, you have to relax the memory cgroup's OOM status by | |
525 | 630 | * enlarge limit or reduce usage. |
526 | 631 | To reduce usage, |
527 | 632 | * kill some tasks. |
... | ... | @@ -532,7 +637,7 @@ |
532 | 637 | |
533 | 638 | At reading, current status of OOM is shown. |
534 | 639 | oom_kill_disable 0 or 1 (if 1, oom-killer is disabled) |
535 | - under_oom 0 or 1 (if 1, the memcg is under OOM,tasks may | |
640 | + under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may | |
536 | 641 | be stopped.) |
537 | 642 | |
538 | 643 | 11. TODO |