Eric Lee / smarc-ti-linux-kernel

1

Memory Resource Controller

1

Memory Resource Controller

2

3

NOTE: The Memory Resource Controller has generically been referred to as the

3

NOTE: The Memory Resource Controller has generically been referred to as the

4

memory controller in this document. Do not confuse memory controller

4

memory controller in this document. Do not confuse memory controller

5

used here with the memory controller that is used in hardware.

5

used here with the memory controller that is used in hardware.

6

7

(For editors)

7

(For editors)

8

In this document:

8

In this document:

9

When we mention a cgroup (cgroupfs's directory) with memory controller,

9

When we mention a cgroup (cgroupfs's directory) with memory controller,

10

we call it "memory cgroup". When you see git-log and source code, you'll

10

we call it "memory cgroup". When you see git-log and source code, you'll

11

see patch's title and function names tend to use "memcg".

11

see patch's title and function names tend to use "memcg".

12

In this document, we avoid using it.

12

In this document, we avoid using it.

13

14

Benefits and Purpose of the memory controller

14

Benefits and Purpose of the memory controller

15

16

The memory controller isolates the memory behaviour of a group of tasks

16

The memory controller isolates the memory behaviour of a group of tasks

17

from the rest of the system. The article on LWN [12] mentions some probable

17

from the rest of the system. The article on LWN [12] mentions some probable

18

uses of the memory controller. The memory controller can be used to

18

uses of the memory controller. The memory controller can be used to

19

20

a. Isolate an application or a group of applications

20

a. Isolate an application or a group of applications

21

Memory-hungry applications can be isolated and limited to a smaller

21

Memory-hungry applications can be isolated and limited to a smaller

22

amount of memory.

22

amount of memory.

23

b. Create a cgroup with a limited amount of memory; this can be used

23

b. Create a cgroup with a limited amount of memory; this can be used

24

as a good alternative to booting with mem=XXXX.

24

as a good alternative to booting with mem=XXXX.

25

c. Virtualization solutions can control the amount of memory they want

25

c. Virtualization solutions can control the amount of memory they want

26

to assign to a virtual machine instance.

26

to assign to a virtual machine instance.

27

d. A CD/DVD burner could control the amount of memory used by the

27

d. A CD/DVD burner could control the amount of memory used by the

28

rest of the system to ensure that burning does not fail due to lack

28

rest of the system to ensure that burning does not fail due to lack

29

of available memory.

29

of available memory.

30

e. There are several other use cases; find one or use the controller just

30

e. There are several other use cases; find one or use the controller just

31

for fun (to learn and hack on the VM subsystem).

31

for fun (to learn and hack on the VM subsystem).

32

33

Current Status: linux-2.6.34-mmotm(development version of 2010/April)

33

Current Status: linux-2.6.34-mmotm(development version of 2010/April)

34

35

Features:

35

Features:

36

- accounting anonymous pages, file caches, swap caches usage and limiting them.

36

- accounting anonymous pages, file caches, swap caches usage and limiting them.

37

- pages are linked to per-memcg LRU exclusively, and there is no global LRU.

37

- pages are linked to per-memcg LRU exclusively, and there is no global LRU.

38

- optionally, memory+swap usage can be accounted and limited.

38

- optionally, memory+swap usage can be accounted and limited.

39

- hierarchical accounting

39

- hierarchical accounting

40

- soft limit

40

- soft limit

41

- moving (recharging) account at moving a task is selectable.

41

- moving (recharging) account at moving a task is selectable.

42

- usage threshold notifier

42

- usage threshold notifier

43

- oom-killer disable knob and oom-notifier

43

- oom-killer disable knob and oom-notifier

44

- Root cgroup has no limit controls.

44

- Root cgroup has no limit controls.

45

46

Kernel memory support is a work in progress, and the current version provides

46

Kernel memory support is a work in progress, and the current version provides

47

basically functionality. (See Section 2.7)

47

basically functionality. (See Section 2.7)

48

49

Brief summary of control files.

49

Brief summary of control files.

50

51

tasks # attach a task(thread) and show list of threads

51

tasks # attach a task(thread) and show list of threads

52

cgroup.procs # show list of processes

52

cgroup.procs # show list of processes

53

cgroup.event_control # an interface for event_fd()

53

cgroup.event_control # an interface for event_fd()

54

memory.usage_in_bytes # show current res_counter usage for memory

54

memory.usage_in_bytes # show current res_counter usage for memory

55

(See 5.5 for details)

55

(See 5.5 for details)

56

memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap

56

memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap

57

(See 5.5 for details)

57

(See 5.5 for details)

58

memory.limit_in_bytes # set/show limit of memory usage

58

memory.limit_in_bytes # set/show limit of memory usage

59

memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage

59

memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage

60

memory.failcnt # show the number of memory usage hits limits

60

memory.failcnt # show the number of memory usage hits limits

61

memory.memsw.failcnt # show the number of memory+Swap hits limits

61

memory.memsw.failcnt # show the number of memory+Swap hits limits

62

memory.max_usage_in_bytes # show max memory usage recorded

62

memory.max_usage_in_bytes # show max memory usage recorded

63

memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded

63

memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded

64

memory.soft_limit_in_bytes # set/show soft limit of memory usage

64

memory.soft_limit_in_bytes # set/show soft limit of memory usage

65

memory.stat # show various statistics

65

memory.stat # show various statistics

66

memory.use_hierarchy # set/show hierarchical account enabled

66

memory.use_hierarchy # set/show hierarchical account enabled

67

memory.force_empty # trigger forced move charge to parent

67

memory.force_empty # trigger forced move charge to parent

68

memory.swappiness # set/show swappiness parameter of vmscan

68

memory.swappiness # set/show swappiness parameter of vmscan

69

(See sysctl's vm.swappiness)

69

(See sysctl's vm.swappiness)

70

memory.move_charge_at_immigrate # set/show controls of moving charges

70

memory.move_charge_at_immigrate # set/show controls of moving charges

71

memory.oom_control # set/show oom controls.

71

memory.oom_control # set/show oom controls.

72

memory.numa_stat # show the number of memory usage per numa node

72

memory.numa_stat # show the number of memory usage per numa node

73

74

memory.kmem.limit_in_bytes # set/show hard limit for kernel memory

74

memory.kmem.limit_in_bytes # set/show hard limit for kernel memory

75

memory.kmem.usage_in_bytes # show current kernel memory allocation

75

memory.kmem.usage_in_bytes # show current kernel memory allocation

76

memory.kmem.failcnt # show the number of kernel memory usage hits limits

76

memory.kmem.failcnt # show the number of kernel memory usage hits limits

77

memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded

77

memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded

78

79

memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory

79

memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory

80

memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation

80

memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation

81

memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits

81

memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits

82

memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded

82

memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded

83

84

1. History

84

1. History

85

86

The memory controller has a long history. A request for comments for the memory

86

The memory controller has a long history. A request for comments for the memory

87

controller was posted by Balbir Singh [1]. At the time the RFC was posted

87

controller was posted by Balbir Singh [1]. At the time the RFC was posted

88

there were several implementations for memory control. The goal of the

88

there were several implementations for memory control. The goal of the

89

RFC was to build consensus and agreement for the minimal features required

89

RFC was to build consensus and agreement for the minimal features required

90

for memory control. The first RSS controller was posted by Balbir Singh[2]

90

for memory control. The first RSS controller was posted by Balbir Singh[2]

91

in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the

91

in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the

92

RSS controller. At OLS, at the resource management BoF, everyone suggested

92

RSS controller. At OLS, at the resource management BoF, everyone suggested

93

that we handle both page cache and RSS together. Another request was raised

93

that we handle both page cache and RSS together. Another request was raised

94

to allow user space handling of OOM. The current memory controller is

94

to allow user space handling of OOM. The current memory controller is

95

at version 6; it combines both mapped (RSS) and unmapped Page

95

at version 6; it combines both mapped (RSS) and unmapped Page

96

Cache Control [11].

96

Cache Control [11].

97

98

2. Memory Control

98

2. Memory Control

99

100

Memory is a unique resource in the sense that it is present in a limited

100

Memory is a unique resource in the sense that it is present in a limited

101

amount. If a task requires a lot of CPU processing, the task can spread

101

amount. If a task requires a lot of CPU processing, the task can spread

102

its processing over a period of hours, days, months or years, but with

102

its processing over a period of hours, days, months or years, but with

103

memory, the same physical memory needs to be reused to accomplish the task.

103

memory, the same physical memory needs to be reused to accomplish the task.

104

105

The memory controller implementation has been divided into phases. These

105

The memory controller implementation has been divided into phases. These

106

are:

106

are:

107

108

1. Memory controller

108

1. Memory controller

109

2. mlock(2) controller

109

2. mlock(2) controller

110

3. Kernel user memory accounting and slab control

110

3. Kernel user memory accounting and slab control

111

4. user mappings length controller

111

4. user mappings length controller

112

113

The memory controller is the first controller developed.

113

The memory controller is the first controller developed.

114

115

2.1. Design

115

2.1. Design

116

117

The core of the design is a counter called the res_counter. The res_counter

117

The core of the design is a counter called the res_counter. The res_counter

118

tracks the current memory usage and limit of the group of processes associated

118

tracks the current memory usage and limit of the group of processes associated

119

with the controller. Each cgroup has a memory controller specific data

119

with the controller. Each cgroup has a memory controller specific data

120

structure (mem_cgroup) associated with it.

120

structure (mem_cgroup) associated with it.

121

122

2.2. Accounting

122

2.2. Accounting

123

124

+--------------------+

124

+--------------------+

125

| mem_cgroup |

125

| mem_cgroup |

126

| (res_counter) |

126

| (res_counter) |

127

+--------------------+

127

+--------------------+

128

/ ^ \

128

/ ^ \

129

/ | \

129

/ | \

130

+---------------+ | +---------------+

130

+---------------+ | +---------------+

131

131

132

| | | | |

132

| | | | |

133

+---------------+ | +---------------+

133

+---------------+ | +---------------+

134

|

134

|

135

+ --------------+

135

+ --------------+

136

|

136

|

137

+---------------+ +------+--------+

137

+---------------+ +------+--------+

138

| page +----------> page_cgroup|

138

| page +----------> page_cgroup|

139

| | | |

139

| | | |

140

+---------------+ +---------------+

140

+---------------+ +---------------+

141

142

(Figure 1: Hierarchy of Accounting)

142

(Figure 1: Hierarchy of Accounting)

143

144

145

Figure 1 shows the important aspects of the controller

145

Figure 1 shows the important aspects of the controller

146

147

1. Accounting happens per cgroup

147

1. Accounting happens per cgroup

148

2. Each mm_struct knows about which cgroup it belongs to

148

2. Each mm_struct knows about which cgroup it belongs to

149

3. Each page has a pointer to the page_cgroup, which in turn knows the

149

3. Each page has a pointer to the page_cgroup, which in turn knows the

150

cgroup it belongs to

150

cgroup it belongs to

151

152

The accounting is done as follows: mem_cgroup_charge_common() is invoked to

152

The accounting is done as follows: mem_cgroup_charge_common() is invoked to

153

set up the necessary data structures and check if the cgroup that is being

153

set up the necessary data structures and check if the cgroup that is being

154

charged is over its limit. If it is, then reclaim is invoked on the cgroup.

154

charged is over its limit. If it is, then reclaim is invoked on the cgroup.

155

More details can be found in the reclaim section of this document.

155

More details can be found in the reclaim section of this document.

156

If everything goes well, a page meta-data-structure called page_cgroup is

156

If everything goes well, a page meta-data-structure called page_cgroup is

157

updated. page_cgroup has its own LRU on cgroup.

157

updated. page_cgroup has its own LRU on cgroup.

158

(*) page_cgroup structure is allocated at boot/memory-hotplug time.

158

(*) page_cgroup structure is allocated at boot/memory-hotplug time.

159

160

2.2.1 Accounting details

160

2.2.1 Accounting details

161

162

All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

162

All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

163

Some pages which are never reclaimable and will not be on the LRU

163

Some pages which are never reclaimable and will not be on the LRU

164

are not accounted. We just account pages under usual VM management.

164

are not accounted. We just account pages under usual VM management.

165

166

RSS pages are accounted at page_fault unless they've already been accounted

166

RSS pages are accounted at page_fault unless they've already been accounted

167

for earlier. A file page will be accounted for as Page Cache when it's

167

for earlier. A file page will be accounted for as Page Cache when it's

168

inserted into inode (radix-tree). While it's mapped into the page tables of

168

inserted into inode (radix-tree). While it's mapped into the page tables of

169

processes, duplicate accounting is carefully avoided.

169

processes, duplicate accounting is carefully avoided.

170

171

An RSS page is unaccounted when it's fully unmapped. A PageCache page is

171

An RSS page is unaccounted when it's fully unmapped. A PageCache page is

172

unaccounted when it's removed from radix-tree. Even if RSS pages are fully

172

unaccounted when it's removed from radix-tree. Even if RSS pages are fully

173

unmapped (by kswapd), they may exist as SwapCache in the system until they

173

unmapped (by kswapd), they may exist as SwapCache in the system until they

174

are really freed. Such SwapCaches are also accounted.

174

are really freed. Such SwapCaches are also accounted.

175

A swapped-in page is not accounted until it's mapped.

175

A swapped-in page is not accounted until it's mapped.

176

177

Note: The kernel does swapin-readahead and reads multiple swaps at once.

177

Note: The kernel does swapin-readahead and reads multiple swaps at once.

178

This means swapped-in pages may contain pages for other tasks than a task

178

This means swapped-in pages may contain pages for other tasks than a task

179

causing page fault. So, we avoid accounting at swap-in I/O.

179

causing page fault. So, we avoid accounting at swap-in I/O.

180

181

At page migration, accounting information is kept.

181

At page migration, accounting information is kept.

182

183

Note: we just account pages-on-LRU because our purpose is to control amount

183

Note: we just account pages-on-LRU because our purpose is to control amount

184

of used pages; not-on-LRU pages tend to be out-of-control from VM view.

184

of used pages; not-on-LRU pages tend to be out-of-control from VM view.

185

186

2.3 Shared Page Accounting

186

2.3 Shared Page Accounting

187

188

Shared pages are accounted on the basis of the first touch approach. The

188

Shared pages are accounted on the basis of the first touch approach. The

189

cgroup that first touches a page is accounted for the page. The principle

189

cgroup that first touches a page is accounted for the page. The principle

190

behind this approach is that a cgroup that aggressively uses a shared

190

behind this approach is that a cgroup that aggressively uses a shared

191

page will eventually get charged for it (once it is uncharged from

191

page will eventually get charged for it (once it is uncharged from

192

the cgroup that brought it in -- this will happen on memory pressure).

192

the cgroup that brought it in -- this will happen on memory pressure).

193

194

But see section 8.2: when moving a task to another cgroup, its pages may

194

But see section 8.2: when moving a task to another cgroup, its pages may

195

be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

195

be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

196

197

Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.

197

Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.

198

When you do swapoff and make swapped-out pages of shmem(tmpfs) to

198

When you do swapoff and make swapped-out pages of shmem(tmpfs) to

199

be backed into memory in force, charges for pages are accounted against the

199

be backed into memory in force, charges for pages are accounted against the

200

caller of swapoff rather than the users of shmem.

200

caller of swapoff rather than the users of shmem.

201

202

2.4 Swap Extension (CONFIG_MEMCG_SWAP)

202

2.4 Swap Extension (CONFIG_MEMCG_SWAP)

203

204

Swap Extension allows you to record charge for swap. A swapped-in page is

204

Swap Extension allows you to record charge for swap. A swapped-in page is

205

charged back to original page allocator if possible.

205

charged back to original page allocator if possible.

206

207

When swap is accounted, following files are added.

207

When swap is accounted, following files are added.

208

- memory.memsw.usage_in_bytes.

208

- memory.memsw.usage_in_bytes.

209

- memory.memsw.limit_in_bytes.

209

- memory.memsw.limit_in_bytes.

210

211

memsw means memory+swap. Usage of memory+swap is limited by

211

memsw means memory+swap. Usage of memory+swap is limited by

212

memsw.limit_in_bytes.

212

memsw.limit_in_bytes.

213

214

Example: Assume a system with 4G of swap. A task which allocates 6G of memory

214

Example: Assume a system with 4G of swap. A task which allocates 6G of memory

215

(by mistake) under 2G memory limitation will use all swap.

215

(by mistake) under 2G memory limitation will use all swap.

216

In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.

216

In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.

217

By using the memsw limit, you can avoid system OOM which can be caused by swap

217

By using the memsw limit, you can avoid system OOM which can be caused by swap

218

shortage.

218

shortage.

219

220

* why 'memory+swap' rather than swap.

220

* why 'memory+swap' rather than swap.

221

The global LRU(kswapd) can swap out arbitrary pages. Swap-out means

221

The global LRU(kswapd) can swap out arbitrary pages. Swap-out means

222

to move account from memory to swap...there is no change in usage of

222

to move account from memory to swap...there is no change in usage of

223

memory+swap. In other words, when we want to limit the usage of swap without

223

memory+swap. In other words, when we want to limit the usage of swap without

224

affecting global LRU, memory+swap limit is better than just limiting swap from

224

affecting global LRU, memory+swap limit is better than just limiting swap from

225

an OS point of view.

225

an OS point of view.

226

227

* What happens when a cgroup hits memory.memsw.limit_in_bytes

227

* What happens when a cgroup hits memory.memsw.limit_in_bytes

228

When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out

228

When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out

229

in this cgroup. Then, swap-out will not be done by cgroup routine and file

229

in this cgroup. Then, swap-out will not be done by cgroup routine and file

230

caches are dropped. But as mentioned above, global LRU can do swapout memory

230

caches are dropped. But as mentioned above, global LRU can do swapout memory

231

from it for sanity of the system's memory management state. You can't forbid

231

from it for sanity of the system's memory management state. You can't forbid

232

it by cgroup.

232

it by cgroup.

233

234

2.5 Reclaim

234

2.5 Reclaim

235

236

Each cgroup maintains a per cgroup LRU which has the same structure as

236

Each cgroup maintains a per cgroup LRU which has the same structure as

237

global VM. When a cgroup goes over its limit, we first try

237

global VM. When a cgroup goes over its limit, we first try

238

to reclaim memory from the cgroup so as to make space for the new

238

to reclaim memory from the cgroup so as to make space for the new

239

pages that the cgroup has touched. If the reclaim is unsuccessful,

239

pages that the cgroup has touched. If the reclaim is unsuccessful,

240

an OOM routine is invoked to select and kill the bulkiest task in the

240

an OOM routine is invoked to select and kill the bulkiest task in the

241

cgroup. (See 10. OOM Control below.)

241

cgroup. (See 10. OOM Control below.)

242

243

The reclaim algorithm has not been modified for cgroups, except that

243

The reclaim algorithm has not been modified for cgroups, except that

244

pages that are selected for reclaiming come from the per-cgroup LRU

244

pages that are selected for reclaiming come from the per-cgroup LRU

245

list.

245

list.

246

247

NOTE: Reclaim does not work for the root cgroup, since we cannot set any

247

NOTE: Reclaim does not work for the root cgroup, since we cannot set any

248

limits on the root cgroup.

248

limits on the root cgroup.

249

250

Note2: When panic_on_oom is set to "2", the whole system will panic.

250

Note2: When panic_on_oom is set to "2", the whole system will panic.

251

252

When oom event notifier is registered, event will be delivered.

252

When oom event notifier is registered, event will be delivered.

253

(See oom_control section)

253

(See oom_control section)

254

255

2.6 Locking

255

2.6 Locking

256

257

lock_page_cgroup()/unlock_page_cgroup() should not be called under

257

lock_page_cgroup()/unlock_page_cgroup() should not be called under

258

mapping->tree_lock.

258

mapping->tree_lock.

259

260

Other lock order is following:

260

Other lock order is following:

261

PG_locked.

261

PG_locked.

262

mm->page_table_lock

262

mm->page_table_lock

263

zone->lru_lock

263

zone->lru_lock

264

lock_page_cgroup.

264

lock_page_cgroup.

265

In many cases, just lock_page_cgroup() is called.

265

In many cases, just lock_page_cgroup() is called.

266

per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by

266

per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by

267

zone->lru_lock, it has no lock of its own.

267

zone->lru_lock, it has no lock of its own.

268

269

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)

269

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)

270

271

With the Kernel memory extension, the Memory Controller is able to limit

271

With the Kernel memory extension, the Memory Controller is able to limit

272

the amount of kernel memory used by the system. Kernel memory is fundamentally

272

the amount of kernel memory used by the system. Kernel memory is fundamentally

273

different than user memory, since it can't be swapped out, which makes it

273

different than user memory, since it can't be swapped out, which makes it

274

possible to DoS the system by consuming too much of this precious resource.

274

possible to DoS the system by consuming too much of this precious resource.

275

276

Kernel memory won't be accounted at all until limit on a group is set. This

276

Kernel memory won't be accounted at all until limit on a group is set. This

277

allows for existing setups to continue working without disruption. The limit

277

allows for existing setups to continue working without disruption. The limit

278

cannot be set if the cgroup have children, or if there are already tasks in the

278

cannot be set if the cgroup have children, or if there are already tasks in the

279

cgroup. Attempting to set the limit under those conditions will return -EBUSY.

279

cgroup. Attempting to set the limit under those conditions will return -EBUSY.

280

When use_hierarchy == 1 and a group is accounted, its children will

280

When use_hierarchy == 1 and a group is accounted, its children will

281

automatically be accounted regardless of their limit value.

281

automatically be accounted regardless of their limit value.

282

283

After a group is first limited, it will be kept being accounted until it

283

After a group is first limited, it will be kept being accounted until it

284

is removed. The memory limitation itself, can of course be removed by writing

284

is removed. The memory limitation itself, can of course be removed by writing

285

-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not

285

-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not

286

limited.

286

limited.

287

288

Kernel memory limits are not imposed for the root cgroup. Usage for the root

288

Kernel memory limits are not imposed for the root cgroup. Usage for the root

289

cgroup may or may not be accounted. The memory used is accumulated into

289

cgroup may or may not be accounted. The memory used is accumulated into

290

memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.

290

memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.

291

(currently only for tcp).

291

(currently only for tcp).

292

The main "kmem" counter is fed into the main counter, so kmem charges will

292

The main "kmem" counter is fed into the main counter, so kmem charges will

293

also be visible from the user counter.

293

also be visible from the user counter.

294

295

Currently no soft limit is implemented for kernel memory. It is future work

295

Currently no soft limit is implemented for kernel memory. It is future work

296

to trigger slab reclaim when those limits are reached.

296

to trigger slab reclaim when those limits are reached.

297

298

2.7.1 Current Kernel Memory resources accounted

298

2.7.1 Current Kernel Memory resources accounted

299

300

* stack pages: every process consumes some stack pages. By accounting into

300

* stack pages: every process consumes some stack pages. By accounting into

301

kernel memory, we prevent new processes from being created when the kernel

301

kernel memory, we prevent new processes from being created when the kernel

302

memory usage is too high.

302

memory usage is too high.

303

304

* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy

305

of each kmem_cache is created everytime the cache is touched by the first time

306

from inside the memcg. The creation is done lazily, so some objects can still be

307

skipped while the cache is being created. All objects in a slab page should

308

belong to the same memcg. This only fails to hold when a task is migrated to a

309

different memcg during the page allocation by the cache.

310

304

* sockets memory pressure: some sockets protocols have memory pressure

311

* sockets memory pressure: some sockets protocols have memory pressure

305

thresholds. The Memory Controller allows them to be controlled individually

312

thresholds. The Memory Controller allows them to be controlled individually

306

per cgroup, instead of globally.

313

per cgroup, instead of globally.

307

314

308

* tcp memory pressure: sockets memory pressure for the tcp protocol.

315

* tcp memory pressure: sockets memory pressure for the tcp protocol.

309

316

310

2.7.3 Common use cases

317

2.7.3 Common use cases

311

318

312

Because the "kmem" counter is fed to the main user counter, kernel memory can

319

Because the "kmem" counter is fed to the main user counter, kernel memory can

313

never be limited completely independently of user memory. Say "U" is the user

320

never be limited completely independently of user memory. Say "U" is the user

314

limit, and "K" the kernel limit. There are three possible ways limits can be

321

limit, and "K" the kernel limit. There are three possible ways limits can be

315

set:

322

set:

316

323

317

U != 0, K = unlimited:

324

U != 0, K = unlimited:

318

This is the standard memcg limitation mechanism already present before kmem

325

This is the standard memcg limitation mechanism already present before kmem

319

accounting. Kernel memory is completely ignored.

326

accounting. Kernel memory is completely ignored.

320

327

321

U != 0, K < U:

328

U != 0, K < U:

322

Kernel memory is a subset of the user memory. This setup is useful in

329

Kernel memory is a subset of the user memory. This setup is useful in

323

deployments where the total amount of memory per-cgroup is overcommited.

330

deployments where the total amount of memory per-cgroup is overcommited.

324

Overcommiting kernel memory limits is definitely not recommended, since the

331

Overcommiting kernel memory limits is definitely not recommended, since the

325

box can still run out of non-reclaimable memory.

332

box can still run out of non-reclaimable memory.

326

In this case, the admin could set up K so that the sum of all groups is

333

In this case, the admin could set up K so that the sum of all groups is

327

never greater than the total memory, and freely set U at the cost of his

334

never greater than the total memory, and freely set U at the cost of his

328

QoS.

335

QoS.

329

336

330

U != 0, K >= U:

337

U != 0, K >= U:

331

Since kmem charges will also be fed to the user counter and reclaim will be

338

Since kmem charges will also be fed to the user counter and reclaim will be

332

triggered for the cgroup for both kinds of memory. This setup gives the

339

triggered for the cgroup for both kinds of memory. This setup gives the

333

admin a unified view of memory, and it is also useful for people who just

340

admin a unified view of memory, and it is also useful for people who just

334

want to track kernel memory usage.

341

want to track kernel memory usage.

335

342

336

3. User Interface

343

3. User Interface

337

344

338

0. Configuration

345

0. Configuration

339

346

340

a. Enable CONFIG_CGROUPS

347

a. Enable CONFIG_CGROUPS

341

b. Enable CONFIG_RESOURCE_COUNTERS

348

b. Enable CONFIG_RESOURCE_COUNTERS

342

c. Enable CONFIG_MEMCG

349

c. Enable CONFIG_MEMCG

343

d. Enable CONFIG_MEMCG_SWAP (to use swap extension)

350

d. Enable CONFIG_MEMCG_SWAP (to use swap extension)

344

d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)

351

d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)

345

352

346

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

353

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

347

# mount -t tmpfs none /sys/fs/cgroup

354

# mount -t tmpfs none /sys/fs/cgroup

348

# mkdir /sys/fs/cgroup/memory

355

# mkdir /sys/fs/cgroup/memory

349

# mount -t cgroup none /sys/fs/cgroup/memory -o memory

356

# mount -t cgroup none /sys/fs/cgroup/memory -o memory

350

357

351

2. Make the new group and move bash into it

358

2. Make the new group and move bash into it

352

# mkdir /sys/fs/cgroup/memory/0

359

# mkdir /sys/fs/cgroup/memory/0

353

# echo $$ > /sys/fs/cgroup/memory/0/tasks

360

# echo $$ > /sys/fs/cgroup/memory/0/tasks

354

361

355

Since now we're in the 0 cgroup, we can alter the memory limit:

362

Since now we're in the 0 cgroup, we can alter the memory limit:

356

# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

363

# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

357

364

358

NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,

365

NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,

359

mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)

366

mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)

360

367

361

NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).

368

NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).

362

NOTE: We cannot set limits on the root cgroup any more.

369

NOTE: We cannot set limits on the root cgroup any more.

363

370

364

# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes

371

# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes

365

4194304

372

4194304

366

373

367

We can check the usage:

374

We can check the usage:

368

# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes

375

# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes

369

1216512

376

1216512

370

377

371

A successful write to this file does not guarantee a successful setting of

378

A successful write to this file does not guarantee a successful setting of

372

this limit to the value written into the file. This can be due to a

379

this limit to the value written into the file. This can be due to a

373

number of factors, such as rounding up to page boundaries or the total

380

number of factors, such as rounding up to page boundaries or the total

374

availability of memory on the system. The user is required to re-read

381

availability of memory on the system. The user is required to re-read

375

this file after a write to guarantee the value committed by the kernel.

382

this file after a write to guarantee the value committed by the kernel.

376

383

377

# echo 1 > memory.limit_in_bytes

384

# echo 1 > memory.limit_in_bytes

378

# cat memory.limit_in_bytes

385

# cat memory.limit_in_bytes

379

4096

386

4096

380

387

381

The memory.failcnt field gives the number of times that the cgroup limit was

388

The memory.failcnt field gives the number of times that the cgroup limit was

382

exceeded.

389

exceeded.

383

390

384

The memory.stat file gives accounting information. Now, the number of

391

The memory.stat file gives accounting information. Now, the number of

385

caches, RSS and Active pages/Inactive pages are shown.

392

caches, RSS and Active pages/Inactive pages are shown.

386

393

387

4. Testing

394

4. Testing

388

395

389

For testing features and implementation, see memcg_test.txt.

396

For testing features and implementation, see memcg_test.txt.

390

397

391

Performance test is also important. To see pure memory controller's overhead,

398

Performance test is also important. To see pure memory controller's overhead,

392

testing on tmpfs will give you good numbers of small overheads.

399

testing on tmpfs will give you good numbers of small overheads.

393

Example: do kernel make on tmpfs.

400

Example: do kernel make on tmpfs.

394

401

395

Page-fault scalability is also important. At measuring parallel

402

Page-fault scalability is also important. At measuring parallel

396

page fault test, multi-process test may be better than multi-thread

403

page fault test, multi-process test may be better than multi-thread

397

test because it has noise of shared objects/status.

404

test because it has noise of shared objects/status.

398

405

399

But the above two are testing extreme situations.

406

But the above two are testing extreme situations.

400

Trying usual test under memory controller is always helpful.

407

Trying usual test under memory controller is always helpful.

401

408

402

4.1 Troubleshooting

409

4.1 Troubleshooting

403

410

404

Sometimes a user might find that the application under a cgroup is

411

Sometimes a user might find that the application under a cgroup is

405

terminated by the OOM killer. There are several causes for this:

412

terminated by the OOM killer. There are several causes for this:

406

413

407

1. The cgroup limit is too low (just too low to do anything useful)

414

1. The cgroup limit is too low (just too low to do anything useful)

408

2. The user is using anonymous memory and swap is turned off or too low

415

2. The user is using anonymous memory and swap is turned off or too low

409

416

410

A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of

417

A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of

411

some of the pages cached in the cgroup (page cache pages).

418

some of the pages cached in the cgroup (page cache pages).

412

419

413

To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and

420

To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and

414

seeing what happens will be helpful.

421

seeing what happens will be helpful.

415

422

416

4.2 Task migration

423

4.2 Task migration

417

424

418

When a task migrates from one cgroup to another, its charge is not

425

When a task migrates from one cgroup to another, its charge is not

419

carried forward by default. The pages allocated from the original cgroup still

426

carried forward by default. The pages allocated from the original cgroup still

420

remain charged to it, the charge is dropped when the page is freed or

427

remain charged to it, the charge is dropped when the page is freed or

421

reclaimed.

428

reclaimed.

422

429

423

You can move charges of a task along with task migration.

430

You can move charges of a task along with task migration.

424

See 8. "Move charges at task migration"

431

See 8. "Move charges at task migration"

425

432

426

4.3 Removing a cgroup

433

4.3 Removing a cgroup

427

434

428

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a

435

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a

429

cgroup might have some charge associated with it, even though all

436

cgroup might have some charge associated with it, even though all

430

tasks have migrated away from it. (because we charge against pages, not

437

tasks have migrated away from it. (because we charge against pages, not

431

against tasks.)

438

against tasks.)

432

439

433

We move the stats to root (if use_hierarchy==0) or parent (if

440

We move the stats to root (if use_hierarchy==0) or parent (if

434

use_hierarchy==1), and no change on the charge except uncharging

441

use_hierarchy==1), and no change on the charge except uncharging

435

from the child.

442

from the child.

436

443

437

Charges recorded in swap information is not updated at removal of cgroup.

444

Charges recorded in swap information is not updated at removal of cgroup.

438

Recorded information is discarded and a cgroup which uses swap (swapcache)

445

Recorded information is discarded and a cgroup which uses swap (swapcache)

439

will be charged as a new owner of it.

446

will be charged as a new owner of it.

440

447

441

About use_hierarchy, see Section 6.

448

About use_hierarchy, see Section 6.

442

449

443

5. Misc. interfaces.

450

5. Misc. interfaces.

444

451

445

5.1 force_empty

452

5.1 force_empty

446

memory.force_empty interface is provided to make cgroup's memory usage empty.

453

memory.force_empty interface is provided to make cgroup's memory usage empty.

447

You can use this interface only when the cgroup has no tasks.

454

You can use this interface only when the cgroup has no tasks.

448

When writing anything to this

455

When writing anything to this

449

456

450

# echo 0 > memory.force_empty

457

# echo 0 > memory.force_empty

451

458

452

Almost all pages tracked by this memory cgroup will be unmapped and freed.

459

Almost all pages tracked by this memory cgroup will be unmapped and freed.

453

Some pages cannot be freed because they are locked or in-use. Such pages are

460

Some pages cannot be freed because they are locked or in-use. Such pages are

454

moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this

461

moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this

455

cgroup will be empty.

462

cgroup will be empty.

456

463

457

The typical use case for this interface is before calling rmdir().

464

The typical use case for this interface is before calling rmdir().

458

Because rmdir() moves all pages to parent, some out-of-use page caches can be

465

Because rmdir() moves all pages to parent, some out-of-use page caches can be

459

moved to the parent. If you want to avoid that, force_empty will be useful.

466

moved to the parent. If you want to avoid that, force_empty will be useful.

460

467

461

Also, note that when memory.kmem.limit_in_bytes is set the charges due to

468

Also, note that when memory.kmem.limit_in_bytes is set the charges due to

462

kernel pages will still be seen. This is not considered a failure and the

469

kernel pages will still be seen. This is not considered a failure and the

463

write will still return success. In this case, it is expected that

470

write will still return success. In this case, it is expected that

464

memory.kmem.usage_in_bytes == memory.usage_in_bytes.

471

memory.kmem.usage_in_bytes == memory.usage_in_bytes.

465

472

466

About use_hierarchy, see Section 6.

473

About use_hierarchy, see Section 6.

467

474

468

5.2 stat file

475

5.2 stat file

469

476

470

memory.stat file includes following statistics

477

memory.stat file includes following statistics

471

478

472

# per-memory cgroup local status

479

# per-memory cgroup local status

473

cache - # of bytes of page cache memory.

480

cache - # of bytes of page cache memory.

474

rss - # of bytes of anonymous and swap cache memory.

481

rss - # of bytes of anonymous and swap cache memory.

475

mapped_file - # of bytes of mapped file (includes tmpfs/shmem)

482

mapped_file - # of bytes of mapped file (includes tmpfs/shmem)

476

pgpgin - # of charging events to the memory cgroup. The charging

483

pgpgin - # of charging events to the memory cgroup. The charging

477

event happens each time a page is accounted as either mapped

484

event happens each time a page is accounted as either mapped

478

anon page(RSS) or cache page(Page Cache) to the cgroup.

485

anon page(RSS) or cache page(Page Cache) to the cgroup.

479

pgpgout - # of uncharging events to the memory cgroup. The uncharging

486

pgpgout - # of uncharging events to the memory cgroup. The uncharging

480

event happens each time a page is unaccounted from the cgroup.

487

event happens each time a page is unaccounted from the cgroup.

481

swap - # of bytes of swap usage

488

swap - # of bytes of swap usage

482

inactive_anon - # of bytes of anonymous memory and swap cache memory on

489

inactive_anon - # of bytes of anonymous memory and swap cache memory on

483

LRU list.

490

LRU list.

484

active_anon - # of bytes of anonymous and swap cache memory on active

491

active_anon - # of bytes of anonymous and swap cache memory on active

485

inactive LRU list.

492

inactive LRU list.

486

inactive_file - # of bytes of file-backed memory on inactive LRU list.

493

inactive_file - # of bytes of file-backed memory on inactive LRU list.

487

active_file - # of bytes of file-backed memory on active LRU list.

494

active_file - # of bytes of file-backed memory on active LRU list.

488

unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).

495

unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).

489

496

490

# status considering hierarchy (see memory.use_hierarchy settings)

497

# status considering hierarchy (see memory.use_hierarchy settings)

491

498

492

hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy

499

hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy

493

under which the memory cgroup is

500

under which the memory cgroup is

494

hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to

501

hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to

495

hierarchy under which memory cgroup is.

502

hierarchy under which memory cgroup is.

496

503

497

total_<counter> - # hierarchical version of <counter>, which in

504

total_<counter> - # hierarchical version of <counter>, which in

498

addition to the cgroup's own value includes the

505

addition to the cgroup's own value includes the

499

sum of all hierarchical children's values of

506

sum of all hierarchical children's values of

500

<counter>, i.e. total_cache

507

<counter>, i.e. total_cache

501

508

502

# The following additional stats are dependent on CONFIG_DEBUG_VM.

509

# The following additional stats are dependent on CONFIG_DEBUG_VM.

503

510

504

recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)

511

recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)

505

recent_rotated_file - VM internal parameter. (see mm/vmscan.c)

512

recent_rotated_file - VM internal parameter. (see mm/vmscan.c)

506

recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)

513

recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)

507

recent_scanned_file - VM internal parameter. (see mm/vmscan.c)

514

recent_scanned_file - VM internal parameter. (see mm/vmscan.c)

508

515

509

Memo:

516

Memo:

510

recent_rotated means recent frequency of LRU rotation.

517

recent_rotated means recent frequency of LRU rotation.

511

recent_scanned means recent # of scans to LRU.

518

recent_scanned means recent # of scans to LRU.

512

showing for better debug please see the code for meanings.

519

showing for better debug please see the code for meanings.

513

520

514

Note:

521

Note:

515

Only anonymous and swap cache memory is listed as part of 'rss' stat.

522

Only anonymous and swap cache memory is listed as part of 'rss' stat.

516

This should not be confused with the true 'resident set size' or the

523

This should not be confused with the true 'resident set size' or the

517

amount of physical memory used by the cgroup.

524

amount of physical memory used by the cgroup.

518

'rss + file_mapped" will give you resident set size of cgroup.

525

'rss + file_mapped" will give you resident set size of cgroup.

519

(Note: file and shmem may be shared among other cgroups. In that case,

526

(Note: file and shmem may be shared among other cgroups. In that case,

520

file_mapped is accounted only when the memory cgroup is owner of page

527

file_mapped is accounted only when the memory cgroup is owner of page

521

cache.)

528

cache.)

522

529

523

5.3 swappiness

530

5.3 swappiness

524

531

525

Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.

532

Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.

526

Please note that unlike the global swappiness, memcg knob set to 0

533

Please note that unlike the global swappiness, memcg knob set to 0

527

really prevents from any swapping even if there is a swap storage

534

really prevents from any swapping even if there is a swap storage

528

available. This might lead to memcg OOM killer if there are no file

535

available. This might lead to memcg OOM killer if there are no file

529

pages to reclaim.

536

pages to reclaim.

530

537

531

Following cgroups' swappiness can't be changed.

538

Following cgroups' swappiness can't be changed.

532

- root cgroup (uses /proc/sys/vm/swappiness).

539

- root cgroup (uses /proc/sys/vm/swappiness).

533

- a cgroup which uses hierarchy and it has other cgroup(s) below it.

540

- a cgroup which uses hierarchy and it has other cgroup(s) below it.

534

- a cgroup which uses hierarchy and not the root of hierarchy.

541

- a cgroup which uses hierarchy and not the root of hierarchy.

535

542

536

5.4 failcnt

543

5.4 failcnt

537

544

538

A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.

545

A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.

539

This failcnt(== failure count) shows the number of times that a usage counter

546

This failcnt(== failure count) shows the number of times that a usage counter

540

hit its limit. When a memory cgroup hits a limit, failcnt increases and

547

hit its limit. When a memory cgroup hits a limit, failcnt increases and

541

memory under it will be reclaimed.

548

memory under it will be reclaimed.

542

549

543

You can reset failcnt by writing 0 to failcnt file.

550

You can reset failcnt by writing 0 to failcnt file.

544

# echo 0 > .../memory.failcnt

551

# echo 0 > .../memory.failcnt

545

552

546

5.5 usage_in_bytes

553

5.5 usage_in_bytes

547

554

548

For efficiency, as other kernel components, memory cgroup uses some optimization

555

For efficiency, as other kernel components, memory cgroup uses some optimization

549

to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the

556

to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the

550

method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz

557

method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz

551

value for efficient access. (Of course, when necessary, it's synchronized.)

558

value for efficient access. (Of course, when necessary, it's synchronized.)

552

If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)

559

If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)

553

value in memory.stat(see 5.2).

560

value in memory.stat(see 5.2).

554

561

555

5.6 numa_stat

562

5.6 numa_stat

556

563

557

This is similar to numa_maps but operates on a per-memcg basis. This is

564

This is similar to numa_maps but operates on a per-memcg basis. This is

558

useful for providing visibility into the numa locality information within

565

useful for providing visibility into the numa locality information within

559

an memcg since the pages are allowed to be allocated from any physical

566

an memcg since the pages are allowed to be allocated from any physical

560

node. One of the use cases is evaluating application performance by

567

node. One of the use cases is evaluating application performance by

561

combining this information with the application's CPU allocation.

568

combining this information with the application's CPU allocation.

562

569

563

We export "total", "file", "anon" and "unevictable" pages per-node for

570

We export "total", "file", "anon" and "unevictable" pages per-node for

564

each memcg. The ouput format of memory.numa_stat is:

571

each memcg. The ouput format of memory.numa_stat is:

565

572

566

total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...

573

total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...

567

file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...

574

file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...

568

anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

575

anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

569

unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

576

unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

570

577

571

And we have total = file + anon + unevictable.

578

And we have total = file + anon + unevictable.

572

579

573

6. Hierarchy support

580

6. Hierarchy support

574

581

575

The memory controller supports a deep hierarchy and hierarchical accounting.

582

The memory controller supports a deep hierarchy and hierarchical accounting.

576

The hierarchy is created by creating the appropriate cgroups in the

583

The hierarchy is created by creating the appropriate cgroups in the

577

cgroup filesystem. Consider for example, the following cgroup filesystem

584

cgroup filesystem. Consider for example, the following cgroup filesystem

578

hierarchy

585

hierarchy

579

586

580

root

587

root

581

/ | \

588

/ | \

582

/ | \

589

/ | \

583

a b c

590

a b c

584

| \

591

| \

585

| \

592

| \

586

d e

593

d e

587

594

588

In the diagram above, with hierarchical accounting enabled, all memory

595

In the diagram above, with hierarchical accounting enabled, all memory

589

usage of e, is accounted to its ancestors up until the root (i.e, c and root),

596

usage of e, is accounted to its ancestors up until the root (i.e, c and root),

590

that has memory.use_hierarchy enabled. If one of the ancestors goes over its

597

that has memory.use_hierarchy enabled. If one of the ancestors goes over its

591

limit, the reclaim algorithm reclaims from the tasks in the ancestor and the

598

limit, the reclaim algorithm reclaims from the tasks in the ancestor and the

592

children of the ancestor.

599

children of the ancestor.

593

600

594

6.1 Enabling hierarchical accounting and reclaim

601

6.1 Enabling hierarchical accounting and reclaim

595

602

596

A memory cgroup by default disables the hierarchy feature. Support

603

A memory cgroup by default disables the hierarchy feature. Support

597

can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup

604

can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup

598

605

599

# echo 1 > memory.use_hierarchy

606

# echo 1 > memory.use_hierarchy

600

607

601

The feature can be disabled by

608

The feature can be disabled by

602

609

603

# echo 0 > memory.use_hierarchy

610

# echo 0 > memory.use_hierarchy

604

611

605

NOTE1: Enabling/disabling will fail if either the cgroup already has other

612

NOTE1: Enabling/disabling will fail if either the cgroup already has other

606

cgroups created below it, or if the parent cgroup has use_hierarchy

613

cgroups created below it, or if the parent cgroup has use_hierarchy

607

enabled.

614

enabled.

608

615

609

NOTE2: When panic_on_oom is set to "2", the whole system will panic in

616

NOTE2: When panic_on_oom is set to "2", the whole system will panic in

610

case of an OOM event in any cgroup.

617

case of an OOM event in any cgroup.

611

618

612

7. Soft limits

619

7. Soft limits

613

620

614

Soft limits allow for greater sharing of memory. The idea behind soft limits

621

Soft limits allow for greater sharing of memory. The idea behind soft limits

615

is to allow control groups to use as much of the memory as needed, provided

622

is to allow control groups to use as much of the memory as needed, provided

616

623

617

a. There is no memory contention

624

a. There is no memory contention

618

b. They do not exceed their hard limit

625

b. They do not exceed their hard limit

619

626

620

When the system detects memory contention or low memory, control groups

627

When the system detects memory contention or low memory, control groups

621

are pushed back to their soft limits. If the soft limit of each control

628

are pushed back to their soft limits. If the soft limit of each control

622

group is very high, they are pushed back as much as possible to make

629

group is very high, they are pushed back as much as possible to make

623

sure that one control group does not starve the others of memory.

630

sure that one control group does not starve the others of memory.

624

631

625

Please note that soft limits is a best-effort feature; it comes with

632

Please note that soft limits is a best-effort feature; it comes with

626

no guarantees, but it does its best to make sure that when memory is

633

no guarantees, but it does its best to make sure that when memory is

627

heavily contended for, memory is allocated based on the soft limit

634

heavily contended for, memory is allocated based on the soft limit

628

hints/setup. Currently soft limit based reclaim is set up such that

635

hints/setup. Currently soft limit based reclaim is set up such that

629

it gets invoked from balance_pgdat (kswapd).

636

it gets invoked from balance_pgdat (kswapd).

630

637

631

7.1 Interface

638

7.1 Interface

632

639

633

Soft limits can be setup by using the following commands (in this example we

640

Soft limits can be setup by using the following commands (in this example we

634

assume a soft limit of 256 MiB)

641

assume a soft limit of 256 MiB)

635

642

636

# echo 256M > memory.soft_limit_in_bytes

643

# echo 256M > memory.soft_limit_in_bytes

637

644

638

If we want to change this to 1G, we can at any time use

645

If we want to change this to 1G, we can at any time use

639

646

640

# echo 1G > memory.soft_limit_in_bytes

647

# echo 1G > memory.soft_limit_in_bytes

641

648

642

NOTE1: Soft limits take effect over a long period of time, since they involve

649

NOTE1: Soft limits take effect over a long period of time, since they involve

643

reclaiming memory for balancing between memory cgroups

650

reclaiming memory for balancing between memory cgroups

644

NOTE2: It is recommended to set the soft limit always below the hard limit,

651

NOTE2: It is recommended to set the soft limit always below the hard limit,

645

otherwise the hard limit will take precedence.

652

otherwise the hard limit will take precedence.

646

653

647

8. Move charges at task migration

654

8. Move charges at task migration

648

655

649

Users can move charges associated with a task along with task migration, that

656

Users can move charges associated with a task along with task migration, that

650

is, uncharge task's pages from the old cgroup and charge them to the new cgroup.

657

is, uncharge task's pages from the old cgroup and charge them to the new cgroup.

651

This feature is not supported in !CONFIG_MMU environments because of lack of

658

This feature is not supported in !CONFIG_MMU environments because of lack of

652

page tables.

659

page tables.

653

660

654

8.1 Interface

661

8.1 Interface

655

662

656

This feature is disabled by default. It can be enabledi (and disabled again) by

663

This feature is disabled by default. It can be enabledi (and disabled again) by

657

writing to memory.move_charge_at_immigrate of the destination cgroup.

664

writing to memory.move_charge_at_immigrate of the destination cgroup.

658

665

659

If you want to enable it:

666

If you want to enable it:

660

667

661

# echo (some positive value) > memory.move_charge_at_immigrate

668

# echo (some positive value) > memory.move_charge_at_immigrate

662

669

663

Note: Each bits of move_charge_at_immigrate has its own meaning about what type

670

Note: Each bits of move_charge_at_immigrate has its own meaning about what type

664

of charges should be moved. See 8.2 for details.

671

of charges should be moved. See 8.2 for details.

665

Note: Charges are moved only when you move mm->owner, in other words,

672

Note: Charges are moved only when you move mm->owner, in other words,

666

a leader of a thread group.

673

a leader of a thread group.

667

Note: If we cannot find enough space for the task in the destination cgroup, we

674

Note: If we cannot find enough space for the task in the destination cgroup, we

668

try to make space by reclaiming memory. Task migration may fail if we

675

try to make space by reclaiming memory. Task migration may fail if we

669

cannot make enough space.

676

cannot make enough space.

670

Note: It can take several seconds if you move charges much.

677

Note: It can take several seconds if you move charges much.

671

678

672

And if you want disable it again:

679

And if you want disable it again:

673

680

674

# echo 0 > memory.move_charge_at_immigrate

681

# echo 0 > memory.move_charge_at_immigrate

675

682

676

8.2 Type of charges which can be moved

683

8.2 Type of charges which can be moved

677

684

678

Each bit in move_charge_at_immigrate has its own meaning about what type of

685

Each bit in move_charge_at_immigrate has its own meaning about what type of

679

charges should be moved. But in any case, it must be noted that an account of

686

charges should be moved. But in any case, it must be noted that an account of

680

a page or a swap can be moved only when it is charged to the task's current

687

a page or a swap can be moved only when it is charged to the task's current

681

(old) memory cgroup.

688

(old) memory cgroup.

682

689

683

bit | what type of charges would be moved ?

690

bit | what type of charges would be moved ?

684

-----+------------------------------------------------------------------------

691

-----+------------------------------------------------------------------------

685

0 | A charge of an anonymous page (or swap of it) used by the target task.

692

0 | A charge of an anonymous page (or swap of it) used by the target task.

686

| You must enable Swap Extension (see 2.4) to enable move of swap charges.

693

| You must enable Swap Extension (see 2.4) to enable move of swap charges.

687

-----+------------------------------------------------------------------------

694

-----+------------------------------------------------------------------------

688

1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)

695

1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)

689

| and swaps of tmpfs file) mmapped by the target task. Unlike the case of

696

| and swaps of tmpfs file) mmapped by the target task. Unlike the case of

690

| anonymous pages, file pages (and swaps) in the range mmapped by the task

697

| anonymous pages, file pages (and swaps) in the range mmapped by the task

691

| will be moved even if the task hasn't done page fault, i.e. they might

698

| will be moved even if the task hasn't done page fault, i.e. they might

692

| not be the task's "RSS", but other task's "RSS" that maps the same file.

699

| not be the task's "RSS", but other task's "RSS" that maps the same file.

693

| And mapcount of the page is ignored (the page can be moved even if

700

| And mapcount of the page is ignored (the page can be moved even if

694

| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to

701

| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to

695

| enable move of swap charges.

702

| enable move of swap charges.

696

703

697

8.3 TODO

704

8.3 TODO

698

705

699

- All of moving charge operations are done under cgroup_mutex. It's not good

706

- All of moving charge operations are done under cgroup_mutex. It's not good

700

behavior to hold the mutex too long, so we may need some trick.

707

behavior to hold the mutex too long, so we may need some trick.

701

708

702

9. Memory thresholds

709

9. Memory thresholds

703

710

704

Memory cgroup implements memory thresholds using the cgroups notification

711

Memory cgroup implements memory thresholds using the cgroups notification

705

API (see cgroups.txt). It allows to register multiple memory and memsw

712

API (see cgroups.txt). It allows to register multiple memory and memsw

706

thresholds and gets notifications when it crosses.

713

thresholds and gets notifications when it crosses.

707

714

708

To register a threshold, an application must:

715

To register a threshold, an application must:

709

- create an eventfd using eventfd(2);

716

- create an eventfd using eventfd(2);

710

- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;

717

- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;

711

- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to

718

- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to

712

cgroup.event_control.

719

cgroup.event_control.

713

720

714

Application will be notified through eventfd when memory usage crosses

721

Application will be notified through eventfd when memory usage crosses

715

threshold in any direction.

722

threshold in any direction.

716

723

717

It's applicable for root and non-root cgroup.

724

It's applicable for root and non-root cgroup.

718

725

719

10. OOM Control

726

10. OOM Control

720

727

721

memory.oom_control file is for OOM notification and other controls.

728

memory.oom_control file is for OOM notification and other controls.

722

729

723

Memory cgroup implements OOM notifier using the cgroup notification

730

Memory cgroup implements OOM notifier using the cgroup notification

724

API (See cgroups.txt). It allows to register multiple OOM notification

731

API (See cgroups.txt). It allows to register multiple OOM notification

725

delivery and gets notification when OOM happens.

732

delivery and gets notification when OOM happens.

726

733

727

To register a notifier, an application must:

734

To register a notifier, an application must:

728

- create an eventfd using eventfd(2)

735

- create an eventfd using eventfd(2)

729

- open memory.oom_control file

736

- open memory.oom_control file

730

- write string like "<event_fd> <fd of memory.oom_control>" to

737

- write string like "<event_fd> <fd of memory.oom_control>" to

731

cgroup.event_control

738

cgroup.event_control

732

739

733

The application will be notified through eventfd when OOM happens.

740

The application will be notified through eventfd when OOM happens.

734

OOM notification doesn't work for the root cgroup.

741

OOM notification doesn't work for the root cgroup.

735

742

736

You can disable the OOM-killer by writing "1" to memory.oom_control file, as:

743

You can disable the OOM-killer by writing "1" to memory.oom_control file, as:

737

744

738

#echo 1 > memory.oom_control

745

#echo 1 > memory.oom_control

739

746

740

This operation is only allowed to the top cgroup of a sub-hierarchy.

747

This operation is only allowed to the top cgroup of a sub-hierarchy.

741

If OOM-killer is disabled, tasks under cgroup will hang/sleep

748

If OOM-killer is disabled, tasks under cgroup will hang/sleep

742

in memory cgroup's OOM-waitqueue when they request accountable memory.

749

in memory cgroup's OOM-waitqueue when they request accountable memory.

743

750

744

For running them, you have to relax the memory cgroup's OOM status by

751

For running them, you have to relax the memory cgroup's OOM status by

745

* enlarge limit or reduce usage.

752

* enlarge limit or reduce usage.

746

To reduce usage,

753

To reduce usage,

747

* kill some tasks.

754

* kill some tasks.

748

* move some tasks to other group with account migration.

755

* move some tasks to other group with account migration.

749

* remove some files (on tmpfs?)

756

* remove some files (on tmpfs?)

750

757

751

Then, stopped tasks will work again.

758

Then, stopped tasks will work again.

752

759

753

At reading, current status of OOM is shown.

760

At reading, current status of OOM is shown.

754

oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)

761

oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)

755

under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may

762

under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may

756

be stopped.)

763

be stopped.)

757

764

758

11. TODO

765

11. TODO

759

766

760

1. Add support for accounting huge pages (as a separate controller)

767

1. Add support for accounting huge pages (as a separate controller)

761

2. Make per-cgroup scanner reclaim not-shared pages first

768

2. Make per-cgroup scanner reclaim not-shared pages first

762

3. Teach controller to account for shared-pages

769

3. Teach controller to account for shared-pages

763

4. Start reclamation in the background when the limit is

770

4. Start reclamation in the background when the limit is

764

not yet hit but the usage is getting closer

771

not yet hit but the usage is getting closer

765

772

766

Summary

773

Summary

767

774

768

Overall, the memory controller has been a stable controller and has been

775

Overall, the memory controller has been a stable controller and has been

769

commented and discussed quite extensively in the community.

776

commented and discussed quite extensively in the community.

770

777

771

References

778

References

772

779

773

1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/

780

1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/

774

2. Singh, Balbir. Memory Controller (RSS Control),

781

2. Singh, Balbir. Memory Controller (RSS Control),

775

http://lwn.net/Articles/222762/

782

http://lwn.net/Articles/222762/

776

3. Emelianov, Pavel. Resource controllers based on process cgroups

783

3. Emelianov, Pavel. Resource controllers based on process cgroups

777

http://lkml.org/lkml/2007/3/6/198

784

http://lkml.org/lkml/2007/3/6/198

778

4. Emelianov, Pavel. RSS controller based on process cgroups (v2)

785

4. Emelianov, Pavel. RSS controller based on process cgroups (v2)

779

http://lkml.org/lkml/2007/4/9/78

786

http://lkml.org/lkml/2007/4/9/78

780

5. Emelianov, Pavel. RSS controller based on process cgroups (v3)

787

5. Emelianov, Pavel. RSS controller based on process cgroups (v3)

781

http://lkml.org/lkml/2007/5/30/244

788

http://lkml.org/lkml/2007/5/30/244

782

6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/

789

6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/

783

7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control

790

7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control

784

subsystem (v3), http://lwn.net/Articles/235534/

791

subsystem (v3), http://lwn.net/Articles/235534/

785

8. Singh, Balbir. RSS controller v2 test results (lmbench),

792

8. Singh, Balbir. RSS controller v2 test results (lmbench),

786

http://lkml.org/lkml/2007/5/17/232

793

http://lkml.org/lkml/2007/5/17/232

787

9. Singh, Balbir. RSS controller v2 AIM9 results

794

9. Singh, Balbir. RSS controller v2 AIM9 results

788

http://lkml.org/lkml/2007/5/18/1

795

http://lkml.org/lkml/2007/5/18/1

789

10. Singh, Balbir. Memory controller v6 test results,

796

10. Singh, Balbir. Memory controller v6 test results,

790

http://lkml.org/lkml/2007/8/19/36

797

http://lkml.org/lkml/2007/8/19/36

791

11. Singh, Balbir. Memory controller introduction (v6),

798

11. Singh, Balbir. Memory controller introduction (v6),

792

http://lkml.org/lkml/2007/8/17/69

799

http://lkml.org/lkml/2007/8/17/69

793

12. Corbet, Jonathan, Controlling memory use in cgroups,

800

12. Corbet, Jonathan, Controlling memory use in cgroups,

794

http://lwn.net/Articles/243795/

801

http://lwn.net/Articles/243795/

795

802

GITLAB

kmem: add slab-specific documentation about the kmem controller

 Memory Resource Controller
 NOTE: The Memory Resource Controller has generically been referred to as the
       memory controller in this document. Do not confuse memory controller
       used here with the memory controller that is used in hardware.
 (For editors)
 In this document:
       When we mention a cgroup (cgroupfs's directory) with memory controller,
       we call it "memory cgroup". When you see git-log and source code, you'll
       see patch's title and function names tend to use "memcg".
       In this document, we avoid using it.
 Benefits and Purpose of the memory controller
 The memory controller isolates the memory behaviour of a group of tasks
 from the rest of the system. The article on LWN [12] mentions some probable
 uses of the memory controller. The memory controller can be used to
 a. Isolate an application or a group of applications
    Memory-hungry applications can be isolated and limited to a smaller
    amount of memory.
 b. Create a cgroup with a limited amount of memory; this can be used
    as a good alternative to booting with mem=XXXX.
 c. Virtualization solutions can control the amount of memory they want
    to assign to a virtual machine instance.
 d. A CD/DVD burner could control the amount of memory used by the
    rest of the system to ensure that burning does not fail due to lack
    of available memory.
 e. There are several other use cases; find one or use the controller just
    for fun (to learn and hack on the VM subsystem).
 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
 Features:
  - accounting anonymous pages, file caches, swap caches usage and limiting them.
  - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
  - optionally, memory+swap usage can be accounted and limited.
  - hierarchical accounting
  - soft limit
  - moving (recharging) account at moving a task is selectable.
  - usage threshold notifier
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
  Kernel memory support is a work in progress, and the current version provides
  basically functionality. (See Section 2.7)
 Brief summary of control files.
  tasks				 # attach a task(thread) and show list of threads
  cgroup.procs			 # show list of processes
  cgroup.event_control		 # an interface for event_fd()
  memory.usage_in_bytes		 # show current res_counter usage for memory
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
  memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
  memory.force_empty		 # trigger forced move charge to parent
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
  memory.kmem.limit_in_bytes      # set/show hard limit for kernel memory
  memory.kmem.usage_in_bytes      # show current kernel memory allocation
  memory.kmem.failcnt             # show the number of kernel memory usage hits limits
  memory.kmem.max_usage_in_bytes  # show max kernel memory usage recorded
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
  memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
  memory.kmem.tcp.failcnt            # show the number of tcp buf memory usage hits limits
  memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
 1. History
 The memory controller has a long history. A request for comments for the memory
 controller was posted by Balbir Singh [1]. At the time the RFC was posted
 there were several implementations for memory control. The goal of the
 RFC was to build consensus and agreement for the minimal features required
 for memory control. The first RSS controller was posted by Balbir Singh[2]
 in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
 RSS controller. At OLS, at the resource management BoF, everyone suggested
 that we handle both page cache and RSS together. Another request was raised
 to allow user space handling of OOM. The current memory controller is
 at version 6; it combines both mapped (RSS) and unmapped Page
 Cache Control [11].
 2. Memory Control
 Memory is a unique resource in the sense that it is present in a limited
 amount. If a task requires a lot of CPU processing, the task can spread
 its processing over a period of hours, days, months or years, but with
 memory, the same physical memory needs to be reused to accomplish the task.
 The memory controller implementation has been divided into phases. These
 are:
 1. Memory controller
 2. mlock(2) controller
 3. Kernel user memory accounting and slab control
 4. user mappings length controller
 The memory controller is the first controller developed.
 2.1. Design
 The core of the design is a counter called the res_counter. The res_counter
 tracks the current memory usage and limit of the group of processes associated
 with the controller. Each cgroup has a memory controller specific data
 structure (mem_cgroup) associated with it.
 2.2. Accounting
 		+--------------------+
 		|  mem_cgroup     |
 		|  (res_counter)     |
 		+--------------------+
 		 /            ^      \
 		/             |       \
            +---------------+  |        +---------------+
            | mm_struct     |  |....    | mm_struct     |
            |               |  |        |               |
            +---------------+  |        +---------------+
                               |
                               + --------------+
                                               |
            +---------------+           +------+--------+
            | page          +---------->  page_cgroup|
            |               |           |               |
            +---------------+           +---------------+
              (Figure 1: Hierarchy of Accounting)
 Figure 1 shows the important aspects of the controller
 1. Accounting happens per cgroup
 2. Each mm_struct knows about which cgroup it belongs to
 3. Each page has a pointer to the page_cgroup, which in turn knows the
    cgroup it belongs to
 The accounting is done as follows: mem_cgroup_charge_common() is invoked to
 set up the necessary data structures and check if the cgroup that is being
 charged is over its limit. If it is, then reclaim is invoked on the cgroup.
 More details can be found in the reclaim section of this document.
 If everything goes well, a page meta-data-structure called page_cgroup is
 updated. page_cgroup has its own LRU on cgroup.
 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
 2.2.1 Accounting details
 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
 Some pages which are never reclaimable and will not be on the LRU
 are not accounted. We just account pages under usual VM management.
 RSS pages are accounted at page_fault unless they've already been accounted
 for earlier. A file page will be accounted for as Page Cache when it's
 inserted into inode (radix-tree). While it's mapped into the page tables of
 processes, duplicate accounting is carefully avoided.
 An RSS page is unaccounted when it's fully unmapped. A PageCache page is
 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
 unmapped (by kswapd), they may exist as SwapCache in the system until they
 are really freed. Such SwapCaches are also accounted.
 A swapped-in page is not accounted until it's mapped.
 Note: The kernel does swapin-readahead and reads multiple swaps at once.
 This means swapped-in pages may contain pages for other tasks than a task
 causing page fault. So, we avoid accounting at swap-in I/O.
 At page migration, accounting information is kept.
 Note: we just account pages-on-LRU because our purpose is to control amount
 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
 2.3 Shared Page Accounting
 Shared pages are accounted on the basis of the first touch approach. The
 cgroup that first touches a page is accounted for the page. The principle
 behind this approach is that a cgroup that aggressively uses a shared
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 But see section 8.2: when moving a task to another cgroup, its pages may
 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
 Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.
 When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
 2.4 Swap Extension (CONFIG_MEMCG_SWAP)
 Swap Extension allows you to record charge for swap. A swapped-in page is
 charged back to original page allocator if possible.
 When swap is accounted, following files are added.
  - memory.memsw.usage_in_bytes.
  - memory.memsw.limit_in_bytes.
 memsw means memory+swap. Usage of memory+swap is limited by
 memsw.limit_in_bytes.
 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
 (by mistake) under 2G memory limitation will use all swap.
 In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
 By using the memsw limit, you can avoid system OOM which can be caused by swap
 shortage.
 * why 'memory+swap' rather than swap.
 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
 to move account from memory to swap...there is no change in usage of
 memory+swap. In other words, when we want to limit the usage of swap without
 affecting global LRU, memory+swap limit is better than just limiting swap from
 an OS point of view.
 * What happens when a cgroup hits memory.memsw.limit_in_bytes
 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
 in this cgroup. Then, swap-out will not be done by cgroup routine and file
 caches are dropped. But as mentioned above, global LRU can do swapout memory
 from it for sanity of the system's memory management state. You can't forbid
 it by cgroup.
 2.5 Reclaim
 Each cgroup maintains a per cgroup LRU which has the same structure as
 global VM. When a cgroup goes over its limit, we first try
 to reclaim memory from the cgroup so as to make space for the new
 pages that the cgroup has touched. If the reclaim is unsuccessful,
 an OOM routine is invoked to select and kill the bulkiest task in the
 cgroup. (See 10. OOM Control below.)
 The reclaim algorithm has not been modified for cgroups, except that
 pages that are selected for reclaiming come from the per-cgroup LRU
 list.
 NOTE: Reclaim does not work for the root cgroup, since we cannot set any
 limits on the root cgroup.
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 When oom event notifier is registered, event will be delivered.
 (See oom_control section)
 2.6 Locking
    lock_page_cgroup()/unlock_page_cgroup() should not be called under
    mapping->tree_lock.
    Other lock order is following:
    PG_locked.
    mm->page_table_lock
        zone->lru_lock
 	  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 With the Kernel memory extension, the Memory Controller is able to limit
 the amount of kernel memory used by the system. Kernel memory is fundamentally
 different than user memory, since it can't be swapped out, which makes it
 possible to DoS the system by consuming too much of this precious resource.
 Kernel memory won't be accounted at all until limit on a group is set. This
 allows for existing setups to continue working without disruption.  The limit
 cannot be set if the cgroup have children, or if there are already tasks in the
 cgroup. Attempting to set the limit under those conditions will return -EBUSY.
 When use_hierarchy == 1 and a group is accounted, its children will
 automatically be accounted regardless of their limit value.
 After a group is first limited, it will be kept being accounted until it
 is removed. The memory limitation itself, can of course be removed by writing
 -1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
 limited.
 Kernel memory limits are not imposed for the root cgroup. Usage for the root
 cgroup may or may not be accounted. The memory used is accumulated into
 memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
 (currently only for tcp).
 The main "kmem" counter is fed into the main counter, so kmem charges will
 also be visible from the user counter.
 Currently no soft limit is implemented for kernel memory. It is future work
 to trigger slab reclaim when those limits are reached.
 2.7.1 Current Kernel Memory resources accounted
 * stack pages: every process consumes some stack pages. By accounting into
 kernel memory, we prevent new processes from being created when the kernel
 memory usage is too high.
+* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
+of each kmem_cache is created everytime the cache is touched by the first time
+from inside the memcg. The creation is done lazily, so some objects can still be
+skipped while the cache is being created. All objects in a slab page should
+belong to the same memcg. This only fails to hold when a task is migrated to a
+different memcg during the page allocation by the cache.
 * sockets memory pressure: some sockets protocols have memory pressure
 thresholds. The Memory Controller allows them to be controlled individually
 per cgroup, instead of globally.
 * tcp memory pressure: sockets memory pressure for the tcp protocol.
 2.7.3 Common use cases
 Because the "kmem" counter is fed to the main user counter, kernel memory can
 never be limited completely independently of user memory. Say "U" is the user
 limit, and "K" the kernel limit. There are three possible ways limits can be
 set:
     U != 0, K = unlimited:
     This is the standard memcg limitation mechanism already present before kmem
     accounting. Kernel memory is completely ignored.
     U != 0, K < U:
     Kernel memory is a subset of the user memory. This setup is useful in
     deployments where the total amount of memory per-cgroup is overcommited.
     Overcommiting kernel memory limits is definitely not recommended, since the
     box can still run out of non-reclaimable memory.
     In this case, the admin could set up K so that the sum of all groups is
     never greater than the total memory, and freely set U at the cost of his
     QoS.
     U != 0, K >= U:
     Since kmem charges will also be fed to the user counter and reclaim will be
     triggered for the cgroup for both kinds of memory. This setup gives the
     admin a unified view of memory, and it is also useful for people who just
     want to track kernel memory usage.
 3. User Interface
 0. Configuration
 a. Enable CONFIG_CGROUPS
 b. Enable CONFIG_RESOURCE_COUNTERS
 c. Enable CONFIG_MEMCG
 d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
 # mount -t tmpfs none /sys/fs/cgroup
 # mkdir /sys/fs/cgroup/memory
 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
 2. Make the new group and move bash into it
 # mkdir /sys/fs/cgroup/memory/0
 # echo $$ > /sys/fs/cgroup/memory/0/tasks
 Since now we're in the 0 cgroup, we can alter the memory limit:
 # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
 NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
 mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
 NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
 NOTE: We cannot set limits on the root cgroup any more.
 # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
 4194304
 We can check the usage:
 # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
 1216512
 A successful write to this file does not guarantee a successful setting of
 this limit to the value written into the file. This can be due to a
 number of factors, such as rounding up to page boundaries or the total
 availability of memory on the system. The user is required to re-read
 this file after a write to guarantee the value committed by the kernel.
 # echo 1 > memory.limit_in_bytes
 # cat memory.limit_in_bytes
 4096
 The memory.failcnt field gives the number of times that the cgroup limit was
 exceeded.
 The memory.stat file gives accounting information. Now, the number of
 caches, RSS and Active pages/Inactive pages are shown.
 4. Testing
 For testing features and implementation, see memcg_test.txt.
 Performance test is also important. To see pure memory controller's overhead,
 testing on tmpfs will give you good numbers of small overheads.
 Example: do kernel make on tmpfs.
 Page-fault scalability is also important. At measuring parallel
 page fault test, multi-process test may be better than multi-thread
 test because it has noise of shared objects/status.
 But the above two are testing extreme situations.
 Trying usual test under memory controller is always helpful.
 4.1 Troubleshooting
 Sometimes a user might find that the application under a cgroup is
 terminated by the OOM killer. There are several causes for this:
 1. The cgroup limit is too low (just too low to do anything useful)
 2. The user is using anonymous memory and swap is turned off or too low
 A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
 some of the pages cached in the cgroup (page cache pages).
 To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
 seeing what happens will be helpful.
 4.2 Task migration
 When a task migrates from one cgroup to another, its charge is not
 carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
 You can move charges of a task along with task migration.
 See 8. "Move charges at task migration"
 4.3 Removing a cgroup
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
 tasks have migrated away from it. (because we charge against pages, not
 against tasks.)
 We move the stats to root (if use_hierarchy==0) or parent (if
 use_hierarchy==1), and no change on the charge except uncharging
 from the child.
 Charges recorded in swap information is not updated at removal of cgroup.
 Recorded information is discarded and a cgroup which uses swap (swapcache)
 will be charged as a new owner of it.
 About use_hierarchy, see Section 6.
 5. Misc. interfaces.
 5.1 force_empty
   memory.force_empty interface is provided to make cgroup's memory usage empty.
   You can use this interface only when the cgroup has no tasks.
   When writing anything to this
   # echo 0 > memory.force_empty
   Almost all pages tracked by this memory cgroup will be unmapped and freed.
   Some pages cannot be freed because they are locked or in-use. Such pages are
   moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
   cgroup will be empty.
   The typical use case for this interface is before calling rmdir().
   Because rmdir() moves all pages to parent, some out-of-use page caches can be
   moved to the parent. If you want to avoid that, force_empty will be useful.
   Also, note that when memory.kmem.limit_in_bytes is set the charges due to
   kernel pages will still be seen. This is not considered a failure and the
   write will still return success. In this case, it is expected that
   memory.kmem.usage_in_bytes == memory.usage_in_bytes.
   About use_hierarchy, see Section 6.
 5.2 stat file
 memory.stat file includes following statistics
 # per-memory cgroup local status
 cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of charging events to the memory cgroup. The charging
 		event happens each time a page is accounted as either mapped
 		anon page(RSS) or cache page(Page Cache) to the cgroup.
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
 		inactive LRU list.
 inactive_file	- # of bytes of file-backed memory on inactive LRU list.
 active_file	- # of bytes of file-backed memory on active LRU list.
 unevictable	- # of bytes of memory that cannot be reclaimed (mlocked etc).
 # status considering hierarchy (see memory.use_hierarchy settings)
 hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
 			under which the memory cgroup is
 hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
 			hierarchy under which memory cgroup is.
 total_<counter>		- # hierarchical version of <counter>, which in
 			addition to the cgroup's own value includes the
 			sum of all hierarchical children's values of
 			<counter>, i.e. total_cache
 # The following additional stats are dependent on CONFIG_DEBUG_VM.
 recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
 recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
 recent_scanned_anon	- VM internal parameter. (see mm/vmscan.c)
 recent_scanned_file	- VM internal parameter. (see mm/vmscan.c)
 Memo:
 	recent_rotated means recent frequency of LRU rotation.
 	recent_scanned means recent # of scans to LRU.
 	showing for better debug please see the code for meanings.
 Note:
 	Only anonymous and swap cache memory is listed as part of 'rss' stat.
 	This should not be confused with the true 'resident set size' or the
 	amount of physical memory used by the cgroup.
 	'rss + file_mapped" will give you resident set size of cgroup.
 	(Note: file and shmem may be shared among other cgroups. In that case,
 	 file_mapped is accounted only when the memory cgroup is owner of page
 	 cache.)
 5.3 swappiness
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
 Please note that unlike the global swappiness, memcg knob set to 0
 really prevents from any swapping even if there is a swap storage
 available. This might lead to memcg OOM killer if there are no file
 pages to reclaim.
 Following cgroups' swappiness can't be changed.
 - root cgroup (uses /proc/sys/vm/swappiness).
 - a cgroup which uses hierarchy and it has other cgroup(s) below it.
 - a cgroup which uses hierarchy and not the root of hierarchy.
 5.4 failcnt
 A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
 This failcnt(== failure count) shows the number of times that a usage counter
 hit its limit. When a memory cgroup hits a limit, failcnt increases and
 memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 5.5 usage_in_bytes
 For efficiency, as other kernel components, memory cgroup uses some optimization
 to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
 method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
 value for efficient access. (Of course, when necessary, it's synchronized.)
 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
 value in memory.stat(see 5.2).
 5.6 numa_stat
 This is similar to numa_maps but operates on a per-memcg basis.  This is
 useful for providing visibility into the numa locality information within
 an memcg since the pages are allowed to be allocated from any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 We export "total", "file", "anon" and "unevictable" pages per-node for
 each memcg.  The ouput format of memory.numa_stat is:
 total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
 file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
 anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 And we have total = file + anon + unevictable.
 6. Hierarchy support
 The memory controller supports a deep hierarchy and hierarchical accounting.
 The hierarchy is created by creating the appropriate cgroups in the
 cgroup filesystem. Consider for example, the following cgroup filesystem
 hierarchy
 	       root
 	     /  |   \
             /	|    \
 	   a	b     c
 		      | \
 		      |  \
 		      d   e
 In the diagram above, with hierarchical accounting enabled, all memory
 usage of e, is accounted to its ancestors up until the root (i.e, c and root),
 that has memory.use_hierarchy enabled. If one of the ancestors goes over its
 limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
 children of the ancestor.
 6.1 Enabling hierarchical accounting and reclaim
 A memory cgroup by default disables the hierarchy feature. Support
 can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
 # echo 1 > memory.use_hierarchy
 The feature can be disabled by
 # echo 0 > memory.use_hierarchy
 NOTE1: Enabling/disabling will fail if either the cgroup already has other
        cgroups created below it, or if the parent cgroup has use_hierarchy
        enabled.
 NOTE2: When panic_on_oom is set to "2", the whole system will panic in
        case of an OOM event in any cgroup.
 7. Soft limits
 Soft limits allow for greater sharing of memory. The idea behind soft limits
 is to allow control groups to use as much of the memory as needed, provided
 a. There is no memory contention
 b. They do not exceed their hard limit
 When the system detects memory contention or low memory, control groups
 are pushed back to their soft limits. If the soft limit of each control
 group is very high, they are pushed back as much as possible to make
 sure that one control group does not starve the others of memory.
 Please note that soft limits is a best-effort feature; it comes with
 no guarantees, but it does its best to make sure that when memory is
 heavily contended for, memory is allocated based on the soft limit
 hints/setup. Currently soft limit based reclaim is set up such that
 it gets invoked from balance_pgdat (kswapd).
 7.1 Interface
 Soft limits can be setup by using the following commands (in this example we
 assume a soft limit of 256 MiB)
 # echo 256M > memory.soft_limit_in_bytes
 If we want to change this to 1G, we can at any time use
 # echo 1G > memory.soft_limit_in_bytes
 NOTE1: Soft limits take effect over a long period of time, since they involve
        reclaiming memory for balancing between memory cgroups
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 8. Move charges at task migration
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
 page tables.
 8.1 Interface
 This feature is disabled by default. It can be enabledi (and disabled again) by
 writing to memory.move_charge_at_immigrate of the destination cgroup.
 If you want to enable it:
 # echo (some positive value) > memory.move_charge_at_immigrate
 Note: Each bits of move_charge_at_immigrate has its own meaning about what type
       of charges should be moved. See 8.2 for details.
 Note: Charges are moved only when you move mm->owner, in other words,
       a leader of a thread group.
 Note: If we cannot find enough space for the task in the destination cgroup, we
       try to make space by reclaiming memory. Task migration may fail if we
       cannot make enough space.
 Note: It can take several seconds if you move charges much.
 And if you want disable it again:
 # echo 0 > memory.move_charge_at_immigrate
 8.2 Type of charges which can be moved
 Each bit in move_charge_at_immigrate has its own meaning about what type of
 charges should be moved. But in any case, it must be noted that an account of
 a page or a swap can be moved only when it is charged to the task's current
 (old) memory cgroup.
   bit | what type of charges would be moved ?
  -----+------------------------------------------------------------------------
    0  | A charge of an anonymous page (or swap of it) used by the target task.
       | You must enable Swap Extension (see 2.4) to enable move of swap charges.
  -----+------------------------------------------------------------------------
    1  | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
       | and swaps of tmpfs file) mmapped by the target task. Unlike the case of
       | anonymous pages, file pages (and swaps) in the range mmapped by the task
       | will be moved even if the task hasn't done page fault, i.e. they might
       | not be the task's "RSS", but other task's "RSS" that maps the same file.
       | And mapcount of the page is ignored (the page can be moved even if
       | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
       | enable move of swap charges.
 8.3 TODO
 - All of moving charge operations are done under cgroup_mutex. It's not good
   behavior to hold the mutex too long, so we may need some trick.
 9. Memory thresholds
 Memory cgroup implements memory thresholds using the cgroups notification
 API (see cgroups.txt). It allows to register multiple memory and memsw
 thresholds and gets notifications when it crosses.
 To register a threshold, an application must:
 - create an eventfd using eventfd(2);
 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
   cgroup.event_control.
 Application will be notified through eventfd when memory usage crosses
 threshold in any direction.
 It's applicable for root and non-root cgroup.
 10. OOM Control
 memory.oom_control file is for OOM notification and other controls.
 Memory cgroup implements OOM notifier using the cgroup notification
 API (See cgroups.txt). It allows to register multiple OOM notification
 delivery and gets notification when OOM happens.
 To register a notifier, an application must:
  - create an eventfd using eventfd(2)
  - open memory.oom_control file
  - write string like "<event_fd> <fd of memory.oom_control>" to
    cgroup.event_control
 The application will be notified through eventfd when OOM happens.
 OOM notification doesn't work for the root cgroup.
 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
 	#echo 1 > memory.oom_control
 This operation is only allowed to the top cgroup of a sub-hierarchy.
 If OOM-killer is disabled, tasks under cgroup will hang/sleep
 in memory cgroup's OOM-waitqueue when they request accountable memory.
 For running them, you have to relax the memory cgroup's OOM status by
 	* enlarge limit or reduce usage.
 To reduce usage,
 	* kill some tasks.
 	* move some tasks to other group with account migration.
 	* remove some files (on tmpfs?)
 Then, stopped tasks will work again.
 At reading, current status of OOM is shown.
 	oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 11. TODO
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
 3. Teach controller to account for shared-pages
 4. Start reclamation in the background when the limit is
    not yet hit but the usage is getting closer
 Summary
 Overall, the memory controller has been a stable controller and has been
 commented and discussed quite extensively in the community.
 References
 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
 2. Singh, Balbir. Memory Controller (RSS Control),
    http://lwn.net/Articles/222762/
 3. Emelianov, Pavel. Resource controllers based on process cgroups
    http://lkml.org/lkml/2007/3/6/198
 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
    http://lkml.org/lkml/2007/4/9/78
 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
    http://lkml.org/lkml/2007/5/30/244
 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
    subsystem (v3), http://lwn.net/Articles/235534/
 8. Singh, Balbir. RSS controller v2 test results (lmbench),
    http://lkml.org/lkml/2007/5/17/232
 9. Singh, Balbir. RSS controller v2 AIM9 results
    http://lkml.org/lkml/2007/5/18/1
 10. Singh, Balbir. Memory controller v6 test results,
     http://lkml.org/lkml/2007/8/19/36
 11. Singh, Balbir. Memory controller introduction (v6),
     http://lkml.org/lkml/2007/8/17/69
 12. Corbet, Jonathan, Controlling memory use in cgroups,
     http://lwn.net/Articles/243795/