Doug / smarc-fsl-linux-kernel

1

Memory Resource Controller

1

Memory Resource Controller

2

3

NOTE: The Memory Resource Controller has generically been referred to as the

3

NOTE: The Memory Resource Controller has generically been referred to as the

4

memory controller in this document. Do not confuse memory controller

4

memory controller in this document. Do not confuse memory controller

5

used here with the memory controller that is used in hardware.

5

used here with the memory controller that is used in hardware.

6

7

(For editors)

7

(For editors)

8

In this document:

8

In this document:

9

When we mention a cgroup (cgroupfs's directory) with memory controller,

9

When we mention a cgroup (cgroupfs's directory) with memory controller,

10

we call it "memory cgroup". When you see git-log and source code, you'll

10

we call it "memory cgroup". When you see git-log and source code, you'll

11

see patch's title and function names tend to use "memcg".

11

see patch's title and function names tend to use "memcg".

12

In this document, we avoid using it.

12

In this document, we avoid using it.

13

14

Benefits and Purpose of the memory controller

14

Benefits and Purpose of the memory controller

15

16

The memory controller isolates the memory behaviour of a group of tasks

16

The memory controller isolates the memory behaviour of a group of tasks

17

from the rest of the system. The article on LWN [12] mentions some probable

17

from the rest of the system. The article on LWN [12] mentions some probable

18

uses of the memory controller. The memory controller can be used to

18

uses of the memory controller. The memory controller can be used to

19

20

a. Isolate an application or a group of applications

20

a. Isolate an application or a group of applications

21

Memory-hungry applications can be isolated and limited to a smaller

21

Memory-hungry applications can be isolated and limited to a smaller

22

amount of memory.

22

amount of memory.

23

b. Create a cgroup with a limited amount of memory; this can be used

23

b. Create a cgroup with a limited amount of memory; this can be used

24

as a good alternative to booting with mem=XXXX.

24

as a good alternative to booting with mem=XXXX.

25

c. Virtualization solutions can control the amount of memory they want

25

c. Virtualization solutions can control the amount of memory they want

26

to assign to a virtual machine instance.

26

to assign to a virtual machine instance.

27

d. A CD/DVD burner could control the amount of memory used by the

27

d. A CD/DVD burner could control the amount of memory used by the

28

rest of the system to ensure that burning does not fail due to lack

28

rest of the system to ensure that burning does not fail due to lack

29

of available memory.

29

of available memory.

30

e. There are several other use cases; find one or use the controller just

30

e. There are several other use cases; find one or use the controller just

31

for fun (to learn and hack on the VM subsystem).

31

for fun (to learn and hack on the VM subsystem).

32

33

Current Status: linux-2.6.34-mmotm(development version of 2010/April)

33

Current Status: linux-2.6.34-mmotm(development version of 2010/April)

34

35

Features:

35

Features:

36

- accounting anonymous pages, file caches, swap caches usage and limiting them.

36

- accounting anonymous pages, file caches, swap caches usage and limiting them.

37

- pages are linked to per-memcg LRU exclusively, and there is no global LRU.

37

- pages are linked to per-memcg LRU exclusively, and there is no global LRU.

38

- optionally, memory+swap usage can be accounted and limited.

38

- optionally, memory+swap usage can be accounted and limited.

39

- hierarchical accounting

39

- hierarchical accounting

40

- soft limit

40

- soft limit

41

- moving (recharging) account at moving a task is selectable.

41

- moving (recharging) account at moving a task is selectable.

42

- usage threshold notifier

42

- usage threshold notifier

43

- oom-killer disable knob and oom-notifier

43

- oom-killer disable knob and oom-notifier

44

- Root cgroup has no limit controls.

44

- Root cgroup has no limit controls.

45

46

Kernel memory support is a work in progress, and the current version provides

46

Kernel memory support is a work in progress, and the current version provides

47

basically functionality. (See Section 2.7)

47

basically functionality. (See Section 2.7)

48

49

Brief summary of control files.

49

Brief summary of control files.

50

51

tasks # attach a task(thread) and show list of threads

51

tasks # attach a task(thread) and show list of threads

52

cgroup.procs # show list of processes

52

cgroup.procs # show list of processes

53

cgroup.event_control # an interface for event_fd()

53

cgroup.event_control # an interface for event_fd()

54

memory.usage_in_bytes # show current res_counter usage for memory

54

memory.usage_in_bytes # show current res_counter usage for memory

55

(See 5.5 for details)

55

(See 5.5 for details)

56

memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap

56

memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap

57

(See 5.5 for details)

57

(See 5.5 for details)

58

memory.limit_in_bytes # set/show limit of memory usage

58

memory.limit_in_bytes # set/show limit of memory usage

59

memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage

59

memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage

60

memory.failcnt # show the number of memory usage hits limits

60

memory.failcnt # show the number of memory usage hits limits

61

memory.memsw.failcnt # show the number of memory+Swap hits limits

61

memory.memsw.failcnt # show the number of memory+Swap hits limits

62

memory.max_usage_in_bytes # show max memory usage recorded

62

memory.max_usage_in_bytes # show max memory usage recorded

63

memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded

63

memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded

64

memory.soft_limit_in_bytes # set/show soft limit of memory usage

64

memory.soft_limit_in_bytes # set/show soft limit of memory usage

65

memory.stat # show various statistics

65

memory.stat # show various statistics

66

memory.use_hierarchy # set/show hierarchical account enabled

66

memory.use_hierarchy # set/show hierarchical account enabled

67

memory.force_empty # trigger forced move charge to parent

67

memory.force_empty # trigger forced move charge to parent

68

memory.swappiness # set/show swappiness parameter of vmscan

68

memory.swappiness # set/show swappiness parameter of vmscan

69

(See sysctl's vm.swappiness)

69

(See sysctl's vm.swappiness)

70

memory.move_charge_at_immigrate # set/show controls of moving charges

70

memory.move_charge_at_immigrate # set/show controls of moving charges

71

memory.oom_control # set/show oom controls.

71

memory.oom_control # set/show oom controls.

72

memory.numa_stat # show the number of memory usage per numa node

72

memory.numa_stat # show the number of memory usage per numa node

73

74

memory.kmem.limit_in_bytes # set/show hard limit for kernel memory

75

memory.kmem.usage_in_bytes # show current kernel memory allocation

76

memory.kmem.failcnt # show the number of kernel memory usage hits limits

77

memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded

78

74

memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory

79

memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory

75

memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation

80

memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation

76

memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits

81

memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits

77

memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded

82

memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded

78

83

79

1. History

84

1. History

80

85

81

The memory controller has a long history. A request for comments for the memory

86

The memory controller has a long history. A request for comments for the memory

82

controller was posted by Balbir Singh [1]. At the time the RFC was posted

87

controller was posted by Balbir Singh [1]. At the time the RFC was posted

83

there were several implementations for memory control. The goal of the

88

there were several implementations for memory control. The goal of the

84

RFC was to build consensus and agreement for the minimal features required

89

RFC was to build consensus and agreement for the minimal features required

85

for memory control. The first RSS controller was posted by Balbir Singh[2]

90

for memory control. The first RSS controller was posted by Balbir Singh[2]

86

in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the

91

in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the

87

RSS controller. At OLS, at the resource management BoF, everyone suggested

92

RSS controller. At OLS, at the resource management BoF, everyone suggested

88

that we handle both page cache and RSS together. Another request was raised

93

that we handle both page cache and RSS together. Another request was raised

89

to allow user space handling of OOM. The current memory controller is

94

to allow user space handling of OOM. The current memory controller is

90

at version 6; it combines both mapped (RSS) and unmapped Page

95

at version 6; it combines both mapped (RSS) and unmapped Page

91

Cache Control [11].

96

Cache Control [11].

92

97

93

2. Memory Control

98

2. Memory Control

94

99

95

Memory is a unique resource in the sense that it is present in a limited

100

Memory is a unique resource in the sense that it is present in a limited

96

amount. If a task requires a lot of CPU processing, the task can spread

101

amount. If a task requires a lot of CPU processing, the task can spread

97

its processing over a period of hours, days, months or years, but with

102

its processing over a period of hours, days, months or years, but with

98

memory, the same physical memory needs to be reused to accomplish the task.

103

memory, the same physical memory needs to be reused to accomplish the task.

99

104

100

The memory controller implementation has been divided into phases. These

105

The memory controller implementation has been divided into phases. These

101

are:

106

are:

102

107

103

1. Memory controller

108

1. Memory controller

104

2. mlock(2) controller

109

2. mlock(2) controller

105

3. Kernel user memory accounting and slab control

110

3. Kernel user memory accounting and slab control

106

4. user mappings length controller

111

4. user mappings length controller

107

112

108

The memory controller is the first controller developed.

113

The memory controller is the first controller developed.

109

114

110

2.1. Design

115

2.1. Design

111

116

112

The core of the design is a counter called the res_counter. The res_counter

117

The core of the design is a counter called the res_counter. The res_counter

113

tracks the current memory usage and limit of the group of processes associated

118

tracks the current memory usage and limit of the group of processes associated

114

with the controller. Each cgroup has a memory controller specific data

119

with the controller. Each cgroup has a memory controller specific data

115

structure (mem_cgroup) associated with it.

120

structure (mem_cgroup) associated with it.

116

121

117

2.2. Accounting

122

2.2. Accounting

118

123

119

+--------------------+

124

+--------------------+

120

| mem_cgroup |

125

| mem_cgroup |

121

| (res_counter) |

126

| (res_counter) |

122

+--------------------+

127

+--------------------+

123

/ ^ \

128

/ ^ \

124

/ | \

129

/ | \

125

+---------------+ | +---------------+

130

+---------------+ | +---------------+

126

131

127

| | | | |

132

| | | | |

128

+---------------+ | +---------------+

133

+---------------+ | +---------------+

129

|

134

|

130

+ --------------+

135

+ --------------+

131

|

136

|

132

+---------------+ +------+--------+

137

+---------------+ +------+--------+

133

| page +----------> page_cgroup|

138

| page +----------> page_cgroup|

134

| | | |

139

| | | |

135

+---------------+ +---------------+

140

+---------------+ +---------------+

136

141

137

(Figure 1: Hierarchy of Accounting)

142

(Figure 1: Hierarchy of Accounting)

138

143

139

144

140

Figure 1 shows the important aspects of the controller

145

Figure 1 shows the important aspects of the controller

141

146

142

1. Accounting happens per cgroup

147

1. Accounting happens per cgroup

143

2. Each mm_struct knows about which cgroup it belongs to

148

2. Each mm_struct knows about which cgroup it belongs to

144

3. Each page has a pointer to the page_cgroup, which in turn knows the

149

3. Each page has a pointer to the page_cgroup, which in turn knows the

145

cgroup it belongs to

150

cgroup it belongs to

146

151

147

The accounting is done as follows: mem_cgroup_charge_common() is invoked to

152

The accounting is done as follows: mem_cgroup_charge_common() is invoked to

148

set up the necessary data structures and check if the cgroup that is being

153

set up the necessary data structures and check if the cgroup that is being

149

charged is over its limit. If it is, then reclaim is invoked on the cgroup.

154

charged is over its limit. If it is, then reclaim is invoked on the cgroup.

150

More details can be found in the reclaim section of this document.

155

More details can be found in the reclaim section of this document.

151

If everything goes well, a page meta-data-structure called page_cgroup is

156

If everything goes well, a page meta-data-structure called page_cgroup is

152

updated. page_cgroup has its own LRU on cgroup.

157

updated. page_cgroup has its own LRU on cgroup.

153

(*) page_cgroup structure is allocated at boot/memory-hotplug time.

158

(*) page_cgroup structure is allocated at boot/memory-hotplug time.

154

159

155

2.2.1 Accounting details

160

2.2.1 Accounting details

156

161

157

All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

162

All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

158

Some pages which are never reclaimable and will not be on the LRU

163

Some pages which are never reclaimable and will not be on the LRU

159

are not accounted. We just account pages under usual VM management.

164

are not accounted. We just account pages under usual VM management.

160

165

161

RSS pages are accounted at page_fault unless they've already been accounted

166

RSS pages are accounted at page_fault unless they've already been accounted

162

for earlier. A file page will be accounted for as Page Cache when it's

167

for earlier. A file page will be accounted for as Page Cache when it's

163

inserted into inode (radix-tree). While it's mapped into the page tables of

168

inserted into inode (radix-tree). While it's mapped into the page tables of

164

processes, duplicate accounting is carefully avoided.

169

processes, duplicate accounting is carefully avoided.

165

170

166

An RSS page is unaccounted when it's fully unmapped. A PageCache page is

171

An RSS page is unaccounted when it's fully unmapped. A PageCache page is

167

unaccounted when it's removed from radix-tree. Even if RSS pages are fully

172

unaccounted when it's removed from radix-tree. Even if RSS pages are fully

168

unmapped (by kswapd), they may exist as SwapCache in the system until they

173

unmapped (by kswapd), they may exist as SwapCache in the system until they

169

are really freed. Such SwapCaches are also accounted.

174

are really freed. Such SwapCaches are also accounted.

170

A swapped-in page is not accounted until it's mapped.

175

A swapped-in page is not accounted until it's mapped.

171

176

172

Note: The kernel does swapin-readahead and reads multiple swaps at once.

177

Note: The kernel does swapin-readahead and reads multiple swaps at once.

173

This means swapped-in pages may contain pages for other tasks than a task

178

This means swapped-in pages may contain pages for other tasks than a task

174

causing page fault. So, we avoid accounting at swap-in I/O.

179

causing page fault. So, we avoid accounting at swap-in I/O.

175

180

176

At page migration, accounting information is kept.

181

At page migration, accounting information is kept.

177

182

178

Note: we just account pages-on-LRU because our purpose is to control amount

183

Note: we just account pages-on-LRU because our purpose is to control amount

179

of used pages; not-on-LRU pages tend to be out-of-control from VM view.

184

of used pages; not-on-LRU pages tend to be out-of-control from VM view.

180

185

181

2.3 Shared Page Accounting

186

2.3 Shared Page Accounting

182

187

183

Shared pages are accounted on the basis of the first touch approach. The

188

Shared pages are accounted on the basis of the first touch approach. The

184

cgroup that first touches a page is accounted for the page. The principle

189

cgroup that first touches a page is accounted for the page. The principle

185

behind this approach is that a cgroup that aggressively uses a shared

190

behind this approach is that a cgroup that aggressively uses a shared

186

page will eventually get charged for it (once it is uncharged from

191

page will eventually get charged for it (once it is uncharged from

187

the cgroup that brought it in -- this will happen on memory pressure).

192

the cgroup that brought it in -- this will happen on memory pressure).

188

193

189

But see section 8.2: when moving a task to another cgroup, its pages may

194

But see section 8.2: when moving a task to another cgroup, its pages may

190

be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

195

be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

191

196

192

Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.

197

Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.

193

When you do swapoff and make swapped-out pages of shmem(tmpfs) to

198

When you do swapoff and make swapped-out pages of shmem(tmpfs) to

194

be backed into memory in force, charges for pages are accounted against the

199

be backed into memory in force, charges for pages are accounted against the

195

caller of swapoff rather than the users of shmem.

200

caller of swapoff rather than the users of shmem.

196

201

197

2.4 Swap Extension (CONFIG_MEMCG_SWAP)

202

2.4 Swap Extension (CONFIG_MEMCG_SWAP)

198

203

199

Swap Extension allows you to record charge for swap. A swapped-in page is

204

Swap Extension allows you to record charge for swap. A swapped-in page is

200

charged back to original page allocator if possible.

205

charged back to original page allocator if possible.

201

206

202

When swap is accounted, following files are added.

207

When swap is accounted, following files are added.

203

- memory.memsw.usage_in_bytes.

208

- memory.memsw.usage_in_bytes.

204

- memory.memsw.limit_in_bytes.

209

- memory.memsw.limit_in_bytes.

205

210

206

memsw means memory+swap. Usage of memory+swap is limited by

211

memsw means memory+swap. Usage of memory+swap is limited by

207

memsw.limit_in_bytes.

212

memsw.limit_in_bytes.

208

213

209

Example: Assume a system with 4G of swap. A task which allocates 6G of memory

214

Example: Assume a system with 4G of swap. A task which allocates 6G of memory

210

(by mistake) under 2G memory limitation will use all swap.

215

(by mistake) under 2G memory limitation will use all swap.

211

In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.

216

In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.

212

By using the memsw limit, you can avoid system OOM which can be caused by swap

217

By using the memsw limit, you can avoid system OOM which can be caused by swap

213

shortage.

218

shortage.

214

219

215

* why 'memory+swap' rather than swap.

220

* why 'memory+swap' rather than swap.

216

The global LRU(kswapd) can swap out arbitrary pages. Swap-out means

221

The global LRU(kswapd) can swap out arbitrary pages. Swap-out means

217

to move account from memory to swap...there is no change in usage of

222

to move account from memory to swap...there is no change in usage of

218

memory+swap. In other words, when we want to limit the usage of swap without

223

memory+swap. In other words, when we want to limit the usage of swap without

219

affecting global LRU, memory+swap limit is better than just limiting swap from

224

affecting global LRU, memory+swap limit is better than just limiting swap from

220

an OS point of view.

225

an OS point of view.

221

226

222

* What happens when a cgroup hits memory.memsw.limit_in_bytes

227

* What happens when a cgroup hits memory.memsw.limit_in_bytes

223

When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out

228

When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out

224

in this cgroup. Then, swap-out will not be done by cgroup routine and file

229

in this cgroup. Then, swap-out will not be done by cgroup routine and file

225

caches are dropped. But as mentioned above, global LRU can do swapout memory

230

caches are dropped. But as mentioned above, global LRU can do swapout memory

226

from it for sanity of the system's memory management state. You can't forbid

231

from it for sanity of the system's memory management state. You can't forbid

227

it by cgroup.

232

it by cgroup.

228

233

229

2.5 Reclaim

234

2.5 Reclaim

230

235

231

Each cgroup maintains a per cgroup LRU which has the same structure as

236

Each cgroup maintains a per cgroup LRU which has the same structure as

232

global VM. When a cgroup goes over its limit, we first try

237

global VM. When a cgroup goes over its limit, we first try

233

to reclaim memory from the cgroup so as to make space for the new

238

to reclaim memory from the cgroup so as to make space for the new

234

pages that the cgroup has touched. If the reclaim is unsuccessful,

239

pages that the cgroup has touched. If the reclaim is unsuccessful,

235

an OOM routine is invoked to select and kill the bulkiest task in the

240

an OOM routine is invoked to select and kill the bulkiest task in the

236

cgroup. (See 10. OOM Control below.)

241

cgroup. (See 10. OOM Control below.)

237

242

238

The reclaim algorithm has not been modified for cgroups, except that

243

The reclaim algorithm has not been modified for cgroups, except that

239

pages that are selected for reclaiming come from the per-cgroup LRU

244

pages that are selected for reclaiming come from the per-cgroup LRU

240

list.

245

list.

241

246

242

NOTE: Reclaim does not work for the root cgroup, since we cannot set any

247

NOTE: Reclaim does not work for the root cgroup, since we cannot set any

243

limits on the root cgroup.

248

limits on the root cgroup.

244

249

245

Note2: When panic_on_oom is set to "2", the whole system will panic.

250

Note2: When panic_on_oom is set to "2", the whole system will panic.

246

251

247

When oom event notifier is registered, event will be delivered.

252

When oom event notifier is registered, event will be delivered.

248

(See oom_control section)

253

(See oom_control section)

249

254

250

2.6 Locking

255

2.6 Locking

251

256

252

lock_page_cgroup()/unlock_page_cgroup() should not be called under

257

lock_page_cgroup()/unlock_page_cgroup() should not be called under

253

mapping->tree_lock.

258

mapping->tree_lock.

254

259

255

Other lock order is following:

260

Other lock order is following:

256

PG_locked.

261

PG_locked.

257

mm->page_table_lock

262

mm->page_table_lock

258

zone->lru_lock

263

zone->lru_lock

259

lock_page_cgroup.

264

lock_page_cgroup.

260

In many cases, just lock_page_cgroup() is called.

265

In many cases, just lock_page_cgroup() is called.

261

per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by

266

per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by

262

zone->lru_lock, it has no lock of its own.

267

zone->lru_lock, it has no lock of its own.

263

268

264

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)

269

2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)

265

270

266

With the Kernel memory extension, the Memory Controller is able to limit

271

With the Kernel memory extension, the Memory Controller is able to limit

267

the amount of kernel memory used by the system. Kernel memory is fundamentally

272

the amount of kernel memory used by the system. Kernel memory is fundamentally

268

different than user memory, since it can't be swapped out, which makes it

273

different than user memory, since it can't be swapped out, which makes it

269

possible to DoS the system by consuming too much of this precious resource.

274

possible to DoS the system by consuming too much of this precious resource.

270

275

276

Kernel memory won't be accounted at all until limit on a group is set. This

277

allows for existing setups to continue working without disruption. The limit

278

cannot be set if the cgroup have children, or if there are already tasks in the

279

cgroup. Attempting to set the limit under those conditions will return -EBUSY.

280

When use_hierarchy == 1 and a group is accounted, its children will

281

automatically be accounted regardless of their limit value.

282

283

After a group is first limited, it will be kept being accounted until it

284

is removed. The memory limitation itself, can of course be removed by writing

285

-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not

286

limited.

287

271

Kernel memory limits are not imposed for the root cgroup. Usage for the root

288

Kernel memory limits are not imposed for the root cgroup. Usage for the root

272

cgroup may or may not be accounted.

289

cgroup may or may not be accounted. The memory used is accumulated into

290

memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.

291

(currently only for tcp).

292

The main "kmem" counter is fed into the main counter, so kmem charges will

293

also be visible from the user counter.

273

294

274

Currently no soft limit is implemented for kernel memory. It is future work

295

Currently no soft limit is implemented for kernel memory. It is future work

275

to trigger slab reclaim when those limits are reached.

296

to trigger slab reclaim when those limits are reached.

276

297

277

2.7.1 Current Kernel Memory resources accounted

298

2.7.1 Current Kernel Memory resources accounted

278

299

300

* stack pages: every process consumes some stack pages. By accounting into

301

kernel memory, we prevent new processes from being created when the kernel

302

memory usage is too high.

303

279

* sockets memory pressure: some sockets protocols have memory pressure

304

* sockets memory pressure: some sockets protocols have memory pressure

280

thresholds. The Memory Controller allows them to be controlled individually

305

thresholds. The Memory Controller allows them to be controlled individually

281

per cgroup, instead of globally.

306

per cgroup, instead of globally.

282

307

283

* tcp memory pressure: sockets memory pressure for the tcp protocol.

308

* tcp memory pressure: sockets memory pressure for the tcp protocol.

284

309

310

2.7.3 Common use cases

311

312

Because the "kmem" counter is fed to the main user counter, kernel memory can

313

never be limited completely independently of user memory. Say "U" is the user

314

limit, and "K" the kernel limit. There are three possible ways limits can be

315

set:

316

317

U != 0, K = unlimited:

318

This is the standard memcg limitation mechanism already present before kmem

319

accounting. Kernel memory is completely ignored.

320

321

U != 0, K < U:

322

Kernel memory is a subset of the user memory. This setup is useful in

323

deployments where the total amount of memory per-cgroup is overcommited.

324

Overcommiting kernel memory limits is definitely not recommended, since the

325

box can still run out of non-reclaimable memory.

326

In this case, the admin could set up K so that the sum of all groups is

327

never greater than the total memory, and freely set U at the cost of his

328

QoS.

329

330

U != 0, K >= U:

331

Since kmem charges will also be fed to the user counter and reclaim will be

332

triggered for the cgroup for both kinds of memory. This setup gives the

333

admin a unified view of memory, and it is also useful for people who just

334

want to track kernel memory usage.

335

285

3. User Interface

336

3. User Interface

286

337

287

0. Configuration

338

0. Configuration

288

339

289

a. Enable CONFIG_CGROUPS

340

a. Enable CONFIG_CGROUPS

290

b. Enable CONFIG_RESOURCE_COUNTERS

341

b. Enable CONFIG_RESOURCE_COUNTERS

291

c. Enable CONFIG_MEMCG

342

c. Enable CONFIG_MEMCG

292

d. Enable CONFIG_MEMCG_SWAP (to use swap extension)

343

d. Enable CONFIG_MEMCG_SWAP (to use swap extension)

344

d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)

293

345

294

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

346

1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

295

# mount -t tmpfs none /sys/fs/cgroup

347

# mount -t tmpfs none /sys/fs/cgroup

296

# mkdir /sys/fs/cgroup/memory

348

# mkdir /sys/fs/cgroup/memory

297

# mount -t cgroup none /sys/fs/cgroup/memory -o memory

349

# mount -t cgroup none /sys/fs/cgroup/memory -o memory

298

350

299

2. Make the new group and move bash into it

351

2. Make the new group and move bash into it

300

# mkdir /sys/fs/cgroup/memory/0

352

# mkdir /sys/fs/cgroup/memory/0

301

# echo $$ > /sys/fs/cgroup/memory/0/tasks

353

# echo $$ > /sys/fs/cgroup/memory/0/tasks

302

354

303

Since now we're in the 0 cgroup, we can alter the memory limit:

355

Since now we're in the 0 cgroup, we can alter the memory limit:

304

# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

356

# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes

305

357

306

NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,

358

NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,

307

mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)

359

mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)

308

360

309

NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).

361

NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).

310

NOTE: We cannot set limits on the root cgroup any more.

362

NOTE: We cannot set limits on the root cgroup any more.

311

363

312

# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes

364

# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes

313

4194304

365

4194304

314

366

315

We can check the usage:

367

We can check the usage:

316

# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes

368

# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes

317

1216512

369

1216512

318

370

319

A successful write to this file does not guarantee a successful setting of

371

A successful write to this file does not guarantee a successful setting of

320

this limit to the value written into the file. This can be due to a

372

this limit to the value written into the file. This can be due to a

321

number of factors, such as rounding up to page boundaries or the total

373

number of factors, such as rounding up to page boundaries or the total

322

availability of memory on the system. The user is required to re-read

374

availability of memory on the system. The user is required to re-read

323

this file after a write to guarantee the value committed by the kernel.

375

this file after a write to guarantee the value committed by the kernel.

324

376

325

# echo 1 > memory.limit_in_bytes

377

# echo 1 > memory.limit_in_bytes

326

# cat memory.limit_in_bytes

378

# cat memory.limit_in_bytes

327

4096

379

4096

328

380

329

The memory.failcnt field gives the number of times that the cgroup limit was

381

The memory.failcnt field gives the number of times that the cgroup limit was

330

exceeded.

382

exceeded.

331

383

332

The memory.stat file gives accounting information. Now, the number of

384

The memory.stat file gives accounting information. Now, the number of

333

caches, RSS and Active pages/Inactive pages are shown.

385

caches, RSS and Active pages/Inactive pages are shown.

334

386

335

4. Testing

387

4. Testing

336

388

337

For testing features and implementation, see memcg_test.txt.

389

For testing features and implementation, see memcg_test.txt.

338

390

339

Performance test is also important. To see pure memory controller's overhead,

391

Performance test is also important. To see pure memory controller's overhead,

340

testing on tmpfs will give you good numbers of small overheads.

392

testing on tmpfs will give you good numbers of small overheads.

341

Example: do kernel make on tmpfs.

393

Example: do kernel make on tmpfs.

342

394

343

Page-fault scalability is also important. At measuring parallel

395

Page-fault scalability is also important. At measuring parallel

344

page fault test, multi-process test may be better than multi-thread

396

page fault test, multi-process test may be better than multi-thread

345

test because it has noise of shared objects/status.

397

test because it has noise of shared objects/status.

346

398

347

But the above two are testing extreme situations.

399

But the above two are testing extreme situations.

348

Trying usual test under memory controller is always helpful.

400

Trying usual test under memory controller is always helpful.

349

401

350

4.1 Troubleshooting

402

4.1 Troubleshooting

351

403

352

Sometimes a user might find that the application under a cgroup is

404

Sometimes a user might find that the application under a cgroup is

353

terminated by the OOM killer. There are several causes for this:

405

terminated by the OOM killer. There are several causes for this:

354

406

355

1. The cgroup limit is too low (just too low to do anything useful)

407

1. The cgroup limit is too low (just too low to do anything useful)

356

2. The user is using anonymous memory and swap is turned off or too low

408

2. The user is using anonymous memory and swap is turned off or too low

357

409

358

A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of

410

A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of

359

some of the pages cached in the cgroup (page cache pages).

411

some of the pages cached in the cgroup (page cache pages).

360

412

361

To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and

413

To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and

362

seeing what happens will be helpful.

414

seeing what happens will be helpful.

363

415

364

4.2 Task migration

416

4.2 Task migration

365

417

366

When a task migrates from one cgroup to another, its charge is not

418

When a task migrates from one cgroup to another, its charge is not

367

carried forward by default. The pages allocated from the original cgroup still

419

carried forward by default. The pages allocated from the original cgroup still

368

remain charged to it, the charge is dropped when the page is freed or

420

remain charged to it, the charge is dropped when the page is freed or

369

reclaimed.

421

reclaimed.

370

422

371

You can move charges of a task along with task migration.

423

You can move charges of a task along with task migration.

372

See 8. "Move charges at task migration"

424

See 8. "Move charges at task migration"

373

425

374

4.3 Removing a cgroup

426

4.3 Removing a cgroup

375

427

376

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a

428

A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a

377

cgroup might have some charge associated with it, even though all

429

cgroup might have some charge associated with it, even though all

378

tasks have migrated away from it. (because we charge against pages, not

430

tasks have migrated away from it. (because we charge against pages, not

379

against tasks.)

431

against tasks.)

380

432

381

We move the stats to root (if use_hierarchy==0) or parent (if

433

We move the stats to root (if use_hierarchy==0) or parent (if

382

use_hierarchy==1), and no change on the charge except uncharging

434

use_hierarchy==1), and no change on the charge except uncharging

383

from the child.

435

from the child.

384

436

385

Charges recorded in swap information is not updated at removal of cgroup.

437

Charges recorded in swap information is not updated at removal of cgroup.

386

Recorded information is discarded and a cgroup which uses swap (swapcache)

438

Recorded information is discarded and a cgroup which uses swap (swapcache)

387

will be charged as a new owner of it.

439

will be charged as a new owner of it.

388

440

389

About use_hierarchy, see Section 6.

441

About use_hierarchy, see Section 6.

390

442

391

5. Misc. interfaces.

443

5. Misc. interfaces.

392

444

393

5.1 force_empty

445

5.1 force_empty

394

memory.force_empty interface is provided to make cgroup's memory usage empty.

446

memory.force_empty interface is provided to make cgroup's memory usage empty.

395

You can use this interface only when the cgroup has no tasks.

447

You can use this interface only when the cgroup has no tasks.

396

When writing anything to this

448

When writing anything to this

397

449

398

# echo 0 > memory.force_empty

450

# echo 0 > memory.force_empty

399

451

400

Almost all pages tracked by this memory cgroup will be unmapped and freed.

452

Almost all pages tracked by this memory cgroup will be unmapped and freed.

401

Some pages cannot be freed because they are locked or in-use. Such pages are

453

Some pages cannot be freed because they are locked or in-use. Such pages are

402

moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this

454

moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this

403

cgroup will be empty.

455

cgroup will be empty.

404

456

405

The typical use case for this interface is before calling rmdir().

457

The typical use case for this interface is before calling rmdir().

406

Because rmdir() moves all pages to parent, some out-of-use page caches can be

458

Because rmdir() moves all pages to parent, some out-of-use page caches can be

407

moved to the parent. If you want to avoid that, force_empty will be useful.

459

moved to the parent. If you want to avoid that, force_empty will be useful.

460

461

Also, note that when memory.kmem.limit_in_bytes is set the charges due to

462

kernel pages will still be seen. This is not considered a failure and the

463

write will still return success. In this case, it is expected that

464

memory.kmem.usage_in_bytes == memory.usage_in_bytes.

408

465

409

About use_hierarchy, see Section 6.

466

About use_hierarchy, see Section 6.

410

467

411

5.2 stat file

468

5.2 stat file

412

469

413

memory.stat file includes following statistics

470

memory.stat file includes following statistics

414

471

415

# per-memory cgroup local status

472

# per-memory cgroup local status

416

cache - # of bytes of page cache memory.

473

cache - # of bytes of page cache memory.

417

rss - # of bytes of anonymous and swap cache memory.

474

rss - # of bytes of anonymous and swap cache memory.

418

mapped_file - # of bytes of mapped file (includes tmpfs/shmem)

475

mapped_file - # of bytes of mapped file (includes tmpfs/shmem)

419

pgpgin - # of charging events to the memory cgroup. The charging

476

pgpgin - # of charging events to the memory cgroup. The charging

420

event happens each time a page is accounted as either mapped

477

event happens each time a page is accounted as either mapped

421

anon page(RSS) or cache page(Page Cache) to the cgroup.

478

anon page(RSS) or cache page(Page Cache) to the cgroup.

422

pgpgout - # of uncharging events to the memory cgroup. The uncharging

479

pgpgout - # of uncharging events to the memory cgroup. The uncharging

423

event happens each time a page is unaccounted from the cgroup.

480

event happens each time a page is unaccounted from the cgroup.

424

swap - # of bytes of swap usage

481

swap - # of bytes of swap usage

425

inactive_anon - # of bytes of anonymous memory and swap cache memory on

482

inactive_anon - # of bytes of anonymous memory and swap cache memory on

426

LRU list.

483

LRU list.

427

active_anon - # of bytes of anonymous and swap cache memory on active

484

active_anon - # of bytes of anonymous and swap cache memory on active

428

inactive LRU list.

485

inactive LRU list.

429

inactive_file - # of bytes of file-backed memory on inactive LRU list.

486

inactive_file - # of bytes of file-backed memory on inactive LRU list.

430

active_file - # of bytes of file-backed memory on active LRU list.

487

active_file - # of bytes of file-backed memory on active LRU list.

431

unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).

488

unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).

432

489

433

# status considering hierarchy (see memory.use_hierarchy settings)

490

# status considering hierarchy (see memory.use_hierarchy settings)

434

491

435

hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy

492

hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy

436

under which the memory cgroup is

493

under which the memory cgroup is

437

hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to

494

hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to

438

hierarchy under which memory cgroup is.

495

hierarchy under which memory cgroup is.

439

496

440

total_<counter> - # hierarchical version of <counter>, which in

497

total_<counter> - # hierarchical version of <counter>, which in

441

addition to the cgroup's own value includes the

498

addition to the cgroup's own value includes the

442

sum of all hierarchical children's values of

499

sum of all hierarchical children's values of

443

<counter>, i.e. total_cache

500

<counter>, i.e. total_cache

444

501

445

# The following additional stats are dependent on CONFIG_DEBUG_VM.

502

# The following additional stats are dependent on CONFIG_DEBUG_VM.

446

503

447

recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)

504

recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)

448

recent_rotated_file - VM internal parameter. (see mm/vmscan.c)

505

recent_rotated_file - VM internal parameter. (see mm/vmscan.c)

449

recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)

506

recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)

450

recent_scanned_file - VM internal parameter. (see mm/vmscan.c)

507

recent_scanned_file - VM internal parameter. (see mm/vmscan.c)

451

508

452

Memo:

509

Memo:

453

recent_rotated means recent frequency of LRU rotation.

510

recent_rotated means recent frequency of LRU rotation.

454

recent_scanned means recent # of scans to LRU.

511

recent_scanned means recent # of scans to LRU.

455

showing for better debug please see the code for meanings.

512

showing for better debug please see the code for meanings.

456

513

457

Note:

514

Note:

458

Only anonymous and swap cache memory is listed as part of 'rss' stat.

515

Only anonymous and swap cache memory is listed as part of 'rss' stat.

459

This should not be confused with the true 'resident set size' or the

516

This should not be confused with the true 'resident set size' or the

460

amount of physical memory used by the cgroup.

517

amount of physical memory used by the cgroup.

461

'rss + file_mapped" will give you resident set size of cgroup.

518

'rss + file_mapped" will give you resident set size of cgroup.

462

(Note: file and shmem may be shared among other cgroups. In that case,

519

(Note: file and shmem may be shared among other cgroups. In that case,

463

file_mapped is accounted only when the memory cgroup is owner of page

520

file_mapped is accounted only when the memory cgroup is owner of page

464

cache.)

521

cache.)

465

522

466

5.3 swappiness

523

5.3 swappiness

467

524

468

Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.

525

Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.

469

Please note that unlike the global swappiness, memcg knob set to 0

526

Please note that unlike the global swappiness, memcg knob set to 0

470

really prevents from any swapping even if there is a swap storage

527

really prevents from any swapping even if there is a swap storage

471

available. This might lead to memcg OOM killer if there are no file

528

available. This might lead to memcg OOM killer if there are no file

472

pages to reclaim.

529

pages to reclaim.

473

530

474

Following cgroups' swappiness can't be changed.

531

Following cgroups' swappiness can't be changed.

475

- root cgroup (uses /proc/sys/vm/swappiness).

532

- root cgroup (uses /proc/sys/vm/swappiness).

476

- a cgroup which uses hierarchy and it has other cgroup(s) below it.

533

- a cgroup which uses hierarchy and it has other cgroup(s) below it.

477

- a cgroup which uses hierarchy and not the root of hierarchy.

534

- a cgroup which uses hierarchy and not the root of hierarchy.

478

535

479

5.4 failcnt

536

5.4 failcnt

480

537

481

A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.

538

A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.

482

This failcnt(== failure count) shows the number of times that a usage counter

539

This failcnt(== failure count) shows the number of times that a usage counter

483

hit its limit. When a memory cgroup hits a limit, failcnt increases and

540

hit its limit. When a memory cgroup hits a limit, failcnt increases and

484

memory under it will be reclaimed.

541

memory under it will be reclaimed.

485

542

486

You can reset failcnt by writing 0 to failcnt file.

543

You can reset failcnt by writing 0 to failcnt file.

487

# echo 0 > .../memory.failcnt

544

# echo 0 > .../memory.failcnt

488

545

489

5.5 usage_in_bytes

546

5.5 usage_in_bytes

490

547

491

For efficiency, as other kernel components, memory cgroup uses some optimization

548

For efficiency, as other kernel components, memory cgroup uses some optimization

492

to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the

549

to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the

493

method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz

550

method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz

494

value for efficient access. (Of course, when necessary, it's synchronized.)

551

value for efficient access. (Of course, when necessary, it's synchronized.)

495

If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)

552

If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)

496

value in memory.stat(see 5.2).

553

value in memory.stat(see 5.2).

497

554

498

5.6 numa_stat

555

5.6 numa_stat

499

556

500

This is similar to numa_maps but operates on a per-memcg basis. This is

557

This is similar to numa_maps but operates on a per-memcg basis. This is

501

useful for providing visibility into the numa locality information within

558

useful for providing visibility into the numa locality information within

502

an memcg since the pages are allowed to be allocated from any physical

559

an memcg since the pages are allowed to be allocated from any physical

503

node. One of the use cases is evaluating application performance by

560

node. One of the use cases is evaluating application performance by

504

combining this information with the application's CPU allocation.

561

combining this information with the application's CPU allocation.

505

562

506

We export "total", "file", "anon" and "unevictable" pages per-node for

563

We export "total", "file", "anon" and "unevictable" pages per-node for

507

each memcg. The ouput format of memory.numa_stat is:

564

each memcg. The ouput format of memory.numa_stat is:

508

565

509

total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...

566

total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...

510

file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...

567

file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...

511

anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

568

anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

512

unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

569

unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...

513

570

514

And we have total = file + anon + unevictable.

571

And we have total = file + anon + unevictable.

515

572

516

6. Hierarchy support

573

6. Hierarchy support

517

574

518

The memory controller supports a deep hierarchy and hierarchical accounting.

575

The memory controller supports a deep hierarchy and hierarchical accounting.

519

The hierarchy is created by creating the appropriate cgroups in the

576

The hierarchy is created by creating the appropriate cgroups in the

520

cgroup filesystem. Consider for example, the following cgroup filesystem

577

cgroup filesystem. Consider for example, the following cgroup filesystem

521

hierarchy

578

hierarchy

522

579

523

root

580

root

524

/ | \

581

/ | \

525

/ | \

582

/ | \

526

a b c

583

a b c

527

| \

584

| \

528

| \

585

| \

529

d e

586

d e

530

587

531

In the diagram above, with hierarchical accounting enabled, all memory

588

In the diagram above, with hierarchical accounting enabled, all memory

532

usage of e, is accounted to its ancestors up until the root (i.e, c and root),

589

usage of e, is accounted to its ancestors up until the root (i.e, c and root),

533

that has memory.use_hierarchy enabled. If one of the ancestors goes over its

590

that has memory.use_hierarchy enabled. If one of the ancestors goes over its

534

limit, the reclaim algorithm reclaims from the tasks in the ancestor and the

591

limit, the reclaim algorithm reclaims from the tasks in the ancestor and the

535

children of the ancestor.

592

children of the ancestor.

536

593

537

6.1 Enabling hierarchical accounting and reclaim

594

6.1 Enabling hierarchical accounting and reclaim

538

595

539

A memory cgroup by default disables the hierarchy feature. Support

596

A memory cgroup by default disables the hierarchy feature. Support

540

can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup

597

can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup

541

598

542

# echo 1 > memory.use_hierarchy

599

# echo 1 > memory.use_hierarchy

543

600

544

The feature can be disabled by

601

The feature can be disabled by

545

602

546

# echo 0 > memory.use_hierarchy

603

# echo 0 > memory.use_hierarchy

547

604

548

NOTE1: Enabling/disabling will fail if either the cgroup already has other

605

NOTE1: Enabling/disabling will fail if either the cgroup already has other

549

cgroups created below it, or if the parent cgroup has use_hierarchy

606

cgroups created below it, or if the parent cgroup has use_hierarchy

550

enabled.

607

enabled.

551

608

552

NOTE2: When panic_on_oom is set to "2", the whole system will panic in

609

NOTE2: When panic_on_oom is set to "2", the whole system will panic in

553

case of an OOM event in any cgroup.

610

case of an OOM event in any cgroup.

554

611

555

7. Soft limits

612

7. Soft limits

556

613

557

Soft limits allow for greater sharing of memory. The idea behind soft limits

614

Soft limits allow for greater sharing of memory. The idea behind soft limits

558

is to allow control groups to use as much of the memory as needed, provided

615

is to allow control groups to use as much of the memory as needed, provided

559

616

560

a. There is no memory contention

617

a. There is no memory contention

561

b. They do not exceed their hard limit

618

b. They do not exceed their hard limit

562

619

563

When the system detects memory contention or low memory, control groups

620

When the system detects memory contention or low memory, control groups

564

are pushed back to their soft limits. If the soft limit of each control

621

are pushed back to their soft limits. If the soft limit of each control

565

group is very high, they are pushed back as much as possible to make

622

group is very high, they are pushed back as much as possible to make

566

sure that one control group does not starve the others of memory.

623

sure that one control group does not starve the others of memory.

567

624

568

Please note that soft limits is a best-effort feature; it comes with

625

Please note that soft limits is a best-effort feature; it comes with

569

no guarantees, but it does its best to make sure that when memory is

626

no guarantees, but it does its best to make sure that when memory is

570

heavily contended for, memory is allocated based on the soft limit

627

heavily contended for, memory is allocated based on the soft limit

571

hints/setup. Currently soft limit based reclaim is set up such that

628

hints/setup. Currently soft limit based reclaim is set up such that

572

it gets invoked from balance_pgdat (kswapd).

629

it gets invoked from balance_pgdat (kswapd).

573

630

574

7.1 Interface

631

7.1 Interface

575

632

576

Soft limits can be setup by using the following commands (in this example we

633

Soft limits can be setup by using the following commands (in this example we

577

assume a soft limit of 256 MiB)

634

assume a soft limit of 256 MiB)

578

635

579

# echo 256M > memory.soft_limit_in_bytes

636

# echo 256M > memory.soft_limit_in_bytes

580

637

581

If we want to change this to 1G, we can at any time use

638

If we want to change this to 1G, we can at any time use

582

639

583

# echo 1G > memory.soft_limit_in_bytes

640

# echo 1G > memory.soft_limit_in_bytes

584

641

585

NOTE1: Soft limits take effect over a long period of time, since they involve

642

NOTE1: Soft limits take effect over a long period of time, since they involve

586

reclaiming memory for balancing between memory cgroups

643

reclaiming memory for balancing between memory cgroups

587

NOTE2: It is recommended to set the soft limit always below the hard limit,

644

NOTE2: It is recommended to set the soft limit always below the hard limit,

588

otherwise the hard limit will take precedence.

645

otherwise the hard limit will take precedence.

589

646

590

8. Move charges at task migration

647

8. Move charges at task migration

591

648

592

Users can move charges associated with a task along with task migration, that

649

Users can move charges associated with a task along with task migration, that

593

is, uncharge task's pages from the old cgroup and charge them to the new cgroup.

650

is, uncharge task's pages from the old cgroup and charge them to the new cgroup.

594

This feature is not supported in !CONFIG_MMU environments because of lack of

651

This feature is not supported in !CONFIG_MMU environments because of lack of

595

page tables.

652

page tables.

596

653

597

8.1 Interface

654

8.1 Interface

598

655

599

This feature is disabled by default. It can be enabledi (and disabled again) by

656

This feature is disabled by default. It can be enabledi (and disabled again) by

600

writing to memory.move_charge_at_immigrate of the destination cgroup.

657

writing to memory.move_charge_at_immigrate of the destination cgroup.

601

658

602

If you want to enable it:

659

If you want to enable it:

603

660

604

# echo (some positive value) > memory.move_charge_at_immigrate

661

# echo (some positive value) > memory.move_charge_at_immigrate

605

662

606

Note: Each bits of move_charge_at_immigrate has its own meaning about what type

663

Note: Each bits of move_charge_at_immigrate has its own meaning about what type

607

of charges should be moved. See 8.2 for details.

664

of charges should be moved. See 8.2 for details.

608

Note: Charges are moved only when you move mm->owner, in other words,

665

Note: Charges are moved only when you move mm->owner, in other words,

609

a leader of a thread group.

666

a leader of a thread group.

610

Note: If we cannot find enough space for the task in the destination cgroup, we

667

Note: If we cannot find enough space for the task in the destination cgroup, we

611

try to make space by reclaiming memory. Task migration may fail if we

668

try to make space by reclaiming memory. Task migration may fail if we

612

cannot make enough space.

669

cannot make enough space.

613

Note: It can take several seconds if you move charges much.

670

Note: It can take several seconds if you move charges much.

614

671

615

And if you want disable it again:

672

And if you want disable it again:

616

673

617

# echo 0 > memory.move_charge_at_immigrate

674

# echo 0 > memory.move_charge_at_immigrate

618

675

619

8.2 Type of charges which can be moved

676

8.2 Type of charges which can be moved

620

677

621

Each bit in move_charge_at_immigrate has its own meaning about what type of

678

Each bit in move_charge_at_immigrate has its own meaning about what type of

622

charges should be moved. But in any case, it must be noted that an account of

679

charges should be moved. But in any case, it must be noted that an account of

623

a page or a swap can be moved only when it is charged to the task's current

680

a page or a swap can be moved only when it is charged to the task's current

624

(old) memory cgroup.

681

(old) memory cgroup.

625

682

626

bit | what type of charges would be moved ?

683

bit | what type of charges would be moved ?

627

-----+------------------------------------------------------------------------

684

-----+------------------------------------------------------------------------

628

0 | A charge of an anonymous page (or swap of it) used by the target task.

685

0 | A charge of an anonymous page (or swap of it) used by the target task.

629

| You must enable Swap Extension (see 2.4) to enable move of swap charges.

686

| You must enable Swap Extension (see 2.4) to enable move of swap charges.

630

-----+------------------------------------------------------------------------

687

-----+------------------------------------------------------------------------

631

1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)

688

1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)

632

| and swaps of tmpfs file) mmapped by the target task. Unlike the case of

689

| and swaps of tmpfs file) mmapped by the target task. Unlike the case of

633

| anonymous pages, file pages (and swaps) in the range mmapped by the task

690

| anonymous pages, file pages (and swaps) in the range mmapped by the task

634

| will be moved even if the task hasn't done page fault, i.e. they might

691

| will be moved even if the task hasn't done page fault, i.e. they might

635

| not be the task's "RSS", but other task's "RSS" that maps the same file.

692

| not be the task's "RSS", but other task's "RSS" that maps the same file.

636

| And mapcount of the page is ignored (the page can be moved even if

693

| And mapcount of the page is ignored (the page can be moved even if

637

| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to

694

| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to

638

| enable move of swap charges.

695

| enable move of swap charges.

639

696

640

8.3 TODO

697

8.3 TODO

641

698

642

- All of moving charge operations are done under cgroup_mutex. It's not good

699

- All of moving charge operations are done under cgroup_mutex. It's not good

643

behavior to hold the mutex too long, so we may need some trick.

700

behavior to hold the mutex too long, so we may need some trick.

644

701

645

9. Memory thresholds

702

9. Memory thresholds

646

703

647

Memory cgroup implements memory thresholds using the cgroups notification

704

Memory cgroup implements memory thresholds using the cgroups notification

648

API (see cgroups.txt). It allows to register multiple memory and memsw

705

API (see cgroups.txt). It allows to register multiple memory and memsw

649

thresholds and gets notifications when it crosses.

706

thresholds and gets notifications when it crosses.

650

707

651

To register a threshold, an application must:

708

To register a threshold, an application must:

652

- create an eventfd using eventfd(2);

709

- create an eventfd using eventfd(2);

653

- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;

710

- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;

654

- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to

711

- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to

655

cgroup.event_control.

712

cgroup.event_control.

656

713

657

Application will be notified through eventfd when memory usage crosses

714

Application will be notified through eventfd when memory usage crosses

658

threshold in any direction.

715

threshold in any direction.

659

716

660

It's applicable for root and non-root cgroup.

717

It's applicable for root and non-root cgroup.

661

718

662

10. OOM Control

719

10. OOM Control

663

720

664

memory.oom_control file is for OOM notification and other controls.

721

memory.oom_control file is for OOM notification and other controls.

665

722

666

Memory cgroup implements OOM notifier using the cgroup notification

723

Memory cgroup implements OOM notifier using the cgroup notification

667

API (See cgroups.txt). It allows to register multiple OOM notification

724

API (See cgroups.txt). It allows to register multiple OOM notification

668

delivery and gets notification when OOM happens.

725

delivery and gets notification when OOM happens.

669

726

670

To register a notifier, an application must:

727

To register a notifier, an application must:

671

- create an eventfd using eventfd(2)

728

- create an eventfd using eventfd(2)

672

- open memory.oom_control file

729

- open memory.oom_control file

673

- write string like "<event_fd> <fd of memory.oom_control>" to

730

- write string like "<event_fd> <fd of memory.oom_control>" to

674

cgroup.event_control

731

cgroup.event_control

675

732

676

The application will be notified through eventfd when OOM happens.

733

The application will be notified through eventfd when OOM happens.

677

OOM notification doesn't work for the root cgroup.

734

OOM notification doesn't work for the root cgroup.

678

735

679

You can disable the OOM-killer by writing "1" to memory.oom_control file, as:

736

You can disable the OOM-killer by writing "1" to memory.oom_control file, as:

680

737

681

#echo 1 > memory.oom_control

738

#echo 1 > memory.oom_control

682

739

683

This operation is only allowed to the top cgroup of a sub-hierarchy.

740

This operation is only allowed to the top cgroup of a sub-hierarchy.

684

If OOM-killer is disabled, tasks under cgroup will hang/sleep

741

If OOM-killer is disabled, tasks under cgroup will hang/sleep

685

in memory cgroup's OOM-waitqueue when they request accountable memory.

742

in memory cgroup's OOM-waitqueue when they request accountable memory.

686

743

687

For running them, you have to relax the memory cgroup's OOM status by

744

For running them, you have to relax the memory cgroup's OOM status by

688

* enlarge limit or reduce usage.

745

* enlarge limit or reduce usage.

689

To reduce usage,

746

To reduce usage,

690

* kill some tasks.

747

* kill some tasks.

691

* move some tasks to other group with account migration.

748

* move some tasks to other group with account migration.

692

* remove some files (on tmpfs?)

749

* remove some files (on tmpfs?)

693

750

694

Then, stopped tasks will work again.

751

Then, stopped tasks will work again.

695

752

696

At reading, current status of OOM is shown.

753

At reading, current status of OOM is shown.

697

oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)

754

oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)

698

under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may

755

under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may

699

be stopped.)

756

be stopped.)

700

757

701

11. TODO

758

11. TODO

702

759

703

1. Add support for accounting huge pages (as a separate controller)

760

1. Add support for accounting huge pages (as a separate controller)

704

2. Make per-cgroup scanner reclaim not-shared pages first

761

2. Make per-cgroup scanner reclaim not-shared pages first

705

3. Teach controller to account for shared-pages

762

3. Teach controller to account for shared-pages

706

4. Start reclamation in the background when the limit is

763

4. Start reclamation in the background when the limit is

707

not yet hit but the usage is getting closer

764

not yet hit but the usage is getting closer

708

765

709

Summary

766

Summary

710

767

711

Overall, the memory controller has been a stable controller and has been

768

Overall, the memory controller has been a stable controller and has been

712

commented and discussed quite extensively in the community.

769

commented and discussed quite extensively in the community.

713

770

714

References

771

References

715

772

716

1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/

773

1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/

717

2. Singh, Balbir. Memory Controller (RSS Control),

774

2. Singh, Balbir. Memory Controller (RSS Control),

718

http://lwn.net/Articles/222762/

775

http://lwn.net/Articles/222762/

719

3. Emelianov, Pavel. Resource controllers based on process cgroups

776

3. Emelianov, Pavel. Resource controllers based on process cgroups

720

http://lkml.org/lkml/2007/3/6/198

777

http://lkml.org/lkml/2007/3/6/198

721

4. Emelianov, Pavel. RSS controller based on process cgroups (v2)

778

4. Emelianov, Pavel. RSS controller based on process cgroups (v2)

722

http://lkml.org/lkml/2007/4/9/78

779

http://lkml.org/lkml/2007/4/9/78

723

5. Emelianov, Pavel. RSS controller based on process cgroups (v3)

780

5. Emelianov, Pavel. RSS controller based on process cgroups (v3)

724

http://lkml.org/lkml/2007/5/30/244

781

http://lkml.org/lkml/2007/5/30/244

725

6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/

782

6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/

726

7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control

783

7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control

727

subsystem (v3), http://lwn.net/Articles/235534/

784

subsystem (v3), http://lwn.net/Articles/235534/

728

8. Singh, Balbir. RSS controller v2 test results (lmbench),

785

8. Singh, Balbir. RSS controller v2 test results (lmbench),

729

http://lkml.org/lkml/2007/5/17/232

786

http://lkml.org/lkml/2007/5/17/232

730

9. Singh, Balbir. RSS controller v2 AIM9 results

787

9. Singh, Balbir. RSS controller v2 AIM9 results

731

http://lkml.org/lkml/2007/5/18/1

788

http://lkml.org/lkml/2007/5/18/1

732

10. Singh, Balbir. Memory controller v6 test results,

789

10. Singh, Balbir. Memory controller v6 test results,

733

http://lkml.org/lkml/2007/8/19/36

790

http://lkml.org/lkml/2007/8/19/36

734

11. Singh, Balbir. Memory controller introduction (v6),

791

11. Singh, Balbir. Memory controller introduction (v6),

735

http://lkml.org/lkml/2007/8/17/69

792

http://lkml.org/lkml/2007/8/17/69

736

12. Corbet, Jonathan, Controlling memory use in cgroups,

793

12. Corbet, Jonathan, Controlling memory use in cgroups,

737

http://lwn.net/Articles/243795/

794

http://lwn.net/Articles/243795/

738

795

GITLAB

memcg: add documentation about the kmem controller

 Memory Resource Controller
 NOTE: The Memory Resource Controller has generically been referred to as the
       memory controller in this document. Do not confuse memory controller
       used here with the memory controller that is used in hardware.
 (For editors)
 In this document:
       When we mention a cgroup (cgroupfs's directory) with memory controller,
       we call it "memory cgroup". When you see git-log and source code, you'll
       see patch's title and function names tend to use "memcg".
       In this document, we avoid using it.
 Benefits and Purpose of the memory controller
 The memory controller isolates the memory behaviour of a group of tasks
 from the rest of the system. The article on LWN [12] mentions some probable
 uses of the memory controller. The memory controller can be used to
 a. Isolate an application or a group of applications
    Memory-hungry applications can be isolated and limited to a smaller
    amount of memory.
 b. Create a cgroup with a limited amount of memory; this can be used
    as a good alternative to booting with mem=XXXX.
 c. Virtualization solutions can control the amount of memory they want
    to assign to a virtual machine instance.
 d. A CD/DVD burner could control the amount of memory used by the
    rest of the system to ensure that burning does not fail due to lack
    of available memory.
 e. There are several other use cases; find one or use the controller just
    for fun (to learn and hack on the VM subsystem).
 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
 Features:
  - accounting anonymous pages, file caches, swap caches usage and limiting them.
  - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
  - optionally, memory+swap usage can be accounted and limited.
  - hierarchical accounting
  - soft limit
  - moving (recharging) account at moving a task is selectable.
  - usage threshold notifier
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
  Kernel memory support is a work in progress, and the current version provides
  basically functionality. (See Section 2.7)
 Brief summary of control files.
  tasks				 # attach a task(thread) and show list of threads
  cgroup.procs			 # show list of processes
  cgroup.event_control		 # an interface for event_fd()
  memory.usage_in_bytes		 # show current res_counter usage for memory
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
  memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
  memory.force_empty		 # trigger forced move charge to parent
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate # set/show controls of moving charges
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
+ memory.kmem.limit_in_bytes      # set/show hard limit for kernel memory
+ memory.kmem.usage_in_bytes      # show current kernel memory allocation
+ memory.kmem.failcnt             # show the number of kernel memory usage hits limits
+ memory.kmem.max_usage_in_bytes  # show max kernel memory usage recorded
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
  memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
  memory.kmem.tcp.failcnt            # show the number of tcp buf memory usage hits limits
  memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
 1. History
 The memory controller has a long history. A request for comments for the memory
 controller was posted by Balbir Singh [1]. At the time the RFC was posted
 there were several implementations for memory control. The goal of the
 RFC was to build consensus and agreement for the minimal features required
 for memory control. The first RSS controller was posted by Balbir Singh[2]
 in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
 RSS controller. At OLS, at the resource management BoF, everyone suggested
 that we handle both page cache and RSS together. Another request was raised
 to allow user space handling of OOM. The current memory controller is
 at version 6; it combines both mapped (RSS) and unmapped Page
 Cache Control [11].
 2. Memory Control
 Memory is a unique resource in the sense that it is present in a limited
 amount. If a task requires a lot of CPU processing, the task can spread
 its processing over a period of hours, days, months or years, but with
 memory, the same physical memory needs to be reused to accomplish the task.
 The memory controller implementation has been divided into phases. These
 are:
 1. Memory controller
 2. mlock(2) controller
 3. Kernel user memory accounting and slab control
 4. user mappings length controller
 The memory controller is the first controller developed.
 2.1. Design
 The core of the design is a counter called the res_counter. The res_counter
 tracks the current memory usage and limit of the group of processes associated
 with the controller. Each cgroup has a memory controller specific data
 structure (mem_cgroup) associated with it.
 2.2. Accounting
 		+--------------------+
 		|  mem_cgroup     |
 		|  (res_counter)     |
 		+--------------------+
 		 /            ^      \
 		/             |       \
            +---------------+  |        +---------------+
            | mm_struct     |  |....    | mm_struct     |
            |               |  |        |               |
            +---------------+  |        +---------------+
                               |
                               + --------------+
                                               |
            +---------------+           +------+--------+
            | page          +---------->  page_cgroup|
            |               |           |               |
            +---------------+           +---------------+
              (Figure 1: Hierarchy of Accounting)
 Figure 1 shows the important aspects of the controller
 1. Accounting happens per cgroup
 2. Each mm_struct knows about which cgroup it belongs to
 3. Each page has a pointer to the page_cgroup, which in turn knows the
    cgroup it belongs to
 The accounting is done as follows: mem_cgroup_charge_common() is invoked to
 set up the necessary data structures and check if the cgroup that is being
 charged is over its limit. If it is, then reclaim is invoked on the cgroup.
 More details can be found in the reclaim section of this document.
 If everything goes well, a page meta-data-structure called page_cgroup is
 updated. page_cgroup has its own LRU on cgroup.
 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
 2.2.1 Accounting details
 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
 Some pages which are never reclaimable and will not be on the LRU
 are not accounted. We just account pages under usual VM management.
 RSS pages are accounted at page_fault unless they've already been accounted
 for earlier. A file page will be accounted for as Page Cache when it's
 inserted into inode (radix-tree). While it's mapped into the page tables of
 processes, duplicate accounting is carefully avoided.
 An RSS page is unaccounted when it's fully unmapped. A PageCache page is
 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
 unmapped (by kswapd), they may exist as SwapCache in the system until they
 are really freed. Such SwapCaches are also accounted.
 A swapped-in page is not accounted until it's mapped.
 Note: The kernel does swapin-readahead and reads multiple swaps at once.
 This means swapped-in pages may contain pages for other tasks than a task
 causing page fault. So, we avoid accounting at swap-in I/O.
 At page migration, accounting information is kept.
 Note: we just account pages-on-LRU because our purpose is to control amount
 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
 2.3 Shared Page Accounting
 Shared pages are accounted on the basis of the first touch approach. The
 cgroup that first touches a page is accounted for the page. The principle
 behind this approach is that a cgroup that aggressively uses a shared
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 But see section 8.2: when moving a task to another cgroup, its pages may
 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
 Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used.
 When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
 2.4 Swap Extension (CONFIG_MEMCG_SWAP)
 Swap Extension allows you to record charge for swap. A swapped-in page is
 charged back to original page allocator if possible.
 When swap is accounted, following files are added.
  - memory.memsw.usage_in_bytes.
  - memory.memsw.limit_in_bytes.
 memsw means memory+swap. Usage of memory+swap is limited by
 memsw.limit_in_bytes.
 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
 (by mistake) under 2G memory limitation will use all swap.
 In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
 By using the memsw limit, you can avoid system OOM which can be caused by swap
 shortage.
 * why 'memory+swap' rather than swap.
 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
 to move account from memory to swap...there is no change in usage of
 memory+swap. In other words, when we want to limit the usage of swap without
 affecting global LRU, memory+swap limit is better than just limiting swap from
 an OS point of view.
 * What happens when a cgroup hits memory.memsw.limit_in_bytes
 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
 in this cgroup. Then, swap-out will not be done by cgroup routine and file
 caches are dropped. But as mentioned above, global LRU can do swapout memory
 from it for sanity of the system's memory management state. You can't forbid
 it by cgroup.
 2.5 Reclaim
 Each cgroup maintains a per cgroup LRU which has the same structure as
 global VM. When a cgroup goes over its limit, we first try
 to reclaim memory from the cgroup so as to make space for the new
 pages that the cgroup has touched. If the reclaim is unsuccessful,
 an OOM routine is invoked to select and kill the bulkiest task in the
 cgroup. (See 10. OOM Control below.)
 The reclaim algorithm has not been modified for cgroups, except that
 pages that are selected for reclaiming come from the per-cgroup LRU
 list.
 NOTE: Reclaim does not work for the root cgroup, since we cannot set any
 limits on the root cgroup.
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 When oom event notifier is registered, event will be delivered.
 (See oom_control section)
 2.6 Locking
    lock_page_cgroup()/unlock_page_cgroup() should not be called under
    mapping->tree_lock.
    Other lock order is following:
    PG_locked.
    mm->page_table_lock
        zone->lru_lock
 	  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 With the Kernel memory extension, the Memory Controller is able to limit
 the amount of kernel memory used by the system. Kernel memory is fundamentally
 different than user memory, since it can't be swapped out, which makes it
 possible to DoS the system by consuming too much of this precious resource.
+Kernel memory won't be accounted at all until limit on a group is set. This
+allows for existing setups to continue working without disruption.  The limit
+cannot be set if the cgroup have children, or if there are already tasks in the
+cgroup. Attempting to set the limit under those conditions will return -EBUSY.
+When use_hierarchy == 1 and a group is accounted, its children will
+automatically be accounted regardless of their limit value.
+After a group is first limited, it will be kept being accounted until it
+is removed. The memory limitation itself, can of course be removed by writing
+-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
+limited.
 Kernel memory limits are not imposed for the root cgroup. Usage for the root
-cgroup may or may not be accounted.
+cgroup may or may not be accounted. The memory used is accumulated into
+memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
+(currently only for tcp).
+The main "kmem" counter is fed into the main counter, so kmem charges will
+also be visible from the user counter.
 Currently no soft limit is implemented for kernel memory. It is future work
 to trigger slab reclaim when those limits are reached.
 2.7.1 Current Kernel Memory resources accounted
+* stack pages: every process consumes some stack pages. By accounting into
+kernel memory, we prevent new processes from being created when the kernel
+memory usage is too high.
 * sockets memory pressure: some sockets protocols have memory pressure
 thresholds. The Memory Controller allows them to be controlled individually
 per cgroup, instead of globally.
 * tcp memory pressure: sockets memory pressure for the tcp protocol.
+2.7.3 Common use cases
+Because the "kmem" counter is fed to the main user counter, kernel memory can
+never be limited completely independently of user memory. Say "U" is the user
+limit, and "K" the kernel limit. There are three possible ways limits can be
+set:
+    U != 0, K = unlimited:
+    This is the standard memcg limitation mechanism already present before kmem
+    accounting. Kernel memory is completely ignored.
+    U != 0, K < U:
+    Kernel memory is a subset of the user memory. This setup is useful in
+    deployments where the total amount of memory per-cgroup is overcommited.
+    Overcommiting kernel memory limits is definitely not recommended, since the
+    box can still run out of non-reclaimable memory.
+    In this case, the admin could set up K so that the sum of all groups is
+    never greater than the total memory, and freely set U at the cost of his
+    QoS.
+    U != 0, K >= U:
+    Since kmem charges will also be fed to the user counter and reclaim will be
+    triggered for the cgroup for both kinds of memory. This setup gives the
+    admin a unified view of memory, and it is also useful for people who just
+    want to track kernel memory usage.
 3. User Interface
 0. Configuration
 a. Enable CONFIG_CGROUPS
 b. Enable CONFIG_RESOURCE_COUNTERS
 c. Enable CONFIG_MEMCG
 d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
+d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
 # mount -t tmpfs none /sys/fs/cgroup
 # mkdir /sys/fs/cgroup/memory
 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
 2. Make the new group and move bash into it
 # mkdir /sys/fs/cgroup/memory/0
 # echo $$ > /sys/fs/cgroup/memory/0/tasks
 Since now we're in the 0 cgroup, we can alter the memory limit:
 # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
 NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
 mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
 NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
 NOTE: We cannot set limits on the root cgroup any more.
 # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
 4194304
 We can check the usage:
 # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
 1216512
 A successful write to this file does not guarantee a successful setting of
 this limit to the value written into the file. This can be due to a
 number of factors, such as rounding up to page boundaries or the total
 availability of memory on the system. The user is required to re-read
 this file after a write to guarantee the value committed by the kernel.
 # echo 1 > memory.limit_in_bytes
 # cat memory.limit_in_bytes
 4096
 The memory.failcnt field gives the number of times that the cgroup limit was
 exceeded.
 The memory.stat file gives accounting information. Now, the number of
 caches, RSS and Active pages/Inactive pages are shown.
 4. Testing
 For testing features and implementation, see memcg_test.txt.
 Performance test is also important. To see pure memory controller's overhead,
 testing on tmpfs will give you good numbers of small overheads.
 Example: do kernel make on tmpfs.
 Page-fault scalability is also important. At measuring parallel
 page fault test, multi-process test may be better than multi-thread
 test because it has noise of shared objects/status.
 But the above two are testing extreme situations.
 Trying usual test under memory controller is always helpful.
 4.1 Troubleshooting
 Sometimes a user might find that the application under a cgroup is
 terminated by the OOM killer. There are several causes for this:
 1. The cgroup limit is too low (just too low to do anything useful)
 2. The user is using anonymous memory and swap is turned off or too low
 A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
 some of the pages cached in the cgroup (page cache pages).
 To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
 seeing what happens will be helpful.
 4.2 Task migration
 When a task migrates from one cgroup to another, its charge is not
 carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
 You can move charges of a task along with task migration.
 See 8. "Move charges at task migration"
 4.3 Removing a cgroup
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
 tasks have migrated away from it. (because we charge against pages, not
 against tasks.)
 We move the stats to root (if use_hierarchy==0) or parent (if
 use_hierarchy==1), and no change on the charge except uncharging
 from the child.
 Charges recorded in swap information is not updated at removal of cgroup.
 Recorded information is discarded and a cgroup which uses swap (swapcache)
 will be charged as a new owner of it.
 About use_hierarchy, see Section 6.
 5. Misc. interfaces.
 5.1 force_empty
   memory.force_empty interface is provided to make cgroup's memory usage empty.
   You can use this interface only when the cgroup has no tasks.
   When writing anything to this
   # echo 0 > memory.force_empty
   Almost all pages tracked by this memory cgroup will be unmapped and freed.
   Some pages cannot be freed because they are locked or in-use. Such pages are
   moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this
   cgroup will be empty.
   The typical use case for this interface is before calling rmdir().
   Because rmdir() moves all pages to parent, some out-of-use page caches can be
   moved to the parent. If you want to avoid that, force_empty will be useful.
+  Also, note that when memory.kmem.limit_in_bytes is set the charges due to
+  kernel pages will still be seen. This is not considered a failure and the
+  write will still return success. In this case, it is expected that
+  memory.kmem.usage_in_bytes == memory.usage_in_bytes.
   About use_hierarchy, see Section 6.
 5.2 stat file
 memory.stat file includes following statistics
 # per-memory cgroup local status
 cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of charging events to the memory cgroup. The charging
 		event happens each time a page is accounted as either mapped
 		anon page(RSS) or cache page(Page Cache) to the cgroup.
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
 		inactive LRU list.
 inactive_file	- # of bytes of file-backed memory on inactive LRU list.
 active_file	- # of bytes of file-backed memory on active LRU list.
 unevictable	- # of bytes of memory that cannot be reclaimed (mlocked etc).
 # status considering hierarchy (see memory.use_hierarchy settings)
 hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
 			under which the memory cgroup is
 hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
 			hierarchy under which memory cgroup is.
 total_<counter>		- # hierarchical version of <counter>, which in
 			addition to the cgroup's own value includes the
 			sum of all hierarchical children's values of
 			<counter>, i.e. total_cache
 # The following additional stats are dependent on CONFIG_DEBUG_VM.
 recent_rotated_anon	- VM internal parameter. (see mm/vmscan.c)
 recent_rotated_file	- VM internal parameter. (see mm/vmscan.c)
 recent_scanned_anon	- VM internal parameter. (see mm/vmscan.c)
 recent_scanned_file	- VM internal parameter. (see mm/vmscan.c)
 Memo:
 	recent_rotated means recent frequency of LRU rotation.
 	recent_scanned means recent # of scans to LRU.
 	showing for better debug please see the code for meanings.
 Note:
 	Only anonymous and swap cache memory is listed as part of 'rss' stat.
 	This should not be confused with the true 'resident set size' or the
 	amount of physical memory used by the cgroup.
 	'rss + file_mapped" will give you resident set size of cgroup.
 	(Note: file and shmem may be shared among other cgroups. In that case,
 	 file_mapped is accounted only when the memory cgroup is owner of page
 	 cache.)
 5.3 swappiness
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
 Please note that unlike the global swappiness, memcg knob set to 0
 really prevents from any swapping even if there is a swap storage
 available. This might lead to memcg OOM killer if there are no file
 pages to reclaim.
 Following cgroups' swappiness can't be changed.
 - root cgroup (uses /proc/sys/vm/swappiness).
 - a cgroup which uses hierarchy and it has other cgroup(s) below it.
 - a cgroup which uses hierarchy and not the root of hierarchy.
 5.4 failcnt
 A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
 This failcnt(== failure count) shows the number of times that a usage counter
 hit its limit. When a memory cgroup hits a limit, failcnt increases and
 memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 5.5 usage_in_bytes
 For efficiency, as other kernel components, memory cgroup uses some optimization
 to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
 method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
 value for efficient access. (Of course, when necessary, it's synchronized.)
 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
 value in memory.stat(see 5.2).
 5.6 numa_stat
 This is similar to numa_maps but operates on a per-memcg basis.  This is
 useful for providing visibility into the numa locality information within
 an memcg since the pages are allowed to be allocated from any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 We export "total", "file", "anon" and "unevictable" pages per-node for
 each memcg.  The ouput format of memory.numa_stat is:
 total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
 file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
 anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 And we have total = file + anon + unevictable.
 6. Hierarchy support
 The memory controller supports a deep hierarchy and hierarchical accounting.
 The hierarchy is created by creating the appropriate cgroups in the
 cgroup filesystem. Consider for example, the following cgroup filesystem
 hierarchy
 	       root
 	     /  |   \
             /	|    \
 	   a	b     c
 		      | \
 		      |  \
 		      d   e
 In the diagram above, with hierarchical accounting enabled, all memory
 usage of e, is accounted to its ancestors up until the root (i.e, c and root),
 that has memory.use_hierarchy enabled. If one of the ancestors goes over its
 limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
 children of the ancestor.
 6.1 Enabling hierarchical accounting and reclaim
 A memory cgroup by default disables the hierarchy feature. Support
 can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
 # echo 1 > memory.use_hierarchy
 The feature can be disabled by
 # echo 0 > memory.use_hierarchy
 NOTE1: Enabling/disabling will fail if either the cgroup already has other
        cgroups created below it, or if the parent cgroup has use_hierarchy
        enabled.
 NOTE2: When panic_on_oom is set to "2", the whole system will panic in
        case of an OOM event in any cgroup.
 7. Soft limits
 Soft limits allow for greater sharing of memory. The idea behind soft limits
 is to allow control groups to use as much of the memory as needed, provided
 a. There is no memory contention
 b. They do not exceed their hard limit
 When the system detects memory contention or low memory, control groups
 are pushed back to their soft limits. If the soft limit of each control
 group is very high, they are pushed back as much as possible to make
 sure that one control group does not starve the others of memory.
 Please note that soft limits is a best-effort feature; it comes with
 no guarantees, but it does its best to make sure that when memory is
 heavily contended for, memory is allocated based on the soft limit
 hints/setup. Currently soft limit based reclaim is set up such that
 it gets invoked from balance_pgdat (kswapd).
 7.1 Interface
 Soft limits can be setup by using the following commands (in this example we
 assume a soft limit of 256 MiB)
 # echo 256M > memory.soft_limit_in_bytes
 If we want to change this to 1G, we can at any time use
 # echo 1G > memory.soft_limit_in_bytes
 NOTE1: Soft limits take effect over a long period of time, since they involve
        reclaiming memory for balancing between memory cgroups
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 8. Move charges at task migration
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
 page tables.
 8.1 Interface
 This feature is disabled by default. It can be enabledi (and disabled again) by
 writing to memory.move_charge_at_immigrate of the destination cgroup.
 If you want to enable it:
 # echo (some positive value) > memory.move_charge_at_immigrate
 Note: Each bits of move_charge_at_immigrate has its own meaning about what type
       of charges should be moved. See 8.2 for details.
 Note: Charges are moved only when you move mm->owner, in other words,
       a leader of a thread group.
 Note: If we cannot find enough space for the task in the destination cgroup, we
       try to make space by reclaiming memory. Task migration may fail if we
       cannot make enough space.
 Note: It can take several seconds if you move charges much.
 And if you want disable it again:
 # echo 0 > memory.move_charge_at_immigrate
 8.2 Type of charges which can be moved
 Each bit in move_charge_at_immigrate has its own meaning about what type of
 charges should be moved. But in any case, it must be noted that an account of
 a page or a swap can be moved only when it is charged to the task's current
 (old) memory cgroup.
   bit | what type of charges would be moved ?
  -----+------------------------------------------------------------------------
    0  | A charge of an anonymous page (or swap of it) used by the target task.
       | You must enable Swap Extension (see 2.4) to enable move of swap charges.
  -----+------------------------------------------------------------------------
    1  | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
       | and swaps of tmpfs file) mmapped by the target task. Unlike the case of
       | anonymous pages, file pages (and swaps) in the range mmapped by the task
       | will be moved even if the task hasn't done page fault, i.e. they might
       | not be the task's "RSS", but other task's "RSS" that maps the same file.
       | And mapcount of the page is ignored (the page can be moved even if
       | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
       | enable move of swap charges.
 8.3 TODO
 - All of moving charge operations are done under cgroup_mutex. It's not good
   behavior to hold the mutex too long, so we may need some trick.
 9. Memory thresholds
 Memory cgroup implements memory thresholds using the cgroups notification
 API (see cgroups.txt). It allows to register multiple memory and memsw
 thresholds and gets notifications when it crosses.
 To register a threshold, an application must:
 - create an eventfd using eventfd(2);
 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
   cgroup.event_control.
 Application will be notified through eventfd when memory usage crosses
 threshold in any direction.
 It's applicable for root and non-root cgroup.
 10. OOM Control
 memory.oom_control file is for OOM notification and other controls.
 Memory cgroup implements OOM notifier using the cgroup notification
 API (See cgroups.txt). It allows to register multiple OOM notification
 delivery and gets notification when OOM happens.
 To register a notifier, an application must:
  - create an eventfd using eventfd(2)
  - open memory.oom_control file
  - write string like "<event_fd> <fd of memory.oom_control>" to
    cgroup.event_control
 The application will be notified through eventfd when OOM happens.
 OOM notification doesn't work for the root cgroup.
 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
 	#echo 1 > memory.oom_control
 This operation is only allowed to the top cgroup of a sub-hierarchy.
 If OOM-killer is disabled, tasks under cgroup will hang/sleep
 in memory cgroup's OOM-waitqueue when they request accountable memory.
 For running them, you have to relax the memory cgroup's OOM status by
 	* enlarge limit or reduce usage.
 To reduce usage,
 	* kill some tasks.
 	* move some tasks to other group with account migration.
 	* remove some files (on tmpfs?)
 Then, stopped tasks will work again.
 At reading, current status of OOM is shown.
 	oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
 	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
 				 be stopped.)
 11. TODO
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
 3. Teach controller to account for shared-pages
 4. Start reclamation in the background when the limit is
    not yet hit but the usage is getting closer
 Summary
 Overall, the memory controller has been a stable controller and has been
 commented and discussed quite extensively in the community.
 References
 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
 2. Singh, Balbir. Memory Controller (RSS Control),
    http://lwn.net/Articles/222762/
 3. Emelianov, Pavel. Resource controllers based on process cgroups
    http://lkml.org/lkml/2007/3/6/198
 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
    http://lkml.org/lkml/2007/4/9/78
 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
    http://lkml.org/lkml/2007/5/30/244
 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
    subsystem (v3), http://lwn.net/Articles/235534/
 8. Singh, Balbir. RSS controller v2 test results (lmbench),
    http://lkml.org/lkml/2007/5/17/232
 9. Singh, Balbir. RSS controller v2 AIM9 results
    http://lkml.org/lkml/2007/5/18/1
 10. Singh, Balbir. Memory controller v6 test results,
     http://lkml.org/lkml/2007/8/19/36
 11. Singh, Balbir. Memory controller introduction (v6),
     http://lkml.org/lkml/2007/8/17/69
 12. Corbet, Jonathan, Controlling memory use in cgroups,
     http://lwn.net/Articles/243795/