Eric Lee / smarc-fsl-linux-kernel

03 Aug, 2016

1 commit

d6507ff53 memcg: put soft limit reclaim out of way if the excess tree is empty ... Browse Code »

We've had a report about soft lockups caused by lock bouncing in the
soft reclaim path:

BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
Call Trace:
mem_cgroup_soft_limit_reclaim+0x25a/0x280
shrink_zones+0xed/0x200
do_try_to_free_pages+0x74/0x320
try_to_free_pages+0x112/0x180
__alloc_pages_slowpath+0x3ff/0x820
__alloc_pages_nodemask+0x1e9/0x200
alloc_pages_vma+0xe1/0x290
do_wp_page+0x19f/0x840
handle_pte_fault+0x1cd/0x230
do_page_fault+0x1fd/0x4c0
page_fault+0x25/0x30

There are no memcgs created so there cannot be any in the soft limit
excess obviously:

[...]
memory 0 1 1

so all this just seems to be mem_cgroup_largest_soft_limit_node trying
to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
excess tree is empty. This is just pointless wasting of cycles and
cache line bouncing during heavy parallel reclaim on large machines.
The particular machine wasn't very healthy and most probably suffering
from a memory leak which just caused the memory reclaim to trash
heavily. But bouncing on the lock certainly didn't help...

Fix this by optimistic lockless check and bail out early if the tree is
empty. This is theoretically racy but that shouldn't matter all that
much. First of all soft limit is a best effort feature and it is slowly
getting deprecated and its usage should be really scarce. Bouncing on a
lock without a good reason is surely much bigger problem, especially on
large CPU machines.

Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-08-03 05:31:41 +0800

29 Jul, 2016

8 commits

efdc94907 mm: fix memcg stack accounting for sub-page stacks ... Browse Code »

We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski
Cc: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Reviewed-by: Josh Poimboeuf
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-07-29 07:07:41 +0800
7ee36a14f mm, vmscan: Update all zone LRU sizes before updating memcg ... Browse Code »

Minchan Kim reported setting the following warning on a 32-bit system
although it can affect 64-bit systems.

WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
Modules linked in:
CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x76/0xaf
__warn+0xea/0x110
? mem_cgroup_update_lru_size+0x103/0x110
warn_slowpath_fmt+0x3b/0x40
mem_cgroup_update_lru_size+0x103/0x110
isolate_lru_pages.isra.61+0x2e2/0x360
shrink_active_list+0xac/0x2a0
? __delay+0xe/0x10
shrink_node_memcg+0x53c/0x7a0
shrink_node+0xab/0x2a0
do_try_to_free_pages+0xc6/0x390
try_to_free_pages+0x245/0x590

LRU list contents and counts are updated separately. Counts are updated
before pages are added to the LRU and updated after pages are removed.
The warning above is from a check in mem_cgroup_update_lru_size that
ensures that list sizes of zero are empty.

The problem is that node-lru needs to account for highmem pages if
CONFIG_HIGHMEM is set. One impact of the implementation is that the
sizes are updated in multiple passes when pages from multiple zones were
isolated. This happens whether HIGHMEM is set or not. When multiple
zones are isolated, it's possible for a debugging check in memcg to be
tripped.

This patch forces all the zone counts to be updated before the memcg
function is called.

Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Tested-by: Minchan Kim
Reported-by: Minchan Kim
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
ef8f23279 mm, memcg: move memcg limit enforcement from zones to nodes ... Browse Code »

Memcg needs adjustment after moving LRUs to the node. Limits are
tracked per memcg but the soft-limit excess is tracked per zone. As
global page reclaim is based on the node, it is easy to imagine a
situation where a zone soft limit is exceeded even though the memcg
limit is fine.

This patch moves the soft limit tree the node. Technically, all the
variable names should also change but people are already familiar by the
meaning of "mz" even if "mn" would be a more appropriate name now.

Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
a9dd0a831 mm, vmscan: make shrink_node decisions more node-centric ... Browse Code »

Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information. This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not. Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

[mgorman@techsingularity.net: optimization]
Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
599d0c954 mm, vmscan: move LRU lists to node ... Browse Code »

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
a52633d8e mm, vmscan: move lru_lock to the node ... Browse Code »

Node-based reclaim requires node-based LRUs and locking. This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review. It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.

Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
55779ec75 mm: fix vm-scalability regression in cgroup-aware workingset code ... Browse Code »

Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
detection") added a page->mem_cgroup lookup to the cache eviction,
refault, and activation paths, as well as locking to the activation
path, and the vm-scalability tests showed a regression of -23%.

While the test in question is an artificial worst-case scenario that
doesn't occur in real workloads - reading two sparse files in parallel
at full CPU speed just to hammer the LRU paths - there is still some
optimizations that can be done in those paths.

Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
doesn't need to be stabilized when counting an activation; we merely
need to hold the RCU lock to prevent the memcg from being freed.

This cuts down on overhead quite a bit:

23047a96d7cfcfca 063f6715e77a7be5770d6081fe
---------------- --------------------------
%stddev %change %stddev
\ | \
21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

[linux@roeck-us.net: drop unnecessary include file]
[hannes@cmpxchg.org: add WARN_ON_ONCE()s]
Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
Reported-by: Ye Xiaolong
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Guenter Roeck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-07-29 07:07:41 +0800
1af8bb432 mm, oom: fortify task_will_free_mem() ... Browse Code »

task_will_free_mem is rather weak. It doesn't really tell whether the
task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step into making
it more robust for multi-threaded applications so now we know that the
whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
use_mm().

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock up
the oom killer.

Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Oleg Nesterov
Cc: Vladimir Davydov
Cc: David Rientjes
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-07-29 07:07:41 +0800

27 Jul, 2016

7 commits

25843c2b2 mm: memcontrol: fix documentation for compound parameter ... Browse Code »

Commit f627c2f53786 ("memcg: adjust to support new THP refcounting")
adds a compound parameter for several functions, and change one as
compound for mem_cgroup_move_account but it does not change the
comments.

Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li RongQing
2016-07-27 07:19:19 +0800
17408d785 mm: memcontrol: remove BUG_ON in uncharge_list ... Browse Code »

When calling uncharge_list, if a page is transparent huge we don't need
to BUG_ON about non-transparent huge, since nobody should be able to see
the page at this stage and this page cannot be raced against with a THP
split.

This check became unneeded after 0a31bc97c80c ("mm: memcontrol: rewrite
uncharge API").

[mhocko@suse.com: changelog enhancements]
Link: http://lkml.kernel.org/r/1465369248-13865-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li RongQing
2016-07-27 07:19:19 +0800
fbe84a09d mm,oom: remove unused argument from oom_scan_process_thread(). ... Browse Code »

oom_scan_process_thread() does not use totalpages argument.
oom_badness() uses it.

Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-07-27 07:19:19 +0800
5e8d35f84 mm: memcontrol: teach uncharge_list to deal with kmem pages ... Browse Code »

Page table pages are batched-freed in release_pages on most
architectures. If we want to charge them to kmemcg (this is what is
done later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages.

Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-07-27 07:19:19 +0800
452647784 mm: memcontrol: cleanup kmem charge functions ... Browse Code »

- Handle memcg_kmem_enabled check out to the caller. This reduces the
number of function definitions making the code easier to follow. At
the same time it doesn't result in code bloat, because all of these
functions are used only in one or two places.

- Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
have to dive deep into memcg implementation to see which allocations
are charged and which are not.

- Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-07-27 07:19:19 +0800
2a966b77a mm: oom: add memcg to oom_control ... Browse Code »

It's a part of oom context just like allocation order and nodemask, so
let's move it to oom_control instead of passing it in the argument list.

Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tetsuo Handa
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-07-27 07:19:19 +0800
48406ef89 mm/memcontrol.c: remove the useless parameter for mc_handle_swap_pte ... Browse Code »

It seems like this parameter has never been used since being introduced
by 90254a65833b ("memcg: clean up move charge"). Not a big deal because
I assume the function would get inlined into the caller anyway but why
not get rid of it.

[mhocko@suse.com: wrote changelog]
Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li RongQing
2016-07-27 07:19:19 +0800

23 Jul, 2016

1 commit

73f576c04 mm: memcontrol: fix cgroup creation failure after many small jobs ... Browse Code »

The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears. At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild. Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs. Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages. And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that. This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

set -e
mkdir -p pages
for x in `seq 128000`; do
[ $((x % 1000)) -eq 0 ] && echo $x
mkdir /cgroup/foo
echo $$ >/cgroup/foo/cgroup.procs
echo trex >pages/$x
echo $$ >/cgroup/cgroup.procs
rmdir /cgroup/foo
done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: John Garcia
Reviewed-by: Vladimir Davydov
Acked-by: Tejun Heo
Cc: Nikolay Borisov
Cc: [3.19+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-07-23 09:25:54 +0800

25 Jun, 2016

2 commits

ea3a96458 memcg: css_alloc should return an ERR_PTR value on error ... Browse Code »

mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
expected it to return an ERR_PTR value leading to the following NULL
deref after a css allocation failure. Fix it by return
ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it
can handle NULL returns.

mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
...
Call Trace:
dump_stack+0x68/0xa1
warn_alloc_failed+0xd6/0x130
__alloc_pages_nodemask+0x4c6/0xf20
alloc_pages_current+0x66/0xe0
alloc_kmem_pages+0x14/0x80
kmalloc_order_trace+0x2a/0x1a0
__kmalloc+0x291/0x310
memcg_update_all_caches+0x6c/0x130
mem_cgroup_css_alloc+0x590/0x610
cgroup_apply_control_enable+0x18b/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
...
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: init_and_link_css+0x37/0x220
PGD 34b1e067 PUD 3a109067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
RIP: 0010:[] [] init_and_link_css+0x37/0x220
RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
cgroup_apply_control_enable+0x1ac/0x370
cgroup_mkdir+0x1de/0x2e0
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xb9/0x150
SyS_mkdir+0x66/0xd0
do_syscall_64+0x53/0x120
entry_SYSCALL64_slow_path+0x25/0x25
Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
RIP init_and_link_css+0x37/0x220
RSP
CR2: 00000000000000d0
---[ end trace a2d8836ae1e852d1 ]---

Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
Signed-off-by: Tejun Heo
Reported-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2016-06-25 08:23:52 +0800
d93c4130a memcg: mem_cgroup_migrate() may be called with irq disabled ... Browse Code »

mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
with irq disabled from migrate_page_copy(). This ends up enabling irq
while holding a irq context lock triggering the following lockdep
warning. Fix it by using irq_save/restore instead.

=================================
[ INFO: inconsistent lock state ]
4.7.0-rc1+ #52 Tainted: G W
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [] aio_migratepage+0x156/0x1e8
{IN-SOFTIRQ-W} state was registered at:
__lock_acquire+0x5b6/0x1930
lock_acquire+0xee/0x270
_raw_spin_lock_irqsave+0x66/0xb0
aio_complete+0x98/0x328
dio_complete+0xe4/0x1e0
blk_update_request+0xd4/0x450
scsi_end_request+0x48/0x1c8
scsi_io_completion+0x272/0x698
blk_done_softirq+0xca/0xe8
__do_softirq+0xc8/0x518
irq_exit+0xee/0x110
do_IRQ+0x6a/0x88
io_int_handler+0x11a/0x25c
__mutex_unlock_slowpath+0x144/0x1d8
__mutex_unlock_slowpath+0x140/0x1d8
kernfs_iop_permission+0x64/0x80
__inode_permission+0x9e/0xf0
link_path_walk+0x6e/0x510
path_lookupat+0xc4/0x1a8
filename_lookup+0x9c/0x160
user_path_at_empty+0x5c/0x70
SyS_readlinkat+0x68/0x140
system_call+0xd6/0x270
irq event stamp: 971410
hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588
hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0
softirqs last enabled at (970526): __do_softirq+0x460/0x518
softirqs last disabled at (970519): irq_exit+0xee/0x110

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&(&ctx->completion_lock)->rlock);

lock(&(&ctx->completion_lock)->rlock);

*** DEADLOCK ***

3 locks held by kcompactd0/151:
#0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8
#1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8
#2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8

stack backtrace:
CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52
Call Trace:
show_trace+0xea/0xf0
show_stack+0x72/0xf0
dump_stack+0x9a/0xd8
print_usage_bug.part.27+0x2d4/0x2e8
mark_lock+0x17e/0x758
mark_held_locks+0xa2/0xd0
trace_hardirqs_on_caller+0x140/0x1c0
mem_cgroup_migrate+0x266/0x370
aio_migratepage+0x16a/0x1e8
move_to_new_page+0xb0/0x260
migrate_pages+0x8f4/0x9f0
compact_zone+0x4dc/0xdc8
kcompactd_do_work+0x1aa/0x358
kcompactd+0xba/0x2c8
kthread+0x10a/0x110
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.

Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
Signed-off-by: Tejun Heo
Reported-by: Christian Borntraeger
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2016-06-25 08:23:52 +0800

10 Jun, 2016

1 commit

d0db7afa1 revert "mm: memcontrol: fix possible css ref leak on oom" ... Browse Code »

Revert commit 1383399d7be0 ("mm: memcontrol: fix possible css ref leak
on oom"). Johannes points out "There is a task_in_memcg_oom() check
before calling mem_cgroup_oom()".

Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-06-10 05:23:11 +0800

04 Jun, 2016

1 commit

3a06bb78c memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem() ... Browse Code »

memcg_offline_kmem() may be called from memcg_free_kmem() after a css
init failure. memcg_free_kmem() is a ->css_free callback which is
called without cgroup_mutex and memcg_offline_kmem() ends up using
css_for_each_descendant_pre() without any locking. Fix it by adding rcu
read locking around it.

mkdir: cannot create directory `65530': No space left on device
===============================
[ INFO: suspicious RCU usage. ]
4.6.0-work+ #321 Not tainted
-------------------------------
kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
[ 527.243970] other info that might help us debug this:
[ 527.244715]
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/0:5/1664:
#0: ("cgroup_destroy"){.+.+..}, at: [] process_one_work+0x165/0x4a0
#1: ((&css->destroy_work)#3){+.+...}, at: [] process_one_work+0x165/0x4a0
[ 527.248098] stack backtrace:
CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Workqueue: cgroup_destroy css_free_work_fn
Call Trace:
dump_stack+0x68/0xa1
lockdep_rcu_suspicious+0xd7/0x110
css_next_descendant_pre+0x7d/0xb0
memcg_offline_kmem.part.44+0x4a/0xc0
mem_cgroup_css_free+0x1ec/0x200
css_free_work_fn+0x49/0x5e0
process_one_work+0x1c5/0x4a0
worker_thread+0x49/0x490
kthread+0xea/0x100
ret_from_fork+0x1f/0x40

Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
Signed-off-by: Tejun Heo
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2016-06-04 07:02:55 +0800

28 May, 2016

2 commits

7cf7806ce mm/memcontrol.c: move comments for get_mctgt_type() to proper position ... Browse Code »

Move the comments for get_mctgt_type() to be before get_mctgt_type()
implementation.

Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li RongQing
2016-05-28 05:49:37 +0800
cbedbac3e mm/memcontrol.c: fix the margin computation in mem_cgroup_margin() ... Browse Code »

mem_cgroup_margin() might return (memory.limit - memory_count) when the
memsw.limit is in excess. This doesn't happen usually because we do not
allow excess on hard limits and (memory.limit
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li RongQing
2016-05-28 05:49:37 +0800

27 May, 2016

1 commit

1ebab2db0 memcg: fix mem_cgroup_out_of_memory() return value. ... Browse Code »

mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
task after an eligible task was found, "false" if it found a TIF_MEMDIE
task before an eligible task is found.

This difference confuses memory_max_write() which checks the return
value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to
continue looping, mem_cgroup_out_of_memory() should return "true" in
this case.

This patch sets a dummy pointer in order to return "true".

Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-05-27 06:35:44 +0800

24 May, 2016

1 commit

1383399d7 mm: memcontrol: fix possible css ref leak on oom ... Browse Code »

mem_cgroup_oom may be invoked multiple times while a process is handling
a page fault, in which case current->memcg_in_oom will be overwritten
leaking the previously taken css reference.

Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-05-24 08:04:14 +0800

21 May, 2016

1 commit

51038171b memcg: fix stale mem_cgroup_force_empty() comment ... Browse Code »

Commit f61c42a7d911 ("memcg: remove tasks/children test from
mem_cgroup_force_empty()") removed memory reparenting from the function.

Fix the function's comment.

Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
Signed-off-by: Greg Thelen
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2016-05-21 08:58:30 +0800

20 May, 2016

5 commits

3ef22dfff oom, oom_reaper: try to reap tasks which skip regular OOM killer path ... Browse Code »

If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task and
queues it for the oom reaper if that is safe to be done meaning that the
task doesn't share the mm with an alive process.

This might help to release the memory pressure while the task tries to
exit.

[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Michal Hocko
Cc: Raushaniya Maksudova
Cc: Michael S. Tsirkin
Cc: Paul E. McKenney
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Daniel Vetter
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-20 10:12:14 +0800
9d5e6a9f2 mm: update_lru_size do the __mod_zone_page_state ... Browse Code »

Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once per
call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites? I chose
to update it at the callsites, rearranging and grouping the updates by
nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
__mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
some more calls in a future commit. Make the code a little smaller and
simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates; but
I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800
ca707239e mm: update_lru_size warn and reset bad lru_size ... Browse Code »

Though debug kernels have a VM_BUG_ON to help protect from misaccounting
lru_size, non-debug kernels are liable to wrap it around: and then the
vast unsigned long size draws page reclaim into a loop of repeatedly
doing nothing on an empty list, without even a cond_resched().

That soft lockup looks confusingly like an over-busy reclaim scenario,
with lots of contention on the lru_lock in shrink_inactive_list(): yet
has a totally different origin.

Help differentiate with a custom warning in
mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
size to avoid the lockup. But the particular bug which suggested this
change was mine alone, and since fixed.

Make it a WARN_ONCE: the first occurrence is the most informative, a
flurry may follow, yet even when rate-limited little more is learnt.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Andres Lagar-Cavilla
Cc: Konstantin Khlebnikov
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800
fda3d69be mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment ... Browse Code »

> The comment seems to have not much to do with the code?

I guess the comment tries to say that the code path is triggered when we
charge the page which happens _before_ it is added to the LRU list and
so last_scanned_node might contain the stale data.

Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-20 10:12:14 +0800
0edaf86cf include/linux/nodemask.h: create next_node_in() helper ... Browse Code »

Lots of code does

node = next_node(node, XXX);
if (node == MAX_NUMNODES)
node = first_node(XXX);

so create next_node_in() to do this and use it in various places.

[mhocko@suse.com: use next_node_in() helper]
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Signed-off-by: Michal Hocko
Cc: Xishi Qiu
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Naoya Horiguchi
Cc: Laura Abbott
Cc: Hui Zhu
Cc: Wang Xiaoqiang
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-05-20 10:12:14 +0800

26 Apr, 2016

1 commit

264a0ae16 memcg: relocate charge moving from ->attach to ->post_attach ... Browse Code »

Hello,

So, this ended up a lot simpler than I originally expected. I tested
it lightly and it seems to work fine. Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?

Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.

There's no reason to perform charge moving from ->attach which is deep
in the task migration path. Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.

* move_charge_struct->mm is added and ->can_attach is now responsible
for pinning and recording the target mm. mem_cgroup_clear_mc() is
updated accordingly. This also simplifies mem_cgroup_move_task().

* mem_cgroup_move_task() is now called from ->post_attach instead of
->attach.

Signed-off-by: Tejun Heo
Cc: Johannes Weiner
Acked-by: Michal Hocko
Debugged-and-tested-by: Petr Mladek
Reported-by: Cyril Hrubis
Reported-by: Johannes Weiner
Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Cc: # 4.4+

Tejun Heo
2016-04-26 03:45:14 +0800

18 Mar, 2016

8 commits

e0775d10f mm: memcontrol: zap oom_info_lock ... Browse Code »

mem_cgroup_print_oom_info is always called under oom_lock, so
oom_info_lock is redundant.

Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
8b5926560 mm: memcontrol: clarify the uncharge_list() loop ... Browse Code »

uncharge_list() does an unusual list walk because the function can take
regular lists with dedicated list_heads as well as singleton lists where
a single page is passed via the page->lru list node.

This can sometimes lead to confusion as well as suggestions to replace
the loop with a list_for_each_entry(), which wouldn't work.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-18 06:09:34 +0800
b6e6edcfa mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage ... Browse Code »

Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage. The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement. And if reclaim is not able
to succeed, trigger OOM kills in the group. Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed. This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-18 06:09:34 +0800
588083bb3 mm: memcontrol: reclaim when shrinking memory.high below usage ... Browse Code »

When setting memory.high below usage, nothing happens until the next
charge comes along, and then it will only reclaim its own charge and not
the now potentially huge excess of the new memory.high. This can cause
groups to stay in excess of their memory.high indefinitely.

To fix that, when shrinking memory.high, kick off a reclaim cycle that
goes after the delta.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-18 06:09:34 +0800
d334c9bcb mm: memcontrol: cleanup css_reset callback ... Browse Code »

- Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
be altered from userspace anymore, so no need to protect them.

- Use plain page_counter_limit() for resetting ->memory and ->memsw
limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
so no need in special handling.

- Reset ->swap and ->tcpmem limits as well.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
0a6b76dd2 mm: workingset: make shadow node shrinker memcg aware ... Browse Code »

Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
b6ecd2dea mm: memcontrol: zap memcg_kmem_online helper ... Browse Code »

As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.

There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
only be called if memcg_kmem_enabled() returned true. Since the cgroup
it operates on is online, mem_cgroup_is_root() check will be enough.

memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
of memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just
open-code the check.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
b313aeee2 mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy ... Browse Code »

Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.

The actual work is done in patch 6 of the series. Patches 1 and 2
prepare memcg/shrinker infrastructure for the change. Patch 3 is just a
collateral cleanup. Patch 4 makes radix_tree_node accounted, which is
necessary for making shadow node shrinker memcg aware. Patch 5 reduces
shadow nodes overhead in case workload mostly uses anonymous pages.

This patch:

Currently, in the legacy hierarchy kmem accounting is off for all
cgroups by default and must be enabled explicitly by writing something
to memory.kmem.limit_in_bytes. Since we don't support reclaim on
hitting kmem limit, nor do we have any plans to implement it, this is
likely to be -1, just to enable kmem accounting and limit kernel memory
consumption by the memory.limit_in_bytes along with user memory.

This user API was introduced when the implementation of kmem accounting
lacked slab shrinker support and hence was useless in practice. Things
have changed since then - slab shrinkers were made memcg aware, the
accounting overhead seems to be negligible, and a failure to charge a
kmem allocation should not have critical consequences, because we only
account those kernel objects that should be safe to fail. That's why
kmem accounting is enabled by default for all cgroups in the default
hierarchy, which will eventually replace the legacy one.

The ability to enable kmem accounting for some cgroups while keeping it
disabled for others is getting difficult to maintain. E.g. to make
shadow node shrinker memcg aware (see mm/workingset.c), we need to know
the relationship between the number of shadow nodes allocated for a
cgroup and the size of its lru list. If kmem accounting is enabled for
all cgroups there is no problem, but what should we do if kmem
accounting is enabled only for half of cgroups? We've no other choice
but use global lru stats while scanning root cgroup's shadow nodes, but
that would be wrong if kmem accounting was enabled for all cgroups
(which is the case if the unified hierarchy is used), in which case we
should use lru stats of the root cgroup's lruvec.

That being said, let's enable kmem accounting for all memory cgroups by
default. If one finds it unstable or too costly, it can always be
disabled system-wide by passing cgroup.memory=nokmem to the kernel at
boot time.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800