Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2017

1 commit

07fc9575e mm, memcg: fix the active list aging for lowmem requests when memcg is enabled ... Browse Code »

commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
active_file:274324 inactive_file:281962 isolated_file:0
unevictable:0 dirty:649 writeback:0 unstable:0
slab_reclaimable:40662 slab_unreclaimable:17754
mapped:7382 shmem:202 pagetables:351 bounce:0
free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request. Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.

The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense. We can simply end up
always seeing the resulting active and inactive counts 0 and return
false. This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled. Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Nils Holland
Tested-by: Nils Holland
Reported-by: Klaus Ethgen
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2017-01-20 03:17:59 +0800

08 Oct, 2016

2 commits

2d7580738 mm: memcontrol: consolidate cgroup socket tracking ... Browse Code »

The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Tejun Heo
Cc: "David S. Miller"
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-08 09:46:29 +0800
7c5f64f84 mm: oom: deduplicate victim selection code for memcg and global oom ... Browse Code »

When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom. The only difference is the scope of tasks to
select the victim from. So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it. That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.

Let's rework this as follows: keep all oom heuristic related code private
to oom_kill.c and make oom_kill.c use exported memcg functions when it's
really necessary (like in case of iterating over memcg tasks).

Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tetsuo Handa
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-10-08 09:46:26 +0800

29 Jul, 2016

6 commits

efdc94907 mm: fix memcg stack accounting for sub-page stacks ... Browse Code »

We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski
Cc: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Reviewed-by: Josh Poimboeuf
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-07-29 07:07:41 +0800
7ee36a14f mm, vmscan: Update all zone LRU sizes before updating memcg ... Browse Code »

Minchan Kim reported setting the following warning on a 32-bit system
although it can affect 64-bit systems.

WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
Modules linked in:
CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x76/0xaf
__warn+0xea/0x110
? mem_cgroup_update_lru_size+0x103/0x110
warn_slowpath_fmt+0x3b/0x40
mem_cgroup_update_lru_size+0x103/0x110
isolate_lru_pages.isra.61+0x2e2/0x360
shrink_active_list+0xac/0x2a0
? __delay+0xe/0x10
shrink_node_memcg+0x53c/0x7a0
shrink_node+0xab/0x2a0
do_try_to_free_pages+0xc6/0x390
try_to_free_pages+0x245/0x590

LRU list contents and counts are updated separately. Counts are updated
before pages are added to the LRU and updated after pages are removed.
The warning above is from a check in mem_cgroup_update_lru_size that
ensures that list sizes of zero are empty.

The problem is that node-lru needs to account for highmem pages if
CONFIG_HIGHMEM is set. One impact of the implementation is that the
sizes are updated in multiple passes when pages from multiple zones were
isolated. This happens whether HIGHMEM is set or not. When multiple
zones are isolated, it's possible for a debugging check in memcg to be
tripped.

This patch forces all the zone counts to be updated before the memcg
function is called.

Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Tested-by: Minchan Kim
Reported-by: Minchan Kim
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
ef8f23279 mm, memcg: move memcg limit enforcement from zones to nodes ... Browse Code »

Memcg needs adjustment after moving LRUs to the node. Limits are
tracked per memcg but the soft-limit excess is tracked per zone. As
global page reclaim is based on the node, it is easy to imagine a
situation where a zone soft limit is exceeded even though the memcg
limit is fine.

This patch moves the soft limit tree the node. Technically, all the
variable names should also change but people are already familiar by the
meaning of "mz" even if "mn" would be a more appropriate name now.

Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
a9dd0a831 mm, vmscan: make shrink_node decisions more node-centric ... Browse Code »

Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information. This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not. Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

[mgorman@techsingularity.net: optimization]
Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
599d0c954 mm, vmscan: move LRU lists to node ... Browse Code »

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800
55779ec75 mm: fix vm-scalability regression in cgroup-aware workingset code ... Browse Code »

Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
detection") added a page->mem_cgroup lookup to the cache eviction,
refault, and activation paths, as well as locking to the activation
path, and the vm-scalability tests showed a regression of -23%.

While the test in question is an artificial worst-case scenario that
doesn't occur in real workloads - reading two sparse files in parallel
at full CPU speed just to hammer the LRU paths - there is still some
optimizations that can be done in those paths.

Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
doesn't need to be stabilized when counting an activation; we merely
need to hold the RCU lock to prevent the memcg from being freed.

This cuts down on overhead quite a bit:

23047a96d7cfcfca 063f6715e77a7be5770d6081fe
---------------- --------------------------
%stddev %change %stddev
\ | \
21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

[linux@roeck-us.net: drop unnecessary include file]
[hannes@cmpxchg.org: add WARN_ON_ONCE()s]
Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
Reported-by: Ye Xiaolong
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Guenter Roeck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-07-29 07:07:41 +0800

27 Jul, 2016

1 commit

452647784 mm: memcontrol: cleanup kmem charge functions ... Browse Code »

- Handle memcg_kmem_enabled check out to the caller. This reduces the
number of function definitions making the code easier to follow. At
the same time it doesn't result in code bloat, because all of these
functions are used only in one or two places.

- Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
have to dive deep into memcg implementation to see which allocations
are charged and which are not.

- Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Eric Dumazet
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-07-27 07:19:19 +0800

23 Jul, 2016

1 commit

73f576c04 mm: memcontrol: fix cgroup creation failure after many small jobs ... Browse Code »

The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears. At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild. Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs. Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages. And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that. This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

set -e
mkdir -p pages
for x in `seq 128000`; do
[ $((x % 1000)) -eq 0 ] && echo $x
mkdir /cgroup/foo
echo $$ >/cgroup/foo/cgroup.procs
echo trex >pages/$x
echo $$ >/cgroup/cgroup.procs
rmdir /cgroup/foo
done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: John Garcia
Reviewed-by: Vladimir Davydov
Acked-by: Tejun Heo
Cc: Nikolay Borisov
Cc: [3.19+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-07-23 09:25:54 +0800

21 May, 2016

1 commit

59dc76b0d mm: vmscan: reduce size of inactive file list ... Browse Code »

The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages. The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.

With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.

This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set. This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel
Signed-off-by: Johannes Weiner
Reported-by: Andres Freund
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2016-05-21 08:58:30 +0800

20 May, 2016

1 commit

9d5e6a9f2 mm: update_lru_size do the __mod_zone_page_state ... Browse Code »

Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once per
call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites? I chose
to update it at the callsites, rearranging and grouping the updates by
nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
__mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
some more calls in a future commit. Make the code a little smaller and
simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates; but
I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800

18 Mar, 2016

4 commits

0a6b76dd2 mm: workingset: make shadow node shrinker memcg aware ... Browse Code »

Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
b6ecd2dea mm: memcontrol: zap memcg_kmem_online helper ... Browse Code »

As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.

There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
only be called if memcg_kmem_enabled() returned true. Since the cgroup
it operates on is online, mem_cgroup_is_root() check will be enough.

memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
of memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just
open-code the check.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
12580e4b5 mm: memcontrol: report kernel stack usage in cgroup2 memory.stat ... Browse Code »

Show how much memory is allocated to kernel stacks.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800
27ee57c93 mm: memcontrol: report slab usage in cgroup2 memory.stat ... Browse Code »

Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-18 06:09:34 +0800

16 Mar, 2016

6 commits

fdf1cdb91 mm: remove unnecessary uses of lock_page_memcg() ... Browse Code »

There are several users that nest lock_page_memcg() inside lock_page()
to prevent page->mem_cgroup from changing. But the page lock prevents
pages from moving between cgroups, so that is unnecessary overhead.

Remove lock_page_memcg() in contexts with locked contexts and fix the
debug code in the page stat functions to be okay with the page lock.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
62cccb8c8 mm: simplify lock_page_memcg() ... Browse Code »

Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Johannes Weiner
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
6a93ca8fd mm: migrate: do not touch page->mem_cgroup of live pages ... Browse Code »

Changing a page's memcg association complicates dealing with the page,
so we want to limit this as much as possible. Page migration e.g. does
not have to do that. Just like page cache replacement, it can forcibly
charge a replacement page, and then uncharge the old page when it gets
freed. Temporarily overcharging the cgroup by a single page is not an
issue in practice, and charging is so cheap nowadays that this is much
preferrable to the headache of messing with live pages.

The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup. But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()). That means page->mem_cgroup is always stable
in callers that have the page isolated from the LRU or locked. Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().

[akpm@linux-foundation.org: fix build]
[vdavydov@virtuozzo.com: fix lockdep splat]
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
23047a96d mm: workingset: per-cgroup cache thrash detection ... Browse Code »

Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.

[sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
Signed-off-by: Johannes Weiner
Signed-off-by: Sergey Senozhatsky
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
81f8c3a46 mm: memcontrol: generalize locking for the page->mem_cgroup binding ... Browse Code »

These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
work properly and be as adaptive to new cache workingsets as global
reclaim already is.

This should have been part of the original thrash detection patch
series, but was deferred due to the complexity of those patches.

This patch (of 5):

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
we might be able to delete it at some point, and these now easy to
identify locking sites along with it.

Signed-off-by: Johannes Weiner
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
9f706d682 mm: fix some spelling ... Browse Code »

Fix up trivial spelling errors, noticed while reading the code.

Signed-off-by: Jesper Dangaard Brouer
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2016-03-16 07:55:16 +0800

04 Feb, 2016

1 commit

c792e8240 mm: memcontrol: drop superfluous entry in the per-memcg stats array ... Browse Code »

MEM_CGROUP_STAT_NSTATS is just a delimiter for cgroup1 statistics, not
an actual array entry. Reuse it for the first cgroup2 stat entry, like
in the event array.

Fixes: b2807f07f4f8 ("mm: memcontrol: add "sock" to cgroup2 memory.stat")
Signed-off-by: Johannes Weiner
Cc: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-02-04 00:28:43 +0800

21 Jan, 2016

9 commits

b2807f07f mm: memcontrol: add "sock" to cgroup2 memory.stat ... Browse Code »

Provide statistics on how much of a cgroup's memory footprint is made up
of socket buffers from network connections owned by the group.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
eb01aaab4 mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online ... Browse Code »

mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
get_scan_count(), which is the only user of this function, now possesses
pointer to memcg, let's pass memcg directly to mem_cgroup_online() instead
of picking it out of lruvec and rename the function accordingly.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-21 09:09:18 +0800
37e843511 mm: memcontrol: charge swap to cgroup2 ... Browse Code »

This patchset introduces swap accounting to cgroup2.

This patch (of 7):

In the legacy hierarchy we charge memsw, which is dubious, because:

- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.

This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup. It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.

The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e. when all references to a swap entry, including the one
taken by swap cache, are gone. This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry. At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-21 09:09:18 +0800
0b8f73e10 mm: memcontrol: clean up alloc, online, offline, free functions ... Browse Code »

The creation and teardown of struct mem_cgroup is fairly messy and
that has attracted mistakes and subtle bugs before.

The main cause for this is that there is no clear model about what
needs to happen when, and that attracts more chaos. So create one:

1. mem_cgroup_alloc() should allocate struct mem_cgroup and its
auxiliary members and initialize work items, locks etc. so that the
object it returns is fully initialized and in a neutral state.

2. mem_cgroup_css_alloc() will use mem_cgroup_alloc() to obtain a new
memcg object and configure it and the system according to the role
of the new memory-controlled cgroup in the hierarchy.

3. mem_cgroup_css_online() is no longer needed to synchronize with
iterators, but it verifies css->id which isn't available earlier.

4. mem_cgroup_css_offline() implements stuff that needs to happen upon
the user-visible destruction of a cgroup, which includes stopping
all user interfacing as well as releasing certain structures when
continued memory consumption would be unexpected at that point.

5. mem_cgroup_css_free() prepares the system and the memcg object for
the object's disappearance, neutralizes its state, and then gives
it back to mem_cgroup_free().

6. mem_cgroup_free() releases struct mem_cgroup and auxiliary memory.

[arnd@arndb.de: fix SLOB build regression]
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
0db152981 mm: memcontrol: flatten struct cg_proto ... Browse Code »

There are no more external users of struct cg_proto, flatten the
structure into struct mem_cgroup.

Since using those struct members doesn't stand out as much anymore,
add cgroup2 static branches to make it clearer which code is legacy.

Suggested-by: Vladimir Davydov
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
d886f4e48 mm: memcontrol: rein in the CONFIG space madness ... Browse Code »

What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
controller code is insignificant, having these conditionals is not
worth the complication and fragility that comes with them.

[akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
489c2a20a mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM ... Browse Code »

Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
interface. This also makes legacy-only code sections stand out better.

[arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
127424c86 mm: memcontrol: move kmem accounting code to CONFIG_MEMCG ... Browse Code »

The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800
567e9ab2e mm: memcontrol: give the kmem states more descriptive names ... Browse Code »

On any given memcg, the kmem accounting feature has three separate
states: not initialized, structures allocated, and actively accounting
slab memory. These are represented through a combination of the
kmem_acct_activated and kmem_acct_active flags, which is confusing.

Convert to a kmem_state enum with the states NONE, ALLOCATED, and
ONLINE. Then rename the functions to modify the state accordingly.
This follows the nomenclature of css object states more closely.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800

16 Jan, 2016

1 commit

f627c2f53 memcg: adjust to support new THP refcounting ... Browse Code »

As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need
to get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone. At that point if
we see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

15 Jan, 2016

6 commits

ef12947c9 mm: memcontrol: switch to the updated jump-label API ... Browse Code »

According to the direct use of struct static_key is
deprecated. Update the socket and slab accounting code accordingly.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reported-by: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
8e8ae6452 mm: memcontrol: hook up vmpressure to socket pressure ... Browse Code »

Let the networking stack know when a memcg is under reclaim pressure so
that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
in the socket and tcp memory code that tells it to curb consumption
growth from sockets associated with said control group.

Traditionally, vmpressure reports for the entire subtree of a memcg
under pressure, which drops useful information on the individual groups
reclaimed. However, it's too late to change the userinterface, so add a
second reporting mode that reports on the level of reclaim instead of at
the level of pressure, and use that report for sockets.

vmpressure events are naturally edge triggered, so for hysteresis assert
socket pressure for a second to allow for subsequent vmpressure events
to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
f7e1cb6ec mm: memcontrol: account socket memory in unified hierarchy memory controller ... Browse Code »

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on the
boot commandline to override any runtime configuration and forcibly
exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner
Acked-by: David S. Miller
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
80e95fe0f mm: memcontrol: generalize the socket accounting jump label ... Browse Code »

The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks. Move it to the memory
controller code and give it a more generic name.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
baac50bbc net: tcp_memcontrol: simplify linkage between socket and page counter ... Browse Code »

There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800
e805605c7 net: tcp_memcontrol: sanitize tcp memory accounting callbacks ... Browse Code »

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800