Eric Lee / smarc-fsl-linux-kernel

07 Dec, 2020

1 commit

8199be001 mm: list_lru: set shrinker map bit when child nr_items is not zero ... Browse Code »

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on
lru, but tracing result shows they were even not scanned at all.

Further investigation shows the memcg's vfs shrinker_map bit is not set.
So the reclaimer or dropping cache just skip calling vfs shrinker. So
we have to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and
the problem can't be reproduced after rebooting. But it seems there is
race between shrinker map bit clear and reparenting by code inspection.
The hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:

root
/ \
system user

The main workloads are running under user slice's children, and it
creates and removes memcg frequently. So reparenting happens very often
under user slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
return SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()

The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 coincidently

keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.

How does list_lru_del() race with reparenting? It is because reparenting
replaces children's kmemcg_id to parent's without protecting from
nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
deleting items from child's lru, but dec'ing parent's nr_items, so the
parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
reparent list_lrus and free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin
Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Acked-by: Kirill Tkhai
Cc: Vladimir Davydov
Cc: [4.19]
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-12-07 02:19:07 +0800

15 Aug, 2020

1 commit

a1f459354 mm/list_lru: fix a data race in list_lru_count_one ... Browse Code »

struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
list_lru_isolate_move+0xf9/0x130
list_lru_isolate_move at mm/list_lru.c:180
inode_lru_isolate+0x12b/0x2a0
__list_lru_walk_one+0x122/0x3d0
list_lru_walk_one+0x75/0xa0
prune_icache_sb+0x8b/0xc0
super_cache_scan+0x1b8/0x250
do_shrink_slab+0x256/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
balance_pgdat+0x652/0xd90
kswapd+0x396/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50

read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
list_lru_count_one+0x116/0x2f0
list_lru_count_one at mm/list_lru.c:193
super_cache_count+0xe8/0x170
do_shrink_slab+0x95/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40

Reported by Kernel Concurrency Sanitizer on:
CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Cc: Marco Elver
Cc: Konrad Rzeszutek Wilk
Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds

Qian Cai
2020-08-15 10:56:57 +0800

30 Jun, 2020

1 commit

e0feed08a mm/list_lru.c: Rename kvfree_rcu() to local variant ... Browse Code »

Rename kvfree_rcu() function to the kvfree_rcu_local() one.
The purpose is to prevent a conflict of two same function
declarations. The kvfree_rcu() will be globally visible
what would lead to a build error. No functional change.

Cc: linux-mm@kvack.org
Cc: rcu@vger.kernel.org
Cc: Andrew Morton
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Joel Fernandes (Google)
Signed-off-by: Paul E. McKenney

Uladzislau Rezki (Sony)
2020-06-30 02:59:25 +0800

05 Jun, 2020

1 commit

3dc5f032c mm/list_lru: fix a typo in comment "numbesr"->"numbers" ... Browse Code »

There is a typo in comment, fix it.

Signed-off-by: Ethon Paul
Signed-off-by: Andrew Morton
Reviewed-by: Ralph Campbell
Link: http://lkml.kernel.org/r/20200411071041.16161-1-ethp@qq.com
Signed-off-by: Linus Torvalds

Ethon Paul
2020-06-05 10:06:24 +0800

08 Apr, 2020

1 commit

e4a9bc589 mm: use fallthrough; ... Browse Code »

Convert the various /* fallthrough */ comments to the pseudo-keyword
fallthrough;

Done via script:
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Reviewed-by: Gustavo A. R. Silva
Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
Signed-off-by: Linus Torvalds

Joe Perches
2020-04-08 01:43:41 +0800

03 Apr, 2020

1 commit

4f103c636 mm: memcg/slab: use mem_cgroup_from_obj() ... Browse Code »

Sometimes we need to get a memcg pointer from a charged kernel object.
The right way to get it depends on whether it's a proper slab object or
it's backed by raw pages (e.g. it's a vmalloc alloction). In the first
case the kmem_cache->memcg_params.memcg indirection should be used; in
other cases it's just page->mem_cgroup.

To simplify this task and hide the implementation details let's use the
mem_cgroup_from_obj() helper, which takes a pointer to any kernel object
and returns a valid memcg pointer or NULL.

Passing a kernel address rather than a pointer to a page will allow to use
this helper for per-object (rather than per-page) tracked objects in the
future.

The caller is still responsible to ensure that the returned memcg isn't
going away underneath: take the rcu read lock, cgroup mutex etc; depending
on the context.

mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be
removed.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Yafang Shao
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-04-03 00:35:28 +0800

13 Jul, 2019

1 commit

4d96ba353 mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages ... Browse Code »

Every slab page charged to a non-root memory cgroup has a pointer to the
memory cgroup and holds a reference to it, which protects a non-empty
memory cgroup from being released. At the same time the page has a
pointer to the corresponding kmem_cache, and also hold a reference to the
kmem_cache. And kmem_cache by itself holds a reference to the cgroup.

So there is clearly some redundancy, which allows to stop setting the
page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
kmem_cache. Further it will allow to change this pointer easier, without
a need to go over all charged pages.

So let's stop setting page->mem_cgroup pointer for slab pages, and stop
using the css refcounter directly for protecting the memory cgroup from
going away. Instead rely on kmem_cache as an intermediate object.

Make sure that vmstats and shrinker lists are working as previously, as
well as /proc/kpagecgroup interface.

Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Vladimir Davydov
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Waiman Long
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Cc: Andrei Vagin
Cc: Qian Cai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2019-07-13 02:05:44 +0800

14 Jun, 2019

1 commit

3510955b3 mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node ... Browse Code »

Syzbot reported following memory leak:

ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
BUG: memory leak
unreferenced object 0xffff888114f26040 (size 32):
comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
hex dump (first 32 bytes):
40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff @`......@`......
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
slab_post_alloc_hook mm/slab.h:439 [inline]
slab_alloc mm/slab.c:3326 [inline]
kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
kmalloc include/linux/slab.h:547 [inline]
__memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
memcg_init_list_lru_node mm/list_lru.c:375 [inline]
memcg_init_list_lru mm/list_lru.c:459 [inline]
__list_lru_init+0x193/0x2a0 mm/list_lru.c:626
alloc_super+0x2e0/0x310 fs/super.c:269
sget_userns+0x94/0x2a0 fs/super.c:609
sget+0x8d/0xb0 fs/super.c:660
mount_nodev+0x31/0xb0 fs/super.c:1387
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
legacy_get_tree+0x27/0x80 fs/fs_context.c:661
vfs_get_tree+0x2e/0x120 fs/super.c:1476
do_new_mount fs/namespace.c:2790 [inline]
do_mount+0x932/0xc50 fs/namespace.c:3110
ksys_mount+0xab/0x120 fs/namespace.c:3319
__do_sys_mount fs/namespace.c:3333 [inline]
__se_sys_mount fs/namespace.c:3330 [inline]
__x64_sys_mount+0x26/0x30 fs/namespace.c:3330
do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is a simple off by one bug on the error path.

Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
Signed-off-by: Shakeel Butt
Acked-by: Michal Hocko
Reviewed-by: Kirill Tkhai
Cc: [4.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2019-06-14 11:34:56 +0800

02 Jun, 2019

1 commit

3e8589963 memcg: make it work on sparse non-0-node systems ... Browse Code »

We have a single node system with node 0 disabled:
Scanning NUMA topology in Northbridge 24
Number of physical nodes 2
Skipping disabled node 0
Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]

This causes crashes in memcg when system boots:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
#PF error: [normal kernel read fault]
...
RIP: 0010:list_lru_add+0x94/0x170
...
Call Trace:
d_lru_add+0x44/0x50
dput.part.34+0xfc/0x110
__fput+0x108/0x230
task_work_run+0x9f/0xc0
exit_to_usermode_loop+0xf5/0x100

It is reproducible as far as 4.12. I did not try older kernels. You have
to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
investigated). Cannot be reproduced with systemd 234.

The system crashes because the size of lru array is never updated in
memcg_update_all_list_lrus and the reads are past the zero-sized array,
causing dereferences of random memory.

The root cause are list_lru_memcg_aware checks in the list_lru code. The
test in list_lru_memcg_aware is broken: it assumes node 0 is always
present, but it is not true on some systems as can be seen above.

So fix this by avoiding checks on node 0. Remember the memcg-awareness by
a bool flag in struct list_lru.

Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
Signed-off-by: Jiri Slaby
Acked-by: Michal Hocko
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Raghavendra K T
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2019-06-02 06:51:31 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

06 Mar, 2019

1 commit

b9726c26d numa: make "nr_node_ids" unsigned int ... Browse Code »

Number of NUMA nodes can't be negative.

This saves a few bytes on x86_64:

add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
Function old new delta
hv_synic_alloc.cold 88 110 +22
prealloc_shrinker 260 262 +2
bootstrap 249 251 +2
sched_init_numa 1566 1567 +1
show_slab_objects 778 777 -1
s_show 1201 1200 -1
kmem_cache_init 346 345 -1
__alloc_workqueue_key 1146 1145 -1
mem_cgroup_css_alloc 1614 1612 -2
__do_sys_swapon 4702 4699 -3
__list_lru_init 655 651 -4
nic_probe 2379 2374 -5
store_user_store 118 111 -7
red_zone_store 106 99 -7
poison_store 106 99 -7
wq_numa_init 348 338 -10
__kmem_cache_empty 75 65 -10
task_numa_free 186 173 -13
merge_across_nodes_store 351 336 -15
irq_create_affinity_masks 1261 1246 -15
do_numa_crng_init 343 321 -22
task_numa_fault 4760 4737 -23
swapfile_init 179 156 -23
hv_synic_alloc 536 492 -44
apply_wqattrs_prepare 746 695 -51

Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-03-06 13:07:19 +0800

18 Aug, 2018

12 commits

6b51e8819 mm/list_lru: introduce list_lru_shrink_walk_irq() ... Browse Code »

Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ. This change
allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().

There is no EXPORT_SYMBOL provided because the current user is in-kernel
only.

Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.

Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Vladimir Davydov
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-08-18 07:20:32 +0800
6e018968f mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one() ... Browse Code »

__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument. Those two are only used to retrieve struct
list_lru_node. Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node* directly and avoid
the dance around it.

Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Vladimir Davydov
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-08-18 07:20:32 +0800
6cfe57a96 mm/list_lru.c: move locking from __list_lru_walk_one() to its caller ... Browse Code »

Move the locking inside __list_lru_walk_one() to its caller. This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.

Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Vladimir Davydov
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-08-18 07:20:32 +0800
87a5ffc16 mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node() ... Browse Code »

Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".

This series removes the local_irq_disable() around
list_lru_shrink_walk() (as used by mm/workingset) by adding
list_lru_shrink_walk_irq().

Vladimir Davydov preferred this over `irq' argument which I added to
struct list_lru.

The initial post (of this series) received a Reviewed-by tag by Vladimir
Davydov which I added to each patch of the series. The series applies
on top of akpm's tree which has Kirill's shrink_slab series and does not
clash with it (akpm asked me to wait a week or so and repost it then).

I tested the code paths by triggering the OOM-killer via memory over
commit and lockdep did not complain (nor did I see any warnings).

This patch (of 4):

list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter. The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1. This is
a preparation step when the spin_lock() function is lifted to the caller
of __list_lru_walk_one(). Invoke list_lru_walk_one() instead
__list_lru_walk_one() when possible.

Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Vladimir Davydov
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-08-18 07:20:32 +0800
fae91d6d8 mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance ... Browse Code »

Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
3b82c4dcc mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node() ... Browse Code »

This is just refactoring to allow next patches to have lru pointer in
memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
9bec5c35b mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() ... Browse Code »

This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
44bd4a475 mm/list_lru.c: add memcg argument to list_lru_from_kmem() ... Browse Code »

This is just refactoring to allow the next patches to have memcg pointer
in list_lru_from_kmem().

Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
c92e8e10c fs: propagate shrinker::id to list_lru ... Browse Code »

Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
84c07d11a mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB ... Browse Code »

Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:30 +0800
e0295238e mm/list_lru.c: combine code under the same define ... Browse Code »

Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.

This patcheset solves the problem with slow shrink_slab() occuring on
the machines having many shrinkers and memory cgroups (i.e., with many
containers). The problem is complexity of shrink_slab() is O(n^2) and
it grows too fast with the growth of containers numbers.

Let us have 200 containers, and every container has 10 mounts and 10
cgroups. All container tasks are isolated, and they don't touch foreign
containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs and
to call all the memcg-aware shrinkers for all of them. This means, the
task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu
cycle. E.g., super_cache_count() accesses at least two lists, and makes
arifmetical calculations. Even, if there are no charged objects, we do
these calculations, and replaces cpu caches by read memory. I observed
nodes spending almost 100% time in kernel, in case of intensive writing
and global reclaim. The writer consumes pages fast, but it's need to
shrink_slab() before the reclaimer reached shrink pages function (and
frees SWAP_CLUSTER_MAX pages). Even if there is no writing, the
iterations just waste the time, and slows reclaim down.

Let's see the small test below:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`;
do mkdir /sys/fs/cgroup/memory/ct/$i;
echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
done

Then, let's see drop caches time (5 sequential calls):

$time echo 3 > /proc/sys/vm/drop_caches

0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU

The last four calls don't actually shrink anything. So, the iterations
over slab shrinkers take 5.48 seconds. Not so good for scalability.

The patchset solves the problem by making shrink_slab() of O(n)
complexity. There are following functional actions:

1) Assign id to every registered memcg-aware shrinker.

2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
shrinker-related bit after the first element is added to lru list
(also, when removed child memcg elements are reparanted).

3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
shrinker if its bit is set in memcg's shrinker bitmap. (Also, there is
a functionality to clear the bit, after last element is shrinked).

This gives significant performance increase. The result after patchset
is applied:

$time echo 3 > /proc/sys/vm/drop_caches

0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

So, the patchset makes shrink_slab() of less complexity and improves the
performance in such types of load I pointed. This will give a profit in
case of !global reclaim case, since there also will be less
do_shrink_slab() calls.

This patch (of 17):

These two pairs of blocks of code are under the same #ifdef #else
#endif.

Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Thomas Gleixner
Cc: Philippe Ombredanne
Cc: Sahitya Tummala
Cc: Greg Kroah-Hartman
Cc: Stephen Rothwell
Cc: Roman Gushchin
Cc: Matthias Kaehlcke
Cc: Tetsuo Handa
Cc: Chris Wilson
Cc: Waiman Long
Cc: Minchan Kim
Cc: "Huang, Ying"
Cc: Mel Gorman
Cc: Josef Bacik
Cc: Guenter Roeck
Cc: Matthew Wilcox
Cc: Li RongQing
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:30 +0800
930eaac5e mm/list_lru.c: fold __list_lru_count_one() into its caller ... Browse Code »

__list_lru_count_one() has a single callsite.

Acked-by: Vladimir Davydov
Cc: Sebastian Andrzej Siewior
Cc: Kirill Tkhai
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2018-08-18 07:20:29 +0800

06 Apr, 2018

1 commit

0c7c1bed7 mm: make counting of list_lru_one::nr_items lockless ... Browse Code »

During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup. In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.

This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().

Shakeel Butt with the patch observes significant perf graph change. He
says:

========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.

The trace without the patch:

+ 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
+ 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
+ 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
+ 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
+ 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
+ 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
+ 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
+ 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
+ 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes

With the patch:

+ 0.16% swapper [kernel.kallsyms] [k] default_idle
+ 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
+ 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
+ 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
+ 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.03% binary [kernel.kallsyms] [k] copy_page
========================================================================

Thanks Shakeel for the testing.

[ktkhai@virtuozzo.com: v2]
Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Tested-by: Shakeel Butt
Acked-by: Vladimir Davydov
Cc: Andrey Ryabinin
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-04-06 12:36:27 +0800

16 Nov, 2017

1 commit

5b568acc3 mm/list_lru.c: mark expected switch fall-through ... Browse Code »

In preparation for enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Link: http://lkml.kernel.org/r/20171020190754.GA24332@embeddedor.com
Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gustavo A. R. Silva
2017-11-16 10:21:07 +0800

04 Oct, 2017

1 commit

f80c7dab9 mm: memcontrol: use vmalloc fallback for large kmem memcg arrays ... Browse Code »

For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:

mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0

This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.

Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.

Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Josef Bacik
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-10-04 08:54:25 +0800

11 Jul, 2017

1 commit

2c80cd57c mm/list_lru.c: fix list_lru_count_node() to be race free ... Browse Code »

list_lru_count_node() iterates over all memcgs to get the total number of
entries on the node but it can race with memcg_drain_all_list_lrus(),
which migrates the entries from a dead cgroup to another. This can return
incorrect number of entries from list_lru_count_node().

Fix this by keeping track of entries per node and simply return it in
list_lru_count_node().

Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala
Acked-by: Vladimir Davydov
Cc: Jan Kara
Cc: Alexander Polakov
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sahitya Tummala
2017-07-11 07:32:33 +0800

28 Oct, 2016

1 commit

1bc11d70b mm/list_lru.c: avoid error-path NULL pointer deref ... Browse Code »

As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

After some analysis it seems to be that the problem is in alloc_super().
In case list_lru_init_memcg() fails it goes into destroy_super(), which
calls list_lru_destroy().

And in list_lru_init() we see that in case memcg_init_list_lru() fails,
lru->node is freed, but not set NULL, which then leads list_lru_destroy()
to believe it is initialized and call memcg_destroy_list_lru().
memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
which is NULL.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Alexander Polakov
Acked-by: Vladimir Davydov
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Polakov
2016-10-28 09:43:42 +0800

21 Jan, 2016

1 commit

127424c86 mm: memcontrol: move kmem accounting code to CONFIG_MEMCG ... Browse Code »

The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-21 09:09:18 +0800

06 Nov, 2015

2 commits

df4065516 memcg: simplify and inline __mem_cgroup_from_kmem ... Browse Code »

Before the previous patch ("memcg: unify slab and other kmem pages
charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
pages and pages allocated with alloc_kmem_pages - memcg in the page
struct. Now we can unify it. Since after it, this function becomes tiny
we can fold it into mem_cgroup_from_kmem.

[hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800
145949a13 mm/list_lru.c: replace nr_node_ids for loop with for_each_node() ... Browse Code »

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is
common), this patch saves some unnecessary allocations for non existing
numa nodes.

Even without that saving, perhaps patch makes code more readable.

[vdavydov@parallels.com: take memcg_aware check outside for_each loop]
Signed-off-by: Raghavendra K T
Reviewed-by: Vladimir Davydov
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Anton Blanchard
Cc: Nishanth Aravamudan
Cc: Greg Kurz
Cc: Grant Likely
Cc: Nikunj A Dadhania
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raghavendra K T
2015-11-06 11:34:48 +0800

09 Sep, 2015

1 commit

26f5d7609 list_lru: don't call list_lru_from_kmem if the list_head is empty ... Browse Code »

If the list_head is empty then we'll have called list_lru_from_kmem for
nothing. Move that call inside of the list_empty if block.

Signed-off-by: Jeff Layton
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2015-09-09 06:35:28 +0800

13 Feb, 2015

5 commits

2788cf0c4 memcg: reparent list_lrus and free kmemcg_id on css offline ... Browse Code »

Now, the only reason to keep kmemcg_id till css free is list_lru, which
uses it to distribute elements between per-memcg lists. However, it can
be easily sorted out - we only need to change kmemcg_id of an offline
cgroup to its parent's id, making further list_lru_add()'s add elements to
the parent's list, and then move all elements from the offline cgroup's
list to the one of its parent. It will work, because a racing
list_lru_del() does not need to know the list it is deleting the element
from. It can decrement the wrong nr_items counter though, but the ongoing
reparenting will fix it. After list_lru reparenting is done we are free
to release kmemcg_id saving a valuable slot in a per-memcg array for new
cgroups.

Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:10 +0800
3f97b1632 list_lru: add helpers to isolate items ... Browse Code »

Currently, the isolate callback passed to the list_lru_walk family of
functions is supposed to just delete an item from the list upon returning
LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
__list_lru_walk_one after the callback returns. Since the callback is
allowed to drop the lock after removing an item (it has to return
LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
of elements on the list even if we check them under the lock. This makes
it difficult to move items from one list_lru_one to another, which is
required for per-memcg list_lru reparenting - we can't just splice the
lists, we have to move entries one by one.

This patch therefore introduces helpers that must be used by callback
functions to isolate items instead of raw list_del/list_move. These are
list_lru_isolate and list_lru_isolate_move. They not only remove the
entry from the list, but also fix the nr_items counter, making sure
nr_items always reflects the actual number of elements on the list if
checked under the appropriate lock.

Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:10 +0800
60d3fd32a list_lru: introduce per-memcg lists ... Browse Code »

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. Hence to turn them
to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of lru lists to the
list_lru_node structure (per-node part of the list_lru), one for each
kmem-active memcg, and dispatches every item addition or removal to the
list corresponding to the memcg which the item is accounted to. So now
the list_lru structure is not just per node, but per node and per memcg.

Not all list_lrus need this feature, so this patch also adds a new
method, list_lru_init_memcg, which initializes a list_lru as memcg
aware. Otherwise (i.e. if initialized with old list_lru_init), the
list_lru won't have per memcg lists.

Just like per memcg caches arrays, the arrays of per-memcg lists are
indexed by memcg_cache_id, so we must grow them whenever
memcg_nr_cache_ids is increased. So we introduce a callback,
memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
space is full.

The locking is implemented in a manner similar to lruvecs, i.e. we have
one lock per node that protects all lists (both global and per cgroup) on
the node.

Signed-off-by: Vladimir Davydov
Cc: Dave Chinner
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Glauber Costa
Cc: Alexander Viro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:09 +0800
c0a5b5609 list_lru: organize all list_lrus to list ... Browse Code »

To make list_lru memcg aware, we need all list_lrus to be kept on a list
protected by a mutex, so that we could sleep while walking over the
list.

Therefore after this change list_lru_destroy may sleep. Fortunately,
there is only one user that calls it from an atomic context - it's
put_super - and we can easily fix it by calling list_lru_destroy before
put_super in destroy_locked_super - anyway we don't longer need lrus by
that time.

Another point that should be noted is that list_lru_destroy is allowed
to be called on an uninitialized zeroed-out object, in which case it is
a no-op. Before this patch this was guaranteed by kfree, but now we
need an explicit check there.

Signed-off-by: Vladimir Davydov
Cc: Dave Chinner
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Glauber Costa
Cc: Alexander Viro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:09 +0800
ff0b67ef5 list_lru: get rid of ->active_nodes ... Browse Code »

The active_nodes mask allows us to skip empty nodes when walking over
list_lru items from all nodes in list_lru_count/walk. However, these
functions are never called from hot paths, so it doesn't seem we need
such kind of optimization there. OTOH, removing the mask will make it
easier to make list_lru per-memcg.

Signed-off-by: Vladimir Davydov
Cc: Dave Chinner
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Glauber Costa
Cc: Alexander Viro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-02-13 10:54:09 +0800

04 Apr, 2014

1 commit

449dd6984 mm: keep page cache radix tree nodes in check ... Browse Code »

Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800

31 Oct, 2013

1 commit

c56b097af mm: list_lru: fix almost infinite loop causing effective livelock ... Browse Code »

I've seen a fair number of issues with kswapd and other processes
appearing to get stuck in v3.12-rc. Using sysrq-p many times seems to
indicate that it gets stuck somewhere in list_lru_walk_node(), called
from prune_icache_sb() and super_cache_scan().

I never seem to be able to trigger a calltrace for functions above that
point.

So I decided to add the following to super_cache_scan():

@@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
total_objects = dentries + inodes + fs_objects + 1;
+printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects);

/* proportion the scan between the caches */
dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes);
+BUG_ON(dentries == 0);
+BUG_ON(inodes == 0);

/*
* prune the dcache first as the icache is pinned by it, then
@@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
freed += sb->s_op->free_cached_objects(sb, fs_objects,
sc->nid);
}
-
+printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed);
drop_super(sb);
return freed;
}

and shortly thereafter, having applied some pressure, I got this:

update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
------------[ cut here ]------------
Kernel BUG at c0101994 [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#3] SMP ARM
Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G D 3.12.0-rc7+ #154
task: daea1200 ti: c3bf8000 task.ti: c3bf8000
PC is at super_cache_scan+0x1c0/0x278
LR is at trace_hardirqs_on+0x14/0x18
Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
...
Backtrace:
(super_cache_scan) from [] (shrink_slab+0x254/0x4c8)
(shrink_slab) from [] (try_to_free_pages+0x3a0/0x5e0)
(try_to_free_pages) from [] (__alloc_pages_nodemask+0x5)
(__alloc_pages_nodemask) from [] (__pte_alloc+0x2c/0x13)
(__pte_alloc) from [] (handle_mm_fault+0x84c/0x914)
(handle_mm_fault) from [] (do_page_fault+0x1f0/0x3bc)
(do_page_fault) from [] (do_translation_fault+0xac/0xb8)
(do_translation_fault) from [] (do_DataAbort+0x38/0xa0)
(do_DataAbort) from [] (__dabt_usr+0x38/0x40)

Notice that we had a very low number of inodes, which were reduced to
zero my mult_frac().

Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
inodes (0) into that as the number of objects to scan:

long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
int nid)
{
LIST_HEAD(freeable);
long freed;

freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
&freeable, &nr_to_scan);

which does:

unsigned long
list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
void *cb_arg, unsigned long *nr_to_walk)
{

struct list_lru_node *nlru = &lru->node[nid];
struct list_head *item, *n;
unsigned long isolated = 0;

spin_lock(&nlru->lock);
restart:
list_for_each_safe(item, n, &nlru->list) {
enum lru_status ret;

/*
* decrement nr_to_walk first so that we don't livelock if we
* get stuck on large numbesr of LRU_RETRY items
*/
if (--(*nr_to_walk) == 0)
break;

So, if *nr_to_walk was zero when this function was entered, that means
we're wanting to operate on (~0UL)+1 objects - which might as well be
infinite.

Clearly this is not correct behaviour. If we think about the behaviour
of this function when *nr_to_walk is 1, then clearly it's wrong - we
decrement first and then test for zero - which results in us doing
nothing at all. A post-decrement would give the desired behaviour -
we'd try to walk one object and one object only if *nr_to_walk were one.

It also gives the correct behaviour for zero - we exit at this point.

Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour")
Signed-off-by: Russell King
Cc: Dave Chinner
Cc: Al Viro
Cc: Andrew Morton
[ Modified to make sure we never underflow the count: this function gets
called in a loop, so the 0 -> ~0ul transition is dangerous - Linus ]
Signed-off-by: Linus Torvalds

Russell King
2013-10-31 03:57:46 +0800

11 Sep, 2013

1 commit

5ca302c8e list_lru: dynamically adjust node arrays ... Browse Code »

We currently use a compile-time constant to size the node array for the
list_lru structure. Due to this, we don't need to allocate any memory at
initialization time. But as a consequence, the structures that contain
embedded list_lru lists can become way too big (the superblock for
instance contains two of them).

This patch aims at ameliorating this situation by dynamically allocating
the node arrays with the firmware provided nr_node_ids.

Signed-off-by: Glauber Costa
Cc: Dave Chinner
Cc: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Glauber Costa
2013-09-11 06:56:32 +0800