Eric Lee / smarc-fsl-linux-kernel

05 Sep, 2018

1 commit

1d7bf02d7 memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure ... Browse Code »

[ Upstream commit 7e97de0b033bcac4fa9a35cef72e0c06e6a22c67 ]

In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
in mem_cgroup_idr even after memcg memory is freed. This leads to leak
of ID in mem_cgroup_idr.

This patch adds removal into mem_cgroup_css_alloc(), which fixes the
problem. For better readability, it adds a generic helper which is used
in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Signed-off-by: Kirill Tkhai
Acked-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Kirill Tkhai
2018-09-05 15:26:32 +0800

25 Jul, 2018

1 commit

be2657752 mm: memcg: fix use after free in mem_cgroup_iter() ... Browse Code »

commit 9f15bde671355c351cf20d9f879004b234353100 upstream.

It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
mem_cgroup_iter+0x2e0/0x6d4
shrink_zone+0x8c/0x324
balance_pgdat+0x450/0x640
kswapd+0x130/0x4b8
kthread+0xe8/0xfc
ret_from_fork+0x10/0x20

mem_cgroup_iter():
......
if (css_tryget(css)) position, which has been freed before and
filled with POISON_FREE(0x6b).

And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.

Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc:
Cc: Shakeel Butt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jing Xia
2018-07-25 17:25:08 +0800

21 Jun, 2018

1 commit

6d2707f26 mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create() ... Browse Code »

[ Upstream commit c892fd82cc0632d425ae011a4dd75eb59e9f84ee ]

If there is heavy memory pressure, page allocation with __GFP_NOWAIT
fails easily although it's order-0 request. I got below warning 9 times
for normal boot.

: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
.. snip ..
Call trace:
dump_backtrace+0x0/0x4
dump_stack+0xa4/0xc0
warn_alloc+0xd4/0x15c
__alloc_pages_nodemask+0xf88/0x10fc
alloc_slab_page+0x40/0x18c
new_slab+0x2b8/0x2e0
___slab_alloc+0x25c/0x464
__kmalloc+0x394/0x498
memcg_kmem_get_cache+0x114/0x2b8
kmem_cache_alloc+0x98/0x3e8
mmap_region+0x3bc/0x8c0
do_mmap+0x40c/0x43c
vm_mmap_pgoff+0x15c/0x1e4
sys_mmap+0xb0/0xc8
el0_svc_naked+0x24/0x28
Mem-Info:
active_anon:17124 inactive_anon:193 isolated_anon:0
active_file:7898 inactive_file:712955 isolated_file:55
unevictable:0 dirty:27 writeback:18 unstable:0
slab_reclaimable:12250 slab_unreclaimable:23334
mapped:19310 shmem:212 pagetables:816 bounce:0
free:36561 free_pcp:1205 free_cma:35615
Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
lowmem_reserve[]: 0 1842 1842
Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
lowmem_reserve[]: 0 0 0
DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
721350 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
945512 pages RAM
0 pages HighMem/MovableOnly
63408 pages reserved
51200 pages cma reserved

__memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
and the worker allocation failure is not really critical because we will
retry on the next kmem charge. We might miss some charges but that
shouldn't be critical. The excessive allocation failure report is not
very helpful.

[mhocko@kernel.org: changelog update]
Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Minchan Kim
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Minchan Kim
2018-06-21 03:02:46 +0800

16 May, 2018

1 commit

17ffa29c3 memcg: fix per_node_info cleanup ... Browse Code »

commit 4eaf431f6f71bbed40a4c733ffe93a7e8cedf9d9 upstream.

syzbot has triggered a NULL ptr dereference when allocation fault
injection enforces a failure and alloc_mem_cgroup_per_node_info
initializes memcg->nodeinfo only half way through.

But __mem_cgroup_free still tries to free all per-node data and
dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
per-node data hasn't been initialized.

The bug is quite unlikely to hit because small allocations do not fail
and we would need quite some numa nodes to make struct
mem_cgroup_per_node large enough to cross the costly order.

Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
Signed-off-by: Michal Hocko
Reviewed-by: Andrey Ryabinin
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2018-05-16 16:10:24 +0800

13 Feb, 2018

1 commit

799a34d5b Revert "defer call to mem_cgroup_sk_alloc()" ... Browse Code »

[ Upstream commit edbe69ef2c90fc86998a74b08319a01c508bd497 ]

This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()").

Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
memcg socket memory accounting, as packets received before memcg
pointer initialization are not accounted and are causing refcounting
underflow on socket release.

Actually the free-after-use problem was fixed by
commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
sk_clone_lock()") for the cgroup pointer.

So, let's revert it and call mem_cgroup_sk_alloc() just before
cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
we're cloning, and it holds a reference to the memcg.

Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
mem_cgroup_sk_alloc(). I see no reasons why bumping the root
memcg counter is a good reason to panic, and there are no realistic
ways to hit it.

Signed-off-by: Roman Gushchin
Cc: Eric Dumazet
Cc: David S. Miller
Cc: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Roman Gushchin
2018-02-13 17:19:48 +0800

05 Dec, 2017

1 commit

d983b6251 mm, memcg: fix mem_cgroup_swapout() for THPs ... Browse Code »

commit d08afa149acfd00871484ada6dabc3880524cd1c upstream.

Commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
support THP") changed mem_cgroup_swapout() to support transparent huge
page (THP).

However the patch missed one location which should be changed for
correctly handling THPs. The resulting bug will cause the memory
cgroups whose THPs were swapped out to become zombies on deletion.

Link: http://lkml.kernel.org/r/20171128161941.20931-1-shakeelb@google.com
Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP")
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Huang Ying
Cc: Vladimir Davydov
Cc: Greg Thelen
Signed-off-by: Greg Kroah-Hartman

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2017-12-05 18:26:29 +0800

10 Oct, 2017

1 commit

9f1c2674b net: memcontrol: defer call to mem_cgroup_sk_alloc() ... Browse Code »

Instead of calling mem_cgroup_sk_alloc() from BH context,
it is better to call it from inet_csk_accept() in process context.

Not only this removes code in mem_cgroup_sk_alloc(), but it also
fixes a bug since listener might have been dismantled and css_get()
might cause a use-after-free.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet
Cc: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: David S. Miller

Eric Dumazet
2017-10-10 11:55:01 +0800

04 Oct, 2017

2 commits

3f2eb0287 mm/memcg: avoid page count check for zone device ... Browse Code »

Fix for 4.14, zone device page always have an elevated refcount of one
and thus page count sanity check in uncharge_page() is inappropriate for
them.

[mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Michal Hocko
Reported-by: Evgeny Baskakov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-10-04 08:54:24 +0800
72f0184c8 mm, memcg: remove hotplug locking from try_charge ... Browse Code »

The following lockdep splat has been noticed during LTP testing

======================================================
WARNING: possible circular locking dependency detected
4.13.0-rc3-next-20170807 #12 Not tainted
------------------------------------------------------
a.out/4771 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++++}, at: [] drain_all_stock.part.35+0x18/0x140

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){++++++}:
lock_acquire+0xc9/0x230
__might_fault+0x70/0xa0
_copy_to_user+0x23/0x70
filldir+0xa7/0x110
xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
xfs_readdir+0x1fa/0x2c0 [xfs]
xfs_file_readdir+0x30/0x40 [xfs]
iterate_dir+0x17a/0x1a0
SyS_getdents+0xb0/0x160
entry_SYSCALL_64_fastpath+0x1f/0xbe

-> #2 (&type->i_mutex_dir_key#3){++++++}:
lock_acquire+0xc9/0x230
down_read+0x51/0xb0
lookup_slow+0xde/0x210
walk_component+0x160/0x250
link_path_walk+0x1a6/0x610
path_openat+0xe4/0xd50
do_filp_open+0x91/0x100
file_open_name+0xf5/0x130
filp_open+0x33/0x50
kernel_read_file_from_path+0x39/0x80
_request_firmware+0x39f/0x880
request_firmware_direct+0x37/0x50
request_microcode_fw+0x64/0xe0
reload_store+0xf7/0x180
dev_attr_store+0x18/0x30
sysfs_kf_write+0x44/0x60
kernfs_fop_write+0x113/0x1a0
__vfs_write+0x37/0x170
vfs_write+0xc7/0x1c0
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x1f0
return_from_SYSCALL_64+0x0/0x7a

-> #1 (microcode_mutex){+.+.+.}:
lock_acquire+0xc9/0x230
__mutex_lock+0x88/0x960
mutex_lock_nested+0x1b/0x20
microcode_init+0xbb/0x208
do_one_initcall+0x51/0x1a9
kernel_init_freeable+0x208/0x2a7
kernel_init+0xe/0x104
ret_from_fork+0x2a/0x40

-> #0 (cpu_hotplug_lock.rw_sem){++++++}:
__lock_acquire+0x153c/0x1550
lock_acquire+0xc9/0x230
cpus_read_lock+0x4b/0x90
drain_all_stock.part.35+0x18/0x140
try_charge+0x3ab/0x6e0
mem_cgroup_try_charge+0x7f/0x2c0
shmem_getpage_gfp+0x25f/0x1050
shmem_fault+0x96/0x200
__do_fault+0x1e/0xa0
__handle_mm_fault+0x9c3/0xe00
handle_mm_fault+0x16e/0x380
__do_page_fault+0x24a/0x530
do_page_fault+0x30/0x80
page_fault+0x28/0x30

other info that might help us debug this:

Chain exists of:
cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(&type->i_mutex_dir_key#3);
lock(&mm->mmap_sem);
lock(cpu_hotplug_lock.rw_sem);

*** DEADLOCK ***

2 locks held by a.out/4771:
#0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530
#1: (percpu_charge_mutex){+.+...}, at: [] try_charge+0x397/0x6e0

The problem is very similar to the one fixed by commit a459eeb7b852
("mm, page_alloc: do not depend on cpu hotplug locks inside the
allocator"). We are taking hotplug locks while we can be sitting on top
of basically arbitrary locks. This just calls for problems.

We can get rid of {get,put}_online_cpus, fortunately. We do not have to
be worried about races with memory hotplug because drain_local_stock,
which is called from both the WQ draining and the memory hotplug
contexts, is always operating on the local cpu stock with IRQs disabled.

The only thing to be careful about is that the target memcg doesn't
vanish while we are still in drain_all_stock so take a reference on it.

Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Artem Savkov
Tested-by: Artem Savkov
Cc: Johannes Weiner
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-10-04 08:54:24 +0800

09 Sep, 2017

6 commits

fa90b2fd3 mem/memcg: cache rightmost node ... Browse Code »

Such that we can optimize __mem_cgroup_largest_soft_limit_node(). The
only overhead is the extra footprint for the cached pointer, but this
should not be an issue for mem_cgroup_tree_per_node.

[dave@stgolabs.net: brain fart #2]
Link: http://lkml.kernel.org/r/20170731160114.GE21328@linux-80c1.suse
Link: http://lkml.kernel.org/r/20170719014603.19029-17-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2017-09-09 09:26:49 +0800
475d0487a mm: memcontrol: use per-cpu stocks for socket memory uncharging ... Browse Code »

We've noticed a quite noticeable performance overhead on some hosts with
significant network traffic when socket memory accounting is enabled.

Perf top shows that socket memory uncharging path is hot:
2.13% [kernel] [k] page_counter_cancel
1.14% [kernel] [k] __sk_mem_reduce_allocated
1.14% [kernel] [k] _raw_spin_lock
0.87% [kernel] [k] _raw_spin_lock_irqsave
0.84% [kernel] [k] tcp_ack
0.84% [kernel] [k] ixgbe_poll
0.83% < workload >
0.82% [kernel] [k] enqueue_entity
0.68% [kernel] [k] __fget
0.68% [kernel] [k] tcp_delack_timer_handler
0.67% [kernel] [k] __schedule
0.60% < workload >
0.59% [kernel] [k] __inet6_lookup_established
0.55% [kernel] [k] __switch_to
0.55% [kernel] [k] menu_select
0.54% libc-2.20.so [.] __memcpy_avx_unaligned

To address this issue, the existing per-cpu stock infrastructure can be
used.

refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
charge to a per-cpu stock instead of calling atomic
page_counter_uncharge().

To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
will explicitly drain the cached charge, if the cached value exceeds
CHARGE_BATCH.

This allows significantly optimize the load:
1.21% [kernel] [k] _raw_spin_lock
1.01% [kernel] [k] ixgbe_poll
0.92% [kernel] [k] _raw_spin_lock_irqsave
0.90% [kernel] [k] enqueue_entity
0.86% [kernel] [k] tcp_ack
0.85% < workload >
0.74% perf-11120.map [.] 0x000000000061bf24
0.73% [kernel] [k] __schedule
0.67% [kernel] [k] __fget
0.63% [kernel] [k] __inet6_lookup_established
0.62% [kernel] [k] menu_select
0.59% < workload >
0.59% [kernel] [k] __switch_to
0.57% libc-2.20.so [.] __memcpy_avx_unaligned

Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2017-09-09 09:26:47 +0800
df6ad6983 mm/device-public-memory: device memory cache coherent with CPU ... Browse Code »

Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Aneesh Kumar
Cc: Paul E. McKenney
Cc: Benjamin Herrenschmidt
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Balbir Singh
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: Johannes Weiner
Cc: John Hubbard
Cc: Kirill A. Shutemov
Cc: Mark Hairgrove
Cc: Michal Hocko
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Vladimir Davydov
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-09-09 09:26:46 +0800
c733a8287 mm/memcontrol: support MEMORY_DEVICE_PRIVATE ... Browse Code »

HMM pages (private or public device pages) are ZONE_DEVICE page and thus
need special handling when it comes to lru or refcount. This patch make
sure that memcontrol properly handle those when it face them. Those pages
are use like regular pages in a process address space either as anonymous
page or as file back page. So from memcg point of view we want to handle
them like regular page for now at least.

Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Balbir Singh
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Aneesh Kumar
Cc: Benjamin Herrenschmidt
Cc: Dan Williams
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: John Hubbard
Cc: Kirill A. Shutemov
Cc: Mark Hairgrove
Cc: Paul E. McKenney
Cc: Ross Zwisler
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-09-09 09:26:46 +0800
a9d5adeeb mm/memcontrol: allow to uncharge page without using page->lru field ... Browse Code »

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Aneesh Kumar
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Cc: Dan Williams
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: John Hubbard
Cc: Kirill A. Shutemov
Cc: Mark Hairgrove
Cc: Paul E. McKenney
Cc: Ross Zwisler
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-09-09 09:26:46 +0800
84c3fc4e9 mm: thp: check pmd migration entry in common path ... Browse Code »

When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.

Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:

1. adds pmd migration entry split code in split_huge_pmd(),

2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.

Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().

Signed-off-by: Zi Yan
Cc: Kirill A. Shutemov
Cc: "H. Peter Anvin"
Cc: Anshuman Khandual
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zi Yan
2017-09-09 09:26:45 +0800

07 Sep, 2017

7 commits

608c1d3c1 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Several notable changes this cycle:

- Thread mode was merged. This will be used for cgroup2 support for
CPU and possibly other controllers. Unfortunately, CPU controller
cgroup2 support didn't make this pull request but most contentions
have been resolved and the support is likely to be merged before
the next merge window.

- cgroup.stat now shows the number of descendant cgroups.

- cpuset now can enable the easier-to-configure v2 behavior on v1
hierarchy"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Allow v2 behavior in v1 cgroup
cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
cgroup: remove unneeded checks
cgroup: misc changes
cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
cgroup: re-use the parent pointer in cgroup_destroy_locked()
cgroup: add cgroup.stat interface with basic hierarchy stats
cgroup: implement hierarchy limits
cgroup: keep track of number of descent cgroups
cgroup: add comment to cgroup_enable_threaded()
cgroup: remove unnecessary empty check when enabling threaded mode
cgroup: update debug controller to print out thread mode information
cgroup: implement cgroup v2 thread support
cgroup: implement CSS_TASK_ITER_THREADED
cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
cgroup: reorganize cgroup.procs / task write path
cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
cgroup: distinguish local and children populated states
cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
...

Linus Torvalds
2017-09-07 13:25:25 +0800
da99ecf11 mm: replace TIF_MEMDIE checks by tsk_is_oom_victim ... Browse Code »

TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all
tasks (threads) which share the mm with the oom victim.

Please note that the freezer still needs to check TIF_MEMDIE because we
cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Mel Gorman
Cc: Tetsuo Handa
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:30 +0800
d6810d730 memcg, THP, swap: make mem_cgroup_swapout() support THP ... Browse Code »

This patch makes mem_cgroup_swapout() works for the transparent huge
page (THP). Which will move the memory cgroup charge from memory to
swap for a THP.

This will be used for the THP swap support. Where a THP may be swapped
out as a whole to a set of (HPAGE_PMD_NR) continuous swap slots on the
swap device.

Link: http://lkml.kernel.org/r/20170724051840.2309-11-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Dan Williams
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Rik van Riel
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
abe2895b7 memcg, THP, swap: avoid to duplicated charge THP in swap cache ... Browse Code »

For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL. So to
check whether the page is charged already, we need to check the head
page. This is not an issue before because it is impossible for a THP to
be in the swap cache before. But after we add delaying splitting THP
after swapped out support, it is possible now.

Link: http://lkml.kernel.org/r/20170724051840.2309-10-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Dan Williams
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Rik van Riel
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
3e14a57b2 memcg, THP, swap: support move mem cgroup charge for THP swapped out ... Browse Code »

PTE mapped THP (Transparent Huge Page) will be ignored when moving
memory cgroup charge. But for THP which is in the swap cache, the
memory cgroup charge for the swap of a tail-page may be moved in current
implementation. That isn't correct, because the swap charge for all
sub-pages of a THP should be moved together. Following the processing
of the PTE mapped THP, the mem cgroup charge moving for the swap entry
for a tail-page of a THP is ignored too.

Link: http://lkml.kernel.org/r/20170724051840.2309-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Dan Williams
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Rik van Riel
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
04fecbf51 mm: memcontrol: use int for event/state parameter in several functions ... Browse Code »

Several functions use an enum type as parameter for an event/state, but
are called in some locations with an argument of a different enum type.
Adjust the interface of these functions to reality by changing the
parameter to int.

This fixes a ton of enum-conversion warnings that are generated when
building the kernel with clang.

[mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()]
Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org
Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Doug Anderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthias Kaehlcke
2017-09-07 08:27:27 +0800
63677c745 mm, memcg: reset memory.low during memcg offlining ... Browse Code »

A removed memory cgroup with a defined memory.low and some belonging
pagecache has very low chances to be freed.

If a cgroup has been removed, there is likely no memory pressure inside
the cgroup, and the pagecache is protected from the external pressure by
the defined low limit. The cgroup will be freed only after the reclaim
of all belonging pages. And it will not happen until there are any
reclaimable memory in the system. That means, there is a good chance,
that a cold pagecache will reside in the memory for an undefined amount
of time, wasting system resources.

This problem was fixed earlier by fa06235b8eb0 ("cgroup: reset css on
destruction"), but it's not a best way to do it, as we can't really
reset all limits/counters during cgroup offlining.

Link: http://lkml.kernel.org/r/20170727130428.28856-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2017-09-07 08:27:27 +0800

19 Aug, 2017

1 commit

739f79fc9 mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() ... Browse Code »

Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
to update the memcg stats:

BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
IP: test_clear_page_writeback+0x12e/0x2c0
[...]
RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
Call Trace:

end_page_writeback+0x47/0x70
f2fs_write_end_io+0x76/0x180 [f2fs]
bio_endio+0x9f/0x120
blk_update_request+0xa8/0x2f0
scsi_end_request+0x39/0x1d0
scsi_io_completion+0x211/0x690
scsi_finish_command+0xd9/0x120
scsi_softirq_done+0x127/0x150
__blk_mq_complete_request_remote+0x13/0x20
flush_smp_call_function_queue+0x56/0x110
generic_smp_call_function_single_interrupt+0x13/0x30
smp_call_function_single_interrupt+0x27/0x40
call_function_single_interrupt+0x89/0x90
RIP: 0010:native_safe_halt+0x6/0x10

(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614 mod_node_page_state(page_pgdat(page), idx, val);
615 if (mem_cgroup_disabled() || !page->mem_cgroup)
616 return;
617 mod_memcg_state(page->mem_cgroup, idx, val);
618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619 this_cpu_add(pn->lruvec_stat->count[idx], val);
620 }
621
622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623 gfp_t gfp_mask,

The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback(). The stat functions looking up
the page's node or zone are safe, as those attributes are static across
allocation and free cycles. But page->mem_cgroup is not, and it will
get cleared if we race with truncation or migration.

It appears this race window has been around for a while, but less likely
to trigger when the memcg stats were updated first thing after
PG_writeback is cleared. Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the race
window out to an extent where people can reproduce the problem.

Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward. It
is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
but leaves the pageref-holding callsites that aren't affected alone.

Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
Signed-off-by: Johannes Weiner
Reported-by: Jaegeuk Kim
Tested-by: Jaegeuk Kim
Reported-by: Bradley Bolen
Tested-by: Brad Bolen
Cc: Vladimir Davydov
Cc: Michal Hocko
Cc: [4.6+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-08-19 06:32:01 +0800

21 Jul, 2017

1 commit

bc2fb7ed0 cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS ... Browse Code »

css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo

Tejun Heo
2017-07-21 23:14:51 +0800

11 Jul, 2017

2 commits

6a1a8b807 mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit() ... Browse Code »

Alice has reported the following UBSAN splat:

UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
signed integer overflow:
-2147483644 - 2147483525 cannot be represented in type 'long int'
CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4
Hardware name: XXXXXX, BIOS YYYYYY
Call Trace:
dump_stack+0x59/0x87
ubsan_epilogue+0xe/0x40
handle_overflow+0xbb/0xf0
__ubsan_handle_sub_overflow+0x12/0x20
memcg_check_events.isra.36+0x223/0x360
mem_cgroup_commit_charge+0x55/0x140
wp_page_copy+0x34e/0xb80
do_wp_page+0x1e6/0x1300
handle_mm_fault+0x88b/0x1990
__do_page_fault+0x2de/0x8a0
do_page_fault+0x1a/0x20
error_code+0x67/0x6c

The reason is that we subtract two signed types. Let's fix this by
truly mimicing time_after and cast the result of the subtraction.

Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: Alice Ferrazzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:32 +0800
34c810579 mm/memcontrol: exclude @root from checks in mem_cgroup_low ... Browse Code »

Make @root exclusive in mem_cgroup_low; it is never considered low when
looked at directly and is not checked when traversing the tree. In
effect, @root is handled identically to how root_mem_cgroup was
previously handled by mem_cgroup_low.

If @root is not excluded from the checks, a cgroup underneath @root will
never be considered low during targeted reclaim of @root, e.g. due to
memory.current > memory.high, unless @root is misconfigured to have
memory.low > memory.high.

Excluding @root enables using memory.low to prioritize memory usage
between cgroups within a subtree of the hierarchy that is limited by
memory.high or memory.max, e.g. when ROOT owns @root's controls but
delegates the @root directory to a USER so that USER can create and
administer children of @root.

For example, given cgroup A with children B and C:

A
/ \
B C

and

1. A/memory.current > A/memory.high
2. A/B/memory.current < A/B/memory.low
3. A/C/memory.current >= A/C/memory.low

As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
should reclaim from 'C' until 'A' is no longer high or until we can no
longer reclaim from 'C'. If 'A', i.e. @root, isn't excluded by
mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
and we will reclaim indiscriminately from both 'B' and 'C'.

Here is the test I used to confirm the bug and the patch.

20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
#!/bin/bash

x62mb=$((62<<<< /dev/null
sudo mkswap /mnt/1gb.swap > /dev/null
fi
if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
sudo swapon /mnt/1gb.swap
fi

if [[ ! -e /cgroup/cgroup.controllers ]]; then
sudo mount -t cgroup2 none /cgroup
fi

grep -q memory /cgroup/cgroup.controllers

sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"

sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
sudo sh -c "echo '96m' > /cgroup/A/memory.high"

mkdir /cgroup/A/0
mkdir /cgroup/A/1

echo 64m > /cgroup/A/0/memory.low
}

teardown() {
set +e

trap - EXIT HUP INT TERM

if [[ -z $1 ]]; then
printf "\n"
printf "%0.s*" {1..35}
printf "\nFAILED!\n\n"
tail /cgroup/A/**/memory.current
printf "%0.s*" {1..35}
printf "\n\n"
fi

ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %

sleep 2

if [[ -e /cgroup/A/0 ]]; then
rmdir /cgroup/A/0
fi
if [[ -e /cgroup/A/1 ]]; then
rmdir /cgroup/A/1
fi
if [[ -e /cgroup/A ]]; then
sudo rmdir /cgroup/A
fi
}

stress_test() {
sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

sudo sh -c "echo $$ > /cgroup/cgroup.procs"

sleep 1

# A/0 should be consuming more memory than A/1
[[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]

# A/0 should be consuming ~64mb
[[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]

# A should cumulatively be consuming ~96mb
[[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]

# Stop the stressors
ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
}

teardown 1
setup

for ((i=1;i
Acked-by: Vladimir Davydov
Acked-by: Balbir Singh
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sean Christopherson
2017-07-11 07:32:31 +0800

07 Jul, 2017

5 commits

00f3ca2c2 mm: memcontrol: per-lruvec stats infrastructure ... Browse Code »

lruvecs are at the intersection of the NUMA node and memcg, which is the
scope for most paging activity.

Introduce a convenient accounting infrastructure that maintains
statistics per node, per memcg, and the lruvec itself.

Then convert over accounting sites for statistics that are already
tracked in both nodes and memcgs and can be easily switched.

[hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
[hannes@cmpxchg.org: don't track uncharged pages at all
Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
[hannes@cmpxchg.org: add missing free_percpu()]
Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
[linux@roeck-us.net: hexagon: fix build error caused by include file order]
Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Guenter Roeck
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
320492961 mm: memcontrol: use the node-native slab memory counters ... Browse Code »

Now that the slab counters are moved from the zone to the node level we
can drop the private memcg node stats and use the official ones.

Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
8e675f7af mm/oom_kill: count global and memory cgroup oom kills ... Browse Code »

Show count of oom killer invocations in /proc/vmstat and count of
processes killed in memory cgroup in knob "memory.events" (in
memory.oom_control for v1 cgroup).

Also describe difference between "oom" and "oom_kill" in memory cgroup
documentation. Currently oom in memory cgroup kills tasks iff shortage
has happened inside page fault.

These counters helps in monitoring oom kills - for now the only way is
grepping for magic words in kernel log.

[akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
[akpm@linux-foundation.org: fix comment, per Konstantin]
Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Cc: Michal Hocko
Cc: Tetsuo Handa
Cc: Roman Guschin
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2017-07-07 07:24:35 +0800
2262185c5 mm: per-cgroup memory reclaim stats ... Browse Code »

Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin
Suggested-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2017-07-07 07:24:35 +0800
38d8b4e6b mm, THP, swap: delay splitting THP during swap out ... Browse Code »

Patch series "THP swap: Delay splitting THP during swapping out", v11.

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine. Because the
performance of the storage device improved faster than that of single
logical CPU. And it seems that the trend will not change in the near
future. On the other hand, the THP becomes more and more popular
because of increased memory size. So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
acquiring/releasing, including allocating/freeing the swap space,
adding/deleting to/from the swap cache, and writing/reading the swap
space, etc. This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO. It is
particularly helpful for the swap read, which are usually 4k random
IO. This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
heavily used by the applications. The 2M continuous pages will be
free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
turned on. Because the speed for khugepaged to collapse the normal
pages into the THP is quite slow. After the THP is split during the
swapping out, it will take quite long time for the normal pages to
collapse back into the THP after being swapped in. The high THP
utilization helps the efficiency of the page based memory management
too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device. To deal with that, the THP swap in should be turned
on only when necessary. For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is the first step for the THP swap support. The plan is
to delay splitting THP step by step, finally avoid splitting THP during
the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the swap
space for the THP and adding the THP into the swap cache. This will
reduce lock acquiring/releasing for the locks used for the swap cache
management.

With the patchset, the swap out throughput improves 15.5% (from about
3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes. The test is done on a Xeon E5 v3 system. The swap
device used is a RAM simulated PMEM (persistent memory) device. To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

This patch (of 5):

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the THP
(Transparent Huge Page) and adding the THP into the swap cache. This
will batch the corresponding operation, thus improve THP swap out
throughput.

This is the first step for the THP swap optimization. The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

In this patch, one swap cluster is used to hold the contents of each THP
swapped out. So, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512). For other
architectures which want such THP swap optimization,
ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
the architecture. In effect, this will enlarge swap cluster size by 2
times on x86_64. Which may make it harder to find a free cluster when
the swap space becomes fragmented. So that, this may reduce the
continuous swap space allocation and sequential write in theory. The
performance test in 0day shows no regressions caused by this.

In the future of THP swap optimization, some information of the swapped
out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

The mem cgroup swap accounting functions are enhanced to support charge
or uncharge a swap cluster backing a THP as a whole.

The swap cluster allocate/free functions are added to allocate/free a
swap cluster for a THP. A fair simple algorithm is used for swap
cluster allocation, that is, only the first swap device in priority list
will be tried to allocate the swap cluster. The function will fail if
the trying is not successful, and the caller will fallback to allocate a
single swap slot instead. This works good enough for normal cases. If
the difference of the number of the free swap clusters among multiple
swap devices is significant, it is possible that some THPs are split
earlier than necessary. For example, this could be caused by big size
difference among multiple swap devices.

The swap cache functions is enhanced to support add/delete THP to/from
the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
enhanced in the future with multi-order radix tree. But because we will
split the THP soon during swapping out, that optimization doesn't make
much sense for this first step.

The THP splitting functions are enhanced to support to split THP in swap
cache during swapping out. The page lock will be held during allocating
the swap cluster, adding the THP into the swap cache and splitting the
THP. So in the code path other than swapping out, if the THP need to be
split, the PageSwapCache(THP) will be always false.

The swap cluster is only available for SSD, so the THP swap optimization
in this patchset has no effect for HDD.

[ying.huang@intel.com: fix two issues in THP optimize patch]
Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
[hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Signed-off-by: Johannes Weiner
Suggested-by: Andrew Morton [for config option]
Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Hugh Dickins
Cc: Shaohua Li
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-07-07 07:24:31 +0800

20 Jun, 2017

2 commits

2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800

13 May, 2017

1 commit

18365225f hwpoison, memcg: forcibly uncharge LRU pages ... Browse Code »

Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.

hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: Laurent Dufour
Tested-by: Laurent Dufour
Reviewed-by: Balbir Singh
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-05-13 06:57:15 +0800

04 May, 2017

6 commits

ccda7f436 mm: memcontrol: use node page state naming scheme for memcg ... Browse Code »

The memory controllers stat function names are awkwardly long and
arbitrarily different from the zone and node stat functions.

The current interface is named:

mem_cgroup_read_stat()
mem_cgroup_update_stat()
mem_cgroup_inc_stat()
mem_cgroup_dec_stat()
mem_cgroup_update_page_stat()
mem_cgroup_inc_page_stat()
mem_cgroup_dec_page_stat()

This patch renames it to match the corresponding node stat functions:

memcg_page_state() [node_page_state()]
mod_memcg_state() [mod_node_state()]
inc_memcg_state() [inc_node_state()]
dec_memcg_state() [dec_node_state()]
mod_memcg_page_state() [mod_node_page_state()]
inc_memcg_page_state() [inc_node_page_state()]
dec_memcg_page_state() [dec_node_page_state()]

Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
71cd31135 mm: memcontrol: re-use node VM page state enum ... Browse Code »

The current duplication is a high-maintenance mess, and it's painful to
add new items or query memcg state from the rest of the VM.

This increases the size of the stat array marginally, but we should aim
to track all these stats on a per-cgroup level anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
df0e53d06 mm: memcontrol: re-use global VM event enum ... Browse Code »

The current duplication is a high-maintenance mess, and it's painful to
add new items.

This increases the size of the event array, but we'll eventually want
most of the VM events tracked on a per-cgroup basis anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
31176c781 mm: memcontrol: clean up memory.events counting function ... Browse Code »

We only ever count single events, drop the @nr parameter. Rename the
function accordingly. Remove low-information kerneldoc.

Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
2a2e48854 mm: vmscan: fix IO/refault regression in cache workingset transition ... Browse Code »

Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.

The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO. However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.

This patch disables active list protection when refaults are being
observed. This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.

The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.

Stateful services that handle user data tend to be more conservative
with kernel upgrades. As a result we hit most page cache issues with
some delay, as was the case here.

The severity seemed to warrant a stable tag.

Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:11 +0800
9a4caf1e9 mm: memcontrol: provide shmem statistics ... Browse Code »

Cgroups currently don't report how much shmem they use, which can be
useful data to have, in particular since shmem is included in the
cache/file item while being reclaimed like anonymous memory.

Add a counter to track shmem pages during charging and uncharging.

Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: Chris Down
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:08 +0800