Eric Lee / smarc-fsl-linux-kernel

30 Dec, 2020

2 commits

6d48fff6d mm: memcg/slab: fix use after free in obj_cgroup_charge ... Browse Code »

[ Upstream commit eefbfa7fd678805b38a46293e78543f98f353d3e ]

The rcu_read_lock/unlock only can guarantee that the memcg will not be
freed, but it cannot guarantee the success of css_get to memcg.

If the whole process of a cgroup offlining is completed between reading a
objcg->memcg pointer and bumping the css reference on another CPU, and
there are exactly 0 external references to this memory cgroup (how we get
to the obj_cgroup_charge() then?), css_get() can change the ref counter
from 0 back to 1.

Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Yafang Shao
Cc: Chris Down
Cc: Christian Brauner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Muchun Song
2020-12-30 18:53:54 +0800
02314d05e mm: memcg/slab: fix return of child memcg objcg for root memcg ... Browse Code »

[ Upstream commit 2f7659a314736b32b66273dbf91c19874a052fde ]

Consider the following memcg hierarchy.

root
/ \
A B

If we failed to get the reference on objcg of memcg A, the
get_obj_cgroup_from_current can return the wrong objcg for the root
memcg.

Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Shakeel Butt
Cc: Joonsoo Kim
Cc: Yafang Shao
Cc: Chris Down
Cc: Christian Brauner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Kees Cook
Cc: Thomas Gleixner
Cc: Eugene Syromiatnikov
Cc: Suren Baghdasaryan
Cc: Adrian Reber
Cc: Marco Elver
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Muchun Song
2020-12-30 18:53:54 +0800

23 Nov, 2020

1 commit

8faeb1ffd mm: memcg/slab: fix root memcg vmstats ... Browse Code »

If we reparent the slab objects to the root memcg, when we free the slab
object, we need to update the per-memcg vmstats to keep it correct for
the root memcg. Now this at least affects the vmstat of
NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
smaller than the PAGE_SIZE.

David said:
"I assume that without this fix that the root memcg's vmstat would
always be inflated if we reparented"

Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes")
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: David Rientjes
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Christopher Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Roman Gushchin
Cc: Vlastimil Babka
Cc: Yafang Shao
Cc: Chris Down
Cc: [5.3+]
Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds

Muchun Song
2020-11-23 02:48:22 +0800

03 Nov, 2020

2 commits

8de15e920 mm: memcg: link page counters to root if use_hierarchy is false ... Browse Code »

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Modules linked in:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
Workqueue: events drain_local_stock
RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Call Trace:
__memcg_kmem_uncharge (mm/memcontrol.c:3022)
drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
drain_local_stock (mm/memcontrol.c:2255)
process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:292)
ret_from_fork (arch/x86/entry/entry_64.S:300)

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge and
triggers a warning.

Fix it by linking all page counters to corresponding root page counters
in the non-hierarchical mode.

Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more
than 1 level. This patch doesn't change it.

The patch also doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!

Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Reported-by:
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Reviewed-by: Michal Koutný
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
Debugged-by: Richard Palethorpe
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-11-03 04:14:18 +0800
7de2e9f19 mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg ... Browse Code »

memcg_page_state will get the specified number in hierarchical memcg, It
should multiply by HPAGE_PMD_NR rather than an page if the item is
NR_ANON_THPS.

[akpm@linux-foundation.org: fix printk warning]
[akpm@linux-foundation.org: use u64 cast, per Michal]

Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: zhongjiang-ali
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
Signed-off-by: Linus Torvalds

zhongjiang-ali
2020-11-03 04:14:18 +0800

19 Oct, 2020

5 commits

4127c6504 mm: kmem: enable kernel memcg accounting from interrupt contexts ... Browse Code »

If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.

Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.

To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
37d5985c0 mm: kmem: prepare remote memcg charging infra for interrupt contexts ... Browse Code »

Remote memcg charging API uses current->active_memcg to store the
currently active memory cgroup, which overwrites the memory cgroup of the
current process. It works well for normal contexts, but doesn't work for
interrupt contexts: indeed, if an interrupt occurs during the execution of
a section with an active memcg set, all allocations inside the interrupt
will be charged to the active memcg set (given that we'll enable
accounting for allocations from an interrupt context). But because the
interrupt might have no relation to the active memcg set outside, it's
obviously wrong from the accounting prospective.

To resolve this problem, let's add a global percpu int_active_memcg
variable, which will be used to store an active memory cgroup which will
be used from interrupt contexts. set_active_memcg() will transparently
use current->active_memcg or int_active_memcg depending on the context.

To make the read part simple and transparent for the caller, let's
introduce two new functions:
- struct mem_cgroup *active_memcg(void),
- struct mem_cgroup *get_active_memcg(void).

They are returning the active memcg if it's set, hiding all implementation
details: where to get it depending on the current context.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
67f028649 mm: kmem: remove redundant checks from get_obj_cgroup_from_current() ... Browse Code »

There are checks for current->mm and current->active_memcg in
get_obj_cgroup_from_current(), but these checks are redundant:
memcg_kmem_bypass() called just above performs same checks.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
279c3393e mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current() ... Browse Code »

Patch series "mm: kmem: kernel memory accounting in an interrupt context".

This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.

Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.

The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.

This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.

This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.

This patch (of 4):

Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
b87d8cefe mm, memcg: rework remote charging API to support nesting ... Browse Code »

Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.

memalloc_use_memcg(target_memcg);

memalloc_unuse_memcg();

It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.

memalloc_use_memcg(target_memcg);

memalloc_use_memcg(target_memcg_2);

memalloc_unuse_memcg();

Error: allocation here are charged to the memcg of the current
process instead of target_memcg.

memalloc_unuse_memcg();

This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:

old_memcg = set_active_memcg(target_memcg);

set_active_memcg(old_memcg);

This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Dan Schatzberg
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800

14 Oct, 2020

12 commits

9a137153f mm/memcg: fix device private memcg accounting ... Browse Code »

The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages are
never handled. Fix this by checking for non_swap_entry() after handling
device private swap PTEs.

I assume the memory cgroup accounting would be off somehow when moving
a process to another memory cgroup. Currently, the device private page
is charged like a normal anonymous page when allocated and is uncharged
when the page is freed so I think that path is OK.

Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Jerome Glisse
Cc: Balbir Singh
Cc: Ira Weiny
Link: https://lkml.kernel.org/r/20201009215952.2726-1-rcampbell@nvidia.com
xFixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE")
Signed-off-by: Linus Torvalds

Ralph Campbell
2020-10-14 09:38:31 +0800
7a52d4d88 mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom() ... Browse Code »

Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter"), the mem_cgroup_unmark_under_oom() is added and the comment of
the mem_cgroup_oom_unlock() is moved here. But this comment make no sense
here because mem_cgroup_oom_lock() does not operate on under_oom field.
So we reword the comment as this would be helpful. [Thanks Michal Hocko
for rewording this comment.]

Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Link: https://lkml.kernel.org/r/20200930095336.21323-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-14 09:38:30 +0800
5f9a4f4a7 mm: memcontrol: add the missing numa_stat interface for cgroup v2 ... Browse Code »

In the cgroup v1, we have a numa_stat interface. This is useful for
providing visibility into the numa locality information within an memcg
since the pages are allowed to be allocated from any physical node. One
of the use cases is evaluating application performance by combining this
information with the application's CPU allocation. But the cgroup v2 does
not. So this patch adds the missing information.

Suggested-by: Shakeel Butt
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Zefan Li
Cc: Johannes Weiner
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Randy Dunlap
Link: https://lkml.kernel.org/r/20200916100030.71698-2-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds

Muchun Song
2020-10-14 09:38:30 +0800
bd0b230fe mm/memcg: unify swap and memsw page counters ... Browse Code »

The swap page counter is v2 only while memsw is v1 only. As v1 and v2
controllers cannot be active at the same time, there is no point to keep
both swap and memsw page counters in mem_cgroup. The previous patch has
made sure that memsw page counter is updated and accessed only when in v1
code paths. So it is now safe to alias the v1 memsw page counter to v2
swap page counter. This saves 14 long's in the size of mem_cgroup. This
is a saving of 112 bytes for 64-bit archs.

While at it, also document which page counters are used in v1 and/or v2.

Signed-off-by: Waiman Long
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Chris Down
Cc: Johannes Weiner
Cc: Roman Gushchin
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Yafang Shao
Link: https://lkml.kernel.org/r/20200914024452.19167-4-longman@redhat.com
Signed-off-by: Linus Torvalds

Waiman Long
2020-10-14 09:38:30 +0800
8d387a5f1 mm/memcg: simplify mem_cgroup_get_max() ... Browse Code »

mem_cgroup_get_max() used to get memory+swap max from both the v1 memsw
and v2 memory+swap page counters & return the maximum of these 2 values.
This is redundant and it is more efficient to just get either the v1 or
the v2 values depending on which one is currently in use.

[longman@redhat.com: v4]
Link: https://lkml.kernel.org/r/20200914150928.7841-1-longman@redhat.com

Signed-off-by: Waiman Long
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Chris Down
Cc: Johannes Weiner
Cc: Roman Gushchin
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Yafang Shao
Link: https://lkml.kernel.org/r/20200914024452.19167-3-longman@redhat.com
Signed-off-by: Linus Torvalds

Waiman Long
2020-10-14 09:38:30 +0800
f9f84ec56 mm/memcg: clean up obsolete enum charge_type ... Browse Code »

Patch series "mm/memcg: Miscellaneous cleanups and streamlining", v2.

This patch (of 3):

Since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") and
commit 00501b531c47 ("mm: memcontrol: rewrite charge API") in v3.17, the
enum charge_type was no longer used anywhere. However, the enum itself
was not removed at that time. Remove the obsolete enum charge_type now.

Signed-off-by: Waiman Long
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Chris Down
Cc: Vladimir Davydov
Cc: Tejun Heo
Cc: Roman Gushchin
Cc: Yafang Shao
Link: https://lkml.kernel.org/r/20200914024452.19167-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20200914024452.19167-2-longman@redhat.com
Signed-off-by: Linus Torvalds

Waiman Long
2020-10-14 09:38:30 +0800
05bdc520b mm: memcontrol: correct the comment of mem_cgroup_iter() ... Browse Code »

Since commit bbec2e15170a ("mm: rename page_counter's count/limit into
usage/max"), the arg @reclaim has no priority field anymore.

Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Link: https://lkml.kernel.org/r/20200913094129.44558-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-14 09:38:30 +0800
19b629c97 mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() ... Browse Code »

mem_cgroup_from_obj() checks the lowest bit of the page->mem_cgroup
pointer to determine if the page has an attached obj_cgroup vector instead
of a regular memcg pointer. If it's not set, it simple returns the
page->mem_cgroup value as a struct mem_cgroup pointer.

The commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
for all allocations") changed the moment when this bit is set: if
previously it was set on the allocation of the slab page, now it can be
set well after, when the first accounted object is allocated on this page.

It opened a race: if page->mem_cgroup is set concurrently after the first
page_has_obj_cgroups(page) check, a pointer to the obj_cgroups array can
be returned as a memory cgroup pointer.

A simple check for page->mem_cgroup pointer for NULL before the
page_has_obj_cgroups() check fixes the race. Indeed, if the pointer is
not NULL, it's either a simple mem_cgroup pointer or a pointer to
obj_cgroup vector. The pointer can be asynchronously changed from NULL to
(obj_cgroup_vec | 0x1UL), but can't be changed from a valid memcg pointer
to objcg vector or back.

If the object passed to mem_cgroup_from_obj() is a slab object and
page->mem_cgroup is NULL, it means that the object is not accounted, so
the function must return NULL.

I've discovered the race looking at the code, so far I haven't seen it in
the wild.

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Vlastimil Babka
Link: https://lkml.kernel.org/r/20200910022435.2773735-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-14 09:38:30 +0800
61e604e63 mm: memcontrol: use the preferred form for passing the size of a structure type ... Browse Code »

Use the preferred form for passing the size of a structure type. The
alternative form where the structure type is spelled out hurts readability
and introduces an opportunity for a bug when the object type is changed
but the corresponding object identifier to which the sizeof operator is
applied is not.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Link: https://lkml.kernel.org/r/773e013ff2f07fe2a0b47153f14dea054c0c04f1.1596214831.git.gustavoars@kernel.org
Signed-off-by: Linus Torvalds

Gustavo A. R. Silva
2020-10-14 09:38:30 +0800
e90342e6d mm: memcontrol: use flex_array_size() helper in memcpy() ... Browse Code »

Make use of the flex_array_size() helper to calculate the size of a
flexible array member within an enclosing structure.

This helper offers defense-in-depth against potential integer overflows,
while at the same time makes it explicitly clear that we are dealing with
a flexible array member.

Also, remove unnecessary braces.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Link: https://lkml.kernel.org/r/ddd60dae2d9aea1ccdd2be66634815c93696125e.1596214831.git.gustavoars@kernel.org
Signed-off-by: Linus Torvalds

Gustavo A. R. Silva
2020-10-14 09:38:30 +0800
f5df8635c mm: use find_get_incore_page in memcontrol ... Browse Code »

The current code does not protect against swapoff of the underlying
swap device, so this is a bug fix as well as a worthwhile reduction in
code complexity.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Cc: Alexey Dobriyan
Cc: Chris Wilson
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Jani Nikula
Cc: Johannes Weiner
Cc: Matthew Auld
Cc: William Kucharski
Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-14 09:38:29 +0800
3ad11d7ac Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:

- Series of merge handling cleanups (Baolin, Christoph)

- Series of blk-throttle fixes and cleanups (Baolin)

- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)

- Removal of bdget() as a generic API (Christoph)

- Removal of blkdev_get() as a generic API (Christoph)

- Cleanup of is-partition checks (Christoph)

- Series reworking disk revalidation (Christoph)

- Series cleaning up bio flags (Christoph)

- bio crypt fixes (Eric)

- IO stats inflight tweak (Gabriel)

- blk-mq tags fixes (Hannes)

- Buffer invalidation fixes (Jan)

- Allow soft limits for zone append (Johannes)

- Shared tag set improvements (John, Kashyap)

- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

- DM no-wait support (Mike, Konstantin)

- Request allocation improvements (Ming)

- Allow md/dm/bcache to use IO stat helpers (Song)

- Series improving blk-iocost (Tejun)

- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...

Linus Torvalds
2020-10-14 03:12:44 +0800

27 Sep, 2020

1 commit

8d3fe09d8 mm: memcontrol: fix missing suffix of workingset_restore ... Browse Code »

We forget to add the suffix to the workingset_restore string, so fix it.

And also update the documentation of cgroup-v2.rst.

Fixes: 170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Tejun Heo
Cc: Zefan Li
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Randy Dunlap
Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds

Muchun Song
2020-09-27 01:33:57 +0800

25 Sep, 2020

1 commit

f56753ac2 bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag ... Browse Code »

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800

06 Sep, 2020

1 commit

f1796544a memcg: fix use-after-free in uncharge_batch ... Browse Code »

syzbot has reported an use-after-free in the uncharge_batch path

BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline]
BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304

CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1f0/0x31e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:183 [inline]
check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
page_counter_cancel mm/page_counter.c:54 [inline]
page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764
uncharge_page+0x115/0x430 mm/memcontrol.c:6796
uncharge_list mm/memcontrol.c:6835 [inline]
mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877
release_pages+0x13a2/0x1550 mm/swap.c:911
tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249
tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328
exit_mmap+0x296/0x550 mm/mmap.c:3185
__mmput+0x113/0x370 kernel/fork.c:1076
exit_mm+0x4cd/0x550 kernel/exit.c:483
do_exit+0x576/0x1f20 kernel/exit.c:793
do_group_exit+0x161/0x2d0 kernel/exit.c:903
get_signal+0x139b/0x1d30 kernel/signal.c:2743
arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:135 [inline]
exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166
syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Commit 1a3e1f40962c ("mm: memcontrol: decouple reference counting from
page accounting") reworked the memcg lifetime to be bound the the struct
page rather than charges. It also removed the css_put_many from
uncharge_batch and that is causing the above splat.

uncharge_batch() is supposed to uncharge accumulated charges for all
pages freed from the same memcg. The queuing is done by uncharge_page
which however drops the memcg reference after it adds charges to the
batch. If the current page happens to be the last one holding the
reference for its memcg then the memcg is OK to go and the next page to
be freed will trigger batched uncharge which needs to access the memcg
which is gone already.

Fix the issue by taking a reference for the memcg in the current batch.

Fixes: 1a3e1f40962c ("mm: memcontrol: decouple reference counting from page accounting")
Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com
Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Roman Gushchin
Cc: Hugh Dickins
Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.cz
Signed-off-by: Linus Torvalds

Michal Hocko
2020-09-06 03:14:29 +0800

15 Aug, 2020

1 commit

6c357848b mm: replace hpage_nr_pages with thp_nr_pages ... Browse Code »

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-08-15 10:56:56 +0800

14 Aug, 2020

1 commit

9f4571792 mm: memcontrol: fix warning when allocating the root cgroup ... Browse Code »

Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the
parent cgroup") adds memory tracking to the memcg kernel structures
themselves to make cgroups liable for the memory they are consuming
through the allocation of child groups (which can be significant).

This code is a bit awkward as it's spread out through several functions:
The outermost function does memalloc_use_memcg(parent) to set up
current->active_memcg, which designates which cgroup to charge, and the
inner functions pass GFP_ACCOUNT to request charging for specific
allocations. To make sure this dependency is satisfied at all times -
to make sure we don't randomly charge whoever is calling the functions -
the inner functions warn on !current->active_memcg.

However, this triggers a false warning when the root memcg itself is
allocated. No parent exists in this case, and so current->active_memcg
is rightfully NULL. It's a false positive, not indicative of a bug.

Delete the warnings for now, we can revisit this later.

Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup")
Signed-off-by: Johannes Weiner
Reported-by: Stephen Rothwell
Acked-by: Roman Gushchin
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-14 03:15:21 +0800

13 Aug, 2020

4 commits

ac5ddd0fc mm/memcontrol.c: delete duplicated words ... Browse Code »

Drop the repeated word "down".

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org
Signed-off-by: Linus Torvalds

Randy Dunlap
2020-08-13 01:57:58 +0800
170b04b7a mm/workingset: prepare the workingset detection infrastructure for anon LRU ... Browse Code »

To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:57:55 +0800
3e38e0aac mm: memcg: charge memcg percpu memory to the parent cgroup ... Browse Code »

Memory cgroups are using large chunks of percpu memory to store vmstat
data. Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.

Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simply ignore it. It actually breaks the isolation
between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

[guro@fb.com: add WARN_ON_ONCE()s, per Johannes]
Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Dennis Zhou
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Tobin C. Harding
Cc: Vlastimil Babka
Cc: Waiman Long
Cc: Bixuan Cui
Cc: Michal Koutný
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-13 01:57:55 +0800
772616b03 mm: memcg/percpu: per-memcg percpu memory statistics ... Browse Code »

Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs. Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page. So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

[guro@fb.com: v3]
Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Dennis Zhou
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Tejun Heo
Cc: Tobin C. Harding
Cc: Vlastimil Babka
Cc: Waiman Long
Cc: Bixuan Cui
Cc: Michal Koutný
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-13 01:57:55 +0800

08 Aug, 2020

9 commits

e22c6ed90 mm: memcontrol: don't count limit-setting reclaim as memory pressure ... Browse Code »

When an outside process lowers one of the memory limits of a cgroup (or
uses the force_empty knob in cgroup1), direct reclaim is performed in the
context of the write(), in order to directly enforce the new limit and
have it being met by the time the write() returns.

Currently, this reclaim activity is accounted as memory pressure in the
cgroup that the writer(!) belongs to. This is unexpected. It
specifically causes problems for senpai
(https://github.com/facebookincubator/senpai), which is an agent that
routinely adjusts the memory limits and performs associated reclaim work
in tens or even hundreds of cgroups running on the host. The cgroup that
senpai is running in itself will report elevated levels of memory
pressure, even though it itself is under no memory shortage or any sort of
distress.

Move the psi annotation from the central cgroup reclaim function to
callsites in the allocation context, and thereby no longer count any
limit-setting reclaim as memory pressure. If the newly set limit causes
the workload inside the cgroup into direct reclaim, that of course will
continue to count as memory pressure.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Acked-by: Chris Down
Acked-by: Michal Hocko
Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-08 02:33:26 +0800
19ce33acb mm: memcontrol: restore proper dirty throttling when memory.high changes ... Browse Code »

Commit 8c8c383c04f6 ("mm: memcontrol: try harder to set a new
memory.high") inadvertently removed a callback to recalculate the
writeback cache size in light of a newly configured memory.high limit.

Without letting the writeback cache know about a potentially heavily
reduced limit, it may permit too many dirty pages, which can cause
unnecessary reclaim latencies or even avoidable OOM situations.

This was spotted while reading the code, it hasn't knowingly caused any
problems in practice so far.

Fixes: 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high")
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Chris Down
Acked-by: Michal Hocko
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-08 02:33:26 +0800
1378b37d0 memcg, oom: check memcg margin for parallel oom ... Browse Code »

Memcg oom killer invocation is synchronized by the global oom_lock and
tasks are sleeping on the lock while somebody is selecting the victim or
potentially race with the oom_reaper is releasing the victim's memory.
This can result in a pointless oom killer invocation because a waiter
might be racing with the oom_reaper

P1 oom_reaper P2
oom_reap_task mutex_lock(oom_lock)
out_of_memory # no victim because we have one already
__oom_reap_task_mm mute_unlock(oom_lock)
mutex_lock(oom_lock)
set MMF_OOM_SKIP
select_bad_process
# finds a new victim

The page allocator prevents from this race by trying to allocate after the
lock can be acquired (in __alloc_pages_may_oom) which acts as a last
minute check. Moreover page allocator simply doesn't block on the
oom_lock and simply retries the whole reclaim process.

Memcg oom killer should do the last minute check as well. Call
mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
well but this doesn't seem to be necessary at this stage.

[mhocko@kernel.org: commit log]

Suggested-by: Michal Hocko
Signed-off-by: Yafang Shao
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Chris Down
Cc: Tetsuo Handa
Cc: David Rientjes
Cc: Johannes Weiner
Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds

Yafang Shao
2020-08-08 02:33:25 +0800
45c7f7e1e mm, memcg: decouple e{low,min} state mutations from protection checks ... Browse Code »

mem_cgroup_protected currently is both used to set effective low and min
and return a mem_cgroup_protection based on the result. As a user, this
can be a little unexpected: it appears to be a simple predicate function,
if not for the big warning in the comment above about the order in which
it must be executed.

This change makes it so that we separate the state mutations from the
actual protection checks, which makes it more obvious where we need to be
careful mutating internal state, and where we are simply checking and
don't need to worry about that.

[mhocko@suse.com - don't check protection on root memcgs]

Suggested-by: Johannes Weiner
Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Roman Gushchin
Cc: Yafang Shao
Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
22f7496f0 mm, memcg: avoid stale protection values when cgroup is above protection ... Browse Code »

Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

This series contains a fix for a edge case in my earlier protection
calculation patches, and a patch to make the area overall a little more
robust to hopefully help avoid this in future.

This patch (of 2):

A cgroup can have both memory protection and a memory limit to isolate it
from its siblings in both directions - for example, to prevent it from
being shrunk below 2G under high pressure from outside, but also from
growing beyond 4G under low pressure.

Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in excess
of their protection settings don't get reclaimed equally but instead in
accordance to their unprotected portion.

During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and should
be applied as such. Reclaim should operate at full efficiency.

However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above its
protection. As a result, a query during limit reclaim may return stale
protection values that were calculated by a previous reclaim cycle in
which the cgroup did have siblings.

When this happens, reclaim is unnecessarily hesitant and potentially slow
to meet the desired limit. In theory this could lead to premature OOM
kills, although it's not obvious this has occurred in practice.

Workaround the problem by special casing reclaim roots in
mem_cgroup_protection. These memcgs are never participating in the
reclaim protection because the reclaim is internal.

We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots. Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold. The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:

Let's have global and A's reclaim in parallel:
|
A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
|\
| C (low = 1G, usage = 2.5G)
B (low = 1G, usage = 0.5G)

for A reclaim we have
B.elow = B.low
C.elow = C.low

For the global reclaim
A.elow = A.low
B.elow = min(B.usage, B.low) because children_low_usage A.elow

Which means that protected memcgs would get reclaimed.

In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution than this
simple workaround.

[hannes@cmpxchg.org - large part of the changelog]
[mhocko@suse.com - workaround explanation]
[chris@chrisdown.name - retitle]

Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao
Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Chris Down
Acked-by: Roman Gushchin
Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Yafang Shao
2020-08-08 02:33:25 +0800
d977aa939 mm, memcg: unify reclaim retry limits with page allocator ... Browse Code »

Reclaim retries have been set to 5 since the beginning of time in
commit 66e1707bc346 ("Memory controller: add per cgroup LRU and
reclaim"). However, we now have a generally agreed-upon standard for
page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
in commit 0a0337e0d1d1 ("mm, oom: rework oom detection").

In the absence of a compelling reason to declare an OOM earlier in memcg
context than page allocator context, it seems reasonable to supplant
MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
allocator and memcg internals more similar in semantics when reclaim
fails to produce results, avoiding premature OOMs or throttling.

Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
b3ff92916 mm, memcg: reclaim more aggressively before high allocator throttling ... Browse Code »

Patch series "mm, memcg: reclaim harder before high throttling", v2.

This patch (of 2):

In Facebook production, we've seen cases where cgroups have been put into
allocator throttling even when they appear to have a lot of slack file
caches which should be trivially reclaimable.

Looking more closely, the problem is that we only try a single cgroup
reclaim walk for each return to usermode before calculating whether or not
we should throttle. This single attempt doesn't produce enough pressure
to shrink for cgroups with a rapidly growing amount of file caches prior
to entering allocator throttling.

As an example, we see that threads in an affected cgroup are stuck in
allocator throttling:

# for i in $(cat cgroup.threads); do
> grep over_high "/proc/$i/stack"
> done
[] mem_cgroup_handle_over_high+0x10b/0x150
[] mem_cgroup_handle_over_high+0x10b/0x150
[] mem_cgroup_handle_over_high+0x10b/0x150

...however, there is no I/O pressure reported by PSI, despite a lot of
slack file pages:

# cat memory.pressure
some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
# cat io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
# grep _file memory.stat
inactive_file 1370939392
active_file 661635072

This patch changes the behaviour to retry reclaim either until the current
task goes below the 10ms grace period, or we are making no reclaim
progress at all. In the latter case, we enter reclaim throttling as
before.

To a user, there's no intuitive reason for the reclaim behaviour to differ
from hitting memory.high as part of a new allocation, as opposed to
hitting memory.high because someone lowered its value. As such this also
brings an added benefit: it unifies the reclaim behaviour between the two.

There's precedent for this behaviour: we already do reclaim retries when
writing to memory.{high,max}, in max reclaim, and in the page allocator
itself.

Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Michal Hocko
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
536d3bf26 mm: memcontrol: avoid workload stalls when lowering memory.high ... Browse Code »

Memory.high limit is implemented in a way such that the kernel penalizes
all threads which are allocating a memory over the limit. Forcing all
threads into the synchronous reclaim and adding some artificial delays
allows to slow down the memory consumption and potentially give some time
for userspace oom handlers/resource control agents to react.

It works nicely if the memory usage is hitting the limit from below,
however it works sub-optimal if a user adjusts memory.high to a value way
below the current memory usage. It basically forces all workload threads
(doing any memory allocations) into the synchronous reclaim and sleep.
This makes the workload completely unresponsive for a long period of time
and can also lead to a system-wide contention on lru locks. It can happen
even if the workload is not actually tight on memory and has, for example,
a ton of cold pagecache.

In the current implementation writing to memory.high causes an atomic
update of page counter's high value followed by an attempt to reclaim
enough memory to fit into the new limit. To fix the problem described
above, all we need is to change the order of execution: try to push the
memory usage under the limit first, and only then set the new high limit.

Reported-by: Domas Mituzas
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Chris Down
Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
991e76738 mm: memcontrol: account kernel stack per node ... Browse Code »

Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds

Shakeel Butt
2020-08-08 02:33:25 +0800