Eric Lee / smarc-fsl-linux-kernel

08 Aug, 2020

40 commits

56f3547bf mm: adjust vm_committed_as_batch according to vm overcommit policy ... Browse Code »

When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Matthew Wilcox (Oracle)
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Qian Cai
Cc: Kees Cook
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Huang Ying
Cc: Christoph Lameter
Cc: Dennis Zhou
Cc: Haiyang Zhang
Cc: kernel test robot
Cc: "K. Y. Srinivasan"
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds

Feng Tang
2020-08-08 02:33:26 +0800
0a4954a85 percpu_counter: add percpu_counter_sync() ... Browse Code »

percpu_counter's accuracy is related to its batch size. For a
percpu_counter with a big batch, its deviation could be big, so when the
counter's batch is runtime changed to a smaller value for better accuracy,
there could also be requirment to reduce the big deviation.

So add a percpu-counter sync function to be run on each CPU.

Reported-by: kernel test robot
Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Cc: Dennis Zhou
Cc: Tejun Heo
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Qian Cai
Cc: Andi Kleen
Cc: Huang Ying
Cc: Dave Hansen
Cc: Haiyang Zhang
Cc: Johannes Weiner
Cc: Kees Cook
Cc: "K. Y. Srinivasan"
Cc: Matthew Wilcox (Oracle)
Cc: Mel Gorman
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1594389708-60781-4-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds

Feng Tang
2020-08-08 02:33:26 +0800
4e2ee51e8 mm/util.c: make vm_memory_committed() more accurate ... Browse Code »

percpu_counter_sum_positive() will provide more accurate info.

As with percpu_counter_read_positive(), in worst case the deviation could
be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be
more when the batch gets enlarged.

Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3
microseconds on a 2S/36C/72T Skylake server in normal case, and in worst
case where vm_committed_as's spinlock is under severe contention, it costs
30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine
for its only two users: /proc/meminfo and HyperV balloon driver's status
trace per second.

Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko # for /proc/meminfo
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Matthew Wilcox (Oracle)
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Qian Cai
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Huang Ying
Cc: Christoph Lameter
Cc: Dennis Zhou
Cc: Kees Cook
Cc: kernel test robot
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-3-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds

Feng Tang
2020-08-08 02:33:26 +0800
1455083c1 proc/meminfo: avoid open coded reading of vm_committed_as ... Browse Code »

Patch series "make vm_committed_as_batch aware of vm overcommit policy", v6.

When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. And for that case,
whether it shows improvements depends on if the test mmap size is bigger
than the batch number computed.

We tested 10+ platforms in 0day (server, desktop and laptop). If we lift
it to 64X, 80%+ platforms show improvements, and for 16X lift, 1/3 of the
platforms will show improvements.

And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit
: from a larger batch. E.g. large in memory databases which do large
: mmaps during startups from multiple threads.

Note: There are some style complain from checkpatch for patch 4, as sysctl
handler declaration follows the similar format of sibling functions

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

This patch (of 4):

Use the existing vm_memory_committed() instead, which is also convenient
for future change.

Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Matthew Wilcox (Oracle)
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Qian Cai
Cc: Kees Cook
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Huang Ying
Cc: Christoph Lameter
Cc: Dennis Zhou
Cc: Haiyang Zhang
Cc: kernel test robot
Cc: "K. Y. Srinivasan"
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1594389708-60781-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-2-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds

Feng Tang
2020-08-08 02:33:26 +0800
7bba8f0ea mm/mmap: optimize a branch judgment in ksys_mmap_pgoff() ... Browse Code »

Look at the pseudo code below. It's very clear that, the judgement
"!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
only needed by the branch 3), because "retval" will be overwritten at 4).

No functional change, but it can reduce the code size. Maybe more clearer?
Before:
text data bss dec hex filename
28733 1590 1 30324 7674 mm/mmap.o

After:
text data bss dec hex filename
28701 1590 1 30292 7654 mm/mmap.o

====pseudo code====:
if (!(flags & MAP_ANONYMOUS)) {
...
1) if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
2) retval = -EINVAL;
3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
...
}
...

4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
...
return retval;

Signed-off-by: Zhen Lei
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
Signed-off-by: Linus Torvalds

Zhen Lei
2020-08-08 02:33:26 +0800
2a681cfa5 mm: move p?d_alloc_track to separate header file ... Browse Code »

The functions are only used in two source files, so there is no need for
them to be in the global header. Move them to the new
header and include it only where needed.

Signed-off-by: Joerg Roedel
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Cc: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Abdul Haleem
Cc: Satheesh Rajendran
Cc: Stephen Rothwell
Cc: Steven Rostedt (VMware)
Cc: Mike Rapoport
Cc: Christophe Leroy
Cc: Arnd Bergmann
Cc: Max Filippov
Cc: Stafford Horne
Cc: Geert Uytterhoeven
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200609120533.25867-1-joro@8bytes.org
Signed-off-by: Linus Torvalds

Joerg Roedel
2020-08-08 02:33:26 +0800
ab05eabfa mm: move lib/ioremap.c to mm/ ... Browse Code »

The functionality in lib/ioremap.c deals with pagetables, vmalloc and
caches, so it naturally belongs to mm/ Moving it there will also allow
declaring p?d_alloc_track functions in an header file inside mm/ rather
than having those declarations in include/linux/mm.h

Suggested-by: Andrew Morton
Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Geert Uytterhoeven
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-8-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
f9cb654cb asm-generic: pgalloc: provide generic pgd_free() ... Browse Code »

Most architectures define pgd_free() as a wrapper for free_page().

Provide a generic version in asm-generic/pgalloc.h and enable its use for
most architectures.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Acked-by: Geert Uytterhoeven [m68k]
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-7-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
d9e8b9296 asm-generic: pgalloc: provide generic pud_alloc_one() and pud_free_one() ... Browse Code »

Several architectures define pud_alloc_one() as a wrapper for
__get_free_page() and pud_free() as a wrapper for free_page().

Provide a generic implementation in asm-generic/pgalloc.h and use it where
appropriate.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Geert Uytterhoeven
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-6-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
1355c31ee asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one() ... Browse Code »

For most architectures that support >2 levels of page tables,
pmd_alloc_one() is a wrapper for __get_free_pages(), sometimes with
__GFP_ZERO and sometimes followed by memset(0) instead.

More elaborate versions on arm64 and x86 account memory for the user page
tables and call to pgtable_pmd_page_ctor() as the part of PMD page
initialization.

Move the arm64 version to include/asm-generic/pgalloc.h and use the
generic version on several architectures.

The pgtable_pmd_page_ctor() is a NOP when ARCH_ENABLE_SPLIT_PMD_PTLOCK is
not enabled, so there is no functional change for most architectures
except of the addition of __GFP_ACCOUNT for allocation of user page
tables.

The pmd_free() is a wrapper for free_page() in all the cases, so no
functional change here.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Cc: Matthew Wilcox
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Geert Uytterhoeven
Link: http://lkml.kernel.org/r/20200627143453.31835-5-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
7278914ca xtensa: switch to generic version of pte allocation ... Browse Code »

xtensa clears PTEs during allocation of the page tables and pte_clear()
sets the PTE to a non-zero value. Splitting ptes_clear() helper out of
pte_alloc_one() and pte_alloc_one_kernel() allows reuse of base generic
allocation methods (__pte_alloc_one() and __pte_alloc_one_kernel()) and
the common GFP mask for page table allocations.

The pte_free() and pte_free_kernel() implementations on xtensa are
identical to the generic ones and can be dropped.

[jcmvbkbc@gmail.com: xtensa: fix closing endif comment]
Link: http://lkml.kernel.org/r/20200721024751.1257-1-jcmvbkbc@gmail.com

Signed-off-by: Mike Rapoport
Signed-off-by: Max Filippov
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Geert Uytterhoeven
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-4-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
fc2a6b837 opeinrisc: switch to generic version of pte allocation ... Browse Code »

Replace pte_alloc_one(), pte_free() and pte_free_kernel() with the generic
implementation. The only actual functional change is the addition of
__GFP_ACCOUT for the allocation of the user page tables.

The pte_alloc_one_kernel() is kept back because its implementation on
openrisc is different than the generic one.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Acked-by: Stafford Horne
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra (Intel)
Cc: Satheesh Rajendran
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Geert Uytterhoeven
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-3-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
ca15ca406 mm: remove unneeded includes of <asm/pgalloc.h> ... Browse Code »

Patch series "mm: cleanup usage of "

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the included all over the place.
The first patch in this series removes unneeded includes of

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in and do not require that header.

As for the other header files that used to include , it is
possible to move that include into the .c file that actually uses symbols
from and drop the include from the header file.

The process was somewhat automated using

sed -i -E '/[
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Acked-by: Geert Uytterhoeven [m68k]
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Joerg Roedel
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800
0c4123e3f mm/memory.c: make remap_pfn_range() reject unaligned addr ... Browse Code »

This function implicitly assumes that the addr passed in is page aligned.
A non page aligned addr could ultimately cause a kernel bug in
remap_pte_range as the exit condition in the logic loop may never be
satisfied. This patch documents the need for the requirement, as well as
explicitly adds a check for it.

Signed-off-by: Alex Zhang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200617233512.177519-1-zhangalex@google.com
Signed-off-by: Linus Torvalds

Alex Zhang
2020-08-08 02:33:26 +0800
463b7a173 mm: remove redundant check non_swap_entry() ... Browse Code »

In zap_pte_range(), the check for non_swap_entry() and
is_device_private_entry() is unnecessary since the latter is sufficient to
determine if the page is a device private page. Remove the test for
non_swap_entry() to simplify the code and for clarity.

Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Acked-by: David Hildenbrand
Link: http://lkml.kernel.org/r/20200615175405.4613-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds

Ralph Campbell
2020-08-08 02:33:26 +0800
a6f23d14e mm/page_counter.c: fix protection usage propagation ... Browse Code »

When workload runs in cgroups that aren't directly below root cgroup and
their parent specifies reclaim protection, it may end up ineffective.

The reason is that propagate_protected_usage() is not called in all
hierarchy up. All the protected usage is incorrectly accumulated in the
workload's parent. This means that siblings_low_usage is overestimated
and effective protection underestimated. Even though it is transitional
phenomenon (uncharge path does correct propagation and fixes the wrong
children_low_usage), it can undermine the intended protection
unexpectedly.

We have noticed this problem while seeing a swap out in a descendant of a
protected memcg (intermediate node) while the parent was conveniently
under its protection limit and the memory pressure was external to that
hierarchy. Michal has pinpointed this down to the wrong
siblings_low_usage which led to the unwanted reclaim.

The fix is simply updating children_low_usage in respective ancestors also
in the charging path.

Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
Signed-off-by: Michal Koutný
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: [4.18+]
Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds

Michal Koutný
2020-08-08 02:33:26 +0800
e22c6ed90 mm: memcontrol: don't count limit-setting reclaim as memory pressure ... Browse Code »

When an outside process lowers one of the memory limits of a cgroup (or
uses the force_empty knob in cgroup1), direct reclaim is performed in the
context of the write(), in order to directly enforce the new limit and
have it being met by the time the write() returns.

Currently, this reclaim activity is accounted as memory pressure in the
cgroup that the writer(!) belongs to. This is unexpected. It
specifically causes problems for senpai
(https://github.com/facebookincubator/senpai), which is an agent that
routinely adjusts the memory limits and performs associated reclaim work
in tens or even hundreds of cgroups running on the host. The cgroup that
senpai is running in itself will report elevated levels of memory
pressure, even though it itself is under no memory shortage or any sort of
distress.

Move the psi annotation from the central cgroup reclaim function to
callsites in the allocation context, and thereby no longer count any
limit-setting reclaim as memory pressure. If the newly set limit causes
the workload inside the cgroup into direct reclaim, that of course will
continue to count as memory pressure.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Acked-by: Chris Down
Acked-by: Michal Hocko
Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-08 02:33:26 +0800
19ce33acb mm: memcontrol: restore proper dirty throttling when memory.high changes ... Browse Code »

Commit 8c8c383c04f6 ("mm: memcontrol: try harder to set a new
memory.high") inadvertently removed a callback to recalculate the
writeback cache size in light of a newly configured memory.high limit.

Without letting the writeback cache know about a potentially heavily
reduced limit, it may permit too many dirty pages, which can cause
unnecessary reclaim latencies or even avoidable OOM situations.

This was spotted while reading the code, it hasn't knowingly caused any
problems in practice so far.

Fixes: 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high")
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Chris Down
Acked-by: Michal Hocko
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-08-08 02:33:26 +0800
1378b37d0 memcg, oom: check memcg margin for parallel oom ... Browse Code »

Memcg oom killer invocation is synchronized by the global oom_lock and
tasks are sleeping on the lock while somebody is selecting the victim or
potentially race with the oom_reaper is releasing the victim's memory.
This can result in a pointless oom killer invocation because a waiter
might be racing with the oom_reaper

P1 oom_reaper P2
oom_reap_task mutex_lock(oom_lock)
out_of_memory # no victim because we have one already
__oom_reap_task_mm mute_unlock(oom_lock)
mutex_lock(oom_lock)
set MMF_OOM_SKIP
select_bad_process
# finds a new victim

The page allocator prevents from this race by trying to allocate after the
lock can be acquired (in __alloc_pages_may_oom) which acts as a last
minute check. Moreover page allocator simply doesn't block on the
oom_lock and simply retries the whole reclaim process.

Memcg oom killer should do the last minute check as well. Call
mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
well but this doesn't seem to be necessary at this stage.

[mhocko@kernel.org: commit log]

Suggested-by: Michal Hocko
Signed-off-by: Yafang Shao
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Chris Down
Cc: Tetsuo Handa
Cc: David Rientjes
Cc: Johannes Weiner
Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds

Yafang Shao
2020-08-08 02:33:25 +0800
45c7f7e1e mm, memcg: decouple e{low,min} state mutations from protection checks ... Browse Code »

mem_cgroup_protected currently is both used to set effective low and min
and return a mem_cgroup_protection based on the result. As a user, this
can be a little unexpected: it appears to be a simple predicate function,
if not for the big warning in the comment above about the order in which
it must be executed.

This change makes it so that we separate the state mutations from the
actual protection checks, which makes it more obvious where we need to be
careful mutating internal state, and where we are simply checking and
don't need to worry about that.

[mhocko@suse.com - don't check protection on root memcgs]

Suggested-by: Johannes Weiner
Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Roman Gushchin
Cc: Yafang Shao
Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
22f7496f0 mm, memcg: avoid stale protection values when cgroup is above protection ... Browse Code »

Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

This series contains a fix for a edge case in my earlier protection
calculation patches, and a patch to make the area overall a little more
robust to hopefully help avoid this in future.

This patch (of 2):

A cgroup can have both memory protection and a memory limit to isolate it
from its siblings in both directions - for example, to prevent it from
being shrunk below 2G under high pressure from outside, but also from
growing beyond 4G under low pressure.

Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in excess
of their protection settings don't get reclaimed equally but instead in
accordance to their unprotected portion.

During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and should
be applied as such. Reclaim should operate at full efficiency.

However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above its
protection. As a result, a query during limit reclaim may return stale
protection values that were calculated by a previous reclaim cycle in
which the cgroup did have siblings.

When this happens, reclaim is unnecessarily hesitant and potentially slow
to meet the desired limit. In theory this could lead to premature OOM
kills, although it's not obvious this has occurred in practice.

Workaround the problem by special casing reclaim roots in
mem_cgroup_protection. These memcgs are never participating in the
reclaim protection because the reclaim is internal.

We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots. Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold. The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:

Let's have global and A's reclaim in parallel:
|
A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
|\
| C (low = 1G, usage = 2.5G)
B (low = 1G, usage = 0.5G)

for A reclaim we have
B.elow = B.low
C.elow = C.low

For the global reclaim
A.elow = A.low
B.elow = min(B.usage, B.low) because children_low_usage A.elow

Which means that protected memcgs would get reclaimed.

In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution than this
simple workaround.

[hannes@cmpxchg.org - large part of the changelog]
[mhocko@suse.com - workaround explanation]
[chris@chrisdown.name - retitle]

Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao
Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Chris Down
Acked-by: Roman Gushchin
Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Yafang Shao
2020-08-08 02:33:25 +0800
d977aa939 mm, memcg: unify reclaim retry limits with page allocator ... Browse Code »

Reclaim retries have been set to 5 since the beginning of time in
commit 66e1707bc346 ("Memory controller: add per cgroup LRU and
reclaim"). However, we now have a generally agreed-upon standard for
page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
in commit 0a0337e0d1d1 ("mm, oom: rework oom detection").

In the absence of a compelling reason to declare an OOM earlier in memcg
context than page allocator context, it seems reasonable to supplant
MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
allocator and memcg internals more similar in semantics when reclaim
fails to produce results, avoiding premature OOMs or throttling.

Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
b3ff92916 mm, memcg: reclaim more aggressively before high allocator throttling ... Browse Code »

Patch series "mm, memcg: reclaim harder before high throttling", v2.

This patch (of 2):

In Facebook production, we've seen cases where cgroups have been put into
allocator throttling even when they appear to have a lot of slack file
caches which should be trivially reclaimable.

Looking more closely, the problem is that we only try a single cgroup
reclaim walk for each return to usermode before calculating whether or not
we should throttle. This single attempt doesn't produce enough pressure
to shrink for cgroups with a rapidly growing amount of file caches prior
to entering allocator throttling.

As an example, we see that threads in an affected cgroup are stuck in
allocator throttling:

# for i in $(cat cgroup.threads); do
> grep over_high "/proc/$i/stack"
> done
[] mem_cgroup_handle_over_high+0x10b/0x150
[] mem_cgroup_handle_over_high+0x10b/0x150
[] mem_cgroup_handle_over_high+0x10b/0x150

...however, there is no I/O pressure reported by PSI, despite a lot of
slack file pages:

# cat memory.pressure
some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
# cat io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
# grep _file memory.stat
inactive_file 1370939392
active_file 661635072

This patch changes the behaviour to retry reclaim either until the current
task goes below the 10ms grace period, or we are making no reclaim
progress at all. In the latter case, we enter reclaim throttling as
before.

To a user, there's no intuitive reason for the reclaim behaviour to differ
from hitting memory.high as part of a new allocation, as opposed to
hitting memory.high because someone lowered its value. As such this also
brings an added benefit: it unifies the reclaim behaviour between the two.

There's precedent for this behaviour: we already do reclaim retries when
writing to memory.{high,max}, in max reclaim, and in the page allocator
itself.

Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Michal Hocko
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds

Chris Down
2020-08-08 02:33:25 +0800
536d3bf26 mm: memcontrol: avoid workload stalls when lowering memory.high ... Browse Code »

Memory.high limit is implemented in a way such that the kernel penalizes
all threads which are allocating a memory over the limit. Forcing all
threads into the synchronous reclaim and adding some artificial delays
allows to slow down the memory consumption and potentially give some time
for userspace oom handlers/resource control agents to react.

It works nicely if the memory usage is hitting the limit from below,
however it works sub-optimal if a user adjusts memory.high to a value way
below the current memory usage. It basically forces all workload threads
(doing any memory allocations) into the synchronous reclaim and sleep.
This makes the workload completely unresponsive for a long period of time
and can also lead to a system-wide contention on lru locks. It can happen
even if the workload is not actually tight on memory and has, for example,
a ton of cold pagecache.

In the current implementation writing to memory.high causes an atomic
update of page counter's high value followed by an attempt to reclaim
enough memory to fit into the new limit. To fix the problem described
above, all we need is to change the order of execution: try to push the
memory usage under the limit first, and only then set the new high limit.

Reported-by: Domas Mituzas
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Chris Down
Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
eda330e57 mm: kmem: switch to static_branch_likely() in memcg_kmem_enabled() ... Browse Code »

Currently memcg_kmem_enabled() is optimized for the kernel memory
accounting being off. It was so for a long time, and arguably the reason
behind was that the kernel memory accounting was initially an opt-in
feature. However, now it's on by default on both cgroup v1 and cgroup v2,
and it's on for all cgroups. So let's switch over to
static_branch_likely() to reflect this fact.

Unlikely there is a significant performance difference, as the cost of a
memory allocation and its accounting significantly exceeds the cost of a
jump. However, the conversion makes the code look more logically.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Link: http://lkml.kernel.org/r/20200707173612.124425-3-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
74d555bed mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() ... Browse Code »

charge_slab_page() and uncharge_slab_page() are not related anymore to
memcg charging and uncharging. In order to make their names less
confusing, let's rename them to account_slab_page() and
unaccount_slab_page() respectively.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Pekka Enberg
Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
849504809 mm: memcg/slab: remove unused argument by charge_slab_page() ... Browse Code »

charge_slab_page() is not using the gfp argument anymore,
remove it.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
991e76738 mm: memcontrol: account kernel stack per node ... Browse Code »

Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds

Shakeel Butt
2020-08-08 02:33:25 +0800
fbc1ac9d0 tools/cgroup: add memcg_slabinfo.py tool ... Browse Code »

Add a drgn-based tool to display slab information for a given memcg. Can
replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2, but in a
more flexiable way.

Currently supports only SLUB configuration, but SLAB can be trivially
added later.

Output example:
$ sudo
shmem_inode_cache
eventpoll_pwq
eventpoll_epi
kmalloc-8
kmalloc-96
kmalloc-2048
kmalloc-64
mm_struct
signal_cache
sighand_cache
files_cache
task_delay_info
task_struct
radix_tree_node
btrfs_inode
kmalloc-1024
kmalloc-192
inode_cache
kmalloc-128
kmalloc-512
skbuff_head_cache
sock_inode_cache
cred_jar
proc_inode_cache
dentry
filp
anon_vma
pid
vm_area_struct ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service 92 92 704 46 8 : tunables 0 0 0 : slabdata 2 2 0 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0 128 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0 160 160 1024 32 8 : tunables 0 0 0 : slabdata 5 5 0 96 96 1024 32 8 : tunables 0 0 0 : slabdata 3 3 0 45 45 2112 15 8 : tunables 0 0 0 : slabdata 3 3 0 138 138 704 46 8 : tunables 0 0 0 : slabdata 3 3 0 153 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0 27 27 3520 9 8 : tunables 0 0 0 : slabdata 3 3 0 56 56 584 28 4 : tunables 0 0 0 : slabdata 2 2 0 140 140 1136 28 8 : tunables 0 0 0 : slabdata 5 5 0 64 64 1024 32 8 : tunables 0 0 0 : slabdata 2 2 0 84 84 192 42 2 : tunables 0 0 0 : slabdata 2 2 0 54 54 600 27 4 : tunables 0 0 0 : slabdata 2 2 0 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0 46 46 704 46 8 : tunables 0 0 0 : slabdata 1 1 0 378 378 192 42 2 : tunables 0 0 0 : slabdata 9 9 0 96 96 672 24 4 : tunables 0 0 0 : slabdata 4 4 0 336 336 192 42 2 : tunables 0 0 0 : slabdata 8 8 0 697 864 256 32 2 : tunables 0 0 0 : slabdata 27 27 0 644 644 88 46 1 : tunables 0 0 0 : slabdata 14 14 0 1408 1408 64 64 1 : tunables 0 0 0 : slabdata 22 22 0 1200 1200 200 40 2 : tunables 0 0 0 : slabdata 30 30 0

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Acked-by: Tejun Heo
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Shakeel Butt
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200623174037.3951353-20-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
933dc80ec kselftests: cgroup: add kernel memory accounting tests ... Browse Code »

Add some tests to cover the kernel memory accounting functionality. These
are covering some issues (and changes) we had recently.

1) A test which allocates a lot of negative dentries, checks memcg slab
statistics, creates memory pressure by setting memory.max to some low
value and checks that some number of slabs was reclaimed.

2) A test which covers side effects of memcg destruction: it creates
and destroys a large number of sub-cgroups, each containing a
multi-threaded workload which allocates and releases some kernel
memory. Then it checks that the charge ans memory.stats do add up on
the parent level.

3) A test which reads /proc/kpagecgroup and implicitly checks that it
doesn't crash the system.

4) A test which spawns a large number of threads and checks that the
kernel stacks accounting works as expected.

5) A test which checks that living charged slab objects are not
preventing the memory cgroup from being released after being deleted by
a user.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200623174037.3951353-19-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
10befea91 mm: memcg/slab: use a single set of kmem_caches for all allocations ... Browse Code »

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated on
demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle kmem_cache
clones for accounted allocations. There is no more need to create them,
accumulate statistics, propagate attributes, etc. It's a quite
significant simplification.

Also, because the total number of slab_caches is reduced almost twice (not
all kmem_caches have a memcg clone), some additional memory savings are
expected. On my devvm it additionally saves about 3.5% of slab memory.

[guro@fb.com: fix build on MIPS]
Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

Suggested-by: Johannes Weiner
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Naresh Kamboju
Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
15999eef7 mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() ... Browse Code »

memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
a first argument, so the is_root_cache(s) check is redundant and can be
removed without any functional change.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-17-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
c7094406f mm: memcg/slab: deprecate slab_root_caches ... Browse Code »

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list if
CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches was
proportional to the number of memory cgroups and could reach really big
values. Now, when it cannot exceed the number of root kmem_caches, there
is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths, so it's
perfectly fine to iterate over slab_caches and filter out non-root
kmem_caches.

It allows to remove a lot of config-dependent code and two pointers from
the kmem_cache structure.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
272911a4a mm: memcg/slab: remove memcg_kmem_get_cache() ... Browse Code »

The memcg_kmem_get_cache() function became really trivial, so let's just
inline it into the single call point: memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler to
generate a better code.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
d797b7d05 mm: memcg/slab: simplify memcg cache creation ... Browse Code »

Because the number of non-root kmem_caches doesn't depend on the number of
memory cgroups anymore and is generally not very big, there is no more
need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's possible to
just embed the work structure into the kmem_cache and avoid the dynamic
allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts to
create a non-root kmem_cache for a root kmem_cache: the second and all
following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be finished.
Instead, cancel_work_sync() can be used to cancel/wait for only one work.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
9855609bd mm: memcg/slab: use a single set of kmem_caches for all accounted allocations ... Browse Code »

This is fairly big but mostly red patch, which makes all accounted slab
allocations use a single set of kmem_caches instead of creating a separate
set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number of
root kmem_caches, there is no need to shrink or destroy them prematurely.
They can be perfectly destroyed together with their root counterparts.
This allows to dramatically simplify the management of non-root
kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named -memcg,
e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
(separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:25 +0800
0f876e4dc mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h ... Browse Code »

To make the memcg_kmem_bypass() function available outside of the
memcontrol.c, let's move it to memcontrol.h. The function is small and
nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
4330a26bc mm: memcg/slab: deprecate memory.kmem.slabinfo ... Browse Code »

Deprecate memory.kmem.slabinfo.

An empty file will be presented if corresponding config options are
enabled.

The interface is implementation dependent, isn't present in cgroup v2, and
is generally useful only for core mm debugging purposes. In other words,
it doesn't provide any value for the absolute majority of users.

A drgn-based replacement can be found in
tools/cgroup/memcg_slabinfo.py. It does support cgroup v1 and v2,
mimics memory.kmem.slabinfo output and also allows to get any
additional information without a need to recompile the kernel.

If a drgn-based solution is too slow for a task, a bpf-based tracing tool
can be used, which can easily keep track of all slab allocations belonging
to a memory cgroup.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-11-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
f2fe7b09a mm: memcg/slab: charge individual slab objects instead of pages ... Browse Code »

Switch to per-object accounting of non-root slab objects.

Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size of
metadata: as now it's the size of an obj_cgroup pointer. If the amount of
memory has been charged successfully, the actual allocation code is
executed. Otherwise, -ENOMEM is returned.

In the post_alloc hook if the actual allocation succeeded, corresponding
vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
charge is canceled.

On the free path obj_cgroup pointer is obtained and used to uncharge the
size of the releasing object.

Memcg and lruvec counters are now representing only memory used by active
slab objects and do not include the free space. The free space is shared
and doesn't belong to any specific cgroup.

Global per-node slab vmstats are still modified from
(un)charge_slab_page() functions. The idea is to keep all slab pages
accounted as slab pages on system level.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800
964d4bd37 mm: memcg/slab: save obj_cgroup for non-root slab objects ... Browse Code »

Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object. Make sure that
each allocated object holds a reference to obj_cgroup.

Objcg pointer is obtained from the memcg->objcg dereferencing in
memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
Then in case of successful allocation(s) it's getting stored in the
page->obj_cgroups vector.

The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-08-08 02:33:24 +0800