Eric Lee / smarc-fsl-linux-kernel

17 Apr, 2019

1 commit

43f47331a mm: writeback: use exact memcg dirty counts ... Browse Code »

commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:

1) per-memcg per-cpu values in range of [-32..32]

2) per-memcg atomic counter

When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic. Stat readers only check the atomic. Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.

Assuming 100 cpus:
4k x86 page_size: 13 MiB error per memcg
64k ppc page_size: 200 MiB error per memcg

Considering that dirty+writeback are used together for some decisions the
errors double.

This inaccuracy can lead to undeserved oom kills. One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages. If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle. It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch. This patch avoids
oom kills.

$ cat test
mount -t cgroup2 none /dev/cgroup
cd /dev/cgroup
echo +io +memory > cgroup.subtree_control
mkdir test
cd test
echo 10M > memory.max
(echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
(echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

$ cat memcg-writeback-stress.c
/*
* Dirty pages from all but one cpu.
* Clean pages from the non dirtying cpu.
* This is to stress per cpu counter imbalance.
* On a 100 cpu machine:
* - per memcg per cpu dirty count is 32 pages for each of 99 cpus
* - per memcg atomic is -99*32 pages
* - thus the complete dirty limit: sum of all counters 0
* - balance_dirty_pages() only sees atomic count -99*32 pages, which
* it max()s to 0.
* - So a workload can dirty -99*32 pages before balance_dirty_pages()
* cares.
*/
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include
#include

static char *buf;
static int bufSize;

static void set_affinity(int cpu)
{
cpu_set_t affinity;

CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
}

static void dirty_on(int output_fd, int cpu)
{
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
}

int main(int argc, char **argv)
{
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
}

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters. This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter. And the percpu_counter
spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports. But that is saved for later.

Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen
Reviewed-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Cc: [4.16+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Greg Thelen
2019-04-17 14:38:51 +0800

06 Apr, 2019

1 commit

9d785b92c memcg: killed threads should not invoke memcg OOM killer ... Browse Code »

[ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem. The race looks as
follows:

P1 oom_reaper P2
try_charge try_charge
mem_cgroup_out_of_memory
mutex_lock(oom_lock)
out_of_memory
oom_kill_process(P1,P2)
wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
mutex_lock(oom_lock)
select_bad_process # no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already. The situation migh have changed in the mean time. It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path. "

Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: David Rientjes
Cc: Kirill Tkhai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Tetsuo Handa
2019-04-06 04:32:58 +0800

13 Jan, 2019

1 commit

1c5e0be35 memcg, oom: notify on oom killer invocation from the charge path ... Browse Code »

commit 7056d3a37d2c6aaaab10c13e8e69adc67ec1fc65 upstream.

Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko
Reported-by: Burt Holzman
Acked-by: Johannes Weiner
Cc: Vladimir Davydov [4.19+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2019-01-13 16:51:04 +0800

05 Sep, 2018

1 commit

3100dab2a mm: memcontrol: print proper OOM header when no eligible victim left ... Browse Code »

When the memcg OOM killer runs out of killable tasks, it currently
prints a WARN with no further OOM context. This has caused some user
confusion.

Warnings indicate a kernel problem. In a reported case, however, the
situation was triggered by a nonsensical memcg configuration (hard limit
set to 0). But without any VM context this wasn't obvious from the
report, and it took some back and forth on the mailing list to identify
what is actually a trivial issue.

Handle this OOM condition like we handle it in the global OOM killer:
dump the full OOM context and tell the user we ran out of tasks.

This way the user can identify misconfigurations easily by themselves
and rectify the problem - without having to go through the hassle of
running into an obscure but unsettling warning, finding the appropriate
kernel mailing list and waiting for a kernel developer to remote-analyze
that the memcg configuration caused this.

If users cannot make sense of why the OOM killer was triggered or why it
failed, they will still report it to the mailing list, we know that from
experience. So in case there is an actual kernel bug causing this,
kernel developers will very likely hear about it.

Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2018-09-05 07:45:02 +0800

23 Aug, 2018

2 commits

3d8b38eb8 mm, oom: introduce memory.oom.group ... Browse Code »

For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group. The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer. If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-08-23 01:52:45 +0800
8de7ecc64 memcg: reduce memcg tree traversals for stats collection ... Browse Code »

Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice. On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist. This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.

I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

Without the patch:
$ time ./read-root-stat-1000-times

real 0m1.663s
user 0m0.000s
sys 0m1.660s

With the patch:
$ time ./read-root-stat-1000-times

real 0m0.468s
user 0m0.000s
sys 0m0.467s

Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: Greg Thelen
Cc: Bruce Merry
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-08-23 01:52:44 +0800

18 Aug, 2018

10 commits

f90280d6b mm/vmscan.c: clear shrinker bit if there are no objects related to memcg ... Browse Code »

To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.

This patch introduces a lockless mechanism to do that without races
without parallel list lru add. After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:

1)list_lru_add() shrink_slab_memcg
list_add_tail() for_each_set_bit()
set_bit() do_shrink_slab() before the first call of do_shrink_slab()
instead of this to do not slow down generic case. Also, it's need the
second call as seen in below in (2).

2)list_lru_add() shrink_slab_memcg()
list_add_tail() ...
set_bit() ...
... for_each_set_bit()
do_shrink_slab() do_shrink_slab()
clear_bit() ...
... ...
list_lru_add() ...
list_add_tail() clear_bit()

set_bit() do_shrink_slab()

The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit. This case is
drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify increase
of performance:

$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
mkdir -p s/$i; mount -t tmpfs $i s/$i;
touch s/$i/file; done

Then, 5 sequential calls of drop caches:

$time echo 3 > /proc/sys/vm/drop_caches

1)Before:
0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Shakeel Butt tested this patchset with fork-bomb on his configuration:

> I created 255 memcgs, 255 ext4 mounts and made each memcg create a
> file containing few KiBs on corresponding mount. Then in a separate
> memcg of 200 MiB limit ran a fork-bomb.
>
> I ran the "perf record -ag -- sleep 60" and below are the results:
>
> Without the patch series:
> Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
> + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
> + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
> + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
> + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> + 0.27% fb.sh [kernel.kallsyms] [k] up_read
> + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
> + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
> + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
>
> With the patch series:
> Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
> + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> + 30.72% fb.sh [kernel.kallsyms] [k] up_read
> + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
> + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
> + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
> + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
> + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
> + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
> + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg

[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
fae91d6d8 mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance ... Browse Code »

Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
dfd2f10cc mm/memcontrol.c: export mem_cgroup_is_root() ... Browse Code »

This will be used in next patch.

Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
9bec5c35b mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() ... Browse Code »

This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:31 +0800
0a4465d34 mm, memcg: assign memcg-aware shrinkers bitmap to memcg ... Browse Code »

Imagine a big node with many cpus, memory cgroups and containers. Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts. If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().

Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list(). I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
the map is related to corresponding shrinker id.

Next patches will maintain set bit only for really charged memcg. This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.

[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:30 +0800
b05706f10 mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines ... Browse Code »

Next patch requires these defines are above their current position, so
here they are moved to declarations.

Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:30 +0800
84c07d11a mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB ... Browse Code »

Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-18 07:20:30 +0800
29ef680ae memcg, oom: move out_of_memory back to the charge path ... Browse Code »

Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory). This
in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible. mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace. Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance. Things have changed since then, though. We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore. Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though. If the oom
killer is not able to make any forward progress - e.g. because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks. We have basically two options
here. Either we fail the charge with ENOMEM or force the charge and
allow overcharge. The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone. Basically the same reason why the page
allocator doesn't fail allocations under such conditions. The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system. E.g. allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Greg Thelen
Cc: Johannes Weiner
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-08-18 07:20:30 +0800
f745c6f5f fs, mm: account buffer_head to kmemcg ... Browse Code »

The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache. In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.

Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation. The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated. One
concrete example is memory reclaim. The reclaim can trigger I/O of
pages of any memcg on the system. So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.

[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Jan Kara
Cc: Amir Goldstein
Cc: Greg Thelen
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-08-18 07:20:30 +0800
d46eb14b7 fs: fsnotify: account fsnotify metadata to kmemcg ... Browse Code »

Patch series "Directed kmem charging", v8.

The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.

The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.

The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.

To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.

This patch (of 2):

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Jan Kara
Cc: Amir Goldstein
Cc: Greg Thelen
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-08-18 07:20:30 +0800

15 Aug, 2018

1 commit

73ba2fb33 Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.

This pull request contains:

- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)

- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes

- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)

- Series improving how we handle sense information on the stack
(Kees)

- Lightnvm fixes and updates/improvements (Mathias/Javier et al)

- Zoned device support for null_blk (Matias)

- AIX partition fixes (Mauricio Faria de Oliveira)

- DIF checksum code made generic (Max Gurtovoy)

- Add support for discard in iostats (Michael Callahan / Tejun)

- Set of updates for BFQ (Paolo)

- Removal of async write support for bsg (Christoph)

- Bio page dirtying and clone fixups (Christoph)

- Set of bcache fix/changes (via Coly)

- Series improving blk-mq queue setup/teardown speed (Ming)

- Series improving merging performance on blk-mq (Ming)

- Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...

Linus Torvalds
2018-08-15 01:23:25 +0800

06 Aug, 2018

1 commit

05b9ba4b5 Merge tag 'v4.18-rc6' into for-4.19/block2 ... Browse Code »

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe

Jens Axboe
2018-08-06 09:32:09 +0800

03 Aug, 2018

1 commit

7e97de0b0 memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure ... Browse Code »

In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
in mem_cgroup_idr even after memcg memory is freed. This leads to leak
of ID in mem_cgroup_idr.

This patch adds removal into mem_cgroup_css_alloc(), which fixes the
problem. For better readability, it adds a generic helper which is used
in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.

Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
Fixes 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Signed-off-by: Kirill Tkhai
Acked-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-08-03 07:03:40 +0800

22 Jul, 2018

1 commit

9f15bde67 mm: memcg: fix use after free in mem_cgroup_iter() ... Browse Code »

It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
mem_cgroup_iter+0x2e0/0x6d4
shrink_zone+0x8c/0x324
balance_pgdat+0x450/0x640
kswapd+0x130/0x4b8
kthread+0xe8/0xfc
ret_from_fork+0x10/0x20

mem_cgroup_iter():
......
if (css_tryget(css)) position, which has been freed before and
filled with POISON_FREE(0x6b).

And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.

Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc:
Cc: Shakeel Butt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jing Xia
2018-07-22 03:50:46 +0800

09 Jul, 2018

1 commit

2cf855837 memcontrol: schedule throttling if we are congested ... Browse Code »

Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo
Signed-off-by: Josef Bacik
Acked-by: Johannes Weiner
Acked-by: Andrew Morton
Signed-off-by: Jens Axboe

Tejun Heo
2018-07-09 23:07:54 +0800

15 Jun, 2018

2 commits

fe6bdfc8e mm: fix oom_kill event handling ... Browse Code »

Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate
when waking pollers") converted most of memcg event counters to
per-memcg atomics, which made them less confusing for a user. The
"oom_kill" counter remained untouched, so now it behaves differently
than other counters (including "oom"). This adds nothing but confusion.

Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
MEMCG_OOM approach.

This also removes a hack from count_memcg_event_mm(), introduced earlier
specially for the OOM_KILL counter.

[akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Konstantin Khlebnikov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-15 06:55:25 +0800
df2a41967 mm: fix null pointer dereference in mem_cgroup_protected ... Browse Code »

Shakeel reported a crash in mem_cgroup_protected(), which can be triggered
by memcg reclaim if the legacy cgroup v1 use_hierarchy=0 mode is used:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0
Oops: 0000 [#4] SMP PTI
CPU: 0 PID: 15581 Comm: bash Tainted: G D 4.17.0-smp-clean #5
Hardware name: ...
RIP: 0010:mem_cgroup_protected+0x54/0x130
Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf 00 00 00 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2 74 05
RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286
RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000
RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88
R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000
FS: 00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0
Call Trace:
? shrink_node+0x194/0x510
do_try_to_free_pages+0xfd/0x390
try_to_free_mem_cgroup_pages+0x123/0x210
try_charge+0x19e/0x700
mem_cgroup_try_charge+0x10b/0x1a0
wp_page_copy+0x134/0x5b0
do_wp_page+0x90/0x460
__handle_mm_fault+0x8e3/0xf30
handle_mm_fault+0xfe/0x220
__do_page_fault+0x262/0x500
do_page_fault+0x28/0xd0
? page_fault+0x8/0x30
page_fault+0x1e/0x30
RIP: 0033:0x485b72

The problem happens because parent_mem_cgroup() returns a NULL pointer,
which is dereferenced later without a check.

As cgroup v1 has no memory guarantee support, let's make
mem_cgroup_protected() immediately return MEMCG_PROT_NONE, if the given
cgroup has no parent (non-hierarchical mode is used).

Link: http://lkml.kernel.org/r/20180611175418.7007-2-guro@fb.com
Fixes: bf8d5d52ffe8 ("memcg: introduce memory.min")
Signed-off-by: Roman Gushchin
Reported-by: Shakeel Butt
Tested-by: Shakeel Butt
Tested-by: John Stultz
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-15 06:55:23 +0800

08 Jun, 2018

11 commits

be09102b4 mm: memcg: allow lowering memory.swap.max below the current usage ... Browse Code »

Currently an attempt to set swap.max into a value lower than the actual
swap usage fails, which causes configuration problems as there's no way
of lowering the configuration below the current usage short of turning
off swap entirely. This makes swap.max difficult to use and allows
delegatees to lock the delegator out of reducing swap allocation.

This patch updates swap_max_write() so that the limit can be lowered
below the current usage. It doesn't implement active reclaiming of swap
entries for the following reasons.

* mem_cgroup_swap_full() already tells the swap machinary to
aggressively reclaim swap entries if the usage is above 50% of
limit, so simply lowering the limit automatically triggers gradual
reclaim.

* Forcing back swapped out pages is likely to heavily impact the
workload and mess up the working set. Given that swap usually is a
lot less valuable and less scarce, letting the existing usage
dissipate over time through the above gradual reclaim and as they're
falted back in is likely the better behavior.

Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo
Acked-by: Roman Gushchin
Acked-by: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2018-06-08 08:34:37 +0800
bf8d5d52f memcg: introduce memory.min ... Browse Code »

Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.

But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.

It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.

This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.

If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.

[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin
Reviewed-by: Randy Dunlap
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-08 08:34:36 +0800
d12c60f64 mm: memcontrol: drain memcg stock on force_empty ... Browse Code »

The per-cpu memcg stock can retain a charge of upto 32 pages. On a
machine with large number of cpus, this can amount to a decent amount of
memory. Additionally force_empty interface might be triggering unneeded
memcg reclaims.

Link: http://lkml.kernel.org/r/20180507201651.165879-1-shakeelb@google.com
Signed-off-by: Junaid Shahid
Signed-off-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junaid Shahid
2018-06-08 08:34:36 +0800
bb4a7ea2b mm: memcontrol: drain stocks on resize limit ... Browse Code »

Resizing the memcg limit for cgroup-v2 drains the stocks before
triggering the memcg reclaim. Do the same for cgroup-v1 to make the
behavior consistent.

Link: http://lkml.kernel.org/r/20180504205548.110696-1-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-06-08 08:34:36 +0800
8dd53fd3b memcg: mark memcg1_events static const ... Browse Code »

Mark memcg1_events static: it's only used by memcontrol.c. And mark it
const: it's not modified.

Link: http://lkml.kernel.org/r/20180503192940.94971-1-gthelen@google.com
Signed-off-by: Greg Thelen
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2018-06-08 08:34:36 +0800
9ccc36171 memcg: writeback: use memcg->cgwb_list directly ... Browse Code »

mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
directly.

Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
Signed-off-by: Wang Long
Reviewed-by: Jan Kara
Acked-by: Tejun Heo
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Long
2018-06-08 08:34:36 +0800
5f93ad674 mm: treat memory.low value inclusive ... Browse Code »

If memcg's usage is equal to the memory.low value, avoid reclaiming from
this cgroup while there is a surplus of reclaimable memory.

This sounds more logical and also matches memory.high and memory.max
behavior: both are inclusive.

Empty cgroups are not considered protected, so MEMCG_LOW events are not
emitted for empty cgroups, if there is no more reclaimable memory in the
system.

Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-08 08:34:35 +0800
230671533 mm: memory.low hierarchical behavior ... Browse Code »

This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.

For example, there are memcgs A, A/B, A/C, A/D and A/E:

A A/memory.low = 2G, A/memory.current = 6G
//\\
BC DE B/memory.low = 3G B/memory.current = 2G
C/memory.low = 1G C/memory.current = 2G
D/memory.low = 0 D/memory.current = 2G
E/memory.low = 10G E/memory.current = 0

If we apply memory pressure, B, C and D are reclaimed at the same pace
while A's usage exceeds 2G. This is obviously wrong, as B's usage is
fully below B's memory.low, and C has 1G of protection as well. Also, A
is pushed to the size, which is less than A's 2G memory.low, which is
also wrong.

A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
A: 1430097920
A/B: 711929856
A/C: 717426688
A/D: 741376
A/E: 0

To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.

Otherwise it's a part of parent's effective memory.low, calculated as a
cgroup's memory.low usage divided by sum of sibling's memory.low usages
(under memory.low usage I mean the size of actually protected memory:
memory.current if memory.current < memory.low, 0 otherwise). It's
necessary to track the actual usage, because otherwise an empty cgroup
with memory.low set (A/E in my example) will affect actual memory
distribution, which makes no sense. To avoid traversing the cgroup tree
twice, page_counters code is reused.

Calculating effective memory.low can be done in the reclaim path, as we
conveniently traversing the cgroup tree from top to bottom and check
memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.

This also eliminates a need to traverse the cgroup tree from bottom to
top each time to check if parent's guarantee is not exceeded.

Setting/resetting effective memory.low is intentionally racy, but it's
fine and shouldn't lead to any significant differences in actual memory
distribution.

With this patch applied results are matching the expectations:
A: 2147930112
A/B: 1428721664
A/C: 718393344
A/D: 815104
A/E: 0

Test script:
#!/bin/bash

CGPATH="/sys/fs/cgroup"

truncate /file1 --size 2G
truncate /file2 --size 2G
truncate /file3 --size 2G
truncate /file4 --size 50G

mkdir "${CGPATH}/A"
echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

echo 2G > "${CGPATH}/A/memory.low"
echo 3G > "${CGPATH}/A/B/memory.low"
echo 1G > "${CGPATH}/A/C/memory.low"
echo 0 > "${CGPATH}/A/D/memory.low"
echo 10G > "${CGPATH}/A/E/memory.low"

echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

echo "A: " `cat "${CGPATH}/A/memory.current"`
echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
rmdir "${CGPATH}/A"
rm /file1 /file2 /file3 /file4

Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-08 08:34:35 +0800
bbec2e151 mm: rename page_counter's count/limit into usage/max ... Browse Code »

This patch renames struct page_counter fields:
count -> usage
limit -> max

and the corresponding functions:
page_counter_limit() -> page_counter_set_max()
mem_cgroup_get_limit() -> mem_cgroup_get_max()
mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
memcg_update_kmem_limit() -> memcg_update_kmem_max()
memcg_update_tcp_limit() -> memcg_update_tcp_max()

The idea behind this renaming is to have the direct matching
between memory cgroup knobs (low, high, max) and page_counters API.

This is pure renaming, this patch doesn't bring any functional change.

Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-08 08:34:35 +0800
f3a53a3a1 mm, memcontrol: implement memory.swap.events ... Browse Code »

Add swap max and fail events so that userland can monitor and respond to
running out of swap.

I'm not too sure about the fail event. Right now, it's a bit confusing
which stats / events are recursive and which aren't and also which ones
reflect events which originate from a given cgroup and which targets the
cgroup. No idea what the right long term solution is and it could just
be that growing them organically is actually the only right thing to do.

Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2018-06-08 08:34:34 +0800
bb98f2c5a mm, memcontrol: move swap charge handling into get_swap_page() ... Browse Code »

Patch series "mm, memcontrol: Implement memory.swap.events", v2.

This patchset implements memory.swap.events which contains max and fail
events so that userland can monitor and respond to swap running out.

This patch (of 2):

get_swap_page() is always followed by mem_cgroup_try_charge_swap().
This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
makes get_swap_page() call the function even after swap allocation
failure.

This simplifies the callers and consolidates memcg related logic and
will ease adding swap related memcg events.

Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2018-06-08 08:34:34 +0800

26 May, 2018

1 commit

9965ed174 fs: add new vfs_poll and file_can_poll helpers ... Browse Code »

These abstract out calls to the poll method in preparation for changes
in how we poll.

Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-26 15:16:44 +0800

21 Apr, 2018

1 commit

c892fd82c mm: memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create() ... Browse Code »

If there is heavy memory pressure, page allocation with __GFP_NOWAIT
fails easily although it's order-0 request. I got below warning 9 times
for normal boot.

: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
.. snip ..
Call trace:
dump_backtrace+0x0/0x4
dump_stack+0xa4/0xc0
warn_alloc+0xd4/0x15c
__alloc_pages_nodemask+0xf88/0x10fc
alloc_slab_page+0x40/0x18c
new_slab+0x2b8/0x2e0
___slab_alloc+0x25c/0x464
__kmalloc+0x394/0x498
memcg_kmem_get_cache+0x114/0x2b8
kmem_cache_alloc+0x98/0x3e8
mmap_region+0x3bc/0x8c0
do_mmap+0x40c/0x43c
vm_mmap_pgoff+0x15c/0x1e4
sys_mmap+0xb0/0xc8
el0_svc_naked+0x24/0x28
Mem-Info:
active_anon:17124 inactive_anon:193 isolated_anon:0
active_file:7898 inactive_file:712955 isolated_file:55
unevictable:0 dirty:27 writeback:18 unstable:0
slab_reclaimable:12250 slab_unreclaimable:23334
mapped:19310 shmem:212 pagetables:816 bounce:0
free:36561 free_pcp:1205 free_cma:35615
Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
lowmem_reserve[]: 0 1842 1842
Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
lowmem_reserve[]: 0 0 0
DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
721350 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
945512 pages RAM
0 pages HighMem/MovableOnly
63408 pages reserved
51200 pages cma reserved

__memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
and the worker allocation failure is not really critical because we will
retry on the next kmem charge. We might miss some charges but that
shouldn't be critical. The excessive allocation failure report is not
very helpful.

[mhocko@kernel.org: changelog update]
Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Minchan Kim
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-04-21 08:18:36 +0800

12 Apr, 2018

4 commits

b93b01631 page cache: use xa_lock ... Browse Code »

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-04-12 01:28:39 +0800
4eaf431f6 memcg: fix per_node_info cleanup ... Browse Code »

syzbot has triggered a NULL ptr dereference when allocation fault
injection enforces a failure and alloc_mem_cgroup_per_node_info
initializes memcg->nodeinfo only half way through.

But __mem_cgroup_free still tries to free all per-node data and
dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
per-node data hasn't been initialized.

The bug is quite unlikely to hit because small allocations do not fail
and we would need quite some numa nodes to make struct
mem_cgroup_per_node large enough to cross the costly order.

Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
Signed-off-by: Michal Hocko
Reviewed-by: Andrey Ryabinin
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-04-12 01:28:32 +0800
e27be240d mm: memcg: make sure memory.events is uptodate when waking pollers ... Browse Code »

Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") added per-cpu drift to all memory cgroup stats
and events shown in memory.stat and memory.events.

For memory.stat this is acceptable. But memory.events issues file
notifications, and somebody polling the file for changes will be
confused when the counters in it are unchanged after a wakeup.

Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
MEMCG_OOM - are sufficiently rare and high-level that we don't need
per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
frequent, but they're counting invocations of reclaim, which is a
complex operation that touches many shared cachelines.

This splits memory.events from the generic VM events and tracks them in
their own, unbuffered atomic counters. That's also cleaner, as it
eliminates the ugly enum nesting of VM and cgroup events.

[hannes@cmpxchg.org: "array subscript is above array bounds"]
Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Acked-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Rik van Riel
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2018-04-12 01:28:31 +0800
2a70f6a76 memcg, thp: do not invoke oom killer on thp charges ... Browse Code »

A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
We have used an explicit __GFP_NORETRY previously which ruled the OOM
killer automagically.

Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for costly
orders which should do the same in the memcg charge path as well.
Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone. As soon
as we get a costly high order kmalloc user we even do not have any means
to tell the memcg specific gfp mask to prevent from OOM because the
charging is deep within guts of the slab allocator.

The unexpected memcg OOM on THP has already been fixed upstream by
9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
one-off fix rather than a generic solution. Teach mem_cgroup_oom to
bail out on costly order requests to fix the THP issue as well as any
other costly OOM eligible allocations to be added in future.

Also revert 9d3c3354bb85 because special gfp for THP is no longer
needed.

Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-04-12 01:28:31 +0800