20 Feb, 2019
1 commit
-
commit a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 upstream.
This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a
relatively small number of objects").This change changes the agressiveness of shrinker reclaim, causing small
cache and low priority reclaim to greatly increase scanning pressure on
small caches. As a result, light memory pressure has a disproportionate
affect on small caches, and causes large caches to be reclaimed much
faster than previously.As a result, it greatly perturbs the delicate balance of the VFS caches
(dentry/inode vs file page cache) such that the inode/dentry caches are
reclaimed much, much faster than the page cache and this drives us into
several other caching imbalance related problems.As such, this is a bad change and needs to be reverted.
[ Needs some massaging to retain the later seekless shrinker
modifications.]Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects")
Signed-off-by: Dave Chinner
Cc: Wolfgang Walter
Cc: Roman Gushchin
Cc: Spock
Cc: Rik van Riel
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
29 Dec, 2018
1 commit
-
commit 68600f623d69da428c6163275f97ca126e1a8ec5 upstream.
I've noticed, that dying memory cgroups are often pinned in memory by a
single pagecache page. Even under moderate memory pressure they sometimes
stayed in such state for a long time. That looked strange.My investigation showed that the problem is caused by applying the LRU
pressure balancing math:scan = div64_u64(scan * fraction[lru], denominator),
where
denominator = fraction[anon] + fraction[file] + 1.
Because fraction[lru] is always less than denominator, if the initial scan
size is 1, the result is always 0.This means the last page is not scanned and has
no chances to be reclaimed.Fix this by rounding up the result of the division.
In practice this change significantly improves the speed of dying cgroups
reclaim.[guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
Signed-off-by: Roman Gushchin
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Rik van Riel
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
06 Oct, 2018
1 commit
-
do_shrink_slab() returns unsigned long value, and the placing into int
variable cuts high bytes off. Then we compare ret and 0xfffffffe (since
SHRINK_EMPTY is converted to ret type).Thus a large number of objects returned by do_shrink_slab() may be
interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
0xfffffffe. Fix that by declaration ret as unsigned long in these
functions.Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Reported-by: Cyrill Gorcunov
Acked-by: Cyrill Gorcunov
Reviewed-by: Josef Bacik
Cc: Michal Hocko
Cc: Andrey Ryabinin
Cc: Johannes Weiner
Cc: Tetsuo Handa
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman
21 Sep, 2018
1 commit
-
9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the
way that the target slab pressure is calculated and made it
priority-based:delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);The problem is that on a default priority (which is 12) no pressure is
applied at all, if the number of potentially reclaimable objects is less
than 4096 (1<
Acked-by: Rik van Riel
Cc: Josef Bacik
Cc: Johannes Weiner
Cc: Shakeel Butt
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman
23 Aug, 2018
2 commits
-
page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
shrinker_idr business is only for memcg-aware shrinkers. Only such type
of shrinkers have id and they must be finaly installed via idr_replace()
in this function. For !memcg-aware shrinkers we never initialize
shrinker->id field.But there are all types of shrinkers passed to idr_replace(), and every
!memcg-aware shrinker with random ID (most probably, its id is 0)
replaces memcg-aware shrinker pointed by the ID in IDR.This patch fixes the problem.
Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
Signed-off-by: Kirill Tkhai
Reported-by:
Cc: Andrey Ryabinin
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Tetsuo Handa
Cc: Shakeel Butt
Cc:
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Aug, 2018
9 commits
-
The patch introduces a special value SHRINKER_REGISTERING to use instead
of list_empty() to differ a registering shrinker from unregistered
shrinker. Why we need that at all?Shrinker registration is split in two parts. The first one is
prealloc_shrinker(), which allocates shrinker memory and reserves ID in
shrinker_idr. This function can fail. The second is
register_shrinker_prepared(), and it finalizes the registration. This
function actually makes shrinker available to be used from
shrink_slab(), and it can't fail.One shrinker may be based on more then one LRU lists. So, we never
clear the bit in memcg shrinker maps, when (one of) corresponding LRU
list becomes empty, since other LRU lists may be not empty. See
superblock shrinker for example: it is based on two LRU lists:
s_inode_lru and s_dentry_lru. We do not want to clear shrinker bit,
when there are no inodes in s_inode_lru, as s_dentry_lru may contain
dentries.Instead of that, we use special algorithm to detect shrinkers having no
elements at all its LRU lists, and this is made in shrink_slab_memcg().
See the comment in this function for the details.Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
meet unregistered shrinker (bit is set, while there is no a shrinker in
IDR). Otherwise, we would have done that at the moment of shrinker
unregistration for all memcgs (and this looks worse, since iteration
over all memcg may take much time). Also this would have imposed
restrictions on shrinker unregistration order for its users: they would
have had to guarantee, there are no new elements after
unregister_shrinker() (otherwise, a new added element would have set a
bit).So, if we meet a set bit in map and no shrinker in IDR when we're
iterating over the map in shrink_slab_memcg(), this means the
corresponding shrinker is unregistered, and we must clear the bit.Another case is shrinker registration. We want two things there:
1) do_shrink_slab() can be called only for completely registered
shrinkers;2) shrinker internal lists may be populated in any order with
register_shrinker_prepared() (let's talk on the example with sb). Both
of:a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
memcg_set_shrinker_bit(); [cpu0]
...
register_shrinker_prepared(); [cpu1]and
b)register_shrinker_prepared(); [cpu0]
...
list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
memcg_set_shrinker_bit(); [cpu1]are legitimate. We don't want to impose restriction here and to
force people to use only (b) variant. We don't want to force people to
care, there is no elements in LRU lists before the shrinker is
completely registered. Internal users of LRU lists and shrinker code
are two different subsystems, and they have to be closed in themselves
each other.In (a) case we have the bit set before shrinker is completely
registered. We don't want do_shrink_slab() is called at this moment, so
we have to detect such the registering shrinkers.Before this patch list_empty() (shrinker is not linked to the list)
check was used for that. So, in (a) there could be a bit set, but we
don't call do_shrink_slab() unless shrinker is linked to the list. It's
just an indicator, I just overloaded linking to the list.This was not the best solution, since it's better not to touch the
shrinker memory from shrink_slab_memcg() before it's completely
registered (this also will be useful in the future to make shrink_slab()
completely lockless).So, this patch introduces better way to detect registering shrinker,
which allows not to dereference shrinker memory. It's just a ~0UL
value, which we insert into the IDR during ID allocation. After
shrinker is ready to be used, we insert actual shrinker pointer in the
IDR, and it becomes available to shrink_slab_memcg().We can't use NULL instead of this new value for this purpose as:
shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
and we don't want the function sees NULL and clears the bit, otherwise
(a) won't work.This is the only thing the patch makes: the better way to detect
registering shrinker. Nothing else this patch makes.Also this gives a better assembler, but it's minor side of the patch:
Before:
callq
mov %rax,%r15
test %rax,%rax
je
mov 0x20(%rax),%rax
lea 0x20(%r15),%rdx
cmp %rax,%rdx
je
mov 0x8(%rsp),%edx
mov %r15,%rsi
lea 0x10(%rsp),%rdi
callqAfter:
callq
mov %rax,%r15
lea -0x1(%rax),%rax
cmp $0xfffffffffffffffd,%rax
ja
mov 0x8(%rsp),%edx
mov %r15,%rsi
lea 0x10(%rsp),%rdi
callq ffffffff810cefd0[ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Reviewed-by: Andrew Morton
Cc: Vladimir Davydov
Cc: Michal Hocko
Cc: Andrey Ryabinin
Cc: "Huang, Ying"
Cc: Tetsuo Handa
Cc: Matthew Wilcox
Cc: Shakeel Butt
Cc: Josef Bacik
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
numa-aware. This is not a real problem, since currently all memcg-aware
shrinkers are numa-aware too (we have two: super_block shrinker and
workingset shrinker), but something may change in the future.Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Reviewed-by: Andrew Morton
Cc: Vladimir Davydov
Cc: Michal Hocko
Cc: Andrey Ryabinin
Cc: "Huang, Ying"
Cc: Tetsuo Handa
Cc: Matthew Wilcox
Cc: Shakeel Butt
Cc: Josef Bacik
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.This patch introduces a lockless mechanism to do that without races
without parallel list lru add. After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:1)list_lru_add() shrink_slab_memcg
list_add_tail() for_each_set_bit()
set_bit() do_shrink_slab() before the first call of do_shrink_slab()
instead of this to do not slow down generic case. Also, it's need the
second call as seen in below in (2).2)list_lru_add() shrink_slab_memcg()
list_add_tail() ...
set_bit() ...
... for_each_set_bit()
do_shrink_slab() do_shrink_slab()
clear_bit() ...
... ...
list_lru_add() ...
list_add_tail() clear_bit()
set_bit() do_shrink_slab()The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit. This case is
drawn in the code comment.[Results/performance of the patchset]
After the whole patchset applied the below test shows signify increase
of performance:$echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
$mkdir /sys/fs/cgroup/memory/ct
$echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
$for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
mkdir -p s/$i; mount -t tmpfs $i s/$i;
touch s/$i/file; doneThen, 5 sequential calls of drop caches:
$time echo 3 > /proc/sys/vm/drop_caches
1)Before:
0.00user 13.78system 0:13.78elapsed 99%CPU
0.00user 5.59system 0:05.60elapsed 99%CPU
0.00user 5.48system 0:05.48elapsed 99%CPU
0.00user 8.35system 0:08.35elapsed 99%CPU
0.00user 8.34system 0:08.35elapsed 99%CPU2)After
0.00user 1.10system 0:01.10elapsed 99%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPU
0.00user 0.00system 0:00.01elapsed 64%CPU
0.00user 0.01system 0:00.01elapsed 82%CPUThe results show the performance increases at least in 548 times.
Shakeel Butt tested this patchset with fork-bomb on his configuration:
> I created 255 memcgs, 255 ext4 mounts and made each memcg create a
> file containing few KiBs on corresponding mount. Then in a separate
> memcg of 200 MiB limit ran a fork-bomb.
>
> I ran the "perf record -ag -- sleep 60" and below are the results:
>
> Without the patch series:
> Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
> + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab
> + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one
> + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count
> + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> + 0.27% fb.sh [kernel.kallsyms] [k] up_read
> + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock
> + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count
> + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node
>
> With the patch series:
> Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
> + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock
> + 30.72% fb.sh [kernel.kallsyms] [k] up_read
> + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter
> + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
> + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected
> + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
> + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
> + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size
> + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node
> + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on
> + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all. Currently, in
the both of these cases, shrinker::count_objects() returns 0.The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case. It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The patch makes shrink_slab() be called for root_mem_cgroup in the same
way as it's called for the rest of cgroups. This simplifies the logic
and improves the readability.[ktkhai@virtuozzo.com: wrote changelog]
Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
Signed-off-by: Vladimir Davydov
Signed-off-by: Kirill Tkhai
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap. To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
for_each_set_bit() from the bitmap. In case of big nodes, having many
isolated environments, this gives significant performance growth. See
next patches for the details.Note that the patch does not respect to empty memcg shrinkers, since we
never clear the bitmap bits after we set it once. Their shrinkers will
be called again, with no shrinked objects as result. This functionality
is provided by next patches.[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Imagine a big node with many cpus, memory cgroups and containers. Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts. If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list(). I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.This patch adds bitmap of memcg-aware shrinkers to memcg. The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in
the map is related to corresponding shrinker id.Next patches will maintain set bit only for really charged memcg. This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce shrinker::id number, which is used to enumerate memcg-aware
shrinkers. The number start from 0, and the code tries to maintain it
as small as possible.This will be used to represent a memcg-aware shrinkers in memcg
shrinkers map.Since all memcg-aware shrinkers are based on list_lru, which is
per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
be under this config option.[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Vladimir Davydov
Tested-by: Shakeel Butt
Cc: Al Viro
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Greg Kroah-Hartman
Cc: Guenter Roeck
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Josef Bacik
Cc: Li RongQing
Cc: Matthew Wilcox
Cc: Matthias Kaehlcke
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Philippe Ombredanne
Cc: Roman Gushchin
Cc: Sahitya Tummala
Cc: Stephen Rothwell
Cc: Tetsuo Handa
Cc: Thomas Gleixner
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8. All easily fit within a byte:- allocation order range: 0..MAX_ORDER(64?)
- priority range: 0..12(DEF_PRIORITY)
- reclaim_idx range: 0..6(__MAX_NR_ZONES)Since 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack
overflows are not an issue. But it's inefficient to use ints.Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
This reduces sizeof(struct scan_control):
- 96 => 80 bytes (x86_64)
- 68 => 56 bytes (i386)scan_control structure field order is changed to utilize padding. After
this patch there is 1 bit of scan_control padding.akpm: makes my vmscan.o's .text 572 bytes smaller as well.
Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
Signed-off-by: Greg Thelen
Suggested-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Jun, 2018
2 commits
-
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin
Reviewed-by: Randy Dunlap
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval
Reviewed-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Tetsuo Handa
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Jun, 2018
1 commit
-
George Boole would have noticed a slight error in 4.16 commit
69d763fc6d3a ("mm: pin address_space before dereferencing it while
isolating an LRU page"). Fix it, to match both the comment above it,
and the original behaviour.Although anonymous pages are not marked PageDirty at first, we have an
old habit of calling SetPageDirty when a page is removed from swap
cache: so there's a category of ex-swap pages that are easily
migratable, but were inadvertently excluded from compaction's async
migration in 4.16.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
Signed-off-by: Hugh Dickins
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Reported-by: Ivan Kalvachev
Cc: "Huang, Ying"
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Apr, 2018
1 commit
-
syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97
("sget(): handle failures of register_shrinker()"). That commit expected
that calling kill_sb() from deactivate_locked_super() without successful
fill_super() is safe, but the reality was different; some callers assign
attributes which are needed for kill_sb() after sget() succeeds.For example, [1] is a report where sb->s_mode (which seems to be either
FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
assigned unless sget() succeeds. But it does not worth complicate sget()
so that register_shrinker() failure path can safely call
kill_block_super() via kill_sb(). Making alloc_super() fail if memory
allocation for register_shrinker() failed is much simpler. Let's avoid
calling deactivate_locked_super() from sget_userns() by preallocating
memory for the shrinker and making register_shrinker() in sget_userns()
never fail.[1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7
Signed-off-by: Tetsuo Handa
Reported-by: syzbot
Cc: Al Viro
Cc: Michal Hocko
Signed-off-by: Al Viro
12 Apr, 2018
7 commits
-
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") added per-cpu drift to all memory cgroup stats
and events shown in memory.stat and memory.events.For memory.stat this is acceptable. But memory.events issues file
notifications, and somebody polling the file for changes will be
confused when the counters in it are unchanged after a wakeup.Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
MEMCG_OOM - are sufficiently rare and high-level that we don't need
per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
frequent, but they're counting invocations of reclaim, which is a
complex operation that touches many shared cachelines.This splits memory.events from the generic VM events and tracks them in
their own, unbuffered atomic counters. That's also cleaner, as it
eliminates the ugly enum nesting of VM and cgroup events.[hannes@cmpxchg.org: "array subscript is above array bounds"]
Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Acked-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Roman Gushchin
Cc: Rik van Riel
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).Before this patch, the code to call the trace event is this:
0f 83 aa fe ff ff jae ffffffff811e6261
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10After the patch:
0f 83 a8 fe ff ff jae ffffffff811e626d
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%raxLink: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware)
Reported-by: Alexei Starovoitov
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc: Andrey Ryabinin
Cc: Alexei Starovoitov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags. I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set. If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee. E.g. direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.The problem could be easily demonstrated by creating heavy congestion in
one cgroup:echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505sAccording to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s[aryabinin@virtuozzo.com: congestion state should be per-node]
Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Michal Hocko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have separate LRU list for each memory cgroup. Memory reclaim
iterates over cgroups and calls shrink_inactive_list() every inactive
LRU list. Based on the state of a single LRU shrink_inactive_list() may
flag the whole node as dirty,congested or under writeback. This is
obviously wrong and hurtful. It's especially hurtful when we have
possibly small congested cgroup in system. Than *all* direct reclaims
waste time by sleeping in wait_iff_congested(). And the more memcgs in
the system we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.Sum reclaim stats across all visited LRUs on node and flag node as
dirty, congested or under writeback based on that sum. Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.This only fixes the problem for global reclaim case. Per-cgroup reclaim
may alter global pgdat flags too, which is wrong. But that is separate
issue and will be addressed in the next patch.This change will not have any effect on a systems with all workload
concentrated in a single cgroup.[aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file]
Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Shakeel Butt
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Only kswapd can have non-zero nr_immediate, and current_may_throttle()
is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Acked-by: Michal Hocko
Cc: Shakeel Butt
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Update some comments that became stale since transiton from per-zone to
per-node reclaim.Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Acked-by: Michal Hocko
Cc: Shakeel Butt
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Apr, 2018
3 commits
-
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since we no longer use return value of shrink_slab() for normal reclaim,
the comment is no longer true. If some do_shrink_slab() call takes
unexpectedly long (root cause of stall is currently unknown) when
register_shrinker()/unregister_shrinker() is pending, trying to drop
caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
loop if many mem_cgroup are defined. For safety, let's not pretend
forward progress.Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Dave Chinner
Cc: Glauber Costa
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.CPU1 CPU2
truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_spacemapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.Similar race exists between swap cache freeing and page_evicatable()
too.The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Jan Kara
Reviewed-by: Andrew Morton
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Mar, 2018
1 commit
-
Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list()
when many dirty pages on the LRU are encountered.However, shrink_inactive_list() doesn't wake up flushers for legacy
cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
flusher wakeup from direct reclaim path") removed the only source of
flusher's wake up in legacy mem cgroup reclaim path.This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killeddd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kBWake up flushers in legacy cgroup reclaim too.
Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin
Tested-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Feb, 2018
1 commit
-
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU. Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability. This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU. Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().#0: __pagevec_lru_add_fn #1: clear_page_mlock
SetPageLRU() if (!TestClearPageMlocked())
return
smp_mb() //
Acked-by: Vlastimil Babka
Cc: Jérôme Glisse
Cc: Huang Ying
Cc: Tim Chen
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Shaohua Li
Cc: Jan Kara
Cc: Nicholas Piggin
Cc: Dan Williams
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Feb, 2018
1 commit
-
Link: http://lkml.kernel.org/r/1516700871-22279-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Feb, 2018
4 commits
-
Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as followsCPU1 CPU2
truncate(inode) __isolate_lru_page()
...
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_spaceif (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.The race is theoretically possible but unlikely. Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page. Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired. There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed. However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman
Acked-by: Minchan Kim
Cc: "Huang, Ying"
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove unused function pgdat_reclaimable_pages() and
node_page_state_snapshot() which becomes unused as well.Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Shakeel Butt reported he has observed in production systems that the job
loader gets stuck for 10s of seconds while doing a mount operation. It
turns out that it was stuck in register_shrinker() because some
unrelated job was under memory pressure and was spending time in
shrink_slab(). Machines have a lot of shrinkers registered and jobs
under memory pressure have to traverse all of those memcg-aware
shrinkers and affect unrelated jobs which want to register their own
shrinkers.To solve the issue, this patch simply bails out slab shrinking if it is
found that someone wants to register a shrinker in parallel. A downside
is it could cause unfair shrinking between shrinkers. However, it
should be rare and we can add compilcated logic if we find it's not
enough.[akpm@linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox
Link: http://lkml.kernel.org/r/1511481899-20335-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Signed-off-by: Shakeel Butt
Reported-by: Shakeel Butt
Tested-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.Consider the following scenario, where we have the following values and
the rest of the memory usage is in slabActive: 58840 kB
Inactive: 46860 kBEvery time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
followingslab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the followingscan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scanor written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->prioritywhich is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik
Acked-by: Johannes Weiner
Acked-by: Dave Chinner
Acked-by: Andrey Ryabinin
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Dec, 2017
1 commit
-
Syzbot caught an oops at unregister_shrinker() because combination of
commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
injection made register_shrinker() fail and the caller of
register_shrinker() did not check for failure.----------
[ 554.881422] FAULT_INJECTION: forcing a failure.
[ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
[ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.881445] Call Trace:
[ 554.881459] dump_stack+0x194/0x257
[ 554.881474] ? arch_local_irq_restore+0x53/0x53
[ 554.881486] ? find_held_lock+0x35/0x1d0
[ 554.881507] should_fail+0x8c0/0xa40
[ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
[ 554.881537] ? check_noncircular+0x20/0x20
[ 554.881546] ? find_next_zero_bit+0x2c/0x40
[ 554.881560] ? ida_get_new_above+0x421/0x9d0
[ 554.881577] ? find_held_lock+0x35/0x1d0
[ 554.881594] ? __lock_is_held+0xb6/0x140
[ 554.881628] ? check_same_owner+0x320/0x320
[ 554.881634] ? lock_downgrade+0x990/0x990
[ 554.881649] ? find_held_lock+0x35/0x1d0
[ 554.881672] should_failslab+0xec/0x120
[ 554.881684] __kmalloc+0x63/0x760
[ 554.881692] ? lock_downgrade+0x990/0x990
[ 554.881712] ? register_shrinker+0x10e/0x2d0
[ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
[ 554.881737] register_shrinker+0x10e/0x2d0
[ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
[ 554.881755] ? _down_write_nest_lock+0x120/0x120
[ 554.881765] ? memcpy+0x45/0x50
[ 554.881785] sget_userns+0xbcd/0xe20
(...snipped...)
[ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
[ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 554.898732] general protection fault: 0000 [#1] SMP KASAN
[ 554.898737] Dumping ftrace buffer:
[ 554.898741] (ftrace buffer empty)
[ 554.898743] Modules linked in:
[ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
[ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
[ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
[ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
[ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
[ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
[ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
[ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
[ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
[ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
[ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 554.898818] Call Trace:
[ 554.898828] unregister_shrinker+0x79/0x300
[ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
[ 554.898844] ? down_write+0x87/0x120
[ 554.898851] ? deactivate_super+0x139/0x1b0
[ 554.898857] ? down_read+0x150/0x150
[ 554.898864] ? check_same_owner+0x320/0x320
[ 554.898875] deactivate_locked_super+0x64/0xd0
[ 554.898883] deactivate_super+0x141/0x1b0
----------Since allowing register_shrinker() callers to call unregister_shrinker()
when register_shrinker() failed can simplify error recovery path, this
patch makes unregister_shrinker() no-op when register_shrinker() failed.
Also, reset shrinker->nr_deferred in case unregister_shrinker() was
by error called twice.Signed-off-by: Tetsuo Handa
Signed-off-by: Aliaksei Karaliou
Reported-by: syzbot
Cc: Glauber Costa
Cc: Al Viro
Signed-off-by: Al Viro
16 Nov, 2017
2 commits
-
Most callers users of free_hot_cold_page claim the pages being released
are cache hot. The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost. As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.The APIs are renamed to indicate that it's no longer about hot/cold
pages. It should also be less confusing as there are subtle differences
between them. __free_pages drops a reference and frees a page when the
refcount reaches zero. free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name. free_unref_page
should be more obvious.No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.[mgorman@techsingularity.net: add pages to head, not tail]
Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") 'pgdat->inactive_ratio' is not used, except for printing
"node_inactive_ratio: 0" in /proc/zoneinfo output.Remove it.
Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Nov, 2017
1 commit
-
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...