Eric Lee / smarc-fsl-linux-kernel

24 May, 2019

1 commit

8607a9652 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your optional any later version of the license

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Richard Fontana
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-24 23:37:54 +0800

21 May, 2019

3 commits

ec8f24b7f treewide: Add SPDX license identifier - Makefile/Kconfig ... Browse Code »

Add SPDX license identifiers to all Make/Kconfig files which:

- Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:46 +0800
09c434b8a treewide: Add SPDX license identifier for more missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800
457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

20 May, 2019

2 commits

cb6f8739f Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge yet more updates from Andrew Morton:
"A few final bits:

- large changes to vmalloc, yielding large performance benefits

- tweak the console-flush-on-panic code

- a few fixes"

* emailed patches from Andrew Morton :
panic: add an option to replay all the printk message in buffer
initramfs: don't free a non-existent initrd
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
mm/vmalloc.c: keep track of free blocks for vmap allocation

Linus Torvalds
2019-05-20 03:15:32 +0800
1335d9a1f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core fixes from Ingo Molnar:
"This fixes a particularly thorny munmap() bug with MPX, plus fixes a
host build environment assumption in objtool"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Allow AR to be overridden with HOSTAR
x86/mpx, mm/core: Fix recursive munmap() corruption

Linus Torvalds
2019-05-20 01:23:24 +0800

19 May, 2019

4 commits

60fce36af mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock ... Browse Code »

syzbot reported the following error from a tree with a head commit of
baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

BUG: unable to handle kernel paging request at ffffea0003348000
#PF error: [normal kernel read fault]
PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 8b 2c 24 31 ff 49
c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
Call Trace:
fast_isolate_around mm/compaction.c:1243 [inline]
fast_isolate_freepages mm/compaction.c:1418 [inline]
isolate_freepages mm/compaction.c:1438 [inline]
compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

There is no reproducer and it is difficult to hit -- 1 crash every few
days. The issue is very similar to the fix in commit 6b0868c820ff
("mm/compaction.c: correct zone boundary handling when resetting pageblock
skip hints"). When isolating free pages around a target pageblock, the
boundary handling is off by one and can stray into the next pageblock.
Triggering the syzbot error requires that the end of pageblock is section
or zone aligned, and that the next section is unpopulated.

A more subtle consequence of the bug is that pageblocks were being
improperly used as migration targets which potentially hurts fragmentation
avoidance in the long-term one page at a time.

A debugging patch revealed that it's definitely possible to stray outside
of a pageblock which is not intended. While syzbot cannot be used to
verify this patch, it was confirmed that the debugging warning no longer
triggers with this patch applied. It has also been confirmed that the THP
allocation stress tests are not degraded by this patch.

Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Signed-off-by: Mel Gorman
Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Qian Cai
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: # v5.1+
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2019-05-19 06:52:26 +0800
a6cf4e0fe mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro ... Browse Code »

This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Ingo Molnar
Cc: Joel Fernandes
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Tejun Heo
Cc: Thomas Garnier
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800
bb850f4da mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro ... Browse Code »

This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Ingo Molnar
Cc: Joel Fernandes
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Tejun Heo
Cc: Thomas Garnier
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800
68ad4a330 mm/vmalloc.c: keep track of free blocks for vmap allocation ... Browse Code »

Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks. It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator. Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree. Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity. It is sequential allocation method
therefore tends to maximize locality. The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

A free block can be split by three different ways. Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits. Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
is just cutting a free block. It is as fast as a current design. Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case. Basically it happens when
requested size and alignment does not fit left nor right edges, i.e. it
is between them. In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps. First one is we
have to allocate a new vmap_area structure. Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized. Since we know a start point in the tree
we do not do a search from the top. Instead a traversal begins from a
rb-tree node we split.

De-allocation. ~O(log(N)) complexity. An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors. The list provides O(1) access to
prev/next, so it is pretty fast to check it. Summarizing. If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here. After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree. Apart of that it can also be populated back
to upper levels to fix the tree. For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it. The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics. I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests. There are three of them which correspond
to different types of splitting(to compare with default). We have 3
ones(see above). Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order. We
do it in order to get stable and repeatable results. Take a look at time
difference in "long_busy_list_alloc_test". It is not surprising because
the worst case is O(N).

# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order. It gives
random allocation behaviour. So it is rough comparison, but it puts in
the picture for sure.

# i5-3320M
vs
real 101m22.813s real 0m56.805s
user 0m0.011s user 0m0.015s
sys 0m5.076s sys 0m0.023s

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

# Hikey960 8x CPUs
vs
real unknown real 4m25.214s
user unknown user 0m0.011s
sys unknown sys 0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version. After 24 hours it was still running, therefore i had to cancel
it. That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas. Therefore each new allocation causes the list being grown. Due to
over fragmented list and different permissive parameters an allocation can
take a long time. For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree. Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e. it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point. If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree. Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done. In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Joel Fernandes
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800

17 May, 2019

1 commit

7878c231d slab: remove /proc/slab_allocators ... Browse Code »

It turned out that DEBUG_SLAB_LEAK is still broken even after recent
recue efforts that when there is a large number of objects like
kmemleak_object which is normal on a debug kernel,

# grep kmemleak /proc/slabinfo
kmemleak_object 2243606 3436210 ...

reading /proc/slab_allocators could easily loop forever while processing
the kmemleak_object cache and any additional freeing or allocating
objects will trigger a reprocessing. To make a situation worse,
soft-lockups could easily happen in this sitatuion which will call
printk() to allocate more kmemleak objects to guarantee an infinite
loop.

Also, since it seems no one had noticed when it was totally broken
more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
by reading /proc/slab_allocators"), probably nobody cares about it
anymore due to the decline of the SLAB. Just remove it entirely.

Suggested-by: Vlastimil Babka
Suggested-by: Linus Torvalds
Signed-off-by: Qian Cai
Signed-off-by: Linus Torvalds

Qian Cai
2019-05-17 06:51:55 +0800

15 May, 2019

29 commits

def0fdae8 mm: memcontrol: fix NUMA round-robin reclaim at intermediate level ... Browse Code »

When a cgroup is reclaimed on behalf of a configured limit, reclaim
needs to round-robin through all NUMA nodes that hold pages of the memcg
in question. However, when assembling the mask of candidate NUMA nodes,
the code only consults the *local* cgroup LRU counters, not the
recursive counters for the entire subtree. Cgroup limits are frequently
configured against intermediate cgroups that do not have memory on their
own LRUs. In this case, the node mask will always come up empty and
reclaim falls back to scanning only the current node.

If a cgroup subtree has some memory on one node but the processes are
bound to another node afterwards, the limit reclaim will never age or
reclaim that memory anymore.

To fix this, use the recursive LRU counts for a cgroup subtree to
determine which nodes hold memory of that cgroup.

The code has been broken like this forever, so it doesn't seem to be a
problem in practice. I just noticed it while reviewing the way the LRU
counters are used in general.

Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
42a300353 mm: memcontrol: fix recursive statistics correctness & scalabilty ... Browse Code »

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not.
When this happens, the event counters, which are supposed to be
monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody
watching, say, memory.stat on an ancestor has to iterate those every
time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
db9adbcbe mm: memcontrol: move stat/event counting functions out-of-line ... Browse Code »

These are getting too big to be inlined in every callsite. They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since. The callsites aren't that hot, either.

Move __mod_memcg_state()
__mod_lruvec_state() and
__count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
205b20cc5 mm: memcontrol: make cgroup stats and events query API explicitly local ... Browse Code »

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.

In an application that periodically monitors memory.stat in our
fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases. To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
871789d4a mm, memcg: rename ambiguously named memory.stat counters and functions ... Browse Code »

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
Signed-off-by: Chris Down
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Roman Gushchin
Cc: Dennis Zhou
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Down
2019-05-15 10:52:52 +0800
134fca906 mm/mincore.c: make mincore() more conservative ... Browse Code »

The semantics of what mincore() considers to be resident is not
completely clear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache".

That's potentially a problem, as that [in]directly exposes
meta-information about pagecache / memory mapping state even about
memory not strictly belonging to the process executing the syscall,
opening possibilities for sidechannel attacks.

Change the semantics of mincore() so that it only reveals pagecache
information for non-anonymous mappings that belog to files that the
calling process could (if it tried to) successfully open for writing;
otherwise we'd be including shared non-exclusive mappings, which

- is the sidechannel

- is not the usecase for mincore(), as that's primarily used for data,
not (shared) text

[jkosina@suse.cz: v2]
Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
[mhocko@suse.com: restructure can_do_mincore() conditions]
Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
Signed-off-by: Jiri Kosina
Signed-off-by: Vlastimil Babka
Acked-by: Josh Snyder
Acked-by: Michal Hocko
Originally-by: Linus Torvalds
Originally-by: Dominique Martinet
Cc: Andy Lutomirski
Cc: Dave Chinner
Cc: Kevin Easton
Cc: Matthew Wilcox
Cc: Cyril Hrubis
Cc: Tejun Heo
Cc: Kirill A. Shutemov
Cc: Daniel Gruss
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2019-05-15 10:52:48 +0800
97500a4a5 mm: maintain randomization of page free lists ... Browse Code »

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Kees Cook
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
b03641af6 mm: move buddy list manipulations into helpers ... Browse Code »

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Vlastimil Babka
Tested-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Dave Hansen
Cc: Kees Cook
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
e900a918b mm: shuffle initial free memory to improve memory-side-cache utilization ... Browse Code »

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.

Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.

I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Qian Cai
Reviewed-by: Kees Cook
Acked-by: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
4d36e6f80 mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t ... Browse Code »

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
64 bit as well.

Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Reviewed-by: William Kucharski
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800
68571be99 mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy() ... Browse Code »

Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
introduced some preempt points, one of those is making an allocation
more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well all the time,
i.e. it should be rather a compromise.

1) Number of lazy pages directly influences the busy list length thus
on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem I run into a situation when
memory usage gets increased hitting out_of_memory -> panic state due to
completely blocking of logic that frees vmap areas in the
__purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized back over
allocation creating a balance between each other.

Using vmalloc test driver in "stress mode", i.e. When all available
test cases are run simultaneously on all online CPUs applying a
pressure on the vmalloc subsystem, my HiKey 960 board runs out of
memory due to the fact that __purge_vmap_area_lazy() logic simply is
not able to free pages in time.

How I run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

During this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.

Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Joel Fernandes
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Tejun Heo
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800
136ac591f mm: update references to page _refcount ... Browse Code »

Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
_refcount") left out a couple of references to the old field name. Fix
that.

Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
Signed-off-by: Baruch Siach
Reviewed-by: Andrew Morton
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baruch Siach
2019-05-15 10:52:47 +0800
318222a35 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- a few misc things and hotfixes

- ocfs2

- almost all of MM

* emailed patches from Andrew Morton : (139 commits)
kernel/memremap.c: remove the unused device_private_entry_fault() export
mm: delete find_get_entries_tag
mm/huge_memory.c: make __thp_get_unmapped_area static
mm/mprotect.c: fix compilation warning because of unused 'mm' variable
mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
mm/Kconfig: update "Memory Model" help text
mm/vmscan.c: don't disable irq again when count pgrefill for memcg
mm: memblock: make keeping memblock memory opt-in rather than opt-out
hugetlbfs: always use address space in inode for resv_map pointer
mm/z3fold.c: support page migration
mm/z3fold.c: add structure for buddy handles
mm/z3fold.c: improve compression by extending search
mm/z3fold.c: introduce helper functions
mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
mm/vmscan.c: simplify shrink_inactive_list()
fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
xen/privcmd-buf.c: convert to use vm_map_pages_zero()
xen/gntdev.c: convert to use vm_map_pages()
...

Linus Torvalds
2019-05-15 01:10:55 +0800
a1b8e6abf mm: delete find_get_entries_tag ... Browse Code »

I removed the only user of this and hadn't noticed it was now unused.

Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-05-15 00:47:51 +0800
b3b07077b mm/huge_memory.c: make __thp_get_unmapped_area static ... Browse Code »

__thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static.
Tested by building and booting the kernel.

Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bharath Vedartham
2019-05-15 00:47:51 +0800
94393c789 mm/mprotect.c: fix compilation warning because of unused 'mm' variable ... Browse Code »

Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
vm_area_struct as arg") the only place that uses the local 'mm' variable
in change_pte_range() is the call to set_pte_at().

Many architectures define set_pte_at() as macro that does not use the 'mm'
parameter, which generates the following compilation warning:

CC mm/mprotect.o
mm/mprotect.c: In function 'change_pte_range':
mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
struct mm_struct *mm = vma->vm_mm;
^~

Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
variable in change_pte_range().

[liu.song.a23@gmail.com: fix missed conversions]
Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:51 +0800
19343b5bd mm/page-writeback: introduce tracepoint for wait_on_page_writeback() ... Browse Code »

Recently there have been some hung tasks on our server due to
wait_on_page_writeback(), and we want to know the details of this
PG_writeback, i.e. this page is writing back to which device. But it is
not so convenient to get the details.

I think it would be better to introduce a tracepoint for diagnosing the
writeback details.

Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Reviewed-by: Andrew Morton
Cc: Jan Kara
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:51 +0800
d66d109d3 mm/Kconfig: update "Memory Model" help text ... Browse Code »

The help describing the memory model selection is outdated. It still says
that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
SPARSEMEM.

Update the help text for the relevant options:
* add a generic help for the "Memory Model" prompt
* add description for FLATMEM
* reduce the description of DISCONTIGMEM and add a deprecation note
* prefer SPARSEMEM over DISCONTIGMEM

Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:51 +0800
2fa2690ca mm/vmscan.c: don't disable irq again when count pgrefill for memcg ... Browse Code »

We can use __count_memcg_events() directly because this callsite is alreay
protected by spin_lock_irq().

Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:51 +0800
350e88bad mm: memblock: make keeping memblock memory opt-in rather than opt-out ... Browse Code »

Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Michael Ellerman (powerpc)
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Richard Kuo
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Ralf Baechle
Cc: Paul Burton
Cc: James Hogan
Cc: Ley Foon Tan
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:50 +0800
f27a5136f hugetlbfs: always use address space in inode for resv_map pointer ... Browse Code »

Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
resv_map") brought up the issue that inode->i_mapping may not point to the
address space embedded within the inode at inode eviction time. The
hugetlbfs truncate routine handles this by explicitly using inode->i_data.
However, code cleaning up the resv_map will still use the address space
pointed to by inode->i_mapping. Luckily, private_data is NULL for address
spaces in all such cases today but, there is no guarantee this will
continue.

Change all hugetlbfs code getting a resv_map pointer to explicitly get it
from the address space embedded within the inode. In addition, add more
comments in the code to indicate why this is being done.

Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reported-by: Yufen Yu
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2019-05-15 00:47:50 +0800
1f862989b mm/z3fold.c: support page migration ... Browse Code »

Now that we are not using page address in handles directly, we can make
z3fold pages movable to decrease the memory fragmentation z3fold may
create over time.

This patch starts advertising non-headless z3fold pages as movable and
uses the existing kernel infrastructure to implement moving of such pages
per memory management subsystem's request. It thus implements 3 required
callbacks for page migration:

* isolation callback: z3fold_page_isolate(): try to isolate the page by
removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise

* migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
memory subsystem. Returns 0 on success or negative error code otherwise

* putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

[lkp@intel.com: z3fold_page_isolate() can be static]
Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42
Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com
Signed-off-by: Vitaly Wool
Signed-off-by: kbuild test robot
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
7c2b8baa6 mm/z3fold.c: add structure for buddy handles ... Browse Code »

For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles. Instead,
it will create abstract handles (3 per page) which will contain pointers
to z3fold objects. Thus, it will be possible to change these pointers
when z3fold page is moved.

Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com
Signed-off-by: Vitaly Wool
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
351618b20 mm/z3fold.c: improve compression by extending search ... Browse Code »

The current z3fold implementation only searches this CPU's page lists for
a fitting page to put a new object into. This patch adds quick search for
very well fitting pages (i. e. those having exactly the required number
of free space) on other CPUs too, before allocating a new page for that
object.

Link: http://lkml.kernel.org/r/20190417103733.72ae81abe1552397c95a008e@gmail.com
Signed-off-by: Vitaly Wool
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
9050cce10 mm/z3fold.c: introduce helper functions ... Browse Code »

Patch series "z3fold: support page migration", v2.

This patchset implements page migration support and slightly better buddy
search. To implement page migration support, z3fold has to move away from
the current scheme of handle encoding. i. e. stop encoding page address
in handles. Instead, a small per-page structure is created which will
contain actual addresses for z3fold objects, while pointers to fields of
that structure will be used as handles.

Thus, it will be possible to change the underlying addresses to reflect
page migration.

To support migration itself, 3 callbacks will be implemented:

1: isolation callback: z3fold_page_isolate(): try to isolate the page
by removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise

2: migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
system. Returns 0 on success or negative error code otherwise

3: putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration compaction,
if migration was successful, or by z3fold_page_putback() in the other
case.

Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.

This patch (of 4):

Introduce a separate helper function for object allocation, as well as 2
smaller helpers to add a buddy to the list and to get a pointer to the
pool from the z3fold header. No functional changes here.

Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com
Signed-off-by: Vitaly Wool
Cc: Dan Streetman
Cc: Bartlomiej Zolnierkiewicz
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
1c52e6d06 mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist ... Browse Code »

Because rmqueue_pcplist() is only called when order is 0, we don't need to
use order as a parameter.

Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:50 +0800
2c8fc3dcf mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig ... Browse Code »

Add 2 new Kconfig variables that are not used by anyone. I check that
various make ARCH=somearch allmodconfig do work and do not complain. This
new Kconfig needs to be added first so that device drivers that depend on
HMM can be updated.

Once drivers are updated then I can update the HMM Kconfig to depend on
this new Kconfig in a followup patch.

This is about solving Kconfig for HMM given that device driver are
going through their own tree we want to avoid changing them from the mm
tree. So plan is:

1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
2 - Kernel release N+1 update driver to depend on new Kconfig ie
stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
and ARCH_HAS_HMM_DEVICE (one or the other or both depending
on the driver)
3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
update in mm/Kconfig

Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Guenter Roeck
Cc: Leon Romanovsky
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2019-05-15 00:47:50 +0800
f46b79120 mm/vmscan.c: simplify shrink_inactive_list() ... Browse Code »

This merges together duplicated patterns of code. Also, replace
count_memcg_events() with its irq-careless namesake, because they are
already called in interrupts disabled context.

Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
Signed-off-by: Kirill Tkhai
Acked-by: Michal Hocko
Reviewed-by: Daniel Jordan
Acked-by: Johannes Weiner
Cc: Baoquan He
Cc: Davidlohr Bueso

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2019-05-15 00:47:50 +0800
a667d7456 mm: introduce new vm_map_pages() and vm_map_pages_zero() API ... Browse Code »

Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

This patch (of 5):

Previouly drivers have their own way of mapping range of kernel
pages/memory into user vma and this was done by invoking vm_insert_page()
within a loop.

As this pattern is common across different drivers, it can be generalized
by creating new functions and using them across the drivers.

vm_map_pages() is the API which can be used to map kernel memory/pages in
drivers which have considered vm_pgoff

vm_map_pages_zero() is the API which can be used to map a range of kernel
memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
passed as default 0 for those drivers.

We _could_ then at a later "fix" these drivers which are using
vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
simply by removing the _zero suffix on the function name and if that
causes regressions, it gives us an easy way to revert.

Tested on Rockchip hardware and display is working, including talking to
Lima via prime.

Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder
Suggested-by: Russell King
Suggested-by: Matthew Wilcox
Reviewed-by: Mike Rapoport
Tested-by: Heiko Stuebner
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: Rik van Riel
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: Robin Murphy
Cc: Joonsoo Kim
Cc: Thierry Reding
Cc: Kees Cook
Cc: Marek Szyprowski
Cc: Stefan Richter
Cc: Sandy Huang
Cc: David Airlie
Cc: Oleksandr Andrushchenko
Cc: Joerg Roedel
Cc: Pawel Osciak
Cc: Kyungmin Park
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2019-05-15 00:47:50 +0800