Eric Lee / smarc-fsl-linux-kernel

08 Oct, 2016

40 commits

cc30c5d64 mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE() ... Browse Code »

So they are CONFIG_DEBUG_VM-only and more informative.

Cc: Al Viro
Cc: David S. Miller
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Joe Perches
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Santosh Shilimkar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-10-08 09:46:29 +0800
a104808e2 mm: don't emit warning from pagefault_out_of_memory() ... Browse Code »

Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
to warn when we raced with disabling the OOM killer.

Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
properly" introduced a timeout for oom_killer_disable(). Even if we
raced with disabling the OOM killer and the system is OOM livelocked,
the OOM killer will be enabled eventually (in 20 seconds by default) and
the OOM livelock will be solved. Therefore, we no longer need to warn
when we raced with disabling the OOM killer.

Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-10-08 09:46:29 +0800
203114202 mm, compaction: restrict fragindex to costly orders ... Browse Code »

Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
heuristic to prevent excessive compaction for costly orders (i.e. THP).
It's unlikely to make any difference for non-costly orders, especially
with the default threshold. But we cannot afford any uncertainty for
the non-costly orders where the only alternative to successful
reclaim/compaction is OOM. After the recent patches we are guaranteed
maximum effort without heuristics from compaction before deciding OOM,
and fragindex is the last remaining heuristic. Therefore skip fragindex
altogether for non-costly orders.

Suggested-by: Michal Hocko
Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
cc5c9f098 mm, compaction: ignore fragindex from compaction_zonelist_suitable() ... Browse Code »

The compaction_zonelist_suitable() function tries to determine if
compaction will be able to proceed after sufficient reclaim, i.e.
whether there are enough reclaimable pages to provide enough order-0
freepages for compaction.

This addition of reclaimable pages to the free pages works well for the
order-0 watermark check, but in the fragmentation index check we only
consider truly free pages. Thus we can get fragindex value close to 0
which indicates failure do to lack of memory, and wrongly decide that
compaction won't be suitable even after reclaim.

Instead of trying to somehow adjust fragindex for reclaimable pages,
let's just skip it from compaction_zonelist_suitable().

Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
423b452e1 mm, page_alloc: pull no_progress_loops update to should_reclaim_retry() ... Browse Code »

The should_reclaim_retry() makes decisions based on no_progress_loops,
so it makes sense to also update the counter there. It will be also
consistent with should_compact_retry() and compaction_retries. No
functional change.

[hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
9f7e33879 mm, compaction: make full priority ignore pageblock suitability ... Browse Code »

Several people have reported premature OOMs for order-2 allocations
(stack) due to OOM rework in 4.7. In the scenario (parallel kernel
build and dd writing to two drives) many pageblocks get marked as
Unmovable and compaction free scanner struggles to isolate free pages.
Joonsoo Kim pointed out that the free scanner skips pageblocks that are
not movable to prevent filling them and forcing non-movable allocations
to fallback to other pageblocks. Such heuristic makes sense to help
prevent long-term fragmentation, but premature OOMs are relatively more
urgent problem. As a compromise, this patch disables the heuristic only
for the ultimate compaction priority.

Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
Reported-by: Ralf-Peter Rohbeck
Reported-by: Arkadiusz Miskiewicz
Reported-by: Olaf Hering
Suggested-by: Joonsoo Kim
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
c2033b00d mm, compaction: restrict full priority to non-costly orders ... Browse Code »

The new ultimate compaction priority disables some heuristics, which may
result in excessive cost. This is fine for non-costly orders where we
want to try hard before resulting for OOM, but might be disruptive for
costly orders which do not trigger OOM and should generally have some
fallback. Thus, we disable the full priority for costly orders.

Suggested-by: Michal Hocko
Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:29 +0800
d94364983 mm, compaction: more reliably increase direct compaction priority ... Browse Code »

During reclaim/compaction loop, compaction priority can be increased by
the should_compact_retry() function, but the current code is not
optimal. Priority is only increased when compaction_failed() is true,
which means that compaction has scanned the whole zone. This may not
happen even after multiple attempts with a lower priority due to
parallel activity, so we might needlessly struggle on the lower
priorities and possibly run out of compaction retry attempts in the
process.

After this patch we are guaranteed at least one attempt at the highest
compaction priority even if we exhaust all retries at the lower
priorities.

Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:28 +0800
3250845d0 Revert "mm, oom: prevent premature OOM killer invocation for high order request" ... Browse Code »

Patch series "reintroduce compaction feedback for OOM decisions".

After several people reported OOM's for order-2 allocations in 4.7 due
to Michal Hocko's OOM rework, he reverted the part that considered
compaction feedback [1] in the decisions to retry reclaim/compaction.
This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
while mmotm had an almost complete solution that instead improved
compaction reliability.

This series completes the mmotm solution and reintroduces the compaction
feedback into OOM decisions. The first two patches restore the state of
mmotm before the temporary solution was merged, the last patch should be
the missing piece for reliability. The third patch restricts the
hardened compaction to non-costly orders, since costly orders don't
result in OOMs in the first place.

[1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E

This patch (of 4):

Commit 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
for high order request") was intended as a quick fix of OOM regressions
for 4.8 and stable 4.7.x kernels. For a better long-term solution, we
still want to consider compaction feedback, which should be possible
after some more improvements in the following patches.

This reverts commit 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f.

Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Rik van Riel
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-10-08 09:46:28 +0800
8cd797887 mm: remove page_file_index ... Browse Code »

After using the offset of the swap entry as the key of the swap cache,
the page_index() becomes exactly same as page_file_index(). So the
page_file_index() is removed and the callers are changed to use
page_index() instead.

Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Trond Myklebust
Cc: Anna Schumaker
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Dan Williams
Cc: Joonsoo Kim
Cc: Ross Zwisler
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2016-10-08 09:46:28 +0800
f6ab1f7f6 mm, swap: use offset of swap entry as key of swap cache ... Browse Code »

This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.

This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.

Use the whole swap entry as key,

perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

Use the swap offset as key,

perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Cc: Dan Williams
Cc: Joonsoo Kim
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Aaron Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2016-10-08 09:46:28 +0800
87744ab38 mm: fix cache mode tracking in vm_insert_mixed() ... Browse Code »

vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
fails to check the pgprot_t it uses for the mapping against the one
recorded in the memtype tracking tree. Add the missing call to
track_pfn_insert() to preclude cases where incompatible aliased mappings
are established for a given physical address range.

Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Cc: David Airlie
Cc: Matthew Wilcox
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-10-08 09:46:28 +0800
d66ba15bd memory-hotplug: fix store_mem_state() return value ... Browse Code »

If store_mem_state() is called to online memory which is already online,
it will return 1, the value it got from device_online().

This is wrong because store_mem_state() is a device_attribute .store
function. Thus a non-negative return value represents input bytes read.

Set the return value to -EINVAL in this case.

Link: http://lkml.kernel.org/r/1472743777-24266-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab
Cc: Greg Kroah-Hartman
Cc: Vlastimil Babka
Cc: Vitaly Kuznetsov
Cc: David Rientjes
Cc: Yaowei Bai
Cc: Joonsoo Kim
Cc: Dan Williams
Cc: Xishi Qiu
Cc: David Vrabel
Cc: Chen Yucong
Cc: Andrew Banman
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Reza Arbab
2016-10-08 09:46:28 +0800
0247f3f4d mm/memcontrol.c: make the walk_page_range() limit obvious ... Browse Code »

mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
callback, which causes the current implementation to skip non-vma
regions. This is all fine but follow up changes would like to make
walk_page_range more generic so it is better to be explicit about which
range to traverse so let's use highest_vm_end to explicitly traverse
only user mmaped memory.

[mhocko@kernel.org: rewrote changelog]
Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse
Acked-by: Naoya Horiguchi
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Morse
2016-10-08 09:46:28 +0800
6fcb52a56 thp: reduce usage of huge zero page's atomic counter ... Browse Code »

The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:

1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu
Cc: Sergey Senozhatsky
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Cc: Tim Chen
Cc: Huang Ying
Cc: Vlastimil Babka
Cc: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Ebru Akagunduz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2016-10-08 09:46:28 +0800
0f30206bf fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious ... Browse Code »

Trying to walk all of virtual memory requires architecture specific
knowledge. On x86_64, addresses must be sign extended from bit 48,
whereas on arm64 the top VA_BITS of address space have their own set of
page tables.

clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
provides a test_walk() callback that only expects to be walking over
VMAs. Currently walk_pmd_range() will skip memory regions that don't
have a VMA, reporting them as a hole.

As this call only expects to walk user address space, make it walk 0 to
'highest_vm_end'.

Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

James Morse
2016-10-08 09:46:28 +0800
03e86dba5 cpu: fix node state for whether it contains CPU ... Browse Code »

In current kernel code, we only call node_set_state(cpu_to_node(cpu),
N_CPU) when a cpu is hot plugged. But we do not set the node state for
N_CPU when the cpus are brought online during boot.

So this could lead to failure when we check to see if a node contains
cpu with node_state(node_id, N_CPU).

One use case is in the node_reclaime function:

/*
* Only run node reclaim on the local node or on nodes that do
* not
* have associated processors. This will favor the local
* processor
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
numa_node_id())
return NODE_RECLAIM_NOSCAN;

I instrumented the kernel to call this function after boot and it always
returns 0 on a x86 desktop machine until I apply the attached patch.

int num_cpu_node(void)
{
int i, nr_cpu_nodes = 0;

for_each_node(i) {
if (node_state(i, N_CPU))
++ nr_cpu_nodes;
}

return nr_cpu_nodes;
}

Fix this by checking each node for online CPU when we initialize
vmstat that's responsible for maintaining node state.

Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
Signed-off-by: Tim Chen
Acked-by: David Rientjes
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Tim Chen
Cc:
Cc: Ying
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Dan Williams
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tim Chen
2016-10-08 09:46:28 +0800
dbe6ec815 ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings ... Browse Code »

To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.

Call thp_get_unmapped_area() from f_op->get_unmapped_area.

Note, there is no change in behavior for a non-DAX file.

Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Cc: Dave Chinner
Cc: Jan Kara
Cc: Theodore Ts'o
Cc: Andreas Dilger
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2016-10-08 09:46:28 +0800
74d2fad13 thp, dax: add thp_get_unmapped_area for pmd mappings ... Browse Code »

When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
This feature relies on both mmap virtual address and FS block (i.e.
physical address) to be aligned by the pmd page size. Users can use
mkfs options to specify FS to align block allocations. However,
aligning mmap address requires code changes to existing applications for
providing a pmd-aligned address to mmap().

For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.

Add thp_get_unmapped_area(), which can be called by filesystem's
get_unmapped_area to align an mmap address by the pmd size for a DAX
file. It calls the default handler, mm->get_unmapped_area(), to find a
range and then aligns it for a DAX file.

The patch is based on Matthew Wilcox's change that allows adding support
of the pud page size easily.

[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Reviewed-by: Dan Williams
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Cc: Dave Chinner
Cc: Jan Kara
Cc: Theodore Ts'o
Cc: Andreas Dilger
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2016-10-08 09:46:28 +0800
26b4224d9 selftests: expanding more mlock selftest ... Browse Code »

This patch will randomly perform mlock/mlock2 on a given memory region,
and verify the RLIMIT_MEMLOCK limitation works properly.

Suggested-by: David Rientjes
Link: http://lkml.kernel.org/r/1473325970-11393-4-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Cc: Shuah Khan
Cc: Vlastimil Babka
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Eric B Munson
Cc: Simon Guo
Cc: Mel Gorman
Cc: Alexey Klimov
Cc: Andrea Arcangeli
Cc: Thierry Reding
Cc: Mike Kravetz
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
d5aed9c06 selftest: move seek_to_smaps_entry() out of mlock2-tests.c ... Browse Code »

Function seek_to_smaps_entry() can be useful for other selftest
functionalities, so move it out to header file.

Link: http://lkml.kernel.org/r/1473325970-11393-3-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Cc: Shuah Khan
Cc: Vlastimil Babka
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Eric B Munson
Cc: Simon Guo
Cc: Mel Gorman
Cc: Alexey Klimov
Cc: Andrea Arcangeli
Cc: Thierry Reding
Cc: Mike Kravetz
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
1448d4d89 selftests/vm: add test for mlock() when areas are intersected ... Browse Code »

This patch adds mlock() test for multiple invocation on the same address
area, and verify it doesn't mess the rlimit mlock limitation.

Link: http://lkml.kernel.org/r/1472554781-9835-5-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Cc: Alexey Klimov
Cc: Eric B Munson
Cc: Geert Uytterhoeven
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Shuah Khan
Cc: Simon Guo
Cc: Thierry Reding
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
c7f032bbe selftest: split mlock2_ funcs into separate mlock2.h ... Browse Code »

To prepare mlock2.h whose functionality will be reused.

Link: http://lkml.kernel.org/r/1472554781-9835-4-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Cc: Alexey Klimov
Cc: Eric B Munson
Cc: Geert Uytterhoeven
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Shuah Khan
Cc: Simon Guo
Cc: Thierry Reding
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
b155b4fde mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT) ... Browse Code »

When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
VM_LOCKED flag only.

There is a hole in mlock_fixup() which increase mm->locked_vm twice even
the two operations are on the same vma and both with VM_LOCKED flags.

The issue can be reproduced by following code:

mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
mlock(p, 1024 * 64); //VM_LOCKED

Then check the increase VmLck field in /proc/pid/status(to 128k).

When vma is set with different vm_flags, and the new vm_flags is with
VM_LOCKED, it is not necessarily be a "new locked" vma. This patch
corrects this bug by prevent mm->locked_vm from increment when old
vm_flags is already VM_LOCKED.

Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Acked-by: Kirill A. Shutemov
Cc: Alexey Klimov
Cc: Eric B Munson
Cc: Geert Uytterhoeven
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Shuah Khan
Cc: Simon Guo
Cc: Thierry Reding
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
0cf2f6f6d mm: mlock: check against vma for actual mlock() size ... Browse Code »

In do_mlock(), the check against locked memory limitation has a hole
which will fail following cases at step 3):

1) User has a memory chunk from addressA with 50k, and user mem lock
rlimit is 64k.
2) mlock(addressA, 30k)
3) mlock(addressA, 40k)

The 3rd step should have been allowed since the 40k request is
intersected with the previous 30k at step 2), and the 3rd step is
actually for mlock on the extra 10k memory.

This patch checks vma to caculate the actual "new" mlock size, if
necessary, and ajust the logic to fix this issue.

[akpm@linux-foundation.org: clean up comment layout]
[wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo
Cc: Alexey Klimov
Cc: Eric B Munson
Cc: Geert Uytterhoeven
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Shuah Khan
Cc: Simon Guo
Cc: Thierry Reding
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Simon Guo
2016-10-08 09:46:28 +0800
9254990fb oom: warn if we go OOM for higher order and compaction is disabled ... Browse Code »

Since the lumpy reclaim is gone there is no source of higher order pages
if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
unreliable for that purpose to say the least. Hitting an OOM for
!costly higher order requests is therefore all not that hard to imagine.
We are trying hard to not invoke OOM killer as much as possible but
there is simply no reliable way to detect whether more reclaim retries
make sense.

Disabling COMPACTION is not widespread but it seems that some users
might have disable the feature without realizing full consequences
(mostly along with disabling THP because compaction used to be THP
mainly thing). This patch just adds a note if the OOM killer was
triggered by higher order request with compaction disabled. This will
help us identifying possible misconfiguration right from the oom report
which is easier than to always keep in mind that somebody might have
disabled COMPACTION without a good reason.

Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:28 +0800
371a096ed mm: don't use radix tree writeback tags for pages in swap cache ... Browse Code »

File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.

Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.

With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.

Details of comparison is as follow,

base base+patch
---------------- --------------------------
%stddev %change %stddev
\ | \
2506952 Â± 2% +28.1% 3212076 Â± 7% vm-scalability.throughput
1207402 Â± 7% +22.3% 1476578 Â± 6% vmstat.swap.so
10.86 Â± 12% -23.4% 8.31 Â± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
10.82 Â± 13% -33.1% 7.24 Â± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
10.36 Â± 11% -100.0% 0.00 Â± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
10.52 Â± 12% -100.0% 0.00 Â± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Acked-by: Rik van Riel
Cc: Hugh Dickins
Cc: Shaohua Li
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Wu Fengguang
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2016-10-08 09:46:28 +0800
1d8bf926f mm/bootmem.c: replace kzalloc() by kzalloc_node() ... Browse Code »

In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
in order to allocate memory within given node preferentially when slab
is available

Link: http://lkml.kernel.org/r/1f487f12-6af4-5e4f-a28c-1de2361cdcd8@zoho.com
Signed-off-by: zijun_hu
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zijun_hu
2016-10-08 09:46:28 +0800
2382705f2 mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statements ... Browse Code »

Fix the following bugs:

- the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
header and relevant source

- don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
asm/processor.h is preferred over default in linux/bootmem.h
completely since the former header isn't included by the latter

Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.com
Signed-off-by: zijun_hu
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zijun_hu
2016-10-08 09:46:28 +0800
1e76609cc powerpc: implement arch_reserved_kernel_pages ... Browse Code »

Currently significant amount of memory is reserved only in kernel booted
to capture kernel dump using the fa_dump method.

Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
only certain size memory per node. The certain size takes into account
the dentry and inode cache sizes. Currently the cache sizes are
calculated based on the total system memory including the reserved
memory. However such a kernel when booting the same kernel as fadump
kernel will not be able to allocate the required amount of memory to
suffice for the dentry and inode caches. This results in crashes like

Hence only implement arch_reserved_kernel_pages() for CONFIG_FA_DUMP
configurations. The amount reserved will be reduced while calculating
the large caches and will avoid crashes like the below on large systems
such as 32 TB systems.

Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
Call Trace:
dump_stack+0xb0/0xf0 (unreliable)
warn_alloc_failed+0x114/0x160
__vmalloc_node_range+0x304/0x340
__vmalloc+0x6c/0x90
alloc_large_system_hash+0x1b8/0x2c0
inode_init+0x94/0xe4
vfs_caches_init+0x8c/0x13c
start_kernel+0x50c/0x578
start_here_common+0x20/0xa8

Link: http://lkml.kernel.org/r/1472476010-4709-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju
Suggested-by: Mel Gorman
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Michael Ellerman
Cc: Mahesh Salgaonkar
Cc: Hari Bathini
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srikar Dronamraju
2016-10-08 09:46:28 +0800
8907de5dc mm/memblock.c: expose total reserved memory ... Browse Code »

The total reserved memory in a system is accounted but not available for
use use outside mm/memblock.c. By exposing the total reserved memory,
systems can better calculate the size of large hashes.

Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju
Suggested-by: Mel Gorman
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Michael Ellerman
Cc: Mahesh Salgaonkar
Cc: Hari Bathini
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srikar Dronamraju
2016-10-08 09:46:28 +0800
f6f34b438 mm: introduce arch_reserved_kernel_pages() ... Browse Code »

Currently arch specific code can reserve memory blocks but
alloc_large_system_hash() may not take it into consideration when sizing
the hashes. This can lead to bigger hash than required and lead to no
available memory for other purposes. This is specifically true for
systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

One approach to solve this problem would be to walk through the memblock
regions and calculate the available memory and base the size of hash
system on the available memory.

The other approach would be to depend on the architecture to provide the
number of pages that are reserved. This change provides hooks to allow
the architecture to provide the required info.

Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju
Suggested-by: Mel Gorman
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Michael Ellerman
Cc: Mahesh Salgaonkar
Cc: Hari Bathini
Cc: Dave Hansen
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srikar Dronamraju
2016-10-08 09:46:28 +0800
c9634cf01 mm: use zonelist name instead of using hardcoded index ... Browse Code »

Use the existing enums instead of hardcoded index when looking at the
zonelist. This makes it more readable. No functionality change by this
patch.

Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V
Reviewed-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2016-10-08 09:46:28 +0800
1b51e65ea oom, oom_reaper: allow to reap mm shared by the kthreads ... Browse Code »

oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()). The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action. This is no longer a problem after "mm: make sure
that kthreads will not refault oom reaped memory" because any attempt to
fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
target user should see an error. This means that we can finally allow
oom reaper also to tasks which share their mm with kthreads.

Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:28 +0800
3f70dc38c mm: make sure that kthreads will not refault oom reaped memory ... Browse Code »

There are only few use_mm() users in the kernel right now. Most of them
write to the target memory but vhost driver relies on
copy_from_user/get_user from a kernel thread context. This makes it
impossible to reap the memory of an oom victim which shares the mm with
the vhost kernel thread because it could see a zero page unexpectedly
and theoretically make an incorrect decision visible outside of the
killed task context.

To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

The vhost kernel thread is bound to an open fd of the vhost device which
is not tight to the mm owner life cycle in general. The device fd can
be inherited or passed over to another process which means that we
really have to be careful about unexpected memory corruption because
unlike for normal oom victims the result will be visible outside of the
oom victim context.

Make sure that no kthread context (users of use_mm) can ever see
corrupted data because of the oom reaper and hook into the page fault
path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
flag before it starts unmapping the address space while the flag is
checked after the page fault has been handled. If the flag is set then
SIGBUS is triggered so any g-u-p user will get a error code.

Regular tasks do not need this protection because all which share the mm
are killed when the mm is reaped and so the corruption will not outlive
them.

This patch shouldn't have any visible effect at this moment because the
OOM killer doesn't invoke oom reaper for tasks with mm shared with
kthreads yet.

Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: "Michael S. Tsirkin"
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:28 +0800
38531201c mm, oom: enforce exit_oom_victim on current task ... Browse Code »

There are no users of exit_oom_victim on !current task anymore so enforce
the API to always work on the current.

Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa
Signed-off-by: Michal Hocko
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-10-08 09:46:28 +0800
7d2e7a22c oom, suspend: fix oom_killer_disable vs. pm suspend properly ... Browse Code »

Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race") has workaround an existing race between
oom_killer_disable and oom_reaper by adding another round of
try_to_freeze_tasks after the oom killer was disabled. This was the
easiest thing to do for a late 4.7 fix. Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer have to
call exit_oom_victim from the oom reaper because we have stable mm
available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time). OOM killer can cope with that by checking mm
flags and move on to another victim but we cannot do the same for
oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system. What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims. The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:27 +0800
862e3073b mm, oom: get rid of signal_struct::oom_victims ... Browse Code »

After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task->signal->oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput. We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:27 +0800
7283094ec kernel, oom: fix potential pgd_lock deadlock from __mmdrop ... Browse Code »

Lockdep complains that __mmdrop is not safe from the softirq context:

=================================
[ INFO: inconsistent lock state ]
4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
{SOFTIRQ-ON-W} state was registered at:
__lock_acquire+0xa06/0x196e
lock_acquire+0x139/0x1e1
_raw_spin_lock+0x32/0x41
__change_page_attr_set_clr+0x2a5/0xacd
change_page_attr_set_clr+0x16f/0x32c
set_memory_nx+0x37/0x3a
free_init_pages+0x9e/0xc7
alternative_instructions+0xa2/0xb3
check_bugs+0xe/0x2d
start_kernel+0x3ce/0x3ea
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x17a/0x18d
irq event stamp: 105916
hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
softirqs last disabled at (105879): irq_exit+0x6f/0xd1

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(pgd_lock);

lock(pgd_lock);

*** DEADLOCK ***

1 lock held by swapper/1/0:
#0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
Call Trace:

print_usage_bug.part.25+0x259/0x268
mark_lock+0x381/0x567
__lock_acquire+0x993/0x196e
lock_acquire+0x139/0x1e1
_raw_spin_lock+0x32/0x41
pgd_free+0x19/0x6b
__mmdrop+0x25/0xb9
__put_task_struct+0x103/0x11e
delayed_put_task_struct+0x157/0x15e
rcu_process_callbacks+0x660/0x800
__do_softirq+0x1ec/0x4d5
irq_exit+0x6f/0xd1
smp_apic_timer_interrupt+0x42/0x4d
apic_timer_interrupt+0x8e/0xa0

arch_cpu_idle+0xf/0x11
default_idle_call+0x32/0x34
cpu_startup_entry+0x20c/0x399
start_secondary+0xfe/0x101

More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
explicit about pgd_lock not to be called from the irq context. This
means that __mmdrop called from free_signal_struct has to be postponed
to a user context. We already have a similar mechanism for mmput_async
so we can use it here as well. This is safe because mm_count is pinned
by mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:27 +0800
26db62f17 oom: keep mm of the killed task available ... Browse Code »

oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:

do_exit
exit_mm
tsk->mm = NULL;
mmput
__mmput
exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:27 +0800