Eric Lee / smarc-fsl-linux-kernel

07 Apr, 2018

1 commit

3b54765cc Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few misc things

- ocfs2 updates

- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.

- most of MM

* emailed patches from Andrew Morton : (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...

Linus Torvalds
2018-04-07 05:19:26 +0800

06 Apr, 2018

39 commits

97b1255cb mm,oom_reaper: check for MMF_OOM_SKIP before complaining ... Browse Code »

I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.

Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: unable to reap pid:7558 (a.out)
a.out D13272 7558 6931 0x00100084
Call Trace:
schedule+0x2d/0x80
rwsem_down_write_failed+0x2bb/0x440
call_rwsem_down_write_failed+0x13/0x20
down_write+0x49/0x60
unlink_file_vma+0x28/0x50
free_pgtables+0x36/0x100
exit_mmap+0xbb/0x180
mmput+0x50/0x110
copy_process.part.41+0xb61/0x1fe0
_do_fork+0xe6/0x560
do_syscall_64+0x74/0x230
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: David Rientjes
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2018-04-06 12:36:27 +0800
77da2ba06 mm/ksm: fix interaction with THP ... Browse Code »

This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.

This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<<<
Co-authored-by: Gerald Schaefer
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Claudio Imbrenda
2018-04-06 12:36:27 +0800
644d87dcc mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t ... Browse Code »

This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:

mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
to 'phys_addr_t' (aka 'unsigned int') changes value from
18446744073709551615 to 4294967295 [-Wconstant-conversion]
r->base : ULLONG_MAX;
^~~~~~~~~~
./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
#define ULLONG_MAX (~0ULL)
^~~~~

Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Catalin Marinas
Cc: Pavel Tatashin
Cc: Ard Biesheuvel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefan Agner
2018-04-06 12:36:27 +0800
514c60324 headers: untangle kmemleak.h from mm.h ... Browse Code »

Currently #includes for no obvious
reason. It looks like it's only a convenience, so remove kmemleak.h
from slab.h and add to any users of kmemleak_* that
don't already #include it. Also remove from source
files that do not use it.

This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
be good to run it through the 0day bot for other $ARCHes. I have
neither the horsepower nor the storage space for the other $ARCHes.

Update: This patch has been extensively build-tested by both the 0day
bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
for which patches are included here (in v2).

[ slab.h is the second most used header file after module.h; kernel.h is
right there with slab.h. There could be some minor error in the
counting due to some #includes having comments after them and I didn't
combine all of those. ]

[akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
Signed-off-by: Randy Dunlap
Reviewed-by: Ingo Molnar
Reported-by: Michael Ellerman [2 build failures]
Reported-by: Fengguang Wu [2 build failures]
Reviewed-by: Andrew Morton
Cc: Wei Yongjun
Cc: Luis R. Rodriguez
Cc: Greg Kroah-Hartman
Cc: Mimi Zohar
Cc: John Johansen
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2018-04-06 12:36:27 +0800
2c7452a07 mm/page_isolation.c: make start_isolate_page_range() fail if already isolated ... Browse Code »

start_isolate_page_range() is used to set the migrate type of a set of
pageblocks to MIGRATE_ISOLATE while attempting to start a migration
operation. It assumes that only one thread is calling it for the
specified range. This routine is used by CMA, memory hotplug and
gigantic huge pages. Each of these users synchronize access to the
range within their subsystem. However, two subsystems (CMA and gigantic
huge pages for example) could attempt operations on the same range. If
this happens, one thread may 'undo' the work another thread is doing.
This can result in pageblocks being incorrectly left marked as
MIGRATE_ISOLATE and therefore not available for page allocation.

What is ideally needed is a way to synchronize access to a set of
pageblocks that are undergoing isolation and migration. The only thing
we know about these pageblocks is that they are all in the same zone. A
per-node mutex is too coarse as we want to allow multiple operations on
different ranges within the same zone concurrently. Instead, we will
use the migration type of the pageblocks themselves as a form of
synchronization.

start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest pfn to
the largest pfn. The zone lock is acquired to check and set the
migration type. When going through the list of pageblocks check if
MIGRATE_ISOLATE is already set. If so, this indicates another thread is
working on this pageblock. We know exactly which pageblocks we set, so
clean up by undo those and return -EBUSY.

This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making use of
these interfaces. Update comments in alloc_contig_range to reflect this
new functionality.

Each CPU holds the associated zone lock to modify or examine the
migration type of a pageblock. And, it will only examine/update a
single pageblock per lock acquire/release cycle.

Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reviewed-by: Andrew Morton
Cc: KAMEZAWA Hiroyuki
Cc: Luiz Capitulino
Cc: Michal Nazarewicz
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2018-04-06 12:36:27 +0800
d46078b28 mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes ... Browse Code »

Since the 2.6 kernel, the oom killer has slightly biased away from
CAP_SYS_ADMIN processes by discounting some of its memory usage in
comparison to other processes.

This has always been implicit and nothing exactly relies on the
behavior.

Gaurav notices that __task_cred() can dereference a potentially freed
pointer if the task under consideration is exiting because a reference
to the task_struct is not held.

Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.

If any CAP_SYS_ADMIN process would like to be biased against, it is
always allowed to adjust /proc/pid/oom_score_adj.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reported-by: Gaurav Kohli
Acked-by: Michal Hocko
Cc: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:27 +0800
5ecd9d403 mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory ... Browse Code »

Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.

This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.

When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:27 +0800
0c7c1bed7 mm: make counting of list_lru_one::nr_items lockless ... Browse Code »

During the reclaiming slab of a memcg, shrink_slab iterates over all
registered shrinkers in the system, and tries to count and consume
objects related to the cgroup. In case of memory pressure, this behaves
bad: I observe high system time and time spent in list_lru_count_one()
for many processes on RHEL7 kernel.

This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
skip taking spinlock in list_lru_count_one().

Shakeel Butt with the patch observes significant perf graph change. He
says:

========================================================================
Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
VM and recording the trace with 'perf record -g -a'.

The trace without the patch:

+ 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
+ 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
+ 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
+ 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
+ 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
+ 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
+ 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
+ 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
+ 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes

With the patch:

+ 0.16% swapper [kernel.kallsyms] [k] default_idle
+ 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
+ 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
+ 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
+ 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
+ 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
+ 0.03% binary [kernel.kallsyms] [k] copy_page
========================================================================

Thanks Shakeel for the testing.

[ktkhai@virtuozzo.com: v2]
Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Tested-by: Shakeel Butt
Acked-by: Vladimir Davydov
Cc: Andrey Ryabinin
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-04-06 12:36:27 +0800
f5c754d63 mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static ... Browse Code »

The bool enable_vma_readahead and swap_vma_readahead() are local to the
source and do not need to be in global scope, so make them static.

Cleans up sparse warnings:

mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
Signed-off-by: Colin Ian King
Reviewed-by: Andrew Morton
Acked-by: "Huang, Ying"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Colin Ian King
2018-04-06 12:36:27 +0800
e8b098fc5 mm: kernel-doc: add missing parameter descriptions ... Browse Code »

Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-04-06 12:36:27 +0800
002843de3 mm/swap.c: remove @cold parameter description for release_pages() ... Browse Code »

The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe7d ("mm: remove cold parameter for release_pages").

Update the description to match the code.

Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-04-06 12:36:26 +0800
e48e3c590 mm/nommu: remove description of alloc_vm_area ... Browse Code »

The alloc_mm_area in nommu is a stub, but its description states it
allocates kernel address space. Remove the description to make the code
and the documentation agree.

Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-04-06 12:36:26 +0800
010b495e2 zsmalloc: introduce zs_huge_class_size() ... Browse Code »

Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.

ZRAM's max_zpage_size is a bad thing. It forces zsmalloc to store
normal objects as huge ones, which results in bigger zsmalloc memory
usage. Drop it and use actual zsmalloc huge-class value when decide if
the object is huge or not.

This patch (of 2):

Not every object can be share its zspage with other objects, e.g. when
the object is as big as zspage or nearly as big a zspage. For such
objects zsmalloc has a so called huge class - every object which belongs
to huge class consumes the entire zspage (which consists of a physical
page). On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
3264, so starting down from size 3264, objects can share page(-s) and
thus minimize memory wastage.

ZRAM, however, has its own statically defined watermark for huge
objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
object larger than this watermark (3072) as a PAGE_SIZE object, in other
words, to a huge class, while zsmalloc can keep some of those objects in
non-huge classes. This results in increased memory consumption.

zsmalloc knows better if the object is huge or not. Introduce
zs_huge_class_size() function which tells if the given object can be
stored in one of non-huge classes or not. This will let us to drop
ZRAM's huge object watermark and fully rely on zsmalloc when we decide
if the object is huge.

[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2018-04-06 12:36:26 +0800
cb9f753a3 mm: fix races between swapoff and flush dcache ... Browse Code »

Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed. So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function. But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

CPU1 CPU2
__get_user_pages() swapoff()
flush_dcache_page()
mapping = page_mapping()
... exit_swap_address_space()
... kvfree(spaces)
mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used. The implementation of flush_dcache_page() in all architectures
follows this too. They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately. And they will use interval tree (mapping->i_mmap)
to find all user space mappings. While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise. All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Andrew Morton
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Dave Hansen
Cc: Chen Liqin
Cc: Russell King
Cc: Yoshinori Sato
Cc: "James E.J. Bottomley"
Cc: Guan Xuetao
Cc: "David S. Miller"
Cc: Chris Zankel
Cc: Vineet Gupta
Cc: Ley Foon Tan
Cc: Ralf Baechle
Cc: Andi Kleen
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-04-06 12:36:26 +0800
05ea88608 mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct ... Browse Code »

When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and report the MMU page mapping size that is being enforced by
the vma.

Similar to commit 31383c6865a5 "mm, hugetlbfs: introduce ->split() to
vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
about device-dax page mapping sizes in the same (hstate) way that
hugetlbfs communicates this attribute. Instead, these patches introduce
a new ->pagesize() vm operation.

Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Jane Chu
Reviewed-by: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2018-04-06 12:36:26 +0800
09135cc59 mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize() ... Browse Code »

Patch series "mm, smaps: MMUPageSize for device-dax", v3.

Similar to commit 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to
vm_operations_struct") here is another occasion where we want
special-case hugetlbfs/hstate enabling to also apply to device-dax.

This prompts the question what other hstate conversions we might do
beyond ->split() and ->pagesize(), but this appears to be the last of
the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.

This patch (of 3):

The current powerpc definition of vma_mmu_pagesize() open codes looking
up the page size via hstate. It is identical to the generic
vma_kernel_pagesize() implementation.

Now, vma_kernel_pagesize() is growing support for determining the page
size of Device-DAX vmas in addition to the existing Hugetlbfs page size
determination.

Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
would automatically benefit from any new vma-type support that is added
to vma_kernel_pagesize(). However, the powerpc vma_mmu_pagesize() is
prevented from calling vma_kernel_pagesize() due to a circular header
dependency that requires vma_mmu_pagesize() to be defined before
including .

Break this circular dependency by defining the default vma_mmu_pagesize()
as a __weak symbol to be overridden by the powerpc version.

Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Jane Chu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2018-04-06 12:36:26 +0800
2923117b7 mm/gup.c: fix coding style issues. ... Browse Code »

- Fixed style error: 8 spaces -> 1 tab.
- Fixed style warning: Corrected misleading indentation.

Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
Signed-off-by: Mario Leinweber
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mario Leinweber
2018-04-06 12:36:26 +0800
97334162e mm/free_pcppages_bulk: prefetch buddy while not holding lock ... Browse Code »

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache
cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

Normally, the number of prefetch will be pcp->batch(default=31 and has
an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
pcp's pages get all drained, it will be pcp->count which has an upper
limit of pcp->high. pcp->high, although has a default value of 186
(pcp->batch=31 * 6), can be changed by user through
/proc/sys/vm/percpu_pagelist_fraction and there is no software upper
limit so could be large, like several thousand. For this reason, only
the first pcp->batch number of page's buddy structure is prefetched to
avoid excessive prefetching.

In the meantime, there are two concerns:

1. the prefetch could potentially evict existing cachelines, especially
for L1D cache since it is not huge

2. there is some additional instruction overhead, namely calculating
buddy pfn twice

For 1, it's hard to say, this microbenchmark though shows good result
but the actual benefit of this patch will be workload/CPU dependant;

For 2, since the calculation is a XOR on two local variables, it's
expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where
multiple CPUs are contending on zone->lock and the most time consuming
part under zone->lock is the wait of 'struct page' cacheline of the
to-be-freed pages and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%

Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)

[aaron.lu@intel.com: use helper function, avoid disordering pages]
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
[aaron.lu@intel.com: v4]
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Signed-off-by: Aaron Lu
Suggested-by: Ying Huang
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Dave Hansen
Cc: David Rientjes
Cc: Kemi Wang
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Tim Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2018-04-06 12:36:26 +0800
0a5f4e5b4 mm/free_pcppages_bulk: do not hold lock when picking pages to free ... Browse Code »

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:

- mmap 128M anonymouse space

- write access to that space

- munmap.

The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu
Acked-by: Mel Gorman
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Andi Kleen
Cc: Dave Hansen
Cc: David Rientjes
Cc: Huang Ying
Cc: Kemi Wang
Cc: Matthew Wilcox
Cc: Tim Chen
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2018-04-06 12:36:26 +0800
77ba9062e mm/free_pcppages_bulk: update pcp->count inside ... Browse Code »

Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().

No functionality or performance change is expected from this patch.

Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
Signed-off-by: Aaron Lu
Suggested-by: Matthew Wilcox
Acked-by: David Rientjes
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Huang Ying
Cc: Dave Hansen
Cc: Kemi Wang
Cc: Tim Chen
Cc: Andi Kleen
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2018-04-06 12:36:26 +0800
bc3106b26 mm, compaction: drain pcps for zone when kcompactd fails ... Browse Code »

It's possible for free pages to become stranded on per-cpu pagesets
(pcps) that, if drained, could be merged with buddy pages on the zone's
free area to form large order pages, including up to MAX_ORDER.

Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
a slab page). Pages on pcps do not have any page flags set.

109954 1
109955 2
109957 1
109958 1
109959 7
109960 1
109961 9
10996a 1
10996b 3
10996e 1
10996f 1
...
109f8c 1
109f8d 2
109f8f 2
109f91 f
109fa0 1
109fa1 7
109fa8 1
109fa9 1
109faa 1
109fab 1 _______S________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ ________________________________________________________________ __________B_____________________________________________________ _______S________________________________________________________

The compaction migration scanner is attempting to defragment this memory
since it is at the beginning of the zone. It has done so quite well,
all movable pages have been migrated. From pfn [0x109955, 0x109fab),
there are only buddy pages and pages without flags set.

These pages may be stranded on pcps that could otherwise allow this
memory to be coalesced if freed back to the zone free area. It is
possible that some of these pages may not be on pcps and that something
has called alloc_pages() and used the memory directly, but we rely on
the absence of __GFP_MOVABLE in these cases to allocate from
MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
pageblocks as free as possible.

These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
allow for three transparent hugepages to be dynamically allocated.
Running the numbers for all such spans on the system, it was found that
there were over 400 such spans of only buddy pages and pages without
flags set at the time this /proc/kpageflags sample was collected.
Without this support, there were _no_ order-9 or order-10 pages free.

When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur. Compaction for that order will
subsequently be deferred, which acts as a ratelimit on this drain.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:26 +0800
4f6923fbb mm: make should_failslab always available for fault injection ... Browse Code »

should_failslab() is a convenient function to hook into for directed
error injection into kmalloc(). However, it is only available if a
config flag is set.

The following BCC script, for example, fails kmalloc() calls after a
btrfs umount:

from bcc import BPF

prog = r"""
BPF_HASH(flag);

#include

int kprobe__btrfs_close_devices(void *ctx) {
u64 key = 1;
flag.update(&key, &key);
return 0;
}

int kprobe__should_failslab(struct pt_regs *ctx) {
u64 key = 1;
u64 *res;
res = flag.lookup(&key);
if (res != 0) {
bpf_override_return(ctx, -ENOMEM);
}
return 0;
}
"""
b = BPF(text=prog)

while 1:
b.kprobe_poll()

This patch refactors the should_failslab implementation so that the
function is always available for error injection, independent of flags.

This change would be similar in nature to commit f5490d3ec921 ("block:
Add should_fail_bio() for bpf error injection").

Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
Signed-off-by: Howard McLauchlan
Reviewed-by: Andrew Morton
Cc: Akinobu Mita
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Josef Bacik
Cc: Johannes Weiner
Cc: Alexei Starovoitov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Howard McLauchlan
2018-04-06 12:36:26 +0800
14298d366 mm/page_poison.c: make early_page_poison_param() __init ... Browse Code »

The early_param() is only called during kernel initialization, So Linux
marks the function of it with __init macro to save memory.

But it forgot to mark the early_page_poison_param(). So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Philippe Ombredanne
Cc: Kate Stewart
Cc: Michal Hocko
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dou Liyang
2018-04-06 12:36:26 +0800
1173194e1 mm/page_owner.c: make early_page_owner_param() __init ... Browse Code »

The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the early_page_owner_param(). So, Make it __init
as well.

Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dou Liyang
2018-04-06 12:36:26 +0800
8bd30c109 mm/kmemleak.c: make kmemleak_boot_config() __init ... Browse Code »

The early_param() is only called during kernel initialization, So Linux
marks the functions of it with __init macro to save memory.

But it forgot to mark the kmemleak_boot_config(). So, Make it __init as
well.

Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com
Signed-off-by: Dou Liyang
Reviewed-by: Andrew Morton
Cc: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dou Liyang
2018-04-06 12:36:26 +0800
e9e9b7ece mm: swap: unify cluster-based and vma-based swap readahead ... Browse Code »

This patch makes do_swap_page() not need to be aware of two different
swap readahead algorithms. Just unify cluster-based and vma-based
readahead function call.

Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-04-06 12:36:25 +0800
eaf649ebc mm: swap: clean up swap readahead ... Browse Code »

When I see recent change of swap readahead, I am very unhappy about
current code structure which diverges two swap readahead algorithm in
do_swap_page. This patch is to clean it up.

Main motivation is that fault handler doesn't need to be aware of
readahead algorithms but just should call swapin_readahead.

As first step, this patch cleans up a little bit but not perfect (I just
separate for review easier) so next patch will make the goal complete.

[minchan@kernel.org: do not check readahead flag with THP anon]
Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
Signed-off-by: Minchan Kim
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2018-04-06 12:36:25 +0800
e830c63a6 mm,vmscan: don't pretend forward progress upon shrinker_rwsem contention ... Browse Code »

Since we no longer use return value of shrink_slab() for normal reclaim,
the comment is no longer true. If some do_shrink_slab() call takes
unexpectedly long (root cause of stall is currently unknown) when
register_shrinker()/unregister_shrinker() is pending, trying to drop
caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
loop if many mem_cgroup are defined. For safety, let's not pretend
forward progress.

Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Dave Chinner
Cc: Glauber Costa
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2018-04-06 12:36:25 +0800
5c9bab592 z3fold: limit use of stale list for allocation ... Browse Code »

Currently if z3fold couldn't find an unbuddied page it would first try
to pull a page off the stale list. The problem with this approach is
that we can't 100% guarantee that the page is not processed by the
workqueue thread at the same time unless we run cancel_work_sync() on
it, which we can't do if we're in an atomic context. So let's just
limit stale list usage to non-atomic contexts only.

Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com
Signed-off-by: Vitaly Vul
Reviewed-by: Andrew Morton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2018-04-06 12:36:25 +0800
605ca5ede mm/huge_memory.c: reorder operations in __split_huge_page_tail() ... Browse Code »

THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent
changes in page locking: non-atomic operation could clear bit
PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 62906027091f ("mm: add PageWaiters
indicating tasks are waiting for a page bit") while unsave unfreeze
itself was added in commit 8df651c7059e ("thp: cleanup
split_huge_page()").

clear_compound_head() also must be called before unfreezing page
reference because after successful get_page_unless_zero() might follow
put_page() which needs correct compound_head().

And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().

Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Nicholas Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2018-04-06 12:36:25 +0800
e92bb4dd9 mm: fix races between address_space dereference and free in page_evicatable ... Browse Code »

When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.

CPU1 CPU2

truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

mapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.

Similar race exists between swap cache freeing and page_evicatable()
too.

The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.

Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Jan Kara
Reviewed-by: Andrew Morton
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-04-06 12:36:25 +0800
5ad350936 mm: reuse DEFINE_SHOW_ATTRIBUTE() macro ... Browse Code »

...instead of open coding file operations followed by custom ->open()
callbacks per each attribute.

[andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Reviewed-by: Sergey Senozhatsky
Acked-by: Christoph Lameter
Cc: Tejun Heo
Cc: Dennis Zhou
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2018-04-06 12:36:25 +0800
7f16f91fd mm, page_alloc: move mirrored_kernelcore to __meminitdata ... Browse Code »

mirrored_kernelcore can be in __meminitdata, so move it there.

At the same time, fixup section specifiers to be after the name of the
variable per checkpatch.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reviewed-by: Andrew Morton
Cc: Mike Kravetz
Cc: Jonathan Corbet
Cc: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:25 +0800
a5c6d6509 mm, page_alloc: extend kernelcore and movablecore for percent ... Browse Code »

Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.

This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.

To define the percentage, the final character of the parameter should be
a '%'.

mhocko: "why is anyone using these options nowadays?"

rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.

[rientjes@google.com: updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Cc: Jonathan Corbet
Cc: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-04-06 12:36:25 +0800
31286a848 mm: hwpoison: disable memory error handling on 1GB hugepage ... Browse Code »

Recently the following BUG was reported:

Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
Memory failure: 0x3c0000: recovery action for huge page: Recovered
BUG: unable to handle kernel paging request at ffff8dfcc0003000
IP: gup_pgd_range+0x1f0/0xc20
PGD 17ae72067 P4D 17ae72067 PUD 0
Oops: 0000 [#1] SMP PTI
...
CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.

I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it. We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.

[n-horiguchi@ah.jp.nec.com: v2]
Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Punit Agrawal
Tested-by: Michael Ellerman
Cc: Anshuman Khandual
Cc: "Aneesh Kumar K.V"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2018-04-06 12:36:25 +0800
d0dc12e86 mm/memory_hotplug: optimize memory hotplug ... Browse Code »

During memory hotplugging we traverse struct pages three times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()

This patch removes the first two loops, and leaves only loop 3. All
struct pages are initialized in one place, the same as it is done during
boot.

The benefits:

- We improve memory hotplug performance because we are not evicting the
cache several times and also reduce loop branching overhead.

- Remove condition from hotpath in __init_single_pfn(), that was added
in order to fix the problem that was reported by Bharata in the above
email thread, thus also improve performance during normal boot.

- Make memory hotplug more similar to the boot memory initialization
path because we zero and initialize struct pages only in one
function.

- Simplifies memory hotplug struct page initialization code, and thus
enables future improvements, such as multi-threading the
initialization of struct pages in order to improve hotplug
performance even further on larger machines.

[pasha.tatashin@oracle.com: v5]
Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Ingo Molnar
Cc: Michal Hocko
Cc: Baoquan He
Cc: Bharata B Rao
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Steven Sistare
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-04-06 12:36:25 +0800
fc44f7f92 mm/memory_hotplug: don't read nid from struct page during hotplug ... Browse Code »

During memory hotplugging the probe routine will leave struct pages
uninitialized, the same as it is currently done during boot. Therefore,
we do not want to access the inside of struct pages before
__init_single_page() is called during onlining.

Because during hotplug we know that pages in one memory block belong to
the same numa node, we can skip the checking. We should keep checking
for the boot case.

[pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Acked-by: Michal Hocko
Reviewed-by: Ingo Molnar
Cc: Baoquan He
Cc: Bharata B Rao
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Steven Sistare
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-04-06 12:36:25 +0800
f165b378b mm: uninitialized struct page poisoning sanity checking ... Browse Code »

During boot we poison struct page memory in order to ensure that no one
is accessing this memory until the struct pages are initialized in
__init_single_page().

This patch adds more scrutiny to this checking by making sure that flags
do not equal the poison pattern when they are accessed. The pattern is
all ones.

Since node id is also stored in struct page, and may be accessed quite
early, we add this enforcement into page_to_nid() function as well.
Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n

[pasha.tatashin@oracle.com: v4]
Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Ingo Molnar
Acked-by: Michal Hocko
Cc: Baoquan He
Cc: Bharata B Rao
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Steven Sistare
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-04-06 12:36:25 +0800
ba3255852 mm/memory_hotplug: enforce block size aligned range check ... Browse Code »

Patch series "optimize memory hotplug", v3.

This patchset:

- Improves hotplug performance by eliminating a number of struct page
traverses during memory hotplug.

- Fixes some issues with hotplugging, where boundaries were not
properly checked. And on x86 block size was not properly aligned with
end of memory

- Also, potentially improves boot performance by eliminating condition
from __init_single_page().

- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329

with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885

Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000

with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000

This patch (of 6):

Start qemu with the following arguments:

-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.

Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY

Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
vfs_write+0xa8/0x1a0
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c

static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000

or

$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).

$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Ingo Molnar
Acked-by: Michal Hocko
Cc: Baoquan He
Cc: Bharata B Rao
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Steven Sistare
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-04-06 12:36:25 +0800