Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2019

1 commit

2c25071be hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined ... Browse Code »

commit b15c87263a69272423771118c653e9a1d0672caa upstream.

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);

fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");

sleep (20);

ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [] dump_trace+0x59/0x340
kernel: [] show_stack_log_lvl+0xea/0x170
kernel: [] show_stack+0x21/0x40
kernel: [] dump_stack+0x5c/0x7c
kernel: [] warn_slowpath_common+0x81/0xb0
kernel: [] __pagevec_lru_add_fn+0x14c/0x160
kernel: [] pagevec_lru_move_fn+0xad/0x100
kernel: [] __lru_cache_add+0x6c/0xb0
kernel: [] add_to_page_cache_lru+0x46/0x70
kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [] __do_page_cache_readahead+0x177/0x200
kernel: [] ondemand_readahead+0x168/0x2a0
kernel: [] generic_file_read_iter+0x41f/0x660
kernel: [] __vfs_read+0xcd/0x140
kernel: [] vfs_read+0x7a/0x120
kernel: [] kernel_read+0x3b/0x50
kernel: [] do_execveat_common.isra.29+0x490/0x6f0
kernel: [] do_execve+0x28/0x30
kernel: [] call_usermodehelper_exec_async+0xfb/0x130
kernel: [] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Oscar Salvador
Debugged-by: Oscar Salvador
Tested-by: Oscar Salvador
Acked-by: David Hildenbrand
Acked-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2019-01-13 17:01:02 +0800

04 Oct, 2017

3 commits

d09b0137d mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long ... Browse Code »

find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
and return the pfn of the section. But the functions are defined as int.
So the functions always return 0x00000000 - 0xffffffff. It means if
memory address is over 16TB, the functions does not work correctly.

To handle 64 bit value, the patch defines
find_{smallest|biggest}_section_pfn() as unsigned long.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Acked-by: Michal Hocko
Cc: Xishi Qiu
Cc: Reza Arbab
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

YASUAKI ISHIMATSU
2017-10-04 08:54:26 +0800
1dd2bfc86 mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function ... Browse Code »

pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro. But
section_nr_to_pfn() has overflow issue if sec is defined as int.

section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.

__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int. So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.

To make callers use proper arg, the patch changes the macros to inline
functions.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Acked-by: Michal Hocko
Cc: Xishi Qiu
Cc: Reza Arbab
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

YASUAKI ISHIMATSU
2017-10-04 08:54:25 +0800
f64ac5e6e mm, memory_hotplug: add scheduling point to __add_pages ... Browse Code »

Patch series "mm, memory_hotplug: fix few soft lockups in memory
hotadd".

Johannes has noticed few soft lockups when adding a large nvdimm device.
All of them were caused by a long loop without any explicit cond_resched
which is a problem for !PREEMPT kernels.

The fix is quite straightforward. Just make sure that cond_resched gets
called from time to time.

This patch (of 3):

__add_pages gets a pfn range to add and there is no upper bound for a
single call. This is usually a memory block aligned size for the
regular memory hotplug - smaller sizes are usual for memory balloning
drivers, or the whole NUMA node for physical memory online. There is no
explicit scheduling point in that code path though.

This can lead to long latencies while __add_pages is executed and we
have even seen a soft lockup report during nvdimm initialization with
!PREEMPT kernel

NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
[...]
Workqueue: events_unbound async_run_entry_fn
task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
RIP: _raw_spin_unlock_irqrestore+0x11/0x20
RSP: 0018:ffff881809277b10 EFLAGS: 00000286
[...]
Call Trace:
sparse_add_one_section+0x13d/0x18e
__add_pages+0x10a/0x1d0
arch_add_memory+0x4a/0xc0
devm_memremap_pages+0x29d/0x430
pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
nvdimm_bus_probe+0x64/0x110 [libnvdimm]
driver_probe_device+0x1f7/0x420
bus_for_each_drv+0x52/0x80
__device_attach+0xb0/0x130
bus_probe_device+0x87/0xa0
device_add+0x3fc/0x5f0
nd_async_device_register+0xe/0x40 [libnvdimm]
async_run_entry_fn+0x43/0x150
process_one_work+0x14e/0x410
worker_thread+0x116/0x490
kthread+0xc7/0xe0
ret_from_fork+0x3f/0x70
DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

Fix this by adding cond_resched once per each memory section in the
given pfn range. Each section is constant amount of work which itself
is not too expensive but many of them will just add up.

Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Johannes Thumshirn
Tested-by: Johannes Thumshirn
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-10-04 08:54:25 +0800

09 Sep, 2017

2 commits

5042db43c mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory ... Browse Code »

HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory. Reasons for HMM and
migration to device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support
different types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type. There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type. All specific code
path are protect with test against the memory type.

Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

[arnd@arndb.de: fix warning]
Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Arnd Bergmann
Acked-by: Dan Williams
Cc: Ross Zwisler
Cc: Aneesh Kumar
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: Johannes Weiner
Cc: John Hubbard
Cc: Kirill A. Shutemov
Cc: Mark Hairgrove
Cc: Michal Hocko
Cc: Paul E. McKenney
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Vladimir Davydov
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-09-09 09:26:46 +0800
8135d8926 mm: memory_hotplug: memory hotremove supports thp migration ... Browse Code »

This patch enables thp migration for memory hotremove.

Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi
Signed-off-by: Zi Yan
Cc: "H. Peter Anvin"
Cc: Anshuman Khandual
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2017-09-09 09:26:45 +0800

07 Sep, 2017

5 commits

b93e0f329 mm, memory_hotplug: get rid of zonelists_mutex ... Browse Code »

zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Haicheng Li
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:26 +0800
34ad12965 mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node ... Browse Code »

try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists. So the additional call is redundant. Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.

Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Toshi Kani
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:26 +0800
72675e131 mm, memory_hotplug: drop zone from build_all_zonelists ... Browse Code »

build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.

Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Wen Congyang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:25 +0800
c6f03e290 mm, memory_hotplug: remove zone restrictions ... Browse Code »

Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
to precede the Movable zone in the physical memory range. The purpose
of the movable zone is, however, not bound to any physical memory
restriction. It merely defines a class of migrateable and reclaimable
memory.

There are users (e.g. CMA) who might want to reserve specific physical
memory ranges for their own purpose. Moreover our pfn walkers have to
be prepared for zones overlapping in the physical range already because
we do support interleaving NUMA nodes and therefore zones can interleave
as well. This means we can allow each memory block to be associated
with a different zone.

Loosen the current onlining semantic and allow explicit onlining type on
any memblock. That means that online_{kernel,movable} will be allowed
regardless of the physical address of the memblock as long as it is
offline of course. This might result in moveble zone overlapping with
other kernel zones. Default onlining then becomes a bit tricky but
still sensible. echo online > memoryXY/state will online the given
block to

1) the default zone if the given range is outside of any zone
2) the enclosing zone if such a zone doesn't interleave with
any other zone
3) the default zone if more zones interleave for this range

where default zone is movable zone only if movable_node is enabled
otherwise it is a kernel zone.

Here is an example of the semantic with (movable_node is not present but
it work in an analogous way). We start with following memblocks, all of
them offline:

memory34/valid_zones:Normal Movable
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Normal Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal Movable

Now, we online block 34 in default mode and block 37 as movable

root@test1:/sys/devices/system/node/node1# echo online > memory34/state
root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal Movable

As we can see all other blocks can still be onlined both into Normal and
Movable zones and the Normal is default because the Movable zone spans
only block37 now.

root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Movable Normal
memory39/valid_zones:Movable Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:Movable

Now the default zone for blocks 37-41 has changed because movable zone
spans that range.

root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:Movable

Note that the block 39 now belongs to the zone Normal and so block38
falls into Normal by default as well.

For completness

root@test1:/sys/devices/system/node/node1# for i in memory[34]?
do
echo online > $i/state 2>/dev/null
done

memory34/valid_zones:Normal
memory35/valid_zones:Normal
memory36/valid_zones:Normal
memory37/valid_zones:Movable
memory38/valid_zones:Normal
memory39/valid_zones:Normal
memory40/valid_zones:Movable
memory41/valid_zones:Movable

Implementation wise the change is quite straightforward. We can get rid
of allow_online_pfn_range altogether. online_pages allows only offline
nodes already. The original default_zone_for_pfn will become
default_kernel_zone_for_pfn. New default_zone_for_pfn implements the
above semantic. zone_for_pfn_range is slightly reorganized to implement
kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
catch all default behavior.

Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Acked-by: Reza Arbab
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc:
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Cc: Wei Yang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:25 +0800
e5e689302 mm, memory_hotplug: display allowed zones in the preferred ordering ... Browse Code »

Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") we used to allow to change the
valid zone types of a memory block if it is adjacent to a different zone
type.

This fact was reflected in memoryNN/valid_zones by the ordering of
printed zones. The first one was default (echo online > memoryNN/state)
and the other one could be onlined explicitly by online_{movable,kernel}.

This behavior was removed by the said patch and as such the ordering was
not all that important. In most cases a kernel zone would be default
anyway. The only exception is movable_node handled by "mm,
memory_hotplug: support movable_node for hotpluggable nodes".

Let's reintroduce this behavior again because later patch will remove
the zone overlap restriction and so user will be allowed to online
kernel resp. movable block regardless of its placement. Original
behavior will then become significant again because it would be
non-trivial for users to see what is the default zone to online into.

Implementation is really simple. Pull out zone selection out of
move_pfn_range into zone_for_pfn_range helper and use it in
show_valid_zones to display the zone for default onlining and then both
kernel and movable if they are allowed. Default online zone is not
duplicated.

Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Reza Arbab
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc:
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:25 +0800

11 Jul, 2017

7 commits

3f906ba23 mm/memory-hotplug: switch locking to a percpu rwsem ... Browse Code »

Andrey reported a potential deadlock with the memory hotplug lock and
the cpu hotplug lock.

The reason is that memory hotplug takes the memory hotplug lock and then
calls stop_machine() which calls get_online_cpus(). That's the reverse
lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

The problem has been there forever. The reason why this was never
reported is that the cpu hotplug locking had this homebrewn recursive
reader writer semaphore construct which due to the recursion evaded the
full lock dep coverage. The memory hotplug code copied that construct
verbatim and therefor has similar issues.

Three steps to fix this:

1) Convert the memory hotplug locking to a per cpu rwsem so the
potential issues get reported proper by lockdep.

2) Lock the online cpus in mem_hotplug_begin() before taking the memory
hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
code to avoid recursive locking.

3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
by invoking lru_add_drain_all_cpuslocked() instead.

Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
Reported-by: Andrey Ryabinin
Signed-off-by: Thomas Gleixner
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Vladimir Davydov
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2017-07-11 07:32:33 +0800
a52149f12 mm/memory_hotplug.c: remove unused local zone_type from __remove_zone() ... Browse Code »

__remove_zone() sets up up zone_type, but never uses it for anything.
This does not cause a warning, due to the (necessary) use of
-Wno-unused-but-set-variable. However, it's noise, so just delete it.

Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Hubbard
2017-07-11 07:32:32 +0800
8b9132388 mm: unify new_node_page and alloc_migrate_target ... Browse Code »

Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") has duplicated a large part of
alloc_migrate_target with some hotplug specific special casing.

To be more precise it tried to enfore the allocation from a different
node than the original page. As a result the two function diverged in
their shared logic, e.g. the hugetlb allocation strategy.

Let's unify the two and express different NUMA requirements by the given
nodemask. new_node_page will simply exclude the node it doesn't care
about and alloc_migrate_target will use all the available nodes.
alloc_migrate_target will then learn to migrate hugetlb pages more
sanely and use preallocated pool when possible.

Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which is
quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.

Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:31 +0800
4db9b2efe hugetlb, memory_hotplug: prefer to use reserved pages for migration ... Browse Code »

new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages. If such a node doesn't have
any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
to allocate a surplus page instead. This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
node 0

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

Now we consume the whole pool on node 4 and try to offline this node.
All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them. With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.

Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.

[akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:31 +0800
7f252f277 mm, memory_hotplug: simplify empty node mask handling in new_node_page ... Browse Code »

new_node_page tries to allocate the target page on a different NUMA node
than the source page. This makes sense in most cases during the hotplug
because we are likely to offline the whole numa node. But there are
cases where there are no other nodes to fallback (e.g. when offlining
parts of the only existing node) and we have to fallback to allocating
from the source node. The current code does that but it can be
simplified by checking the nmask and updating it before we even try to
allocate rather than special casing it.

This patch shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:31 +0800
9f123ab54 mm, memory_hotplug: support movable_node for hotpluggable nodes ... Browse Code »

movable_node kernel parameter allows making hotpluggable NUMA nodes to
put all the hotplugable memory into movable zone which allows more or
less reliable memory hotremove. At least this is the case for the NUMA
nodes present during the boot (see find_zone_movable_pfns_for_nodes).

This is not the case for the memory hotplug, though.

echo online > /sys/devices/system/memory/memoryXYZ/state

will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have
a special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not to mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.

It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to make
them use the same code path because the runtime hotplug doesn't play
with the memblock allocator at all.

Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining
strategy.

Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks but
let's keep this simple now.

Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Reza Arbab
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Yasuaki Ishimatsu
Cc:
Cc: Kani Toshimitsu
Cc:
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:31 +0800
dbac61a3f mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference ... Browse Code »

The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
might be NULL.

rollback_node_hotadd() dereferences this pointer. Add NULL check to
avoid a potential NULL pointer dereference.

Addresses-Coverity-ID: 1369133
Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gustavo A. R. Silva
2017-07-11 07:32:30 +0800

07 Jul, 2017

15 commits

4932381ee mm, memory_hotplug: move movable_node to the hotplug proper ... Browse Code »

movable_node_is_enabled is defined in memblock proper while it is
initialized from the memory hotplug proper. This is quite messy and it
makes a dependency between the two so move movable_node along with the
helper functions to memory_hotplug.

To make it more entertaining the kernel parameter is ignored unless
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
information for each memblock otherwise. So let's warn when the option
is disabled.

Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Reza Arbab
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Jerome Glisse
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc: Chen Yucong
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:35 +0800
f70029bba mm, memory_hotplug: drop CONFIG_MOVABLE_NODE ... Browse Code »

Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
good explanation on why it is actually useful.

It makes a lot of sense to make movable node semantic opt in but we
already have that because the feature has to be explicitly enabled on
the kernel command line. A config option on top only makes the
configuration space larger without a good reason. It also adds an
additional ifdefery that pollutes the code.

Just drop the config option and make it de-facto always enabled. This
shouldn't introduce any change to the semantic.

Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Reza Arbab
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Jerome Glisse
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc: Chen Yucong
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:35 +0800
57c0a1723 mm, memory_hotplug: drop artificial restriction on online/offline ... Browse Code »

Patch series "remove CONFIG_MOVABLE_NODE".

I am continuing to clean up the memory hotplug code and
CONFIG_MOVABLE_NODE seems dubious at best. The following two patches
simply removes the flag and make it de-facto always enabled.

The current semantic of the config option is twofold 1) it automatically
binds hotplugable nodes to have memory in zone_movable by default when
movable_node is enabled 2) forbids memory hotplug to online all the
memory as movable when !CONFIG_MOVABLE_NODE.

The later restriction is quite dubious because there is no clear cut of
how much normal memory do we need for a reasonable system operation. A
single memory block which is sufficient to allow further movable onlines
is far from sufficient (e.g a node with >2GB and memblocks 128MB will
fill up this zone with struct pages leaving nothing for other
allocations). Removing the config option will not only reduce the
configuration space it also removes quite some code.

The semantic of the movable_node command line parameter is preserved.

The first patch removes the restriction mentioned above and the second
one simply removes all the CONFIG_MOVABLE_NODE related stuff. The last
patch moves movable_node flag handling to memory_hotplug proper where it
belongs.

[1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org

This patch (of 3):

Commit 74d42d8fe146 ("memory_hotplug: ensure every online node has
NORMAL memory") has introduced a restriction that every numa node has to
have at least some memory in !movable zones before a first movable
memory can be onlined if !CONFIG_MOVABLE_NODE.

Likewise can_offline_normal checks the amount of normal memory in
!movable zones and it disallows to offline memory if there is no normal
memory left with a justification that "memory-management acts bad when
we have nodes which is online but don't have any normal memory".

While it is true that not having _any_ memory for kernel allocations on
a NUMA node is far from great and such a node would be quite subotimal
because all kernel allocations will have to fallback to another NUMA
node but there is no reason to disallow such a configuration in
principle.

Besides that there is not really a big difference to have one memblock
for ZONE_NORMAL available or none. With 128MB size memblocks the system
might trash on the kernel allocations requests anyway. It is really
hard to draw a line on how much normal memory is really sufficient so we
have to rely on administrator to configure system sanely therefore drop
the artificial restriction and remove can_offline_normal and
can_online_high_movable altogether.

Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Reza Arbab
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Jerome Glisse
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc: Chen Yucong
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:35 +0800
04ec6264f mm, page_alloc: pass preferred nid instead of zonelist to allocator ... Browse Code »

The main allocator function __alloc_pages_nodemask() takes a zonelist
pointer as one of its parameters. All of its callers directly or
indirectly obtain the zonelist via node_zonelist() using a preferred
node id and gfp_mask. We can make the code a bit simpler by doing the
zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
id instead (gfp_mask is already another parameter).

There are some code size benefits thanks to removal of inlined
node_zonelist():

bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

This will also make things simpler if we proceed with converting cpusets
to zonelists.

Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Christoph Lameter
Acked-by: Michal Hocko
Cc: Dimitri Sivanich
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Anshuman Khandual
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Li Zefan
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2017-07-07 07:24:34 +0800
559bfc7d1 mm, memory_hotplug: remove unused cruft after memory hotplug rework ... Browse Code »

zone_for_memory doesn't have any user anymore as well as the whole zone
shifting infrastructure so drop them all.

This shouldn't introduce any functional changes.

Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
cdf72f250 mm, memory_hotplug: fix the section mismatch warning ... Browse Code »

Tobias has reported following section mismatches introduced by "mm,
memory_hotplug: do not associate hotadded memory to zones until online".

WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.

WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.

WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.

WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.

Both memmap_init_zone and init_currently_empty_zone are marked __meminit
but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
devm_memremap_pages) so we have to hide it from the checker by __ref
annotation.

Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tobias Regnery
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Vlastimil Babka
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
3d79a728f mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory ... Browse Code »

arch_add_memory gets for_device argument which then controls whether we
want to create memblocks for created memory sections. Simplify the
logic by telling whether we want memblocks directly rather than going
through pointless negation. This also makes the api easier to
understand because it is clear what we want rather than nothing telling
for_device which can mean anything.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
Signed-off-by: Michal Hocko
Tested-by: Dan Williams
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
c246a213f mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone ... Browse Code »

Heiko Carstens has noticed that he can generate overlapping zones for
ZONE_DMA and ZONE_NORMAL:

DMA [mem 0x0000000000000000-0x000000007fffffff]
Normal [mem 0x0000000080000000-0x000000017fffffff]

$ cat /sys/devices/system/memory/block_size_bytes
10000000
$ cat /sys/devices/system/memory/memory5/valid_zones
DMA
$ echo 0 > /sys/devices/system/memory/memory5/online
$ cat /sys/devices/system/memory/memory5/valid_zones
Normal
$ echo 1 > /sys/devices/system/memory/memory5/online
Normal

$ cat /proc/zoneinfo
Node 0, zone DMA
spanned 524288
Reported-by: Heiko Carstens
Tested-by: Heiko Carstens
Acked-by: Vlastimil Babka
Cc: Dan Williams
Cc: Reza Arbab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
a69578a15 mm, memory_hotplug: fix MMOP_ONLINE_KEEP behavior ... Browse Code »

Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently

$ grep . memory3?/valid_zones
memory34/valid_zones:Normal Movable
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Normal Movable

$ echo online_movable > memory34/state
$ grep . memory3?/valid_zones
memory34/valid_zones:Movable
memory35/valid_zones:Movable
memory36/valid_zones:Movable
memory37/valid_zones:Movable

$ echo online > memory36/state
$ grep . memory3?/valid_zones
memory34/valid_zones:Movable
memory36/valid_zones:Normal
memory37/valid_zones:Movable

so we have effectively punched a hole into the movable zone.

The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is
wrong. It only checks whether the given range is already part of the
movable zone which is not the case here as only memory34 is in the zone.
Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if
that is false then we can be sure that movable onlining is the right
thing to do.

Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Heiko Carstens
Tested-by: Heiko Carstens
Acked-by: Vlastimil Babka
Cc: Dan Williams
Cc: Reza Arbab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
f1dd2cd13 mm, memory_hotplug: do not associate hotadded memory to zones until online ... Browse Code »

The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.

Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed. Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable. This essentially means that the online type changes as
the new memblocks are added.

Let's simulate memory hot online manually
$ echo 0x100000000 > /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

$ echo $((0x100000000+(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

$ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

$ echo online_movable > /sys/devices/system/memory/memory34/state
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g. association with a node) but it will inherently race
with new blocks showing up.

This patch changes the physical online phase to not associate pages with
any zone at all. All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request. There are only two requirements

- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

the latter one is not an inherent requirement and can be changed in the
future. It preserves the current behavior and made the code slightly
simpler. This is subject to change in future.

This means that the same physical online steps as above will lead to the
following state: Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable

Implementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node. __add_pages is updated to not require the
zone and only initializes sections in the range. This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).

devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way. It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs. This means that this
particular code path has to call move_pfn_range_to_zone explicitly.

The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.

Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs. Movable)
used to allow to change its movable type. This will be handled later.

[richard.weiyang@gmail.com: simplify zone_intersects()]
Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Wei Yang
Tested-by: Dan Williams
Tested-by: Reza Arbab
Acked-by: Heiko Carstens # For s390 bits
Acked-by: Vlastimil Babka
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
2d070eab2 mm: consider zone which is not fully populated to have holes ... Browse Code »

__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid page
from the given pfn range for the given zone. This doesn't handle zones
which are not fully populated though. Memory pageblocks can be offlined
or might not have been onlined yet. In such a case the zone should be
considered to have holes otherwise pfn walkers can touch and play with
offline pages.

Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.

In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.

pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.

Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.

[mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
mark sections online after all struct pages are initialized in
online_pages_range (Vlastimil)]
Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
9037a9934 mm, memory_hotplug: split up register_one_node() ... Browse Code »

Memory hotplug (add_memory_resource) has to reinitialize node
infrastructure if the node is offline (one which went through the
complete add_memory(); remove_memory() cycle). That involves node
registration to the kobj infrastructure (register_node), the proper
association with cpus (register_cpu_under_node) and finally creation of
nodememblock symlinks (link_mem_sections).

The last part requires to know node_start_pfn and node_spanned_pages
which we currently have but a leter patch will postpone this
initialization to the onlining phase which happens later. In fact we do
not need to rely on the early pgdat initialization even now because the
currently hot added pfn range is currently known.

Split register_one_node into core which does all the common work for the
boot time NUMA initialization and the hotplug (__register_one_node).
register_one_node keeps the full initialization while hotplug calls
__register_one_node and manually calls link_mem_sections for the proper
range.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
1b862aecf mm, memory_hotplug: get rid of is_zone_device_section ... Browse Code »

Device memory hotplug hooks into regular memory hotplug only half way.
It needs memory sections to track struct pages but there is no
need/desire to associate those sections with memory blocks and export
them to the userspace via sysfs because they cannot be onlined anyway.

This is currently expressed by for_device argument to arch_add_memory
which then makes sure to associate the given memory range with
ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
to distinguish special memory hotplug from the regular one. While this
works now, later patches in this series want to move __add_zone outside
of arch_add_memory path so we have to come up with something else.

Add want_memblock down the __add_pages path and use it to control
whether the section->memblock association should be done.
arch_add_memory then just trivially want memblock for everything but
for_device hotplug.

remove_memory_section doesn't need is_zone_device_section either. We
can simply skip all the memblock specific cleanup if there is no
memblock for the given section.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
Signed-off-by: Michal Hocko
Tested-by: Dan Williams
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
c8f956571 mm, memory_hotplug: use node instead of zone in can_online_high_movable ... Browse Code »

The primary purpose of this helper is to query the node state so use the
node id directly. This is a preparatory patch for later changes.

This shouldn't introduce any functional change

Link: http://lkml.kernel.org/r/20170515085827.16474-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
dc0bbf3b7 mm: remove return value from init_currently_empty_zone ... Browse Code »

Patch series "mm: make movable onlining suck less", v4.

Movable onlining is a real hack with many downsides - mainly
reintroduction of lowmem/highmem issues we used to have on 32b systems -
but it is the only way to make the memory hotremove more reliable which
is something that people are asking for.

The current semantic of memory movable onlinening is really cumbersome,
however. The main reason for this is that the udev driven approach is
basically unusable because udev races with the memory probing while only
the last memory block or the one adjacent to the existing zone_movable
are allowed to be onlined movable. In short the criterion for the
successful online_movable changes under udev's feet. A reliable udev
approach would require a 2 phase approach where the first successful
movable online would have to check all the previous blocks and online
them in descending order. This is hard to be considered sane.

This patchset aims at making the onlining semantic more usable. First
of all it allows to online memory movable as long as it doesn't clash
with the existing ZONE_NORMAL. That means that ZONE_NORMAL and
ZONE_MOVABLE cannot overlap. Currently I preserve the original ordering
semantic so the zone always precedes the movable zone but I have plans
to remove this restriction in future because it is not really necessary.

First 3 patches are cleanups which should be ready to be merged right
away (unless I have missed something subtle of course).

Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.

Patch 5 deals with implicit assumptions of register_one_node on pgdat
initialization.

Patches 6-10 deal with offline holes in the zone for pfn walkers. I
hope I got all of them right but people familiar with compaction should
double check this.

Patch 11 is the core of the change. In order to make it easier to
review I have tried it to be as minimalistic as possible and the large
code removal is moved to patch 14.

Patch 12 is a trivial follow up cleanup. Patch 13 fixes sparse warnings
and finally patch 14 removes the unused code.

I have tested the patches in kvm:
# qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...

and then probed the additional memory by
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

Then I have used this simple script to probe the memory block by hand
# cat probe_memblock.sh
#!/bin/sh

BLOCK_NR=$1

# echo $((0x100000000+$BLOCK_NR*(128< /sys/devices/system/memory/probe

# for i in $(seq 10); do sh probe_memblock.sh $i; done
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory35/valid_zones:Normal Movable
/sys/devices/system/memory/memory36/valid_zones:Normal Movable
/sys/devices/system/memory/memory37/valid_zones:Normal Movable
/sys/devices/system/memory/memory38/valid_zones:Normal Movable
/sys/devices/system/memory/memory39/valid_zones:Normal Movable

The main difference to the original implementation is that all new
memblocks can be both online_kernel and online_movable initially because
there is no clash obviously. For the comparison the original
implementation would have

/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal
/sys/devices/system/memory/memory35/valid_zones:Normal
/sys/devices/system/memory/memory36/valid_zones:Normal
/sys/devices/system/memory/memory37/valid_zones:Normal
/sys/devices/system/memory/memory38/valid_zones:Normal
/sys/devices/system/memory/memory39/valid_zones:Normal Movable

Now
# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/sys/devices/system/memory/memory36/valid_zones:Movable
/sys/devices/system/memory/memory37/valid_zones:Movable
/sys/devices/system/memory/memory38/valid_zones:Movable
/sys/devices/system/memory/memory39/valid_zones:Movable

Block 33 can still be online both kernel and movable while all
the remaining can be only movable.

/proc/zonelist says
Node 0, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
--
Node 0, zone Movable
pages free 32753
min 85
low 117
high 149
spanned 32768
present 32768

A new memblock at a lower address will result in a new memblock (32)
which will still allow both Normal and Movable.

# sh probe_memblock.sh 0
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable

and online_kernel will convert it to the zone normal properly
while 33 can be still onlined both ways.

# echo online_kernel > /sys/devices/system/memory/memory32/state
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable

/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 65441
min 165
low 230
high 295
spanned 65536
present 65536
--
Node 0, zone Movable
pages free 32740
min 82
low 114
high 146
spanned 32768
present 32768

so both zones have one memblock spanned and present.

Onlining 39 should associate this block to the movable zone

# echo online > /sys/devices/system/memory/memory39/state

/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 32765
min 80
low 112
high 144
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 65501
min 160
low 225
high 290
spanned 196608
present 65536

so we will have a movable zone which spans 6 memblocks, 2 present and 4
representing a hole.

Offlining both movable blocks will lead to the zone with no present
pages which is the expected behavior I believe.

# echo offline > /sys/devices/system/memory/memory39/state
# echo offline > /sys/devices/system/memory/memory34/state
# grep -A6 "Movable\|Normal" /proc/zoneinfo
Node 0, zone Normal
pages free 32735
min 90
low 122
high 154
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 196608
present 0

As a bonus we will get a nice cleanup in the memory hotplug codebase.

This patch (of 16):

init_currently_empty_zone doesn't have any error to return yet it is
still an int and callers try to be defensive and try to handle potential
error. Remove this nonsense and simplify all callers.

This patch shouldn't have any visible effect

Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Balbir Singh
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800

04 May, 2017

1 commit

e716f2eb2 mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx ... Browse Code »

kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken
for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
this zone will be 0.

However, the decision on whether to sleep is made on
kswapd_classzone_idx which is 0 without a recent wakeup request and that
classzone does not account for lowmem reserves. This allows kswapd to
sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
request even if a stream of allocations cannot use that zone. While
kswapd may be woken again shortly in the near future there are two
consequences -- the pgdat bits that control congestion are cleared
prematurely and direct reclaim is more likely as kswapd slept
prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
invalid index) when there has been no recent wakeups. If there are no
wakeups, it'll decide whether to sleep based on the highest possible
zone available (MAX_NR_ZONES - 1). It then becomes critical that the
"pgdat balanced" decisions during reclaim and when deciding to sleep are
the same. If there is a mismatch, kswapd can stay awake continually
trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent

4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)

With this patch on top, write and allocation latencies are massively
improved. The read latencies are slightly impaired but it's worth
noting that this is mostly due to the IO scheduler and not directly
related to reclaim. The vmstats are a bit of a mix but the relevant
ones are as follows;

4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0 0 0
Swap Outs 0 608 0
Direct pages scanned 6910672 3132699 6357298
Kswapd pages scanned 57036946 82488665 56986286
Kswapd pages reclaimed 55993488 63474329 55939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency 98% 76% 98%
Kswapd velocity 12494.375 17597.507 12488.065
Direct efficiency 99% 94% 99%
Direct velocity 1513.835 668.306 1393.148
Page writes by reclaim 0.000 4410243.000 0.000
Page writes file 0 4409635 0
Page writes anon 0 608 0
Page reclaim immediate 1036792 14175203 1042571

4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Swap Ins 0 12 0
Swap Outs 0 838 0
Direct pages scanned 6579706 3237270 6256811
Kswapd pages scanned 61853702 79961486 54837791
Kswapd pages reclaimed 60768764 60755788 53849586
Direct pages reclaimed 6579055 2987453 6256151
Kswapd efficiency 98% 75% 98%
Page writes by reclaim 0.000 4389496.000 0.000
Page writes file 0 4388658 0
Page writes anon 0 838 0
Page reclaim immediate 1073573 14473009 982507

Swap-outs are equivalent to baseline.

Direct reclaim is reduced but not eliminated. It's worth noting that
there are two periods of direct reclaim for this workload. The first is
when it switches from preparing the files for the actual test itself.
It's a lot of file IO followed by a lot of allocs that reclaims heavily
for a brief window. While direct reclaim is lower with clear-v2, it is
due to kswapd scanning aggressively and trying to reclaim the world
which is not the right thing to do. With the patches applied, there is
still direct reclaim but the phase change from "creating work files" to
starting multiple threads that allocate a lot of anonymous memory faster
than kswapd can reclaim.

Scanning/reclaim efficiency is restored by this patch.

Page writes from reclaim context are back at 0 which is ideal.

Pages immediately reclaimed after IO completes is slightly improved but
it is expected this will vary slightly.

On UMA, there is almost no change so this is not expected to be a
universal win.

[mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Shantanu Goel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-05-04 06:52:09 +0800

17 Mar, 2017

1 commit

55adc1d05 mm: add private lock to serialize memory hotplug operations ... Browse Code »

Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.

The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.

The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().

This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").

That commit was extended again with commit b5d24fda9c3d ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.

In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.

This in turn triggers the following warning on s390:

WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
Call Trace:
assert_held_device_hotplug+0x40/0x58)
mem_hotplug_begin+0x34/0xc8
add_memory_resource+0x7e/0x1f8
add_memory+0xda/0x130
add_memory_merged+0x15c/0x178
sclp_detect_standby_memory+0x2ae/0x2f8
do_one_initcall+0xa2/0x150
kernel_init_freeable+0x228/0x2d8
kernel_init+0x2a/0x140
kernel_thread_starter+0x6/0xc

One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end(). But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).

Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.

To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.

Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens
Reported-by: Sebastian Ott
Acked-by: Dan Williams
Acked-by: Rafael J. Wysocki
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Ben Hutchings
Cc: Gerald Schaefer
Cc: Martin Schwidefsky
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2017-03-17 07:56:18 +0800

02 Mar, 2017

1 commit

174cd4b1e sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sc… ... Browse Code »

…hed.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-03-02 15:42:32 +0800

25 Feb, 2017

4 commits

dc18d706a memory-hotplug: use dev_online for memhp_auto_online ... Browse Code »

Commit 31bc3858ea3e ("add automatic onlining policy for the newly added
memory") provides the capability to have added memory automatically
onlined during add, but this appears to be slightly broken.

The current implementation uses walk_memory_range() to call
online_memory_block, which uses memory_block_change_state() to online
the memory. Instead, we should be calling device_online() for the
memory block in online_memory_block(). This would online the memory
(the memory bus online routine memory_subsys_online() called from
device_online calls memory_block_change_state()) and properly update the
device struct offline flag.

As a result of the current implementation, attempting to remove a memory
block after adding it using auto online fails. This is because doing a
remove, for instance

echo offline > /sys/devices/system/memory/memoryXXX/state

uses device_offline() which checks the dev->offline flag.

Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com
Signed-off-by: Nathan Fontenot
Cc: Michael Ellerman
Cc: Michael Roth
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Fontenot
2017-02-25 09:46:56 +0800
d6d8c8a48 mm/memory_hotplug.c: fix overflow in test_pages_in_a_zone() ... Browse Code »

When mainline introduced commit a96dfddbcc04 ("base/memory, hotplug: fix
a kernel oops in show_valid_zones()"), it obtained the valid start and
end pfn from the given pfn range. The valid start pfn can fix the
actual issue, but it introduced another issue. The valid end pfn will
may exceed the given end_pfn.

Although the incorrect overflow will not result in actual problem at
present, but I think it need to be fixed.

[toshi.kani@hpe.com: remove assumption that end_pfn is aligned by MAX_ORDER_NR_PAGES]
Fixes: a96dfddbcc04 ("base/memory, hotplug: fix a kernel oops in show_valid_zones()")
Link: http://lkml.kernel.org/r/1486467299-22648-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Signed-off-by: Toshi Kani
Cc: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2017-02-25 09:46:56 +0800
0efadf48b mm/hotplug: enable memory hotplug for non-lru movable pages ... Browse Code »

We had considered all of the non-lru pages as unmovable before commit
bda807d44454 ("mm: migrate: support non-lru movable page migration").
But now some of non-lru pages like zsmalloc, virtio-balloon pages also
become movable. So we can offline such blocks by using non-lru page
migration.

This patch straightforwardly adds non-lru migration code, which means
adding non-lru related code to the functions which scan over pfn and
collect pages to be migrated and isolate them before migration.

Signed-off-by: Yisheng Xie
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Andi Kleen
Cc: Hanjun Guo
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Taku Izumi
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yisheng Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2017-02-25 09:46:55 +0800
997126bbc mm/memory_hotplug.c: unexport __remove_pages() ... Browse Code »

It has no modular callers.

Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2017-02-25 09:46:53 +0800