Eric Lee / smarc-fsl-linux-kernel

07 Oct, 2020

2 commits

9626c1a63 mm: don't rely on system state to detect hot-plug operations ... Browse Code »

commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream.

In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Fenghua Yu
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Laurent Dufour
2020-10-07 14:01:30 +0800
42b7153dd mm: replace memmap_context by meminit_context ... Browse Code »

commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream.

Patch series "mm: fix memory to node bad links in sysfs", v3.

Sometimes, firmware may expose interleaved memory layout like this:

Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]

In that case, we can see memory blocks assigned to multiple nodes in
sysfs:

$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.

This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:

kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c

This has been seen on PowerPC LPAR.

The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.

There are two issues here:

(a) The sysfs memory and node's layouts are broken due to these
multiple links

(b) The link errors in link_mem_sections() should not lead to a system
panic.

To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.

Issue (b) will be addressed separately.

This patch (of 2):

The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.

Make it general to the hotplug operation and rename it as
meminit_context.

There is no functional change introduced by this patch

Suggested-by: David Hildenbrand
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J . Wysocki"
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc: Fenghua Yu
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Laurent Dufour
2020-10-07 14:01:29 +0800

23 Sep, 2020

1 commit

6b02d0598 mm/memory_hotplug: drain per-cpu pages again during memory offline ... Browse Code »

commit 9683182612214aa5f5e709fad49444b847cd866a upstream.

There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.

Thread#1 - a new process:

load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPES

Thread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.

Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp list

Thread#2
Never drains pcp again, and therefore gets stuck in the loop.

The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.

Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin
Signed-off-by: Andrew Morton
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Cc: Oscar Salvador
Cc: Wei Yang
Cc:
Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Pavel Tatashin
2020-09-23 18:40:47 +0800

21 Aug, 2020

1 commit

b47215b37 mm/memory_hotplug: fix unpaired mem_hotplug_begin/done ... Browse Code »

commit b4223a510e2ab1bf0f971d50af7c1431014b25ad upstream.

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
Call Trace:
percpu_up_write+0x33/0x40
try_remove_memory+0x66/0x120
? _cond_resched+0x19/0x30
remove_memory+0x2b/0x40
dev_dax_kmem_remove+0x36/0x72 [kmem]
device_release_driver_internal+0xf0/0x1c0
device_release_driver+0x12/0x20
bus_remove_device+0xe1/0x150
device_del+0x17b/0x3e0
unregister_dev_dax+0x29/0x60
devm_action_release+0x15/0x20
release_nodes+0x19a/0x1e0
devres_release_all+0x3f/0x50
device_release_driver_internal+0x100/0x1c0
driver_detach+0x4c/0x8f
bus_remove_driver+0x5c/0xd0
driver_unregister+0x31/0x50
dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Signed-off-by: Jia He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Dan Williams
Cc: [5.6+]
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chuhong Yuan
Cc: Dave Hansen
Cc: Dave Jiang
Cc: Fenghua Yu
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Kaly Xin
Cc: Logan Gunthorpe
Cc: Masahiro Yamada
Cc: Mike Rapoport
Cc: Peter Zijlstra
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vishal Verma
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jia He
2020-08-21 19:05:27 +0800

12 Mar, 2020

1 commit

d467fbc93 mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled ... Browse Code »

commit c87cbc1f007c4b46165f05ceca04e1973cda0b9c upstream.

Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
fixed memory hotplug with debug_pagealloc enabled, where onlining a page
goes through page freeing, which removes the direct mapping. Some arches
don't like when the page is not mapped in the first place, so
generic_online_page() maps it first. This is somewhat wasteful, but
better than special casing page freeing fast paths.

The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
it's actually enabled. One has to test debug_pagealloc_enabled() since
031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
configurable"), or alternatively debug_pagealloc_enabled_static() since
8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
but this is not done.

As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
will crash:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
0000000000000001 0000000000000000 0000000000000002 0000000000000100
0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
>0000001ecd281b9e: 94fb5006 ni 6(%r5),251
0000001ecd281ba2: 41505008 la %r5,8(%r5)
0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
0000001ecd281bac: 1a07 ar %r0,%r7
0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
Call Trace:
[] __kernel_map_pages+0x166/0x188
[] online_pages_range+0xf6/0x128
[] walk_system_ram_range+0x7e/0xd8
[] online_pages+0x2fe/0x3f0
[] memory_subsys_online+0x8e/0xc0
[] device_online+0x5a/0xc8
[] state_store+0x88/0x118
[] kernfs_fop_write+0xc2/0x200
[] vfs_write+0x176/0x1e0
[] ksys_write+0xa2/0x100
[] system_call+0xd8/0x2c8

Fix this by checking debug_pagealloc_enabled_static() before calling
kernel_map_pages(). Backports for kernel before 5.5 should use
debug_pagealloc_enabled() instead. Also add comments.

Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
Reported-by: Gerald Schaefer
Signed-off-by: Andrew Morton
Signed-off-by: Vlastimil Babka
Reviewed-by: David Hildenbrand
Cc:
Cc: Joonsoo Kim
Cc: Qian Cai
Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2020-03-12 20:00:19 +0800

11 Feb, 2020

1 commit

aab4189df mm/memory_hotplug: fix remove_memory() lockdep splat ... Browse Code »

commit f1037ec0cc8ac1a450974ad9754e991f72884f48 upstream.

The daxctl unit test for the dax_kmem driver currently triggers the
(false positive) lockdep splat below. It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking). It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute
path associated with memory-block device.

sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock. The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline state of
the memblocks and be assured that any in progress online attempts are
flushed / blocked by kernfs_drain() / attribute removal.

The add_memory() path safely creates memblock devices under the
mem_hotplug_lock(). There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.

This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
arch_remove_memory()"), and David's due diligence tracking down the
guarantees afforded by kernfs_drain(). Not flagged for -stable since
this only impacts ongoing development and lockdep validation, not a
runtime issue.

======================================================
WARNING: possible circular locking dependency detected
5.5.0-rc3+ #230 Tainted: G OE
------------------------------------------------------
lt-daxctl/6459 is trying to acquire lock:
ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

but task is already holding lock:
ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (mem_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
get_online_mems+0x3e/0xb0
kmem_cache_create_usercopy+0x2e/0x260
kmem_cache_create+0x12/0x20
ptlock_cache_init+0x20/0x28
start_kernel+0x243/0x547
secondary_startup_64+0xb6/0xc0

-> #1 (cpu_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
cpus_read_lock+0x3e/0xb0
online_pages+0x37/0x300
memory_subsys_online+0x17d/0x1c0
device_online+0x60/0x80
state_store+0x65/0xd0
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (kn->count#241){++++}:
check_prev_add+0x98/0xa40
validate_chain+0x576/0x860
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
__kernfs_remove+0x25f/0x2e0
kernfs_remove_by_name_ns+0x41/0x80
remove_files.isra.0+0x30/0x70
sysfs_remove_group+0x3d/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x39/0x70
device_del+0x16a/0x3f0
device_unregister+0x16/0x60
remove_memory_block_devices+0x82/0xb0
try_remove_memory+0xb5/0x130
remove_memory+0x26/0x40
dev_dax_kmem_remove+0x44/0x6a [kmem]
device_release_driver_internal+0xe4/0x1c0
unbind_store+0xef/0x120
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(mem_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
lock(mem_hotplug_lock.rw_sem);
lock(kn->count#241);

*** DEADLOCK ***

No fixes tag as this has been a long standing issue that predated the
addition of kernfs lockdep annotations.

Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Cc: Vishal Verma
Cc: Pavel Tatashin
Cc: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2020-02-11 20:35:12 +0800

09 Jan, 2020

1 commit

e84c5b761 mm/memory_hotplug: shrink zones when offlining memory ... Browse Code »

commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.

We currently try to shrink a single zone when removing memory. We use
the zone of the first page of the memory we are removing. If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):

BUG: unable to handle page fault for address: 000000000000353d
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:clear_zone_contiguous+0x5/0x10
Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__remove_pages+0x4b/0x640
arch_remove_memory+0x63/0x8d
try_remove_memory+0xdb/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x70/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x227/0x3a0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x221/0x550
worker_thread+0x50/0x3b0
kthread+0x105/0x140
ret_from_fork+0x3a/0x50
Modules linked in:
CR2: 000000000000353d

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now
properly shrink the zones, even if we have DIMMs whereby

- Some memory blocks fall into no zone (never onlined)

- Some memory blocks fall into multiple zones (offlined+re-onlined)

- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: "Matthew Wilcox (Oracle)"
Cc: "Aneesh Kumar K.V"
Cc: Pavel Tatashin
Cc: Greg Kroah-Hartman
Cc: Dan Williams
Cc: Logan Gunthorpe
Cc: [5.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-01-09 17:19:56 +0800

23 Nov, 2019

1 commit

7ce700bf1 mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() ... Browse Code »

Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
We should never try to touch the memmap of offline sections where we
could have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.

There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory.

E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
and will therefore never set a ZONE_DEVICE zone consecutive. Stopping
to shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the right
value. So the zone might be wrong but not garbage. After that commit,
both the zone and the node will be garbage when touching uninitialized
memmaps.

Toshiki reported a BUG (race between delayed initialization of
ZONE_DEVICE memmaps without holding the memory hotplug lock and
concurrent zone shrinking).

https://lkml.org/lkml/2019/11/14/1040

"Iteration of create and destroy namespace causes the panic as below:

kernel BUG at mm/page_alloc.c:535!
CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
Call Trace:
memmap_init_zone_device+0x165/0x17c
memremap_pages+0x4c1/0x540
devm_memremap_pages+0x1d/0x60
pmem_attach_disk+0x16b/0x600 [nd_pmem]
nvdimm_bus_probe+0x69/0x1c0
really_probe+0x1c2/0x3e0
driver_probe_device+0xb4/0x100
device_driver_attach+0x4f/0x60
bind_store+0xc9/0x110
kernfs_fop_write+0x116/0x190
vfs_write+0xa5/0x1a0
ksys_write+0x59/0xd0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

While creating a namespace and initializing memmap, if you destroy the
namespace and shrink the zone, it will initialize the memmap outside
the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
pfn), page) in set_pfnblock_flags_mask()."

This BUG is also mitigated by this commit, where we for now stop to
shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.

Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand
Reported-by: Aneesh Kumar K.V
Reported-by: Toshiki Fukasawa
Cc: Oscar Salvador
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Alexander Duyck
Cc: Alexander Potapenko
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Damian Tometzki
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Gerald Schaefer
Cc: Greg Kroah-Hartman
Cc: Halil Pasic
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Jun Yao
Cc: Logan Gunthorpe
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Pankaj Gupta
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Rich Felker
Cc: Robin Murphy
Cc: Steve Capper
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Wei Yang
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Cc: [4.13+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-11-23 01:11:18 +0800

16 Nov, 2019

1 commit

2c91f8fc6 mm/memory_hotplug: fix try_offline_node() ... Browse Code »

try_offline_node() is pretty much broken right now:

- The node span is updated when onlining memory, not when adding it. We
ignore memory that was mever onlined. Bad.

- We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
trigger a kernel panic. Bad for memory that is offline but also bad
for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
first PFN of a section might contain garbage.

- Sections belonging to mixed nodes are not properly considered.

As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE). This makes things
more complicated.

Luckily, we can piggy pack on the node span and the nid stored in memory
blocks. Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node(). Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.

If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline. As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).

Introduce for_each_memory_block() to efficiently walk all memory blocks.

Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized. This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.

Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized). The introducing
commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.

I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node. The node is properly onlined when adding the DIMMs. When
removing the DIMMs, the node is properly offlined.

Masayoshi Mizuma reported:

: Without this patch, memory hotplug fails as panic:
:
: BUG: kernel NULL pointer dereference, address: 0000000000000000
: ...
: Call Trace:
: remove_memory_block_devices+0x81/0xc0
: try_remove_memory+0xb4/0x130
: __remove_memory+0xa/0x20
: acpi_memory_device_remove+0x84/0x100
: acpi_bus_trim+0x57/0x90
: acpi_bus_trim+0x2e/0x90
: acpi_device_hotplug+0x2b2/0x4d0
: acpi_hotplug_work_fn+0x1a/0x30
: process_one_work+0x171/0x380
: worker_thread+0x49/0x3f0
: kthread+0xf8/0x130
: ret_from_fork+0x35/0x40

[david@redhat.com: v3]
Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b319
Signed-off-by: David Hildenbrand
Tested-by: Masayoshi Mizuma
Cc: Tang Chen
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Keith Busch
Cc: Jiri Olsa
Cc: "Peter Zijlstra (Intel)"
Cc: Jani Nikula
Cc: Nayna Jain
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Stephen Rothwell
Cc: Dan Williams
Cc: Pavel Tatashin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-11-16 10:34:00 +0800

07 Nov, 2019

1 commit

656d57119 mm/memory_hotplug: fix updating the node span ... Browse Code »

We recently started updating the node span based on the zone span to
avoid touching uninitialized memmaps.

Currently, we will always detect the node span to start at 0, meaning a
node can easily span too many pages. pgdat_is_empty() will still work
correctly if all zones span no pages. We should skip over all zones
without spanned pages and properly handle the first detected zone that
spans pages.

Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
span cannot easily be inspected and tested. The node span gives no real
guarantees when an architecture supports memory hotplug, meaning it can
easily contain holes or span pages of different nodes.

The node span is not really used after init on architectures that
support memory hotplug.

E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
mm/kmemleak.c:kmemleak_scan(). These users seem to be fine.

Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
Signed-off-by: David Hildenbrand
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Stephen Rothwell
Cc: Dan Williams
Cc: Pavel Tatashin
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-11-07 00:47:50 +0800

19 Oct, 2019

1 commit

00d6c019b mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() ... Browse Code »

We might use the nid of memmaps that were never initialized. For
example, if the memmap was poisoned, we will crash the kernel in
pfn_to_nid() right now. Let's use the calculated boundaries of the
separate zones instead. This now also avoids having to iterate over a
whole bunch of subsections again, after shrinking one zone.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized to 0 and the node was set to the
right value. After that commit, the node might be garbage.

We'll have to fix shrink_zone_span() next.

Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b319]
Signed-off-by: David Hildenbrand
Reported-by: Aneesh Kumar K.V
Cc: Oscar Salvador
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Wei Yang
Cc: Alexander Duyck
Cc: Alexander Potapenko
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Damian Tometzki
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Gerald Schaefer
Cc: Greg Kroah-Hartman
Cc: Halil Pasic
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Jun Yao
Cc: Logan Gunthorpe
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Pankaj Gupta
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Rich Felker
Cc: Robin Murphy
Cc: Steve Capper
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Cc: [4.13+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-10-19 18:32:31 +0800

25 Sep, 2019

9 commits

29a90db92 mm/memory_hotplug.c: s/is/if ... Browse Code »

Correct typo in comment.

Link: http://lkml.kernel.org/r/1568233954-3913-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2019-09-25 06:54:09 +0800
ca9a46f8a mm/memory_hotplug: online_pages cannot be 0 in online_pages() ... Browse Code »

walk_system_ram_range() will fail with -EINVAL in case
online_pages_range() was never called (== no resource applicable in the
range). Otherwise, we will always call online_pages_range() with nr_pages
> 0 and, therefore, have online_pages > 0.

Remove that special handling.

Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Arun KS
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Nadav Amit
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
bd02cc01d mm/memory_hotplug: make sure the pfn is aligned to the order when onlining ... Browse Code »

Commit a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as
higher order") assumed that any PFN we get via memory resources is aligned
to to MAX_ORDER - 1, I am not convinced that is always true. Let's play
safe, check the alignment and fallback to single pages.

akpm: warn in this situation so we get to find out if and why this ever
occurs.

[akpm@linux-foundation.org: add WARN_ON_ONCE()]
Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Arun KS
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Nadav Amit
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
b2c2ab208 mm/memory_hotplug: simplify online_pages_range() ... Browse Code »

online_pages always corresponds to nr_pages. Simplify the code, getting
rid of online_pages_blocks(). Add some comments.

Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Arun KS
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Nadav Amit
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
5ecae6359 mm/memory_hotplug: drop PageReserved() check in online_pages_range() ... Browse Code »

move_pfn_range_to_zone() will set all pages to PG_reserved via
memmap_init_zone(). The only way a page could no longer be reserved would
be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
done (the online_page callback is used for that purpose by e.g., Hyper-V
instead). walk_system_ram_range() will never call online_pages_range()
with duplicate PFNs, so drop the PageReserved() check.

This seems to be a leftover from ancient times where the memmap was
initialized when adding memory and we wanted to check for already onlined
memory.

Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Dan Williams
Cc: Arun KS
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Nadav Amit
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
33fce0113 mm/memory_hotplug.c: prevent memory leak when reusing pgdat ... Browse Code »

When offlining a node in try_offline_node(), pgdat is not released. So
that pgdat could be reused in hotadd_new_pgdat(). While we reallocate
pgdat->per_cpu_nodestats if this pgdat is reused.

This patch prevents the memory leak by just allocating per_cpu_nodestats
when it is a new pgdat.

Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang
Acked-by: Michal Hocko
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2019-09-25 06:54:09 +0800
b6c88d3b9 drivers/base/memory.c: don't store end_section_nr in memory blocks ... Browse Code »

Each memory block spans the same amount of sections/pages/bytes. The size
is determined before the first memory block is created. No need to store
what we can easily calculate - and the calculations even look simpler now.

Michal brought up the idea of variable-sized memory blocks. However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size). So let's cleanup what we have right now.

While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.

Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Pavel Tatashin
Cc: Michal Hocko
Cc: Dan Williams
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
3fccb74cf mm/memory_hotplug: remove move_pfn_range() ... Browse Code »

Let's remove this indirection. We need the zone in the caller either way,
so let's just detect it there. Add some documentation for
move_pfn_range_to_zone() instead.

[akpm@linux-foundation.org: restore newline, per David]
Link: http://lkml.kernel.org/r/20190724142324.3686-1-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Reviewed-by: Oscar Salvador
Cc: David Hildenbrand
Cc: Pavel Tatashin
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-09-25 06:54:09 +0800
d8c6546b1 mm: introduce compound_nr() ... Browse Code »

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-09-25 06:54:08 +0800

03 Aug, 2019

1 commit

aa4996b3a mm/memory_hotplug.c: remove unneeded return for void function ... Browse Code »

return is unneeded in void function

Link: http://lkml.kernel.org/r/20190723130814.21826-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weitao Hou
2019-08-03 22:02:01 +0800

19 Jul, 2019

14 commits

9a8450304 mm/sparsemem: cleanup 'section number' data types ... Browse Code »

David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.

[akpm@linux-foundation.org: fix printk format]
Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: David Hildenbrand
Reviewed-by: David Hildenbrand
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
ba72b4c8c mm/sparsemem: support sub-section hotplug ... Browse Code »

The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.

For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0

WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]

It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.

A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.

Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.

[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
7ea621604 mm/sparsemem: prepare for sub-section ranges ... Browse Code »

Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().

This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.

Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Pavel Tatashin
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
96da43500 mm/hotplug: kill is_dev_zone() usage in __remove_pages() ... Browse Code »

The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".

Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
49ba3c6b3 mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal ... Browse Code »

Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.

[osalvador@suse.de: fix shrink_{zone,node}_span]
Link: http://lkml.kernel.org/r/20190717090725.23618-3-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092351496.979959.12703722803097017492.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
f1eca35a0 mm/sparsemem: introduce struct mem_section_usage ... Browse Code »

Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check

- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.

- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Oscar Salvador
Reviewed-by: Wei Yang
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jérôme Glisse
Cc: Mike Rapoport
Cc: Jane Chu
Cc: Pavel Tatashin
Cc: Jonathan Corbet
Cc: Qian Cai
Cc: Logan Gunthorpe
Cc: Toshi Kani
Cc: Jeff Moyer
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
ea8846411 mm/memory_hotplug: move and simplify walk_memory_blocks() ... Browse Code »

Let's move walk_memory_blocks() to the place where memory block logic
resides and simplify it. While at it, add a type for the callback
function.

Link: http://lkml.kernel.org/r/20190614100114.311-6-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: David Hildenbrand
Cc: Stephen Rothwell
Cc: Pavel Tatashin
Cc: Andrew Banman
Cc: Mike Travis
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Wei Yang
Cc: Arun KS
Cc: Qian Cai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
fbcf73ce6 mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns ... Browse Code »

walk_memory_range() was once used to iterate over sections. Now, it
iterates over memory blocks. Rename the function, fixup the
documentation.

Also, pass start+size instead of PFNs, which is what most callers
already have at hand. (we'll rework link_mem_sections() most probably
soon)

Follow-up patches will rework, simplify, and move walk_memory_blocks()
to drivers/base/memory.c.

Note: walk_memory_blocks() only works correctly right now if the
start_pfn is aligned to a section start. This is the case right now,
but we'll generalize the function in a follow up patch so the semantics
match the documentation.

[akpm@linux-foundation.org: remove unused variable]
Link: http://lkml.kernel.org/r/20190614100114.311-5-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: "Rafael J. Wysocki"
Cc: Len Brown
Cc: Greg Kroah-Hartman
Cc: David Hildenbrand
Cc: Rashmica Gupta
Cc: Pavel Tatashin
Cc: Anshuman Khandual
Cc: Michael Neuling
Cc: Thomas Gleixner
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Wei Yang
Cc: Juergen Gross
Cc: Qian Cai
Cc: Arun KS
Cc: Nick Desaulniers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
b9bf8d342 mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section ... Browse Code »

The parameter is unused, so let's drop it. Memory removal paths should
never care about zones. This is the job of memory offlining and will
require more refactorings.

Link: http://lkml.kernel.org/r/20190527111152.16324-12-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Dan Williams
Reviewed-by: Wei Yang
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Alex Deucher
Cc: Andrew Banman
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Arun KS
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Christophe Leroy
Cc: Chris Wilson
Cc: Dave Hansen
Cc: "David S. Miller"
Cc: Fenghua Yu
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Joonsoo Kim
Cc: Jun Yao
Cc: "Kirill A. Shutemov"
Cc: Logan Gunthorpe
Cc: Mark Brown
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: Mathieu Malaterre
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: "mike.travis@hpe.com"
Cc: Nicholas Piggin
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: "Rafael J. Wysocki"
Cc: Rich Felker
Cc: Rob Herring
Cc: Robin Murphy
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
4c4b7f9ba mm/memory_hotplug: remove memory block devices before arch_remove_memory() ... Browse Code »

Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().

This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().

Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: David Hildenbrand
Cc: "mike.travis@hpe.com"
Cc: Andrew Banman
Cc: Ingo Molnar
Cc: Alex Deucher
Cc: "David S. Miller"
Cc: Mark Brown
Cc: Chris Wilson
Cc: Oscar Salvador
Cc: Jonathan Cameron
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Christophe Leroy
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Joonsoo Kim
Cc: Jun Yao
Cc: "Kirill A. Shutemov"
Cc: Logan Gunthorpe
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Rich Felker
Cc: Rob Herring
Cc: Robin Murphy
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Wei Yang
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
05f800a0b mm/memory_hotplug: drop MHP_MEMBLOCK_API ... Browse Code »

No longer needed, the callers of arch_add_memory() can handle this
manually.

Link: http://lkml.kernel.org/r/20190527111152.16324-9-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Wei Yang
Acked-by: Michal Hocko
Cc: David Hildenbrand
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Joonsoo Kim
Cc: Qian Cai
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Mike Rapoport
Cc: Alex Deucher
Cc: Andrew Banman
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Christophe Leroy
Cc: Chris Wilson
Cc: Dan Williams
Cc: Dave Hansen
Cc: "David S. Miller"
Cc: Fenghua Yu
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Jun Yao
Cc: "Kirill A. Shutemov"
Cc: Logan Gunthorpe
Cc: Mark Brown
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: "mike.travis@hpe.com"
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Rich Felker
Cc: Rob Herring
Cc: Robin Murphy
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
db051a0da mm/memory_hotplug: create memory block devices after arch_add_memory() ... Browse Code »

Only memory to be added to the buddy and to be onlined/offlined by user
space using /sys/devices/system/memory/... needs (and should have!)
memory block devices.

Factor out creation of memory block devices. Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock
parameter, because it is now effectively stale.

Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space
at all before arch_add_memory() succeeded.

While at it
- use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
- introduce find_memory_block_by_id() to search via block id
- Use find_memory_block_by_id() in init_memory_block() to catch
duplicates

Link: http://lkml.kernel.org/r/20190527111152.16324-8-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: David Hildenbrand
Cc: "mike.travis@hpe.com"
Cc: Ingo Molnar
Cc: Andrew Banman
Cc: Oscar Salvador
Cc: Qian Cai
Cc: Wei Yang
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Alex Deucher
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Christophe Leroy
Cc: Chris Wilson
Cc: Dan Williams
Cc: Dave Hansen
Cc: "David S. Miller"
Cc: Fenghua Yu
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Jonathan Cameron
Cc: Joonsoo Kim
Cc: Jun Yao
Cc: "Kirill A. Shutemov"
Cc: Logan Gunthorpe
Cc: Mark Brown
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rich Felker
Cc: Rob Herring
Cc: Robin Murphy
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
80ec922db mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE ... Browse Code »

We want to improve error handling while adding memory by allowing to use
arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:

arch_add_memory()
rc = do_something();
if (rc) {
arch_remove_memory();
}

We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
quite some dependencies for memory offlining.

Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Pavel Tatashin
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Oscar Salvador
Cc: "Kirill A. Shutemov"
Cc: Alex Deucher
Cc: "David S. Miller"
Cc: Mark Brown
Cc: Chris Wilson
Cc: Christophe Leroy
Cc: Nicholas Piggin
Cc: Vasily Gorbik
Cc: Rob Herring
Cc: Masahiro Yamada
Cc: "mike.travis@hpe.com"
Cc: Andrew Banman
Cc: Arun KS
Cc: Qian Cai
Cc: Mathieu Malaterre
Cc: Baoquan He
Cc: Logan Gunthorpe
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Dan Williams
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Joonsoo Kim
Cc: Jun Yao
Cc: Mark Rutland
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc: Robin Murphy
Cc: Wei Yang
Cc: Will Deacon
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
cec3ebd08 mm/memory_hotplug: simplify and fix check_hotplug_memory_range() ... Browse Code »

Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3.

We only want memory block devices for memory to be onlined/offlined
(add/remove from the buddy). This is required so user space can
online/offline memory and kdump gets notified about newly onlined
memory.

Let's factor out creation/removal of memory block devices. This helps
to further cleanup arch_add_memory/arch_remove_memory() and to make
implementation of new features easier - especially sub-section memory
hot add from Dan.

Anshuman Khandual is currently working on arch_remove_memory(). I added
a temporary solution via "arm64/mm: Add temporary arch_remove_memory()
implementation", that is sufficient as a firsts tep in the context of
this series. (we don't cleanup page tables in case anything goes wrong
already)

Did a quick sanity test with DIMM plug/unplug, making sure all devices
and sysfs links properly get added/removed. Compile tested on s390x and
x86-64.

This patch (of 11):

By converting start and size to page granularity, we actually ignore
unaligned parts within a page instead of properly bailing out with an
error.

Link: http://lkml.kernel.org/r/20190527111152.16324-2-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Dan Williams
Reviewed-by: Wei Yang
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: David Hildenbrand
Cc: Qian Cai
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Alex Deucher
Cc: Andrew Banman
Cc: Andy Lutomirski
Cc: Anshuman Khandual
Cc: Ard Biesheuvel
Cc: Baoquan He
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chintan Pandya
Cc: Christophe Leroy
Cc: Chris Wilson
Cc: Dave Hansen
Cc: "David S. Miller"
Cc: Fenghua Yu
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Joonsoo Kim
Cc: Jun Yao
Cc: "Kirill A. Shutemov"
Cc: Logan Gunthorpe
Cc: Mark Brown
Cc: Mark Rutland
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: "mike.travis@hpe.com"
Cc: Nicholas Piggin
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Rich Felker
Cc: Rob Herring
Cc: Robin Murphy
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Will Deacon
Cc: Yoshinori Sato
Cc: Yu Zhao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800

17 Jul, 2019

1 commit

eca499ab3 mm/hotplug: make remove_memory() interface usable ... Browse Code »

Presently the remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to change
memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails. This
way it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dan Williams
Cc: Dave Hansen
Cc: Dave Hansen
Cc: Dave Jiang
Cc: Fengguang Wu
Cc: Huang Ying
Cc: James Morris
Cc: Jérôme Glisse
Cc: Keith Busch
Cc: Ross Zwisler
Cc: Sasha Levin
Cc: Takashi Iwai
Cc: Tom Lendacky
Cc: Vishal Verma
Cc: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2019-07-17 10:23:24 +0800

03 Jul, 2019

1 commit

514caf23a memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag ... Browse Code »

Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible. Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ira Weiny
Reviewed-by: Dan Williams
Tested-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:44 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

15 May, 2019

2 commits

e900a918b mm: shuffle initial free memory to improve memory-side-cache utilization ... Browse Code »

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.

Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.

I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Qian Cai
Reviewed-by: Kees Cook
Acked-by: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
ac5c94264 mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail ... Browse Code »

All callers of arch_remove_memory() ignore errors. And we should really
try to remove any errors from the memory removal path. No more errors are
reported from __remove_pages(). BUG() in s390x code in case
arch_remove_memory() is triggered. We may implement that properly later.
WARN in case powerpc code failed to remove the section mapping, which is
better than ignoring the error completely right now.

Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc: "Kirill A. Shutemov"
Cc: Christophe Leroy
Cc: Stefan Agner
Cc: Nicholas Piggin
Cc: Pavel Tatashin
Cc: Vasily Gorbik
Cc: Arun KS
Cc: Geert Uytterhoeven
Cc: Masahiro Yamada
Cc: Rob Herring
Cc: Joonsoo Kim
Cc: Wei Yang
Cc: Qian Cai
Cc: Mathieu Malaterre
Cc: Andrew Banman
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Mike Travis
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-05-15 00:47:50 +0800