Eric Lee / smarc-fsl-linux-kernel

17 Oct, 2020

1 commit

3a0aaefe4 mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG ... Browse Code »

We soon want to pass flags via a new type to add_memory() and friends.
That revealed that we currently don't guard some declarations by
CONFIG_MEMORY_HOTPLUG.

While some definitions could be moved to different places, let's keep it
minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
compiled with CONFIG_MEMORY_HOTPLUG.

Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
from CONFIG_MEMORY_HOTPLUG code.

While at it, remove allow_online_pfn_range(), which is no longer around,
and mhp_notimplemented(), which is unused.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Dan Williams
Cc: Pankaj Gupta
Cc: Baoquan He
Cc: Wei Yang
Cc: Anton Blanchard
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Boris Ostrovsky
Cc: Christian Borntraeger
Cc: Dave Jiang
Cc: Eric Biederman
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Heiko Carstens
Cc: Jason Gunthorpe
Cc: Jason Wang
Cc: Juergen Gross
Cc: Julien Grall
Cc: Kees Cook
Cc: "K. Y. Srinivasan"
Cc: Len Brown
Cc: Leonardo Bras
Cc: Libor Pechacek
Cc: Michael Ellerman
Cc: "Michael S. Tsirkin"
Cc: Nathan Lynch
Cc: "Oliver O'Halloran"
Cc: Paul Mackerras
Cc: Pingfan Liu
Cc: "Rafael J. Wysocki"
Cc: Roger Pau Monné
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vishal Verma
Cc: Wei Liu
Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-10-17 02:11:18 +0800

14 Oct, 2020

1 commit

c9118e6c3 arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() ... Browse Code »

There are several occurrences of the following pattern:

for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);

/* do something with start_pfn and end_pfn */
}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Acked-by: Miguel Ojeda [.clang-format]
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Daniel Axtens
Cc: Dave Hansen
Cc: Emil Renner Berthing
Cc: Hari Bathini
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Marek Szyprowski
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Yoshinori Sato
Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-10-14 09:38:35 +0800

08 Aug, 2020

3 commits

c89ab04fe mm/sparse: cleanup the code surrounding memory_present() ... Browse Code »

After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().

Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.

Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.

Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:27 +0800
ef69bc9f6 mm/sparse: never partially remove memmap for early section ... Browse Code »

For early sections, its memmap is handled specially even sub-section is
enabled. The memmap could only be populated as a whole.

Quoted from the comment of section_activate():

* The early init code does not consider partially populated
* initial sections, it simply assumes that memory will never be
* referenced. If we hot-add memory into such a section then we
* do not need to populate the memmap and can simply reuse what
* is already there.

While current section_deactivate() breaks this rule. When hot-remove a
sub-section, section_deactivate() would depopulate its memmap. The
consequence is if we hot-add this subsection again, its memmap never get
proper populated.

We can reproduce the case by following steps:

1. Hacking qemu to allow sub-section early section

: diff --git a/hw/i386/pc.c b/hw/i386/pc.c
: index 51b3050d01..c6a78d83c0 100644
: --- a/hw/i386/pc.c
: +++ b/hw/i386/pc.c
: @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
: }
:
: machine->device_memory->base =
: - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
: + 0x100000000ULL + x86ms->above_4g_mem_size;
:
: if (pcmc->enforce_aligned_dimm) {
: /* size device region assuming 1G page max alignment per slot */

2. Bootup qemu with PSE disabled and a sub-section aligned memory size

Part of the qemu command would look like this:

sudo x86_64-softmmu/qemu-system-x86_64 \
--enable-kvm -cpu host,pse=off \
-m 4160M,maxmem=20G,slots=1 \
-smp sockets=2,cores=16 \
-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
-machine pc,nvdimm \
-nographic \
-object memory-backend-ram,id=mem0,size=8G \
-device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k

3. Re-config a pmem device with sub-section size in guest

ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

Then you would see the following call trace:

pmem0: detected capacity change from 0 to 16777216
BUG: unable to handle page fault for address: ffffec73c51000b4
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
Oops: 0002 [#1] SMP NOPTI
CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
RIP: 0010:memmap_init_zone+0x154/0x1c2
Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
Call Trace:
move_pfn_range_to_zone+0x128/0x150
memremap_pages+0x4e4/0x5a0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0x69/0x160 [device_dax]
really_probe+0x298/0x3c0
driver_probe_device+0xe1/0x150
? driver_allows_async_probing+0x50/0x50
bus_for_each_drv+0x7e/0xc0
__device_attach+0xdf/0x160
bus_probe_device+0x8e/0xa0
device_add+0x3b9/0x740
__devm_create_dev_dax+0x127/0x1c0
__dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
dax_pmem_probe+0xc/0x1b [dax_pmem]
nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
really_probe+0x147/0x3c0
driver_probe_device+0xe1/0x150
device_driver_attach+0x53/0x60
bind_store+0xd1/0x110
kernfs_fop_write+0xce/0x1b0
vfs_write+0xb6/0x1a0
ksys_write+0x5f/0xe0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Cc: Oscar Salvador
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200625223534.18024-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800
ca15ca406 mm: remove unneeded includes of <asm/pgalloc.h> ... Browse Code »

Patch series "mm: cleanup usage of "

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the included all over the place.
The first patch in this series removes unneeded includes of

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in and do not require that header.

As for the other header files that used to include , it is
possible to move that include into the .c file that actually uses symbols
from and drop the include from the header file.

The process was somewhat automated using

sed -i -E '/[
Signed-off-by: Andrew Morton
Reviewed-by: Pekka Enberg
Acked-by: Geert Uytterhoeven [m68k]
Cc: Abdul Haleem
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Christophe Leroy
Cc: Joerg Roedel
Cc: Max Filippov
Cc: Peter Zijlstra
Cc: Satheesh Rajendran
Cc: Stafford Horne
Cc: Stephen Rothwell
Cc: Steven Rostedt
Cc: Joerg Roedel
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:26 +0800

10 Jun, 2020

1 commit

e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

05 Jun, 2020

1 commit

2e6787d38 mm/sparse: fix a typo in comment "convienence"->"convenience" ... Browse Code »

There is a typo in comment, fix it.

Signed-off-by: Ethon Paul
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200411002955.14545-1-ethp@qq.com
Signed-off-by: Linus Torvalds

Ethon Paul
2020-06-05 10:06:24 +0800

08 Apr, 2020

5 commits

6ecb0fc61 mm/sparse.c: move subsection_map related functions together ... Browse Code »

No functional change.

[bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Wei Yang
Cc: Dan Williams
Cc: Pankaj Gupta
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
95a5a34df mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug ... Browse Code »

And tell check_pfn_span() gating the porper alignment and size of hot
added memory region.

And also move the code comments from inside section_deactivate() to being
above it. The code comments are reasonable for the whole function, and
the moving makes code cleaner.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Dan Williams
Cc: Pankaj Gupta
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
0a9f9f623 mm/sparse.c: only use subsection map in VMEMMAP case ... Browse Code »

Currently, to support subsection aligned memory region adding for pmem,
subsection map is added to track which subsection is present.

However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, it's meaningless. Even worse, it may confuse people when
checking code related to the classic sparse.

About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.

Combining the above reasons, no need to provide subsection map and the
relevant handling for the classic sparse. Let's remove them.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Cc: Dan Williams
Cc: Michal Hocko
Cc: Pankaj Gupta
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
37bc15020 mm/sparse.c: introduce a new function clear_subsection_map() ... Browse Code »

Factor out the code which clear subsection map of one memory region from
section_deactivate() into clear_subsection_map().

And also add helper function is_subsection_map_empty() to check if the
current subsection map is empty or not.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Cc: Dan Williams
Cc: Michal Hocko
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
5d87255ca mm/sparse.c: introduce new function fill_subsection_map() ... Browse Code »

Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.

Memory sub-section hotplug was added to fix the issue that nvdimm could be
mapped at non-section aligned starting address. A subsection map is added
into struct mem_section_usage to implement it.

However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, subsection map is meaningless and confusing.

About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.

This patch (of 5):

Factor out the code that fills the subsection map from section_activate()
into fill_subsection_map(), this makes section_activate() cleaner and
easier to follow.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Cc: Dan Williams
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800

03 Apr, 2020

3 commits

4027149ab mm/sparse.c: allocate memmap preferring the given node ... Browse Code »

When allocating memmap for hot added memory with the classic sparse, the
specified 'nid' is ignored in populate_section_memmap().

While in allocating memmap for the classic sparse during boot, the node
given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in
both boot stage and memory hot adding. So seems no reason to not respect
the node of 'nid' for the classic sparse when hot adding memory.

Use kvmalloc_node instead to use the passed in 'nid'.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: David Hildenbrand
Reviewed-by: Wei Yang
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srv
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-03 00:35:30 +0800
3af776f60 mm/sparse.c: use kvmalloc/kvfree to alloc/free memmap for the classic sparse ... Browse Code »

This change makes populate_section_memmap()/depopulate_section_memmap
much simpler.

Suggested-by: Michal Hocko
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Pankaj Gupta
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: Wei Yang
Acked-by: Michal Hocko
Link: http://lkml.kernel.org/r/20200316125450.GG3486@MiWiFi-R3L-srv
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-03 00:35:30 +0800
4627d76dc mm/sparsemem: get address to page struct instead of address to pfn ... Browse Code »

memmap should be the address to page struct instead of address to pfn.

As mentioned by David, if system memory and devmem sit within a section,
the mismatch address would lead kdump to dump unexpected memory.

Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
to get the page struct address at this point.

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Cc: Dan Williams
Cc: Baoquan He
Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-04-03 00:35:30 +0800

30 Mar, 2020

1 commit

b943f045a mm/sparse: fix kernel crash with pfn_section_valid check ... Browse Code »

Fix the crash like this:

BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
section_deactivate+0x220/0x240
__remove_pages+0x118/0x170
arch_remove_memory+0x3c/0x150
memunmap_pages+0x1cc/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x2f8/0x3e0
device_release_driver_internal+0x168/0x270
unbind_store+0x130/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x68/0x80
kernfs_fop_write+0x100/0x290
__vfs_write+0x3c/0x70
vfs_write+0xcc/0x240
ksys_write+0x7c/0x140
system_call+0x5c/0x68

The crash is due to NULL dereference at

test_bit(idx, ms->usage->subsection_map);

due to ms->usage = NULL in pfn_section_valid()

With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem(). This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
config pfn_to_page does

__section_mem_map_addr(__sec) + __pfn;

where

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}

Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()). Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.

static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}

where

static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);

return test_bit(idx, ms->usage->subsection_map);
}

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed. For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections. Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.

[aneesh.kumar@linux.ibm.com: add comment]
Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Tested-by: Sachin Sant
Reviewed-by: Baoquan He
Reviewed-by: Wei Yang
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Michael Ellerman
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc:
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2020-03-30 00:47:06 +0800

22 Mar, 2020

1 commit

d41e2f3bd mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case ... Browse Code »

In section_deactivate(), pfn_to_page() doesn't work any more after
ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
causes a hot remove failure:

kernel BUG at mm/page_alloc.c:4806!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:free_pages+0x85/0xa0
Call Trace:
__remove_pages+0x99/0xc0
arch_remove_memory+0x23/0x4d
try_remove_memory+0xc8/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x72/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x2eb/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40

Let's move the ->section_mem_map resetting after
depopulate_section_memmap() to fix it.

[akpm@linux-foundation.org: remove unneeded initialization, per David]
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: Pankaj Gupta
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc:
Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-03-22 09:56:06 +0800

22 Feb, 2020

1 commit

18e19f195 mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM ... Browse Code »

When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
doesn't work before sparse_init_one_section() is called.

This leads to a crash when hotplug memory:

BUG: unable to handle page fault for address: 0000000006400000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:__memset+0x24/0x30
Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
sparse_add_section+0x1c9/0x26a
__add_pages+0xbf/0x150
add_pages+0x12/0x60
add_memory_resource+0xc8/0x210
__add_memory+0x62/0xb0
acpi_memory_device_add+0x13f/0x300
acpi_bus_attach+0xf6/0x200
acpi_bus_scan+0x43/0x90
acpi_device_hotplug+0x275/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40

We should use memmap as it did.

On x86 the impact is limited to x86_32 builds, or x86_64 configurations
that override the default setting for SPARSEMEM_VMEMMAP.

Other memory hotplug archs (arm64, ia64, and ppc) also default to
SPARSEMEM_VMEMMAP=y.

[dan.j.williams@intel.com: changelog update]
{rppt@linux.ibm.com: changelog update]
Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang
Signed-off-by: Baoquan He
Acked-by: David Hildenbrand
Reviewed-by: Baoquan He
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2020-02-22 03:22:15 +0800

04 Feb, 2020

1 commit

4c6058814 mm: factor out next_present_section_nr() ... Browse Code »

Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary). While at
it, make the section_nr in next_pfn() const.

In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
exceed __highest_present_section_nr, which doesn't make a difference in
the caller as it is big enough (>= all sane end_pfn).

Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Kirill A. Shutemov
Cc: Baoquan He
Cc: Dan Williams
Cc: "Jin, Zhi"
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-02-04 11:05:23 +0800

01 Feb, 2020

1 commit

1f503443e mm/sparse.c: reset section's mem_map when fully deactivated ... Browse Code »

After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.

In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.

Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)

The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile

[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Cc: Dan Williams
Cc: Oscar Salvador
Cc: Baoquan He
Cc: Qian Cai
Cc: Kazuhito Hagio
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pingfan Liu
2020-02-01 02:30:36 +0800

14 Jan, 2020

1 commit

8068df3b6 mm/memory_hotplug: don't free usage map when removing a re-added early section ... Browse Code »

When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.

Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.

Can be triggered using memtrace under ppc64/powernv:

$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
------------[ cut here ]------------
kernel BUG at mm/slub.c:3969!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
NIP kfree+0x338/0x3b0
LR section_deactivate+0x138/0x200
Call Trace:
section_deactivate+0x138/0x200
__remove_pages+0x114/0x150
arch_remove_memory+0x3c/0x160
try_remove_memory+0x114/0x1a0
__remove_memory+0x20/0x40
memtrace_enable_set+0x254/0x850
simple_attr_write+0x138/0x160
full_proxy_write+0x8c/0x110
__vfs_write+0x38/0x70
vfs_write+0x11c/0x2a0
ksys_write+0x84/0x140
system_call+0x5c/0x68
---[ end trace 4b053cbd84e0db62 ]---

The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").

Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.

Using Dynamic Memory under a Power DLAPR can trigger it easily.

Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.

Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand
Tested-by: Pingfan Liu
Cc: Dan Williams
Cc: Oscar Salvador
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-01-14 10:19:01 +0800

02 Dec, 2019

4 commits

0ac398b17 mm: support memblock alloc on the exact node for sparse_buffer_init() ... Browse Code »

sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
for page management structure, if memory allocation fails from specified
node, it will fall back to allocate from other nodes.

Normally, the page management structure will not exceed 2% of the total
memory, but a large continuous block of allocation is needed. In most
cases, memory allocation from the specified node will succeed, but a
node memory become highly fragmented will fail. we expect to allocate
memory base section rather than by allocating a large block of memory
from other NUMA nodes

Add memblock_alloc_exact_nid_raw() for this situation, which allocate
boot memory block on the exact node. If a large contiguous block memory
allocate fail in sparse_buffer_init(), it will fall back to allocate
small block memory base section.

Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
Signed-off-by: Yunfeng Ye
Reviewed-by: Mike Rapoport
Cc: Wei Yang
Cc: Oscar Salvador
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Qian Cai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yunfeng Ye
2019-12-02 04:59:08 +0800
09dbcf422 mm/sparse.c: do not waste pre allocated memmap space ... Browse Code »

Vincent has noticed [1] that there is something unusual with the memmap
allocations going on on his platform

: I noticed this because on my ARM64 platform, with 1 GiB of memory the
: first [and only] section is allocated from the zeroing path while with
: 2 GiB of memory the first 1 GiB section is allocated from the
: non-zeroing path.

The underlying problem is that although sparse_buffer_init allocates
enough memory for all sections on the node sparse_buffer_alloc is not
able to consume them due to mismatch in the expected allocation
alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
alignment the real memmap has to be aligned to section_map_size() this
results in a wasted initial chunk of the preallocated memmap and
unnecessary fallback allocation for a section.

While we are at it also change __populate_section_memmap to align to the
requested size because at least VMEMMAP has constrains to have memmap
properly aligned.

[1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com

[akpm@linux-foundation.org: tweak layout, per David]
Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations")
Signed-off-by: Michal Hocko
Reported-by: Vincent Whitchurch
Debugged-by: Vincent Whitchurch
Acked-by: David Hildenbrand
Cc: Pavel Tatashin
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-12-02 04:59:05 +0800
030eab4f9 mm/sparse.c: mark populate_section_memmap as __meminit ... Browse Code »

Building the kernel on s390 with -Og produces the following warning:

WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
The function populate_section_memmap() references
the function __meminit __populate_section_memmap().
This is often because populate_section_memmap lacks a __meminit
annotation or the annotation of __populate_section_memmap is wrong.

While -Og is not supported, in theory this might still happen with
another compiler or on another architecture. So fix this by using the
correct section annotations.

[iii@linux.ibm.com: v2]
Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich
Acked-by: David Hildenbrand
Cc: Heiko Carstens
Cc: Vasily Gorbik
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ilya Leoshkevich
2019-12-02 04:59:05 +0800
4c29700ed mm/sparse: consistently do not zero memmap ... Browse Code »

sparsemem without VMEMMAP has two allocation paths to allocate the
memory needed for its memmap (done in sparse_mem_map_populate()).

In one allocation path (sparse_buffer_alloc() succeeds), the memory is
not zeroed (since it was previously allocated with
memblock_alloc_try_nid_raw()).

In the other allocation path (sparse_buffer_alloc() fails and
sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the
memory is zeroed.

AFAICS this difference does not appear to be on purpose. If the code is
supposed to work with non-initialized memory (__init_single_page() takes
care of zeroing the struct pages which are actually used), we should
consistently not zero the memory, to avoid masking bugs.

( I noticed this because on my ARM64 platform, with 1 GiB of memory the
first [and only] section is allocated from the zeroing path while with
2 GiB of memory the first 1 GiB section is allocated from the
non-zeroing path. )

Michal:
"the main user visible problem is a memory wastage. The overal amount
of memory should be small. I wouldn't call it stable material."

Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Whitchurch
2019-12-02 04:59:05 +0800

08 Oct, 2019

1 commit

758b8db4a mm: fix -Wmissing-prototypes warnings ... Browse Code »

We get two warnings when build kernel W=1:

mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

Make the functions static to fix this.

Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang
Reviewed-by: David Hildenbrand
Reviewed-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yi Wang
2019-10-08 06:47:19 +0800

25 Sep, 2019

5 commits

5ed867037 mm/sparse.c: remove NULL check in clear_hwpoisoned_pages() ... Browse Code »

There is no possibility for memmap to be NULL in the current codebase.

This check was added in commit 95a4774d055c ("memory-hotplug: update
mce_bad_pages when removing the memory") where memmap was originally
inited to NULL, and only conditionally given a value.

The code that could have passed a NULL has been removed by commit
ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
longer a possibility that memmap can be NULL.

Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
Signed-off-by: Alastair D'Silva
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Cc: Mike Rapoport
Cc: Wei Yang
Cc: Qian Cai
Cc: Alexander Duyck
Cc: Logan Gunthorpe
Cc: Baoquan He
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alastair D'Silva
2019-09-25 06:54:09 +0800
9f82883c6 mm/sparse.c: don't manually decrement num_poisoned_pages ... Browse Code »

Use the function written to do it instead.

Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
Signed-off-by: Alastair D'Silva
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Acked-by: Mike Rapoport
Reviewed-by: Wei Yang
Reviewed-by: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alastair D'Silva
2019-09-25 06:54:09 +0800
c1cbc3eeb mm/sparse.c: use __nr_to_section(section_nr) to get mem_section ... Browse Code »

__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

Since we already get section_nr, it is not necessary to get mem_section
from start_pfn. By doing so, we reduce one redundant operation.

Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang
Reviewed-by: Anshuman Khandual
Tested-by: Anshuman Khandual
Cc: Oscar Salvador
Cc: Pavel Tatashin
Cc: Michal Hocko
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2019-09-25 06:54:09 +0800
db57e98d8 mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc() ... Browse Code »

The size argument passed into sparse_buffer_alloc() has already been
aligned with PAGE_SIZE or PMD_SIZE.

If the size after aligned is not power of 2 (e.g. 0x480000), the
PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
up to next multiple of size.

Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen
Signed-off-by: Mark-PK Tsai
Cc: YJ Chiang
Cc: Lecopzer Chen
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lecopzer Chen
2019-09-25 06:54:09 +0800
ae8318940 mm/sparse.c: fix memory leak of sparsemap_buf in aligned memory ... Browse Code »

sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
after being aligned with the size. However, the size is at least
PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
than PAGE_SIZE.

Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
first, the aligned space before sparsemap_buf is wasted and no one will
touch it.

In our ARM32 platform (without SPARSEMEM_VMEMMAP)
Sparse_buffer_init
Reserve d359c000 - d3e9c000 (9M)
Sparse_buffer_alloc
Alloc d3a00000 - d3E80000 (4.5M)
Sparse_buffer_fini
Free d3e80000 - d3e9c000 (~=100k)
The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.

In ARM64 platform (with SPARSEMEM_VMEMMAP)

sparse_buffer_init
Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
Sparse_buffer_alloc
Alloc ffffffc07d800000 - ffffffc07f600000 (30M)
Sparse_buffer_fini
Free ffffffc07f600000 - ffffffc07f623000 (140K)
The reserved memory between ffffffc07d623000 - ffffffc07d800000
(~=1.9M) is unfreed.

Let's explicit free redundant aligned memory.

[arnd@arndb.de: mark sparse_buffer_free as __meminit]
Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen
Signed-off-by: Mark-PK Tsai
Signed-off-by: Arnd Bergmann
Cc: YJ Chiang
Cc: Lecopzer Chen
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lecopzer Chen
2019-09-25 06:54:09 +0800

19 Jul, 2019

9 commits

9a8450304 mm/sparsemem: cleanup 'section number' data types ... Browse Code »

David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.

[akpm@linux-foundation.org: fix printk format]
Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: David Hildenbrand
Reviewed-by: David Hildenbrand
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
ba72b4c8c mm/sparsemem: support sub-section hotplug ... Browse Code »

The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.

For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0

WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]

It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.

A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.

Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.

[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Oscar Salvador
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
7ea621604 mm/sparsemem: prepare for sub-section ranges ... Browse Code »

Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().

This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.

Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Pavel Tatashin
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: David Hildenbrand
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
e9c0a3f05 mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() ... Browse Code »

Allow sub-section sized ranges to be added to the memmap.

populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Pavel Tatashin
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Logan Gunthorpe
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
f46edbd1b mm/sparsemem: add helpers track active portions of a section at boot ... Browse Code »

Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.

The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.

Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Qian Cai
Tested-by: Jane Chu
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
326e1b8f8 mm/sparsemem: introduce a SECTION_IS_EARLY flag ... Browse Code »

In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug. This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.

For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section. Several pfn_valid() users expect to just check
if a section is valid, but they are not careful to check if the given
pfn is within a "System RAM" boundary and instead expect pgdat
information to further validate the pfn.

Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.

Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Qian Cai
Tested-by: Aneesh Kumar K.V [ppc64]
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: Logan Gunthorpe
Cc: David Hildenbrand
Cc: Pavel Tatashin
Cc: Jane Chu
Cc: Jeff Moyer
Cc: Jérôme Glisse
Cc: Jonathan Corbet
Cc: Mike Rapoport
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
f1eca35a0 mm/sparsemem: introduce struct mem_section_usage ... Browse Code »

Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check

- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.

- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Oscar Salvador
Reviewed-by: Wei Yang
Tested-by: Aneesh Kumar K.V [ppc64]
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Logan Gunthorpe
Cc: Pavel Tatashin
Cc: David Hildenbrand
Cc: Jérôme Glisse
Cc: Mike Rapoport
Cc: Jane Chu
Cc: Pavel Tatashin
Cc: Jonathan Corbet
Cc: Qian Cai
Cc: Logan Gunthorpe
Cc: Toshi Kani
Cc: Jeff Moyer
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Jason Gunthorpe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-07-19 08:08:07 +0800
2491f0a2c mm: section numbers use the type "unsigned long" ... Browse Code »

Patch series "mm: Further memory block device cleanups", v1.

Some further cleanups around memory block devices. Especially, clean up
and simplify walk_memory_range(). Including some other minor cleanups.

This patch (of 6):

We are using a mixture of "int" and "unsigned long". Let's make this
consistent by using "unsigned long" everywhere. We'll do the same with
memory block ids next.

While at it, turn the "unsigned long i" in removable_show() into an int
- sections_per_block is an int.

[akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/]
[david@redhat.com: v3]
Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com
Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Dan Williams
Cc: Mel Gorman
Cc: Wei Yang
Cc: Johannes Weiner
Cc: Arun KS
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Stephen Rothwell
Cc: Mike Rapoport
Cc: Baoquan He
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-07-19 08:08:06 +0800
26f26beda mm/sparse.c: set section nid for hot-add memory ... Browse Code »

In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
section_to_node_table[]. While for hot-add memory, this is missed.
Without this information, page_to_nid() may not give the right node id.

BTW, current online_pages works because it leverages nid in
memory_block. But the granularity of node id should be mem_section
wide.

Link: http://lkml.kernel.org/r/20190618005537.18878-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang
Reviewed-by: Oscar Salvador
Reviewed-by: David Hildenbrand
Reviewed-by: Anshuman Khandual
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2019-07-19 08:08:06 +0800