06 Jan, 2021
1 commit
-
commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.
VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.Before the commit:
[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448With commit
[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He
Reported-by: Rahul Gopakumar
Reviewed-by: Mike Rapoport
Cc: David Hildenbrand
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
30 Dec, 2020
1 commit
-
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
23 Nov, 2020
1 commit
-
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=nNot only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.comFixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap
Reported-by: Thomas Gleixner
Reported-by: kernel test robot
Reported-by: Christoph Hellwig
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Tested-by: Randy Dunlap
Tested-by: Thomas Gleixner
Reviewed-by: Thomas Gleixner
Reviewed-by: Christoph Hellwig
Cc: Joao Martins
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Vishal Verma
Cc: Stephen Rothwell
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds
19 Oct, 2020
1 commit
-
To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page. Wrapper for alloc_migration_target() exists
for this purpose.However, Vlastimil informs that all migration source pages come from a
single node. In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper. Set up the migration_target_control once and use it
for all pages.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Christoph Hellwig
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds
17 Oct, 2020
15 commits
-
As we no longer shuffle via generic_online_page() and when undoing
isolation, we can simplify the comment.We now effectively shuffle only once (properly) when onlining new memory.
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Acked-by: Michal Hocko
Cc: Alexander Duyck
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc: Pankaj Gupta
Cc: Haiyang Zhang
Cc: "K. Y. Srinivasan"
Cc: Matthew Wilcox
Cc: Michael Ellerman
Cc: Scott Cheloha
Cc: Stephen Hemminger
Cc: Wei Liu
Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.com
Signed-off-by: Linus Torvalds -
At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.Since the number of memory blocks managed could be high, the messages are
rate limited.As a consequence, link_mem_sections() has no status to report anymore.
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Cc: Greg Kroah-Hartman
Cc: Fenghua Yu
Cc: Nathan Lynch
Cc: "Rafael J . Wysocki"
Cc: Scott Cheloha
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds -
"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region(). Make it implicit.
The only single caller always passes iomem_resource, other parents are not
applicable.Suggested-by: Wei Yang
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Kees Cook
Cc: Ard Biesheuvel
Cc: Pankaj Gupta
Cc: Baoquan He
Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
Signed-off-by: Linus Torvalds -
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
either created within add_memory*() or passed via add_memory_resource()
shall be marked mergeable and merged with applicable siblings.To implement that, we need a kernel/resource interface to mark selected
System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
merging.Note: We really want to merge after the whole operation succeeded, not
directly when adding a resource to the resource tree (it would break
add_memory_resource() and require splitting resources again when the
operation failed - e.g., due to -ENOMEM).Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Pankaj Gupta
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Kees Cook
Cc: Ard Biesheuvel
Cc: Thomas Gleixner
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Wei Liu
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: Stefano Stabellini
Cc: Roger Pau Monné
Cc: Julien Grall
Cc: Baoquan He
Cc: Wei Yang
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Christian Borntraeger
Cc: Dave Jiang
Cc: Eric Biederman
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: Jason Wang
Cc: Len Brown
Cc: Leonardo Bras
Cc: Libor Pechacek
Cc: Michael Ellerman
Cc: "Michael S. Tsirkin"
Cc: Nathan Lynch
Cc: "Oliver O'Halloran"
Cc: Paul Mackerras
Cc: Pingfan Liu
Cc: "Rafael J. Wysocki"
Cc: Vasily Gorbik
Cc: Vishal Verma
Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
Signed-off-by: Linus Torvalds -
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.This patch is based on a similar patch by Oscar Salvador:
https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Juergen Gross # Xen related part
Reviewed-by: Pankaj Gupta
Acked-by: Wei Liu
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Baoquan He
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Rafael J. Wysocki"
Cc: Len Brown
Cc: Greg Kroah-Hartman
Cc: Vishal Verma
Cc: Dave Jiang
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Wei Liu
Cc: Heiko Carstens
Cc: Vasily Gorbik
Cc: Christian Borntraeger
Cc: David Hildenbrand
Cc: "Michael S. Tsirkin"
Cc: Jason Wang
Cc: Boris Ostrovsky
Cc: Stefano Stabellini
Cc: "Oliver O'Halloran"
Cc: Pingfan Liu
Cc: Nathan Lynch
Cc: Libor Pechacek
Cc: Anton Blanchard
Cc: Leonardo Bras
Cc: Ard Biesheuvel
Cc: Eric Biederman
Cc: Julien Grall
Cc: Kees Cook
Cc: Roger Pau Monné
Cc: Thomas Gleixner
Cc: Wei Yang
Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
Signed-off-by: Linus Torvalds -
IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
always set to 0 by hardware. This is far from beautiful (and confusing),
and the bit only applies to SYSRAM. So let's move it out of the
bus-specific (PnP) defined bits.We'll add another SYSRAM specific bit soon. If we ever need more bits for
other purposes, we can steal some from "desc", or reshuffle/regroup what
we have.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Kees Cook
Cc: Ard Biesheuvel
Cc: Pankaj Gupta
Cc: Baoquan He
Cc: Wei Yang
Cc: Eric Biederman
Cc: Thomas Gleixner
Cc: Greg Kroah-Hartman
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Boris Ostrovsky
Cc: Christian Borntraeger
Cc: Dave Jiang
Cc: Haiyang Zhang
Cc: Heiko Carstens
Cc: Jason Wang
Cc: Juergen Gross
Cc: Julien Grall
Cc: "K. Y. Srinivasan"
Cc: Len Brown
Cc: Leonardo Bras
Cc: Libor Pechacek
Cc: Michael Ellerman
Cc: "Michael S. Tsirkin"
Cc: Nathan Lynch
Cc: "Oliver O'Halloran"
Cc: Paul Mackerras
Cc: Pingfan Liu
Cc: "Rafael J. Wysocki"
Cc: Roger Pau Monné
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Vasily Gorbik
Cc: Vishal Verma
Cc: Wei Liu
Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.com
Signed-off-by: Linus Torvalds -
Patch series "selective merging of system ram resources", v4.
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.Resources are effectively stored in a list-based tree. Having a lot of
resources not only wastes memory, it also makes traversing that tree more
expensive, and makes /proc/iomem explode in size (e.g., requiring
kexec-tools to manually merge resources when creating a kdump header. The
current kexec-tools resource count limit does not allow for more than
~100GB of memory with a memory block size of 128MB on x86-64).Let's allow to selectively merge system ram resources by specifying a new
flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
tested with virtio-mem.This patch (of 8):
Let's make sure splitting a resource on memory hotunplug will never fail.
This will become more relevant once we merge selected System RAM resources
- then, we'll trigger that case more often on memory hotunplug.In general, this function is already unlikely to fail. When we remove
memory, we free up quite a lot of metadata (memmap, page tables, memory
block device, etc.). The only reason it could really fail would be when
injecting allocation errors.All other error cases inside release_mem_region_adjustable() seem to be
sanity checks if the function would be abused in different context - let's
add WARN_ON_ONCE() in these cases so we can catch them.[natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1159Signed-off-by: David Hildenbrand
Signed-off-by: Nathan Chancellor
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Kees Cook
Cc: Ard Biesheuvel
Cc: Pankaj Gupta
Cc: Baoquan He
Cc: Wei Yang
Cc: Anton Blanchard
Cc: Benjamin Herrenschmidt
Cc: Boris Ostrovsky
Cc: Christian Borntraeger
Cc: Dave Jiang
Cc: Eric Biederman
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Heiko Carstens
Cc: Jason Wang
Cc: Juergen Gross
Cc: Julien Grall
Cc: "K. Y. Srinivasan"
Cc: Len Brown
Cc: Leonardo Bras
Cc: Libor Pechacek
Cc: Michael Ellerman
Cc: "Michael S. Tsirkin"
Cc: Nathan Lynch
Cc: "Oliver O'Halloran"
Cc: Paul Mackerras
Cc: Pingfan Liu
Cc: "Rafael J. Wysocki"
Cc: Roger Pau Monn
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vishal Verma
Cc: Wei Liu
Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com
Signed-off-by: Linus Torvalds -
Currently, it can happen that pages are allocated (and freed) via the
buddy before we finished basic memory onlining.For example, pages are exposed to the buddy and can be allocated before we
actually mark the sections online. Allocated pages could suddenly fail
pfn_to_online_page() checks. We had similar issues with pcp handling,
when pages are allocated+freed before we reach zone_pcp_update() in
online_pages() [1].Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.[1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
Signed-off-by: Linus Torvalds -
On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: Dan Williams
Cc: Mike Rapoport
Cc: "Matthew Wilcox (Oracle)"
Cc: Michel Lespinasse
Cc: Charan Teja Reddy
Cc: Mel Gorman
Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
Signed-off-by: Linus Torvalds -
We don't allow to offline memory with holes, all boot memory is online,
and all hotplugged memory cannot have holes.We can now simplify onlining of pages. As we only allow to online/offline
full sections and sections always span full MAX_ORDER_NR_PAGES, we can
just process MAX_ORDER - 1 pages without further special handling.The number of onlined pages simply corresponds to the number of pages we
were requested to online.While at it, refine the comment regarding the callback not exposing all
pages to the buddy.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.com
Signed-off-by: Linus Torvalds -
Callers no longer need the number of isolated pageblocks. Let's simplify.
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.com
Signed-off-by: Linus Torvalds -
We make sure that we cannot have any memory holes right at the beginning
of offline_pages() and we only support to online/offline full sections.
Both, sections and pageblocks are a power of two in size, and sections
always span full pageblocks.We can directly calculate the number of isolated pageblocks from nr_pages.
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.com
Signed-off-by: Linus Torvalds -
We make sure that we cannot have any memory holes right at the beginning
of offline_pages(). We no longer need walk_system_ram_range() and can
call test_pages_isolated() and __offline_isolated_pages() directly.offlined_pages always corresponds to nr_pages, so we can simplify that.
[akpm@linux-foundation.org: patch conflict resolution]
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.com
Signed-off-by: Linus Torvalds -
Already two people (including me) tried to offline subsections, because
the function looks like it can deal with it. But we really can only
online/offline full sections that are properly aligned (e.g., we can only
mark full sections online/offline via SECTION_IS_ONLINE).Add a simple safety net to document the restriction now. Current users
(core and powernv/memtrace) respect these restrictions.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.com
Signed-off-by: Linus Torvalds -
Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.
These are a bunch of cleanups for online_pages()/offline_pages() and
related code, mostly getting rid of memory hole handling that is no longer
necessary. There is only a single walk_system_ram_range() call left in
offline_pages(), to make sure we don't have any memory holes. I had some
of these patches lying around for a longer time but didn't have time to
polish them.In addition, the last patch marks all pageblocks of memory to get onlined
MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
get allocated before onlining is complete. Once heavy lifting is done,
the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
possible.I played with DIMMs and virtio-mem on x86-64 and didn't spot any
surprises. I verified that the numer of isolated pageblocks is correctly
handled when onlining/offlining.This patch (of 10):
There is only a single user, offline_pages(). Let's inline, to make
it look more similar to online_pages().Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Reviewed-by: Pankaj Gupta
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Cc: Mel Gorman
Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.com
Signed-off-by: Linus Torvalds
14 Oct, 2020
1 commit
-
In preparation to set a fallback value for dev_dax->target_node, introduce
generic fallback helpers for phys_to_target_node()A generic implementation based on node-data or memblock was proposed, but
as noted by Mike:"Here again, I would prefer to add a weak default for
phys_to_target_node() because the "generic" implementation is not really
generic.The fallback to reserved ranges is x86 specfic because on x86 most of
the reserved areas is not in memblock.memory. AFAIK, no other
architecture does this."The info message in the generic memory_add_physaddr_to_nid()
implementation is fixed up to properly reflect that
memory_add_physaddr_to_nid() communicates "online" node info and
phys_to_target_node() indicates "target / to-be-onlined" node info.[akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.comSigned-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: David Hildenbrand
Cc: Mike Rapoport
Cc: Jia He
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Borislav Petkov
Cc: Brice Goglin
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: Dave Hansen
Cc: Dave Jiang
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Jeff Moyer
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Rafael J. Wysocki
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vishal Verma
Cc: Wei Yang
Cc: Will Deacon
Cc: Ard Biesheuvel
Cc: Bjorn Helgaas
Cc: Boris Ostrovsky
Cc: Hulk Robot
Cc: Jason Yan
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Vivek Goyal
Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds
27 Sep, 2020
2 commits
-
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.The consequence is that on system with interleaved node's ranges like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27cThis patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Fenghua Yu
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds -
Patch series "mm: fix memory to node bad links in sysfs", v3.
Sometimes, firmware may expose interleaved memory layout like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]In that case, we can see memory blocks assigned to multiple nodes in
sysfs:$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zonesThe same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27cThis has been seen on PowerPC LPAR.
The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.There are two issues here:
(a) The sysfs memory and node's layouts are broken due to these
multiple links(b) The link errors in link_mem_sections() should not lead to a system
panic.To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.Issue (b) will be addressed separately.
This patch (of 2):
The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.Make it general to the hotplug operation and rename it as
meminit_context.There is no functional change introduced by this patch
Suggested-by: David Hildenbrand
Signed-off-by: Laurent Dufour
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Greg Kroah-Hartman
Cc: "Rafael J . Wysocki"
Cc: Nathan Lynch
Cc: Scott Cheloha
Cc: Tony Luck
Cc: Fenghua Yu
Cc:
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds
20 Sep, 2020
1 commit
-
There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.Thread#1 - a new process:
load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPESThread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp listThread#2
Never drains pcp again, and therefore gets stuck in the loop.The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin
Signed-off-by: Andrew Morton
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: David Hildenbrand
Cc: Oscar Salvador
Cc: Wei Yang
Cc:
Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Signed-off-by: Linus Torvalds
15 Aug, 2020
1 commit
-
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.[akpm@linux-foundation.org: fix mm/migrate.c]
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds
13 Aug, 2020
4 commits
-
There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Christoph Hellwig
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds -
When onlining a first memory block in a zone, pcp lists are not updated
thus pcp struct will have the default setting of ->high = 0,->batch = 1.This means till the second memory block in a zone(if it have) is onlined
the pcp lists of this zone will not contain any pages because pcp's
->count is always greater than ->high thus free_pcppages_bulk() is called
to free batch size(=1) pages every time system wants to add a page to the
pcp list through free_unref_page().To put this in a word, system is not using benefits offered by the pcp
lists when there is a single onlineable memory block in a zone. Correct
this by always updating the pcp lists when memory block is onlined.Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
Signed-off-by: Charan Teja Reddy
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Vinayak Menon
Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds -
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.Therefore a warning:
Call Trace:
percpu_up_write+0x33/0x40
try_remove_memory+0x66/0x120
? _cond_resched+0x19/0x30
remove_memory+0x2b/0x40
dev_dax_kmem_remove+0x36/0x72 [kmem]
device_release_driver_internal+0xf0/0x1c0
device_release_driver+0x12/0x20
bus_remove_device+0xe1/0x150
device_del+0x17b/0x3e0
unregister_dev_dax+0x29/0x60
devm_action_release+0x15/0x20
release_nodes+0x19a/0x1e0
devres_release_all+0x3f/0x50
device_release_driver_internal+0x100/0x1c0
driver_detach+0x4c/0x8f
bus_remove_driver+0x5c/0xd0
driver_unregister+0x31/0x50
dax_pmem_exit+0x10/0xfe0 [dax_pmem]Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Signed-off-by: Jia He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Dan Williams
Cc: [5.6+]
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chuhong Yuan
Cc: Dave Hansen
Cc: Dave Jiang
Cc: Fenghua Yu
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Kaly Xin
Cc: Logan Gunthorpe
Cc: Masahiro Yamada
Cc: Mike Rapoport
Cc: Peter Zijlstra
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vishal Verma
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
Signed-off-by: Linus Torvalds -
This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.This is the preparation to set a fallback value for dev_dax->target_node.
Signed-off-by: Jia He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Cc: Dan Williams
Cc: Michal Hocko
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Vishal Verma
Cc: Dave Jiang
Cc: Baoquan He
Cc: Chuhong Yuan
Cc: Mike Rapoport
Cc: Logan Gunthorpe
Cc: Masahiro Yamada
Cc: Jonathan Cameron
Cc: Kaly Xin
Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
Signed-off-by: Linus Torvalds
08 Aug, 2020
2 commits
-
It's not completely obvious why we have to shuffle the complete zone -
introduced in commit e900a918b098 ("mm: shuffle initial free memory to
improve memory-side-cache utilization") - because some sort of shuffling
is already performed when onlining pages via __free_one_page(), placing
MAX_ORDER-1 pages either to the head or the tail of the freelist. Let's
document why we have to shuffle the complete zone when exposing larger,
contiguous physical memory areas to the buddy.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Dan Williams
Acked-by: Michal Hocko
Cc: Alexander Duyck
Cc: Dan Williams
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200624094741.9918-3-david@redhat.com
Signed-off-by: Linus Torvalds -
The global variable "vm_total_pages" is a relic from older days. There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.Use a local variable in build_all_zonelists() instead and remove the
global variable.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Pankaj Gupta
Reviewed-by: Mike Rapoport
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Huang Ying
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
Signed-off-by: Linus Torvalds
26 Jun, 2020
1 commit
-
When working with very large nodes, poisoning the struct pages (for which
there will be very many) can take a very long time. If the system is
using voluntary preemptions, the software watchdog will not be able to
detect forward progress. This patch addresses this issue by offering to
give up time like __remove_pages() does. This behavior was introduced in
v5.6 with: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in
remove_pfn_range_from_zone()")Alternately, init_page_poison could do this cond_resched(), but it seems
to me that the caller of init_page_poison() is what actually knows whether
or not it should relax its own priority.Based on Dan's notes, I think this is perfectly safe: commit f931ab479dd2
("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")Aside from fixing the lockup, it is also a friendlier thing to do on lower
core systems that might wipe out large chunks of hotplug memory (probably
not a very common case).Fixes this kind of splat:
watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
irq event stamp: 138450
hardirqs last enabled at (138449): [] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (138450): [] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (138448): [] __do_softirq+0x347/0x456
softirqs last disabled at (138443): [] irq_exit+0x7d/0xb0
CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
RIP: 0010:memset_erms+0x9/0x10
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
Call Trace:
remove_pfn_range_from_zone+0x3a/0x380
memunmap_pages+0x17f/0x280
release_nodes+0x22a/0x260
__device_release_driver+0x172/0x220
device_driver_detach+0x3e/0xa0
unbind_store+0x113/0x130
kernfs_fop_write+0xdc/0x1c0
vfs_write+0xde/0x1d0
ksys_write+0x58/0xd0
do_syscall_64+0x5a/0x120
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Built 2 zonelists, mobility grouping on. Total pages: 49050381
Policy zone: Normal
Built 3 zonelists, mobility grouping on. Total pages: 49312525
Policy zone: NormalDavid said: "It really only is an issue for devmem. Ordinary
hotplugged system memory is not affected (onlined/offlined in memory
block granularity)."Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
Fixes: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
Signed-off-by: Ben Widawsky
Reported-by: "Scargall, Steve"
Reported-by: Ben Widawsky
Acked-by: David Hildenbrand
Cc: Dan Williams
Cc: Vishal Verma
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jun, 2020
1 commit
-
Pull virtio updates from Michael Tsirkin:
- virtio-mem: paravirtualized memory hotplug
- support doorbell mapping for vdpa
- config interrupt support in ifc
- fixes all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (40 commits)
vhost/test: fix up after API change
virtio_mem: convert device block size into 64bit
virtio-mem: drop unnecessary initialization
ifcvf: implement config interrupt in IFCVF
vhost: replace -1 with VHOST_FILE_UNBIND in ioctls
vhost_vdpa: Support config interrupt in vdpa
ifcvf: ignore continuous setting same status value
virtio-mem: Don't rely on implicit compiler padding for requests
virtio-mem: Try to unplug the complete online memory block first
virtio-mem: Use -ETXTBSY as error code if the device is busy
virtio-mem: Unplug subblocks right-to-left
virtio-mem: Drop manual check for already present memory
virtio-mem: Add parent resource for all added "System RAM"
virtio-mem: Better retry handling
virtio-mem: Offline and remove completely unplugged memory blocks
mm/memory_hotplug: Introduce offline_and_remove_memory()
virtio-mem: Allow to offline partially unplugged memory blocks
mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE
virtio-mem: Paravirtualized memory hotunplug part 2
virtio-mem: Paravirtualized memory hotunplug part 1
...
05 Jun, 2020
8 commits
-
There is a typo in comment, fix it.
s/recoreded/recordedSigned-off-by: Ethon Paul
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Ralph Campbell
Link: http://lkml.kernel.org/r/20200410160328.13843-1-ethp@qq.com
Signed-off-by: Linus Torvalds -
Patch series "mm/memory_hotplug: Interface to add driver-managed system
ram", v4.kexec (via kexec_load()) can currently not properly handle memory added
via dax/kmem, and will have similar issues with virtio-mem. kexec-tools
will currently add all memory to the fixed-up initial firmware memmap. In
case of dax/kmem, this means that - in contrast to a proper reboot - how
that persistent memory will be used can no longer be configured by the
kexec'd kernel. In case of virtio-mem it will be harmful, because that
memory might contain inaccessible pieces that require coordination with
hypervisor first.In both cases, we want to let the driver in the kexec'd kernel handle
detecting and adding the memory, like during an ordinary reboot.
Introduce add_memory_driver_managed(). More on the samentics are in patch
#1.In the future, we might want to make this behavior configurable for
dax/kmem- either by configuring it in the kernel (which would then also
allow to configure kexec_file_load()) or in kexec-tools by also adding
"System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
firmware memmap.More on the motivation can be found in [1] and [2].
[1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
[2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.comThis patch (of 3):
Some device drivers rely on memory they managed to not get added to the
initial (firmware) memmap as system RAM - so it's not used as initial
system RAM by the kernel and the driver is under control. While this is
the case during cold boot and after a reboot, kexec is not aware of that
and might add such memory to the initial (firmware) memmap of the kexec
kernel. We need ways to teach kernel and userspace that this system ram
is different.For example, dax/kmem allows to decide at runtime if persistent memory is
to be used as system ram. Another future user is virtio-mem, which has to
coordinate with its hypervisor to deal with inaccessible parts within
memory resources.We want to let users in the kernel (esp. kexec) but also user space
(esp. kexec-tools) know that this memory has different semantics and
needs to be handled differently:
1. Don't create entries in /sys/firmware/memmap/
2. Name the memory resource "System RAM ($DRIVER)" (exposed via
/proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED/sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
map" because "on most architectures that firmware-provided memory map is
modified afterwards by the kernel itself". The primary user is kexec on
x86-64. Since commit d96ae5309165 ("memory-hotplug: create
/sys/firmware/memmap entry for new memory"), we add all hotplugged memory
to that firmware memmap - which makes perfect sense for traditional memory
hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
firmware memmap. We replicate what the "raw firmware-provided memory map"
looks like after hot(un)plug.To keep things simple, let the user provide the full resource name instead
of only the driver name - this way, we don't have to manually
allocate/craft strings for memory resources. Also use the resource name
to make decisions, to avoid passing additional flags. In case the name
isn't "System RAM", it's special.We don't have to worry about firmware_map_remove() on the removal path.
If there is no entry, it will simply return with -EINVAL.We'll adapt dax/kmem in a follow-up patch.
[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Pankaj Gupta
Cc: Michal Hocko
Cc: Pankaj Gupta
Cc: Wei Yang
Cc: Baoquan He
Cc: Dave Hansen
Cc: Eric Biederman
Cc: Pavel Tatashin
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.com
Signed-off-by: Linus Torvalds -
The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
no pages at all, until memory is moved to the zone/node via
move_pfn_range_to_zone() - e.g., when onlining memory blocks.The only archs that care about memblocks for hotplugged memory (either for
iterating over all system RAM or testing for memory validity) are arm64,
s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK. Without
CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Mike Rapoport
Acked-by: Michal Hocko
Cc: Michal Hocko
Cc: Baoquan He
Cc: Oscar Salvador
Cc: Pankaj Gupta
Cc: Mike Rapoport
Cc: Anshuman Khandual
Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.com
Signed-off-by: Linus Torvalds -
Patch series "mm/memory_hotplug: handle memblocks only with
CONFIG_ARCH_KEEP_MEMBLOCK", v1.A hotadded node/pgdat will span no pages at all, until memory is moved to
the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
when onlining memory blocks. We don't have to initialize the
node_start_pfn to the memory we are adding.This patch (of 2):
Especially, there is an inconsistency:
- Hotplugging memory to a memory-less node with cpus: node_start_pf == 0
- Offlining and removing last memory from a node: node_start_pfn == 0
- Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0As soon as memory is onlined, node_start_pfn is overwritten with the
actual start. E.g., when adding two DIMMs but only onlining one of both,
only that DIMM (with online memory blocks) is spanned by the node.Currently, the validity of node_start_pfn really is linked to
node_spanned_pages != 0. With node_spanned_pages == 0 (e.g., before
onlining memory), it has no meaning.So let's stop setting node_start_pfn, just to be overwritten via
move_pfn_range_to_zone(). This avoids confusion when looking at the code,
wondering which magic will be performed with the node_start_pfn in this
function, when hotadding a pgdat.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Pankaj Gupta
Cc: Michal Hocko
Cc: Baoquan He
Cc: Oscar Salvador
Cc: Pankaj Gupta
Cc: Anshuman Khandual
Cc: Mike Rapoport
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.com
Signed-off-by: Linus Torvalds -
Fortunately, all users of is_mem_section_removable() are gone. Get rid of
it, including some now unnecessary functions.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Oscar Salvador
Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.com
Signed-off-by: Linus Torvalds -
A misbehaving qemu created a situation where the ACPI SRAT table
advertised one fewer proximity domains than intended. The NFIT table did
describe all the expected proximity domains. This caused the device dax
driver to assign an impossible target_node to the device, and when
hotplugged as system memory, this would fail with the following signature:BUG: kernel NULL pointer dereference, address: 0000000000000088
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G O 5.6.0-rc5 #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20
RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
FS: 0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
Call Trace:
kswapd+0x103/0x520
kthread+0x120/0x140
ret_from_fork+0x3a/0x50Add a check in the add_memory path to fail if the node to which we are
adding memory is in the node_possible_mapSigned-off-by: Vishal Verma
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: David Hildenbrand
Cc: Dan Williams
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.com
Signed-off-by: Linus Torvalds -
virtio-mem wants to offline and remove a memory block once it unplugged
all subblocks (e.g., using alloc_contig_range()). Let's provide
an interface to do that from a driver. virtio-mem already supports to
offline partially unplugged memory blocks. Offlining a fully unplugged
memory block will not require to migrate any pages. All unplugged
subblocks are PageOffline() and have a reference count of 0 - so
offlining code will simply skip them.All we need is an interface to offline and remove the memory from kernel
module context, where we don't have access to the memory block devices
(esp. find_memory_block() and device_offline()) and the device hotplug
lock.To keep things simple, allow to only work on a single memory block.
Acked-by: Michal Hocko
Tested-by: Pankaj Gupta
Acked-by: Andrew Morton
Cc: Andrew Morton
Cc: David Hildenbrand
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Wei Yang
Cc: Dan Williams
Cc: Qian Cai
Signed-off-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.com
Signed-off-by: Michael S. Tsirkin -
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.Note 1: If offlining fails, a driver has to increment the reference
count again in MEM_CANCEL_OFFLINE.Note 2: A driver that makes use of this has to be aware that re-onlining
the memory block has to be handled by hooking into onlining code
(online_page_callback_t), resetting the page PageOffline() and
not giving them to the buddy.Reviewed-by: Alexander Duyck
Acked-by: Michal Hocko
Tested-by: Pankaj Gupta
Acked-by: Andrew Morton
Cc: Andrew Morton
Cc: Juergen Gross
Cc: Konrad Rzeszutek Wilk
Cc: Pavel Tatashin
Cc: Alexander Duyck
Cc: Vlastimil Babka
Cc: Johannes Weiner
Cc: Anthony Yznaga
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Mel Gorman
Cc: Mike Rapoport
Cc: Dan Williams
Cc: Anshuman Khandual
Cc: Qian Cai
Cc: Pingfan Liu
Signed-off-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.com
Signed-off-by: Michael S. Tsirkin