06 Jan, 2021

1 commit

  • commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.

    VMware observed a performance regression during memmap init on their
    platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
    iterate over memblock regions rather that check each PFN") causing it.

    Before the commit:

    [0.033176] Normal zone: 1445888 pages used for memmap
    [0.033176] Normal zone: 89391104 pages, LIFO batch:63
    [0.035851] ACPI: PM-Timer IO Port: 0x448

    With commit

    [0.026874] Normal zone: 1445888 pages used for memmap
    [0.026875] Normal zone: 89391104 pages, LIFO batch:63
    [2.028450] ACPI: PM-Timer IO Port: 0x448

    The root cause is the current memmap defer init doesn't work as expected.

    Before, memmap_init_zone() was used to do memmap init of one whole zone,
    to initialize all low zones of one numa node, but defer memmap init of
    the last zone in that numa node. However, since commit 73a6e474cb376,
    function memmap_init() is adapted to iterater over memblock regions
    inside one zone, then call memmap_init_zone() to do memmap init for each
    region.

    E.g, on VMware's system, the memory layout is as below, there are two
    memory regions in node 2. The current code will mistakenly initialize the
    whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
    iniatialize only one memmory section on the 2nd region [mem
    0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
    only one memory section's memmap initialized. That's why more time is
    costed at the time.

    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
    [ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
    [ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
    [ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
    [ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

    Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
    down the real zone end pfn so that defer_init() can use it to judge
    whether defer need be taken in zone wide.

    Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
    Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
    Signed-off-by: Baoquan He
    Reported-by: Rahul Gopakumar
    Reviewed-by: Mike Rapoport
    Cc: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     

30 Dec, 2020

1 commit

  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     

23 Nov, 2020

1 commit

  • The core-mm has a default __weak implementation of phys_to_target_node()
    to mirror the weak definition of memory_add_physaddr_to_nid(). That
    symbol is exported for modules. However, while the export in
    mm/memory_hotplug.c exported the symbol in the configuration cases of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=y

    ...and:

    CONFIG_NUMA_KEEP_MEMINFO=n
    CONFIG_MEMORY_HOTPLUG=y

    ...it failed to export the symbol in the case of:

    CONFIG_NUMA_KEEP_MEMINFO=y
    CONFIG_MEMORY_HOTPLUG=n

    Not only is that broken, but Christoph points out that the kernel should
    not be exporting any __weak symbol, which means that
    memory_add_physaddr_to_nid() example that phys_to_target_node() copied
    is broken too.

    Rework the definition of phys_to_target_node() and
    memory_add_physaddr_to_nid() to not require weak symbols. Move to the
    common arch override design-pattern of an asm header defining a symbol
    to replace the default implementation.

    The only common header that all memory_add_physaddr_to_nid() producing
    architectures implement is asm/sparsemem.h. In fact, powerpc already
    defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
    Double-down on that observation and define phys_to_target_node() where
    necessary in asm/sparsemem.h. An alternate consideration that was
    discarded was to put this override in asm/numa.h, but that entangles
    with the definition of MAX_NUMNODES relative to the inclusion of
    linux/nodemask.h, and requires powerpc to grow a new header.

    The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
    now that the symbol is properly exported / stubbed in all combinations
    of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

    [dan.j.williams@intel.com: v4]
    Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
    [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
    Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

    Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
    Reported-by: Randy Dunlap
    Reported-by: Thomas Gleixner
    Reported-by: kernel test robot
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Tested-by: Thomas Gleixner
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Christoph Hellwig
    Cc: Joao Martins
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Vishal Verma
    Cc: Stephen Rothwell
    Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

19 Oct, 2020

1 commit

  • To calculate the correct node to migrate the page for hotplug, we need to
    check node id of the page. Wrapper for alloc_migration_target() exists
    for this purpose.

    However, Vlastimil informs that all migration source pages come from a
    single node. In this case, we don't need to check the node id for each
    page and we don't need to re-set the target nodemask for each page by
    using the wrapper. Set up the migration_target_control once and use it
    for all pages.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

17 Oct, 2020

15 commits

  • As we no longer shuffle via generic_online_page() and when undoing
    isolation, we can simplify the comment.

    We now effectively shuffle only once (properly) when onlining new memory.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Pankaj Gupta
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Scott Cheloha
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • At boot time, or when doing memory hot-add operations, if the links in
    sysfs can't be created, the system is still able to run, so just report
    the error in the kernel log rather than BUG_ON and potentially make system
    unusable because the callpath can be called with locks held.

    Since the number of memory blocks managed could be high, the messages are
    rate limited.

    As a consequence, link_mem_sections() has no status to report anymore.

    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: Fenghua Yu
    Cc: Nathan Lynch
    Cc: "Rafael J . Wysocki"
    Cc: Scott Cheloha
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • "mem" in the name already indicates the root, similar to
    release_mem_region() and devm_request_mem_region(). Make it implicit.
    The only single caller always passes iomem_resource, other parents are not
    applicable.

    Suggested-by: Wei Yang
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Some add_memory*() users add memory in small, contiguous memory blocks.
    Examples include virtio-mem, hyper-v balloon, and the XEN balloon.

    This can quickly result in a lot of memory resources, whereby the actual
    resource boundaries are not of interest (e.g., it might be relevant for
    DIMMs, exposed via /proc/iomem to user space). We really want to merge
    added resources in this scenario where possible.

    Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
    either created within add_memory*() or passed via add_memory_resource()
    shall be marked mergeable and merged with applicable siblings.

    To implement that, we need a kernel/resource interface to mark selected
    System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
    merging.

    Note: We really want to merge after the whole operation succeeded, not
    directly when adding a resource to the resource tree (it would break
    add_memory_resource() and require splitting resources again when the
    operation failed - e.g., due to -ENOMEM).

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Thomas Gleixner
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: Roger Pau Monné
    Cc: Julien Grall
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Jason Wang
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We soon want to pass flags, e.g., to mark added System RAM resources.
    mergeable. Prepare for that.

    This patch is based on a similar patch by Oscar Salvador:

    https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Juergen Gross # Xen related part
    Reviewed-by: Pankaj Gupta
    Acked-by: Wei Liu
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Baoquan He
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Rafael J. Wysocki"
    Cc: Len Brown
    Cc: Greg Kroah-Hartman
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: David Hildenbrand
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Cc: Boris Ostrovsky
    Cc: Stefano Stabellini
    Cc: "Oliver O'Halloran"
    Cc: Pingfan Liu
    Cc: Nathan Lynch
    Cc: Libor Pechacek
    Cc: Anton Blanchard
    Cc: Leonardo Bras
    Cc: Ard Biesheuvel
    Cc: Eric Biederman
    Cc: Julien Grall
    Cc: Kees Cook
    Cc: Roger Pau Monné
    Cc: Thomas Gleixner
    Cc: Wei Yang
    Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
    always set to 0 by hardware. This is far from beautiful (and confusing),
    and the bit only applies to SYSRAM. So let's move it out of the
    bus-specific (PnP) defined bits.

    We'll add another SYSRAM specific bit soon. If we ever need more bits for
    other purposes, we can steal some from "desc", or reshuffle/regroup what
    we have.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Eric Biederman
    Cc: Thomas Gleixner
    Cc: Greg Kroah-Hartman
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Boris Ostrovsky
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Haiyang Zhang
    Cc: Heiko Carstens
    Cc: Jason Wang
    Cc: Juergen Gross
    Cc: Julien Grall
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Roger Pau Monné
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "selective merging of system ram resources", v4.

    Some add_memory*() users add memory in small, contiguous memory blocks.
    Examples include virtio-mem, hyper-v balloon, and the XEN balloon.

    This can quickly result in a lot of memory resources, whereby the actual
    resource boundaries are not of interest (e.g., it might be relevant for
    DIMMs, exposed via /proc/iomem to user space). We really want to merge
    added resources in this scenario where possible.

    Resources are effectively stored in a list-based tree. Having a lot of
    resources not only wastes memory, it also makes traversing that tree more
    expensive, and makes /proc/iomem explode in size (e.g., requiring
    kexec-tools to manually merge resources when creating a kdump header. The
    current kexec-tools resource count limit does not allow for more than
    ~100GB of memory with a memory block size of 128MB on x86-64).

    Let's allow to selectively merge system ram resources by specifying a new
    flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
    tested with virtio-mem.

    This patch (of 8):

    Let's make sure splitting a resource on memory hotunplug will never fail.
    This will become more relevant once we merge selected System RAM resources
    - then, we'll trigger that case more often on memory hotunplug.

    In general, this function is already unlikely to fail. When we remove
    memory, we free up quite a lot of metadata (memmap, page tables, memory
    block device, etc.). The only reason it could really fail would be when
    injecting allocation errors.

    All other error cases inside release_mem_region_adjustable() seem to be
    sanity checks if the function would be abused in different context - let's
    add WARN_ON_ONCE() in these cases so we can catch them.

    [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
    Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
    Link: https://github.com/ClangBuiltLinux/linux/issues/1159

    Signed-off-by: David Hildenbrand
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Boris Ostrovsky
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Heiko Carstens
    Cc: Jason Wang
    Cc: Juergen Gross
    Cc: Julien Grall
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Roger Pau Monn
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Currently, it can happen that pages are allocated (and freed) via the
    buddy before we finished basic memory onlining.

    For example, pages are exposed to the buddy and can be allocated before we
    actually mark the sections online. Allocated pages could suddenly fail
    pfn_to_online_page() checks. We had similar issues with pcp handling,
    when pages are allocated+freed before we reach zone_pcp_update() in
    online_pages() [1].

    Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
    impossible. Once done with the heavy lifting, use
    undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
    freelist, marking them ready for allocation. Similar to offline_pages(),
    we have to manually adjust zone->nr_isolate_pageblock.

    [1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
    un-isolate the pages after memory onlining is complete. Let's allow
    passing in the migratetype.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: Dan Williams
    Cc: Mike Rapoport
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Michel Lespinasse
    Cc: Charan Teja Reddy
    Cc: Mel Gorman
    Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We don't allow to offline memory with holes, all boot memory is online,
    and all hotplugged memory cannot have holes.

    We can now simplify onlining of pages. As we only allow to online/offline
    full sections and sections always span full MAX_ORDER_NR_PAGES, we can
    just process MAX_ORDER - 1 pages without further special handling.

    The number of onlined pages simply corresponds to the number of pages we
    were requested to online.

    While at it, refine the comment regarding the callback not exposing all
    pages to the buddy.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Callers no longer need the number of isolated pageblocks. Let's simplify.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We make sure that we cannot have any memory holes right at the beginning
    of offline_pages() and we only support to online/offline full sections.
    Both, sections and pageblocks are a power of two in size, and sections
    always span full pageblocks.

    We can directly calculate the number of isolated pageblocks from nr_pages.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We make sure that we cannot have any memory holes right at the beginning
    of offline_pages(). We no longer need walk_system_ram_range() and can
    call test_pages_isolated() and __offline_isolated_pages() directly.

    offlined_pages always corresponds to nr_pages, so we can simplify that.

    [akpm@linux-foundation.org: patch conflict resolution]

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Already two people (including me) tried to offline subsections, because
    the function looks like it can deal with it. But we really can only
    online/offline full sections that are properly aligned (e.g., we can only
    mark full sections online/offline via SECTION_IS_ONLINE).

    Add a simple safety net to document the restriction now. Current users
    (core and powernv/memtrace) respect these restrictions.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.

    These are a bunch of cleanups for online_pages()/offline_pages() and
    related code, mostly getting rid of memory hole handling that is no longer
    necessary. There is only a single walk_system_ram_range() call left in
    offline_pages(), to make sure we don't have any memory holes. I had some
    of these patches lying around for a longer time but didn't have time to
    polish them.

    In addition, the last patch marks all pageblocks of memory to get onlined
    MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
    get allocated before onlining is complete. Once heavy lifting is done,
    the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
    possible.

    I played with DIMMs and virtio-mem on x86-64 and didn't spot any
    surprises. I verified that the numer of isolated pageblocks is correctly
    handled when onlining/offlining.

    This patch (of 10):

    There is only a single user, offline_pages(). Let's inline, to make
    it look more similar to online_pages().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Cc: Mel Gorman
    Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

14 Oct, 2020

1 commit

  • In preparation to set a fallback value for dev_dax->target_node, introduce
    generic fallback helpers for phys_to_target_node()

    A generic implementation based on node-data or memblock was proposed, but
    as noted by Mike:

    "Here again, I would prefer to add a weak default for
    phys_to_target_node() because the "generic" implementation is not really
    generic.

    The fallback to reserved ranges is x86 specfic because on x86 most of
    the reserved areas is not in memblock.memory. AFAIK, no other
    architecture does this."

    The info message in the generic memory_add_physaddr_to_nid()
    implementation is fixed up to properly reflect that
    memory_add_physaddr_to_nid() communicates "online" node info and
    phys_to_target_node() indicates "target / to-be-onlined" node info.

    [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
    Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Mike Rapoport
    Cc: Jia He
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vishal Verma
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Ard Biesheuvel
    Cc: Bjorn Helgaas
    Cc: Boris Ostrovsky
    Cc: Hulk Robot
    Cc: Jason Yan
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Vivek Goyal
    Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

27 Sep, 2020

2 commits

  • In register_mem_sect_under_node() the system_state's value is checked to
    detect whether the call is made during boot time or during an hot-plug
    operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
    because regular memory is registered at SYSTEM_SCHEDULING state. In
    addition, memory hot-plug operation can be triggered at this system
    state by the ACPI [1]. So checking against the system state is not
    enough.

    The consequence is that on system with interleaved node's ranges like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    This can be seen on PowerPC LPAR after multiple memory hot-plug and
    hot-unplug operations are done. At the next reboot the node's memory
    ranges can be interleaved and since the call to link_mem_sections() is
    made in topology_init() while the system is in the SYSTEM_SCHEDULING
    state, the node's id is not checked, and the sections registered to
    multiple nodes:

    $ ls -l /sys/devices/system/memory/memory21/node*
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

    In that case, the system is able to boot but if later one of theses
    memory blocks is hot-unplugged and then hot-plugged, the sysfs
    inconsistency is detected and this is triggering a BUG_ON():

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This patch addresses the root cause by not relying on the system_state
    value to detect whether the call is due to a hot-plug operation. An
    extra parameter is added to link_mem_sections() detailing whether the
    operation is due to a hot-plug operation.

    [1] According to Oscar Salvador, using this qemu command line, ACPI
    memory hotplug operations are raised at SYSTEM_SCHEDULING state:

    $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
    -m size=$MEM,slots=255,maxmem=4294967296k \
    -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
    -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
    -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
    -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
    -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
    -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
    -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
    -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

    Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Fenghua Yu
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

20 Sep, 2020

1 commit

  • There is a race during page offline that can lead to infinite loop:
    a page never ends up on a buddy list and __offline_pages() keeps
    retrying infinitely or until a termination signal is received.

    Thread#1 - a new process:

    load_elf_binary
    begin_new_exec
    exec_mmap
    mmput
    exit_mmap
    tlb_finish_mmu
    tlb_flush_mmu
    release_pages
    free_unref_page_list
    free_unref_page_prepare
    set_pcppage_migratetype(page, migratetype);
    // Set page->index migration type below MIGRATE_PCPTYPES

    Thread#2 - hot-removes memory
    __offline_pages
    start_isolate_page_range
    set_migratetype_isolate
    set_pageblock_migratetype(page, MIGRATE_ISOLATE);
    Set migration type to MIGRATE_ISOLATE-> set
    drain_all_pages(zone);
    // drain per-cpu page lists to buddy allocator.

    Thread#1 - continue
    free_unref_page_commit
    migratetype = get_pcppage_migratetype(page);
    // get old migration type
    list_add(&page->lru, &pcp->lists[migratetype]);
    // add new page to already drained pcp list

    Thread#2
    Never drains pcp again, and therefore gets stuck in the loop.

    The fix is to try to drain per-cpu lists again after
    check_pages_isolated_cb() fails.

    Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Wei Yang
    Cc:
    Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
    Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

4 commits

  • There are some similar functions for migration target allocation. Since
    there is no fundamental difference, it's better to keep just one rather
    than keeping all variants. This patch implements base migration target
    allocation function. In the following patches, variants will be converted
    to use this function.

    Changes should be mechanical, but, unfortunately, there are some
    differences. First, some callers' nodemask is assgined to NULL since NULL
    nodemask will be considered as all available nodes, that is,
    &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
    redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
    user provided gfp_mask has it. This is because future caller of this
    function requires to set this node constaint. Lastly, if provided nodeid
    is NUMA_NO_NODE, nodeid is set up to the node where migration source
    lives. It helps to remove simple wrappers for setting up the nodeid.

    Note that PageHighmem() call in previous function is changed to open-code
    "is_highmem_idx()" since it provides more readability.

    [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
    [akpm@linux-foundation.org: fix typo in comment]

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When onlining a first memory block in a zone, pcp lists are not updated
    thus pcp struct will have the default setting of ->high = 0,->batch = 1.

    This means till the second memory block in a zone(if it have) is onlined
    the pcp lists of this zone will not contain any pages because pcp's
    ->count is always greater than ->high thus free_pcppages_bulk() is called
    to free batch size(=1) pages every time system wants to add a page to the
    pcp list through free_unref_page().

    To put this in a word, system is not using benefits offered by the pcp
    lists when there is a single onlineable memory block in a zone. Correct
    this by always updating the pcp lists when memory block is onlined.

    Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds

    Charan Teja Reddy
     
  • When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
    online at that time), mem_hotplug_begin/done is unpaired in such case.

    Therefore a warning:
    Call Trace:
    percpu_up_write+0x33/0x40
    try_remove_memory+0x66/0x120
    ? _cond_resched+0x19/0x30
    remove_memory+0x2b/0x40
    dev_dax_kmem_remove+0x36/0x72 [kmem]
    device_release_driver_internal+0xf0/0x1c0
    device_release_driver+0x12/0x20
    bus_remove_device+0xe1/0x150
    device_del+0x17b/0x3e0
    unregister_dev_dax+0x29/0x60
    devm_action_release+0x15/0x20
    release_nodes+0x19a/0x1e0
    devres_release_all+0x3f/0x50
    device_release_driver_internal+0x100/0x1c0
    driver_detach+0x4c/0x8f
    bus_remove_driver+0x5c/0xd0
    driver_unregister+0x31/0x50
    dax_pmem_exit+0x10/0xfe0 [dax_pmem]

    Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: [5.6+]
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chuhong Yuan
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Mike Rapoport
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
    Signed-off-by: Linus Torvalds

    Jia He
     
  • This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
    is a fallback option to get the nid in case NUMA_NO_NID is detected.

    After this patch, arm64/sh/s390 can simply use the general dummy version.
    PowerPC/x86/ia64 will still use their specific version.

    This is the preparation to set a fallback value for dev_dax->target_node.

    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Baoquan He
    Cc: Chuhong Yuan
    Cc: Mike Rapoport
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
    Signed-off-by: Linus Torvalds

    Jia He
     

08 Aug, 2020

2 commits

  • It's not completely obvious why we have to shuffle the complete zone -
    introduced in commit e900a918b098 ("mm: shuffle initial free memory to
    improve memory-side-cache utilization") - because some sort of shuffling
    is already performed when onlining pages via __free_one_page(), placing
    MAX_ORDER-1 pages either to the head or the tail of the freelist. Let's
    document why we have to shuffle the complete zone when exposing larger,
    contiguous physical memory areas to the buddy.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Dan Williams
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200624094741.9918-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • The global variable "vm_total_pages" is a relic from older days. There is
    only a single user that reads the variable - build_all_zonelists() - and
    the first thing it does is update it.

    Use a local variable in build_all_zonelists() instead and remove the
    global variable.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Huang Ying
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

26 Jun, 2020

1 commit

  • When working with very large nodes, poisoning the struct pages (for which
    there will be very many) can take a very long time. If the system is
    using voluntary preemptions, the software watchdog will not be able to
    detect forward progress. This patch addresses this issue by offering to
    give up time like __remove_pages() does. This behavior was introduced in
    v5.6 with: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in
    remove_pfn_range_from_zone()")

    Alternately, init_page_poison could do this cond_resched(), but it seems
    to me that the caller of init_page_poison() is what actually knows whether
    or not it should relax its own priority.

    Based on Dan's notes, I think this is perfectly safe: commit f931ab479dd2
    ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")

    Aside from fixing the lockup, it is also a friendlier thing to do on lower
    core systems that might wipe out large chunks of hotplug memory (probably
    not a very common case).

    Fixes this kind of splat:

    watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
    irq event stamp: 138450
    hardirqs last enabled at (138449): [] trace_hardirqs_on_thunk+0x1a/0x1c
    hardirqs last disabled at (138450): [] trace_hardirqs_off_thunk+0x1a/0x1c
    softirqs last enabled at (138448): [] __do_softirq+0x347/0x456
    softirqs last disabled at (138443): [] irq_exit+0x7d/0xb0
    CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
    Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
    RIP: 0010:memset_erms+0x9/0x10
    Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
    Call Trace:
    remove_pfn_range_from_zone+0x3a/0x380
    memunmap_pages+0x17f/0x280
    release_nodes+0x22a/0x260
    __device_release_driver+0x172/0x220
    device_driver_detach+0x3e/0xa0
    unbind_store+0x113/0x130
    kernfs_fop_write+0xdc/0x1c0
    vfs_write+0xde/0x1d0
    ksys_write+0x58/0xd0
    do_syscall_64+0x5a/0x120
    entry_SYSCALL_64_after_hwframe+0x49/0xb3
    Built 2 zonelists, mobility grouping on. Total pages: 49050381
    Policy zone: Normal
    Built 3 zonelists, mobility grouping on. Total pages: 49312525
    Policy zone: Normal

    David said: "It really only is an issue for devmem. Ordinary
    hotplugged system memory is not affected (onlined/offlined in memory
    block granularity)."

    Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
    Fixes: commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
    Signed-off-by: Ben Widawsky
    Reported-by: "Scargall, Steve"
    Reported-by: Ben Widawsky
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Vishal Verma
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Widawsky
     

11 Jun, 2020

1 commit

  • Pull virtio updates from Michael Tsirkin:

    - virtio-mem: paravirtualized memory hotplug

    - support doorbell mapping for vdpa

    - config interrupt support in ifc

    - fixes all over the place

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (40 commits)
    vhost/test: fix up after API change
    virtio_mem: convert device block size into 64bit
    virtio-mem: drop unnecessary initialization
    ifcvf: implement config interrupt in IFCVF
    vhost: replace -1 with VHOST_FILE_UNBIND in ioctls
    vhost_vdpa: Support config interrupt in vdpa
    ifcvf: ignore continuous setting same status value
    virtio-mem: Don't rely on implicit compiler padding for requests
    virtio-mem: Try to unplug the complete online memory block first
    virtio-mem: Use -ETXTBSY as error code if the device is busy
    virtio-mem: Unplug subblocks right-to-left
    virtio-mem: Drop manual check for already present memory
    virtio-mem: Add parent resource for all added "System RAM"
    virtio-mem: Better retry handling
    virtio-mem: Offline and remove completely unplugged memory blocks
    mm/memory_hotplug: Introduce offline_and_remove_memory()
    virtio-mem: Allow to offline partially unplugged memory blocks
    mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE
    virtio-mem: Paravirtualized memory hotunplug part 2
    virtio-mem: Paravirtualized memory hotunplug part 1
    ...

    Linus Torvalds
     

05 Jun, 2020

8 commits

  • There is a typo in comment, fix it.
    s/recoreded/recorded

    Signed-off-by: Ethon Paul
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200410160328.13843-1-ethp@qq.com
    Signed-off-by: Linus Torvalds

    Ethon Paul
     
  • Patch series "mm/memory_hotplug: Interface to add driver-managed system
    ram", v4.

    kexec (via kexec_load()) can currently not properly handle memory added
    via dax/kmem, and will have similar issues with virtio-mem. kexec-tools
    will currently add all memory to the fixed-up initial firmware memmap. In
    case of dax/kmem, this means that - in contrast to a proper reboot - how
    that persistent memory will be used can no longer be configured by the
    kexec'd kernel. In case of virtio-mem it will be harmful, because that
    memory might contain inaccessible pieces that require coordination with
    hypervisor first.

    In both cases, we want to let the driver in the kexec'd kernel handle
    detecting and adding the memory, like during an ordinary reboot.
    Introduce add_memory_driver_managed(). More on the samentics are in patch
    #1.

    In the future, we might want to make this behavior configurable for
    dax/kmem- either by configuring it in the kernel (which would then also
    allow to configure kexec_file_load()) or in kexec-tools by also adding
    "System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
    firmware memmap.

    More on the motivation can be found in [1] and [2].

    [1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
    [2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com

    This patch (of 3):

    Some device drivers rely on memory they managed to not get added to the
    initial (firmware) memmap as system RAM - so it's not used as initial
    system RAM by the kernel and the driver is under control. While this is
    the case during cold boot and after a reboot, kexec is not aware of that
    and might add such memory to the initial (firmware) memmap of the kexec
    kernel. We need ways to teach kernel and userspace that this system ram
    is different.

    For example, dax/kmem allows to decide at runtime if persistent memory is
    to be used as system ram. Another future user is virtio-mem, which has to
    coordinate with its hypervisor to deal with inaccessible parts within
    memory resources.

    We want to let users in the kernel (esp. kexec) but also user space
    (esp. kexec-tools) know that this memory has different semantics and
    needs to be handled differently:
    1. Don't create entries in /sys/firmware/memmap/
    2. Name the memory resource "System RAM ($DRIVER)" (exposed via
    /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
    3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED

    /sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
    map" because "on most architectures that firmware-provided memory map is
    modified afterwards by the kernel itself". The primary user is kexec on
    x86-64. Since commit d96ae5309165 ("memory-hotplug: create
    /sys/firmware/memmap entry for new memory"), we add all hotplugged memory
    to that firmware memmap - which makes perfect sense for traditional memory
    hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
    firmware memmap. We replicate what the "raw firmware-provided memory map"
    looks like after hot(un)plug.

    To keep things simple, let the user provide the full resource name instead
    of only the driver name - this way, we don't have to manually
    allocate/craft strings for memory resources. Also use the resource name
    to make decisions, to avoid passing additional flags. In case the name
    isn't "System RAM", it's special.

    We don't have to worry about firmware_map_remove() on the removal path.
    If there is no entry, it will simply return with -EINVAL.

    We'll adapt dax/kmem in a follow-up patch.

    [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Pankaj Gupta
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
    longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
    no pages at all, until memory is moved to the zone/node via
    move_pfn_range_to_zone() - e.g., when onlining memory blocks.

    The only archs that care about memblocks for hotplugged memory (either for
    iterating over all system RAM or testing for memory validity) are arm64,
    s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK. Without
    CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Mike Rapoport
    Cc: Anshuman Khandual
    Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: handle memblocks only with
    CONFIG_ARCH_KEEP_MEMBLOCK", v1.

    A hotadded node/pgdat will span no pages at all, until memory is moved to
    the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
    when onlining memory blocks. We don't have to initialize the
    node_start_pfn to the memory we are adding.

    This patch (of 2):

    Especially, there is an inconsistency:
    - Hotplugging memory to a memory-less node with cpus: node_start_pf == 0
    - Offlining and removing last memory from a node: node_start_pfn == 0
    - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0

    As soon as memory is onlined, node_start_pfn is overwritten with the
    actual start. E.g., when adding two DIMMs but only onlining one of both,
    only that DIMM (with online memory blocks) is spanned by the node.

    Currently, the validity of node_start_pfn really is linked to
    node_spanned_pages != 0. With node_spanned_pages == 0 (e.g., before
    onlining memory), it has no meaning.

    So let's stop setting node_start_pfn, just to be overwritten via
    move_pfn_range_to_zone(). This avoids confusion when looking at the code,
    wondering which magic will be performed with the node_start_pfn in this
    function, when hotadding a pgdat.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Pankaj Gupta
    Cc: Michal Hocko
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Anshuman Khandual
    Cc: Mike Rapoport
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Fortunately, all users of is_mem_section_removable() are gone. Get rid of
    it, including some now unnecessary functions.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Oscar Salvador
    Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • A misbehaving qemu created a situation where the ACPI SRAT table
    advertised one fewer proximity domains than intended. The NFIT table did
    describe all the expected proximity domains. This caused the device dax
    driver to assign an impossible target_node to the device, and when
    hotplugged as system memory, this would fail with the following signature:

    BUG: kernel NULL pointer dereference, address: 0000000000000088
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
    Oops: 0000 [#1] SMP PTI
    CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G O 5.6.0-rc5 #9
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
    Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20
    RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
    RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
    RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
    R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
    R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
    FS: 0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
    Call Trace:
    kswapd+0x103/0x520
    kthread+0x120/0x140
    ret_from_fork+0x3a/0x50

    Add a check in the add_memory path to fail if the node to which we are
    adding memory is in the node_possible_map

    Signed-off-by: Vishal Verma
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Dan Williams
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.com
    Signed-off-by: Linus Torvalds

    Vishal Verma
     
  • virtio-mem wants to offline and remove a memory block once it unplugged
    all subblocks (e.g., using alloc_contig_range()). Let's provide
    an interface to do that from a driver. virtio-mem already supports to
    offline partially unplugged memory blocks. Offlining a fully unplugged
    memory block will not require to migrate any pages. All unplugged
    subblocks are PageOffline() and have a reference count of 0 - so
    offlining code will simply skip them.

    All we need is an interface to offline and remove the memory from kernel
    module context, where we don't have access to the memory block devices
    (esp. find_memory_block() and device_offline()) and the device hotplug
    lock.

    To keep things simple, allow to only work on a single memory block.

    Acked-by: Michal Hocko
    Tested-by: Pankaj Gupta
    Acked-by: Andrew Morton
    Cc: Andrew Morton
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Wei Yang
    Cc: Dan Williams
    Cc: Qian Cai
    Signed-off-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.com
    Signed-off-by: Michael S. Tsirkin

    David Hildenbrand
     
  • virtio-mem wants to allow to offline memory blocks of which some parts
    were unplugged (allocated via alloc_contig_range()), especially, to later
    offline and remove completely unplugged memory blocks. The important part
    is that PageOffline() has to remain set until the section is offline, so
    these pages will never get accessed (e.g., when dumping). The pages should
    not be handed back to the buddy (which would require clearing PageOffline()
    and result in issues if offlining fails and the pages are suddenly in the
    buddy).

    Let's allow to do that by allowing to isolate any PageOffline() page
    when offlining. This way, we can reach the memory hotplug notifier
    MEM_GOING_OFFLINE, where the driver can signal that he is fine with
    offlining this page by dropping its reference count. PageOffline() pages
    with a reference count of 0 can then be skipped when offlining the
    pages (like if they were free, however they are not in the buddy).

    Anybody who uses PageOffline() pages and does not agree to offline them
    (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
    decrement the reference count and make offlining fail when trying to
    migrate such an unmovable page. So there should be no observable change.
    Same applies to balloon compaction users (movable PageOffline() pages), the
    pages will simply be migrated.

    Note 1: If offlining fails, a driver has to increment the reference
    count again in MEM_CANCEL_OFFLINE.

    Note 2: A driver that makes use of this has to be aware that re-onlining
    the memory block has to be handled by hooking into onlining code
    (online_page_callback_t), resetting the page PageOffline() and
    not giving them to the buddy.

    Reviewed-by: Alexander Duyck
    Acked-by: Michal Hocko
    Tested-by: Pankaj Gupta
    Acked-by: Andrew Morton
    Cc: Andrew Morton
    Cc: Juergen Gross
    Cc: Konrad Rzeszutek Wilk
    Cc: Pavel Tatashin
    Cc: Alexander Duyck
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Anthony Yznaga
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Dan Williams
    Cc: Anshuman Khandual
    Cc: Qian Cai
    Cc: Pingfan Liu
    Signed-off-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.com
    Signed-off-by: Michael S. Tsirkin

    David Hildenbrand