17 Oct, 2020

1 commit

  • We soon want to pass flags via a new type to add_memory() and friends.
    That revealed that we currently don't guard some declarations by
    CONFIG_MEMORY_HOTPLUG.

    While some definitions could be moved to different places, let's keep it
    minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
    compiled with CONFIG_MEMORY_HOTPLUG.

    Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
    from CONFIG_MEMORY_HOTPLUG code.

    While at it, remove allow_online_pfn_range(), which is no longer around,
    and mhp_notimplemented(), which is unused.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Boris Ostrovsky
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Heiko Carstens
    Cc: Jason Gunthorpe
    Cc: Jason Wang
    Cc: Juergen Gross
    Cc: Julien Grall
    Cc: Kees Cook
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Roger Pau Monné
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

14 Oct, 2020

1 commit

  • There are several occurrences of the following pattern:

    for_each_memblock(memory, reg) {
    start_pfn = memblock_region_memory_base_pfn(reg);
    end_pfn = memblock_region_memory_end_pfn(reg);

    /* do something with start_pfn and end_pfn */
    }

    Rather than iterate over all memblock.memory regions and each time query
    for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
    simpler and clearer code.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Miguel Ojeda [.clang-format]
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Aug, 2020

3 commits

  • After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
    functions that call memory_present() for each region in memblock.memory:
    sparse_memory_present_with_active_regions() and membocks_present().

    Moreover, all architectures have a call to either of these functions
    preceding the call to sparse_init() and in the most cases they are called
    one after the other.

    Mark the regions from memblock.memory as present during sparce_init() by
    making sparse_init() call memblocks_present(), make memblocks_present()
    and memory_present() functions static and remove redundant
    sparse_memory_present_with_active_regions() function.

    Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For early sections, its memmap is handled specially even sub-section is
    enabled. The memmap could only be populated as a whole.

    Quoted from the comment of section_activate():

    * The early init code does not consider partially populated
    * initial sections, it simply assumes that memory will never be
    * referenced. If we hot-add memory into such a section then we
    * do not need to populate the memmap and can simply reuse what
    * is already there.

    While current section_deactivate() breaks this rule. When hot-remove a
    sub-section, section_deactivate() would depopulate its memmap. The
    consequence is if we hot-add this subsection again, its memmap never get
    proper populated.

    We can reproduce the case by following steps:

    1. Hacking qemu to allow sub-section early section

    : diff --git a/hw/i386/pc.c b/hw/i386/pc.c
    : index 51b3050d01..c6a78d83c0 100644
    : --- a/hw/i386/pc.c
    : +++ b/hw/i386/pc.c
    : @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
    : }
    :
    : machine->device_memory->base =
    : - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
    : + 0x100000000ULL + x86ms->above_4g_mem_size;
    :
    : if (pcmc->enforce_aligned_dimm) {
    : /* size device region assuming 1G page max alignment per slot */

    2. Bootup qemu with PSE disabled and a sub-section aligned memory size

    Part of the qemu command would look like this:

    sudo x86_64-softmmu/qemu-system-x86_64 \
    --enable-kvm -cpu host,pse=off \
    -m 4160M,maxmem=20G,slots=1 \
    -smp sockets=2,cores=16 \
    -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
    -machine pc,nvdimm \
    -nographic \
    -object memory-backend-ram,id=mem0,size=8G \
    -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k

    3. Re-config a pmem device with sub-section size in guest

    ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

    Then you would see the following call trace:

    pmem0: detected capacity change from 0 to 16777216
    BUG: unable to handle page fault for address: ffffec73c51000b4
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
    Oops: 0002 [#1] SMP NOPTI
    CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
    RIP: 0010:memmap_init_zone+0x154/0x1c2
    Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
    RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
    RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
    RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
    RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
    R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
    R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
    FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
    Call Trace:
    move_pfn_range_to_zone+0x128/0x150
    memremap_pages+0x4e4/0x5a0
    devm_memremap_pages+0x1e/0x60
    dev_dax_probe+0x69/0x160 [device_dax]
    really_probe+0x298/0x3c0
    driver_probe_device+0xe1/0x150
    ? driver_allows_async_probing+0x50/0x50
    bus_for_each_drv+0x7e/0xc0
    __device_attach+0xdf/0x160
    bus_probe_device+0x8e/0xa0
    device_add+0x3b9/0x740
    __devm_create_dev_dax+0x127/0x1c0
    __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
    dax_pmem_probe+0xc/0x1b [dax_pmem]
    nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
    really_probe+0x147/0x3c0
    driver_probe_device+0xe1/0x150
    device_driver_attach+0x53/0x60
    bind_store+0xd1/0x110
    kernfs_fop_write+0xce/0x1b0
    vfs_write+0xb6/0x1a0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x4d/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200625223534.18024-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Patch series "mm: cleanup usage of "

    Most architectures have very similar versions of pXd_alloc_one() and
    pXd_free_one() for intermediate levels of page table. These patches add
    generic versions of these functions in and enable
    use of the generic functions where appropriate.

    In addition, functions declared and defined in headers are
    used mostly by core mm and early mm initialization in arch and there is no
    actual reason to have the included all over the place.
    The first patch in this series removes unneeded includes of

    In the end it didn't work out as neatly as I hoped and moving
    pXd_alloc_track() definitions to would require
    unnecessary changes to arches that have custom page table allocations, so
    I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
    to mm/.

    This patch (of 8):

    In most cases header is required only for allocations of
    page table memory. Most of the .c files that include that header do not
    use symbols declared in and do not require that header.

    As for the other header files that used to include , it is
    possible to move that include into the .c file that actually uses symbols
    from and drop the include from the header file.

    The process was somewhat automated using

    sed -i -E '/[
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Geert Uytterhoeven [m68k]
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Joerg Roedel
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

10 Jun, 2020

1 commit

  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

05 Jun, 2020

1 commit


08 Apr, 2020

5 commits

  • No functional change.

    [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
    Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Wei Yang
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • And tell check_pfn_span() gating the porper alignment and size of hot
    added memory region.

    And also move the code comments from inside section_deactivate() to being
    above it. The code comments are reasonable for the whole function, and
    the moving makes code cleaner.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Currently, to support subsection aligned memory region adding for pmem,
    subsection map is added to track which subsection is present.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, it's meaningless. Even worse, it may confuse people when
    checking code related to the classic sparse.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    Combining the above reasons, no need to provide subsection map and the
    relevant handling for the classic sparse. Let's remove them.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Factor out the code which clear subsection map of one memory region from
    section_deactivate() into clear_subsection_map().

    And also add helper function is_subsection_map_empty() to check if the
    current subsection map is empty or not.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.

    Memory sub-section hotplug was added to fix the issue that nvdimm could be
    mapped at non-section aligned starting address. A subsection map is added
    into struct mem_section_usage to implement it.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, subsection map is meaningless and confusing.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    This patch (of 5):

    Factor out the code that fills the subsection map from section_activate()
    into fill_subsection_map(), this makes section_activate() cleaner and
    easier to follow.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     

03 Apr, 2020

3 commits

  • When allocating memmap for hot added memory with the classic sparse, the
    specified 'nid' is ignored in populate_section_memmap().

    While in allocating memmap for the classic sparse during boot, the node
    given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in
    both boot stage and memory hot adding. So seems no reason to not respect
    the node of 'nid' for the classic sparse when hot adding memory.

    Use kvmalloc_node instead to use the passed in 'nid'.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Hildenbrand
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srv
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • This change makes populate_section_memmap()/depopulate_section_memmap
    much simpler.

    Suggested-by: Michal Hocko
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200316125450.GG3486@MiWiFi-R3L-srv
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • memmap should be the address to page struct instead of address to pfn.

    As mentioned by David, if system memory and devmem sit within a section,
    the mismatch address would lead kdump to dump unexpected memory.

    Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
    to get the page struct address at this point.

    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Baoquan He
    Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

30 Mar, 2020

1 commit

  • Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
    section_deactivate+0x220/0x240
    __remove_pages+0x118/0x170
    arch_remove_memory+0x3c/0x150
    memunmap_pages+0x1cc/0x2f0
    devm_action_release+0x30/0x50
    release_nodes+0x2f8/0x3e0
    device_release_driver_internal+0x168/0x270
    unbind_store+0x130/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x68/0x80
    kernfs_fop_write+0x100/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xcc/0x240
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The crash is due to NULL dereference at

    test_bit(idx, ms->usage->subsection_map);

    due to ms->usage = NULL in pfn_section_valid()

    With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
    SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
    depopulate_section_mem(). This was done so that pfn_page() can work
    correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
    config pfn_to_page does

    __section_mem_map_addr(__sec) + __pfn;

    where

    static inline struct page *__section_mem_map_addr(struct mem_section *section)
    {
    unsigned long map = section->section_mem_map;
    map &= SECTION_MAP_MASK;
    return (struct page *)map;
    }

    Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
    used to check the pfn validity (pfn_valid()). Since section_deactivate
    release mem_section->usage if a section is fully deactivated,
    pfn_valid() check after a subsection_deactivate cause a kernel crash.

    static inline int pfn_valid(unsigned long pfn)
    {
    ...
    return early_section(ms) || pfn_section_valid(ms, pfn);
    }

    where

    static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
    {
    int idx = subsection_map_index(pfn);

    return test_bit(idx, ms->usage->subsection_map);
    }

    Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
    freed. For architectures like ppc64 where large pages are used for
    vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
    sections. Hence before a vmemmap mapping page can be freed, the kernel
    needs to make sure there are no valid sections within that mapping.
    Clearing the section valid bit before depopulate_section_memap enables
    this.

    [aneesh.kumar@linux.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
    Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
    Reported-by: Sachin Sant
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Michael Ellerman
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

22 Mar, 2020

1 commit

  • In section_deactivate(), pfn_to_page() doesn't work any more after
    ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
    causes a hot remove failure:

    kernel BUG at mm/page_alloc.c:4806!
    invalid opcode: 0000 [#1] SMP PTI
    CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:free_pages+0x85/0xa0
    Call Trace:
    __remove_pages+0x99/0xc0
    arch_remove_memory+0x23/0x4d
    try_remove_memory+0xc8/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x72/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x2eb/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    Let's move the ->section_mem_map resetting after
    depopulate_section_memmap() to fix it.

    [akpm@linux-foundation.org: remove unneeded initialization, per David]
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     

22 Feb, 2020

1 commit

  • When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
    doesn't work before sparse_init_one_section() is called.

    This leads to a crash when hotplug memory:

    BUG: unable to handle page fault for address: 0000000006400000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:__memset+0x24/0x30
    Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
    RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
    RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
    RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
    R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
    FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    sparse_add_section+0x1c9/0x26a
    __add_pages+0xbf/0x150
    add_pages+0x12/0x60
    add_memory_resource+0xc8/0x210
    __add_memory+0x62/0xb0
    acpi_memory_device_add+0x13f/0x300
    acpi_bus_attach+0xf6/0x200
    acpi_bus_scan+0x43/0x90
    acpi_device_hotplug+0x275/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    We should use memmap as it did.

    On x86 the impact is limited to x86_32 builds, or x86_64 configurations
    that override the default setting for SPARSEMEM_VMEMMAP.

    Other memory hotplug archs (arm64, ia64, and ppc) also default to
    SPARSEMEM_VMEMMAP=y.

    [dan.j.williams@intel.com: changelog update]
    {rppt@linux.ibm.com: changelog update]
    Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Baoquan He
    Acked-by: David Hildenbrand
    Reviewed-by: Baoquan He
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

04 Feb, 2020

1 commit

  • Let's move it to the header and use the shorter variant from
    mm/page_alloc.c (the original one will also check
    "__highest_present_section_nr + 1", which is not necessary). While at
    it, make the section_nr in next_pfn() const.

    In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
    exceed __highest_present_section_nr, which doesn't make a difference in
    the caller as it is big enough (>= all sane end_pfn).

    Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Cc: Baoquan He
    Cc: Dan Williams
    Cc: "Jin, Zhi"
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

01 Feb, 2020

1 commit

  • After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
    when a mem section is fully deactivated, section_mem_map still records
    the section's start pfn, which is not used any more and will be
    reassigned during re-addition.

    In analogy with alloc/free pattern, it is better to clear all fields of
    section_mem_map.

    Beside this, it breaks the user space tool "makedumpfile" [1], which
    makes assumption that a hot-removed section has mem_map as NULL, instead
    of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
    will be better to change the assumption, and need a patch)

    The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
    trigger a crash, and save vmcore by makedumpfile

    [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

    Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Qian Cai
    Cc: Kazuhito Hagio
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     

14 Jan, 2020

1 commit

  • When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

02 Dec, 2019

4 commits

  • sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
    for page management structure, if memory allocation fails from specified
    node, it will fall back to allocate from other nodes.

    Normally, the page management structure will not exceed 2% of the total
    memory, but a large continuous block of allocation is needed. In most
    cases, memory allocation from the specified node will succeed, but a
    node memory become highly fragmented will fail. we expect to allocate
    memory base section rather than by allocating a large block of memory
    from other NUMA nodes

    Add memblock_alloc_exact_nid_raw() for this situation, which allocate
    boot memory block on the exact node. If a large contiguous block memory
    allocate fail in sparse_buffer_init(), it will fall back to allocate
    small block memory base section.

    Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
    Signed-off-by: Yunfeng Ye
    Reviewed-by: Mike Rapoport
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yunfeng Ye
     
  • Vincent has noticed [1] that there is something unusual with the memmap
    allocations going on on his platform

    : I noticed this because on my ARM64 platform, with 1 GiB of memory the
    : first [and only] section is allocated from the zeroing path while with
    : 2 GiB of memory the first 1 GiB section is allocated from the
    : non-zeroing path.

    The underlying problem is that although sparse_buffer_init allocates
    enough memory for all sections on the node sparse_buffer_alloc is not
    able to consume them due to mismatch in the expected allocation
    alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
    alignment the real memmap has to be aligned to section_map_size() this
    results in a wasted initial chunk of the preallocated memmap and
    unnecessary fallback allocation for a section.

    While we are at it also change __populate_section_memmap to align to the
    requested size because at least VMEMMAP has constrains to have memmap
    properly aligned.

    [1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com

    [akpm@linux-foundation.org: tweak layout, per David]
    Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
    Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations")
    Signed-off-by: Michal Hocko
    Reported-by: Vincent Whitchurch
    Debugged-by: Vincent Whitchurch
    Acked-by: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Building the kernel on s390 with -Og produces the following warning:

    WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
    The function populate_section_memmap() references
    the function __meminit __populate_section_memmap().
    This is often because populate_section_memmap lacks a __meminit
    annotation or the annotation of __populate_section_memmap is wrong.

    While -Og is not supported, in theory this might still happen with
    another compiler or on another architecture. So fix this by using the
    correct section annotations.

    [iii@linux.ibm.com: v2]
    Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
    Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
    Signed-off-by: Ilya Leoshkevich
    Acked-by: David Hildenbrand
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Leoshkevich
     
  • sparsemem without VMEMMAP has two allocation paths to allocate the
    memory needed for its memmap (done in sparse_mem_map_populate()).

    In one allocation path (sparse_buffer_alloc() succeeds), the memory is
    not zeroed (since it was previously allocated with
    memblock_alloc_try_nid_raw()).

    In the other allocation path (sparse_buffer_alloc() fails and
    sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the
    memory is zeroed.

    AFAICS this difference does not appear to be on purpose. If the code is
    supposed to work with non-initialized memory (__init_single_page() takes
    care of zeroing the struct pages which are actually used), we should
    consistently not zero the memory, to avoid masking bugs.

    ( I noticed this because on my ARM64 platform, with 1 GiB of memory the
    first [and only] section is allocated from the zeroing path while with
    2 GiB of memory the first 1 GiB section is allocated from the
    non-zeroing path. )

    Michal:
    "the main user visible problem is a memory wastage. The overal amount
    of memory should be small. I wouldn't call it stable material."

    Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
    Signed-off-by: Vincent Whitchurch
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Whitchurch
     

08 Oct, 2019

1 commit

  • We get two warnings when build kernel W=1:

    mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
    mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

    Make the functions static to fix this.

    Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Reviewed-by: David Hildenbrand
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     

25 Sep, 2019

5 commits

  • There is no possibility for memmap to be NULL in the current codebase.

    This check was added in commit 95a4774d055c ("memory-hotplug: update
    mce_bad_pages when removing the memory") where memmap was originally
    inited to NULL, and only conditionally given a value.

    The code that could have passed a NULL has been removed by commit
    ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
    longer a possibility that memmap can be NULL.

    Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Mike Rapoport
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Alexander Duyck
    Cc: Logan Gunthorpe
    Cc: Baoquan He
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • Use the function written to do it instead.

    Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Acked-by: Mike Rapoport
    Reviewed-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • __pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

    Since we already get section_nr, it is not necessary to get mem_section
    from start_pfn. By doing so, we reduce one redundant operation.

    Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Anshuman Khandual
    Tested-by: Anshuman Khandual
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The size argument passed into sparse_buffer_alloc() has already been
    aligned with PAGE_SIZE or PMD_SIZE.

    If the size after aligned is not power of 2 (e.g. 0x480000), the
    PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
    up to next multiple of size.

    Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com
    Signed-off-by: Lecopzer Chen
    Signed-off-by: Mark-PK Tsai
    Cc: YJ Chiang
    Cc: Lecopzer Chen
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lecopzer Chen
     
  • sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
    after being aligned with the size. However, the size is at least
    PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
    than PAGE_SIZE.

    Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
    sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
    first, the aligned space before sparsemap_buf is wasted and no one will
    touch it.

    In our ARM32 platform (without SPARSEMEM_VMEMMAP)
    Sparse_buffer_init
    Reserve d359c000 - d3e9c000 (9M)
    Sparse_buffer_alloc
    Alloc d3a00000 - d3E80000 (4.5M)
    Sparse_buffer_fini
    Free d3e80000 - d3e9c000 (~=100k)
    The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.

    In ARM64 platform (with SPARSEMEM_VMEMMAP)

    sparse_buffer_init
    Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
    Sparse_buffer_alloc
    Alloc ffffffc07d800000 - ffffffc07f600000 (30M)
    Sparse_buffer_fini
    Free ffffffc07f600000 - ffffffc07f623000 (140K)
    The reserved memory between ffffffc07d623000 - ffffffc07d800000
    (~=1.9M) is unfreed.

    Let's explicit free redundant aligned memory.

    [arnd@arndb.de: mark sparse_buffer_free as __meminit]
    Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
    Signed-off-by: Lecopzer Chen
    Signed-off-by: Mark-PK Tsai
    Signed-off-by: Arnd Bergmann
    Cc: YJ Chiang
    Cc: Lecopzer Chen
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lecopzer Chen
     

19 Jul, 2019

9 commits

  • David points out that there is a mixture of 'int' and 'unsigned long'
    usage for section number data types. Update the memory hotplug path to
    use 'unsigned long' consistently for section numbers.

    [akpm@linux-foundation.org: fix printk format]
    Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare the memory hot-{add,remove} paths for handling sub-section
    ranges by plumbing the starting page frame and number of pages being
    handled through arch_{add,remove}_memory() to
    sparse_{add,remove}_one_section().

    This is simply plumbing, small cleanups, and some identifier renames.
    No intended functional changes.

    Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Allow sub-section sized ranges to be added to the memmap.

    populate_section_memmap() takes an explict pfn range rather than
    assuming a full section, and those parameters are plumbed all the way
    through to vmmemap_populate(). There should be no sub-section usage in
    current deployments. New warnings are added to clarify which memmap
    allocation paths are sub-section capable.

    Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
    sub-section active bitmask, each bit representing a PMD_SIZE span of the
    architecture's memory hotplug section size.

    The implications of a partially populated section is that pfn_valid()
    needs to go beyond a valid_section() check and either determine that the
    section is an "early section", or read the sub-section active ranges
    from the bitmask. The expectation is that the bitmask (subsection_map)
    fits in the same cacheline as the valid_section() / early_section()
    data, so the incremental performance overhead to pfn_valid() should be
    negligible.

    The rationale for using early_section() to short-ciruit the
    subsection_map check is that there are legacy code paths that use
    pfn_valid() at section granularity before validating the pfn against
    pgdat data. So, the early_section() check allows those traditional
    assumptions to persist while also permitting subsection_map to tell the
    truth for purposes of populating the unused portions of early sections
    with PMEM and other ZONE_DEVICE mappings.

    Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Qian Cai
    Tested-by: Jane Chu
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for sub-section hotplug, track whether a given section
    was created during early memory initialization, or later via memory
    hotplug. This distinction is needed to maintain the coarse expectation
    that pfn_valid() returns true for any pfn within a given section even if
    that section has pages that are reserved from the page allocator.

    For example one of the of goals of subsection hotplug is to support
    cases where the system physical memory layout collides System RAM and
    PMEM within a section. Several pfn_valid() users expect to just check
    if a section is valid, but they are not careful to check if the given
    pfn is within a "System RAM" boundary and instead expect pgdat
    information to further validate the pfn.

    Rather than unwind those paths to make their pfn_valid() queries more
    precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
    traditional expectation that pfn_valid() returns true for all early
    sections.

    Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
    Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Qian Cai
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Sub-section memory hotplug support", v10.

    The memory hotplug section is an arbitrary / convenient unit for memory
    hotplug. 'Section-size' units have bled into the user interface
    ('memblock' sysfs) and can not be changed without breaking existing
    userspace. The section-size constraint, while mostly benign for typical
    memory hotplug, has and continues to wreak havoc with 'device-memory'
    use cases, persistent memory (pmem) in particular. Recall that pmem
    uses devm_memremap_pages(), and subsequently arch_add_memory(), to
    allocate a 'struct page' memmap for pmem. However, it does not use the
    'bottom half' of memory hotplug, i.e. never marks pmem pages online and
    never exposes the userspace memblock interface for pmem. This leaves an
    opening to redress the section-size constraint.

    To date, the libnvdimm subsystem has attempted to inject padding to
    satisfy the internal constraints of arch_add_memory(). Beyond
    complicating the code, leading to bugs [2], wasting memory, and limiting
    configuration flexibility, the padding hack is broken when the platform
    changes this physical memory alignment of pmem from one boot to the
    next. Device failure (intermittent or permanent) and physical
    reconfiguration are events that can cause the platform firmware to
    change the physical placement of pmem on a subsequent boot, and device
    failure is an everyday event in a data-center.

    It turns out that sections are only a hard requirement of the
    user-facing interface for memory hotplug and with a bit more
    infrastructure sub-section arch_add_memory() support can be added for
    kernel internal usages like devm_memremap_pages(). Here is an analysis
    of the current design assumptions in the current code and how they are
    addressed in the new implementation:

    Current design assumptions:

    - Sections that describe boot memory (early sections) are never
    unplugged / removed.

    - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
    valid_section() check

    - __add_pages() and helper routines assume all operations occur in
    PAGES_PER_SECTION units.

    - The memblock sysfs interface only comprehends full sections

    New design assumptions:

    - Sections are instrumented with a sub-section bitmask to track (on
    x86) individual 2MB sub-divisions of a 128MB section.

    - Partially populated early sections can be extended with additional
    sub-sections, and those sub-sections can be removed with
    arch_remove_memory(). With this in place we no longer lose usable
    memory capacity to padding.

    - pfn_valid() is updated to look deeper than valid_section() to also
    check the active-sub-section mask. This indication is in the same
    cacheline as the valid_section() so the performance impact is
    expected to be negligible. So far the lkp robot has not reported any
    regressions.

    - Outside of the core vmemmap population routines which are replaced,
    other helper routines like shrink_{zone,pgdat}_span() are updated to
    handle the smaller granularity. Core memory hotplug routines that
    deal with online memory are not touched.

    - The existing memblock sysfs user api guarantees / assumptions are not
    touched since this capability is limited to !online
    !memblock-sysfs-accessible sections.

    Meanwhile the issue reports continue to roll in from users that do not
    understand when and how the 128MB constraint will bite them. The current
    implementation relied on being able to support at least one misaligned
    namespace, but that immediately falls over on any moderately complex
    namespace creation attempt. Beyond the initial problem of 'System RAM'
    colliding with pmem, and the unsolvable problem of physical alignment
    changes, Linux is now being exposed to platforms that collide pmem ranges
    with other pmem ranges by default [3]. In short, devm_memremap_pages()
    has pushed the venerable section-size constraint past the breaking point,
    and the simplicity of section-aligned arch_add_memory() is no longer
    tenable.

    These patches are exposed to the kbuild robot on a subsection-v10 branch
    [4], and a preview of the unit test for this functionality is available
    on the 'subsection-pending' branch of ndctl [5].

    [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
    [3]: https://github.com/pmem/ndctl/issues/76
    [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
    [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

    This patch (of 13):

    Towards enabling memory hotplug to track partial population of a section,
    introduce 'struct mem_section_usage'.

    A pointer to a 'struct mem_section_usage' instance replaces the existing
    pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
    'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
    a new 'subsection_map' bitmap. The new bitmap enables the memory
    hot{plug,remove} implementation to act on incremental sub-divisions of a
    section.

    SUBSECTION_SHIFT is defined as global constant instead of per-architecture
    value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
    subsection users. Specifically a common subsection size allows for the
    possibility that persistent memory namespace configurations be made
    compatible across architectures.

    The primary motivation for this functionality is to support platforms that
    mix "System RAM" and "Persistent Memory" within a single section, or
    multiple PMEM ranges with different mapping lifetimes within a single
    section. The section restriction for hotplug has caused an ongoing saga
    of hacks and bugs for devm_memremap_pages() users.

    Beyond the fixups to teach existing paths how to retrieve the 'usemap'
    from a section, and updates to usemap allocation path, there are no
    expected behavior changes.

    Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jérôme Glisse
    Cc: Mike Rapoport
    Cc: Jane Chu
    Cc: Pavel Tatashin
    Cc: Jonathan Corbet
    Cc: Qian Cai
    Cc: Logan Gunthorpe
    Cc: Toshi Kani
    Cc: Jeff Moyer
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Further memory block device cleanups", v1.

    Some further cleanups around memory block devices. Especially, clean up
    and simplify walk_memory_range(). Including some other minor cleanups.

    This patch (of 6):

    We are using a mixture of "int" and "unsigned long". Let's make this
    consistent by using "unsigned long" everywhere. We'll do the same with
    memory block ids next.

    While at it, turn the "unsigned long i" in removable_show() into an int
    - sections_per_block is an int.

    [akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/]
    [david@redhat.com: v3]
    Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com
    Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Wei Yang
    Cc: Johannes Weiner
    Cc: Arun KS
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Stephen Rothwell
    Cc: Mike Rapoport
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
    section_to_node_table[]. While for hot-add memory, this is missed.
    Without this information, page_to_nid() may not give the right node id.

    BTW, current online_pages works because it leverages nid in
    memory_block. But the granularity of node id should be mem_section
    wide.

    Link: http://lkml.kernel.org/r/20190618005537.18878-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Anshuman Khandual
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang