04 Sep, 2021

5 commits

  • There are several places that allocate memory for the memory map:
    alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
    __populate_section_memmap() for SPARSEMEM.

    The memory allocated in the FLATMEM case is zeroed and it is never
    poisoned, regardless of CONFIG_PAGE_POISON setting.

    The memory allocated in the SPARSEMEM cases is not zeroed and it is
    implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

    Introduce memmap_alloc() wrapper for memblock allocators that will be used
    for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
    poisoning consistent for different memory models.

    Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Clarify pgdat_to_phys() by testing if
    pgdat == &contig_page_data when CONFIG_NUMA=n.

    We only expect contig_page_data in such case, so we
    use &contig_page_data directly instead of pgdat.

    No functional change intended when CONFIG_BUG_VM=n.

    Comment from Mark [1]:
    "
    ... and I reckon it'd be clearer and more robust to define
    pgdat_to_phys() in the same ifdefs as contig_page_data so
    that these, stay in-sync. e.g. have:

    | #ifdef CONFIG_NUMA
    | #define pgdat_to_phys(x) virt_to_phys(x)
    | #else /* CONFIG_NUMA */
    |
    | extern struct pglist_data contig_page_data;
    | ...
    | #define pgdat_to_phys(x) __pa_symbol(&contig_page_data)
    |
    | #endif /* CONIFIG_NUMA */
    "

    [1] https://lore.kernel.org/linux-arm-kernel/20210615131902.GB47121@C02TD0UTHF1T.local/

    Link: https://lkml.kernel.org/r/20210723123342.26406-1-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Reviewed-by: David Hildenbrand
    Acked-by: Mike Rapoport
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • cppcheck warns that we're possibly losing information by shifting an int.
    It's a false positive, because we don't allow for a NUMA node ID that
    large, but if we ever change SECTION_NID_SHIFT, it could become a problem,
    and in any case this is usually a legitimate warning. Fix it by adding
    the necessary cast, which makes the compiler generate the right code.

    Link: https://lkml.kernel.org/r/YOya+aBZFFmC476e@casper.infradead.org
    Link: https://lkml.kernel.org/r/202107130348.6LsVT9Nc-lkp@intel.com
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • As the last users of __section_nr() are gone, let's remove unused function
    __section_nr().

    Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com
    Signed-off-by: Ohhoon Kwon
    Acked-by: Michal Hocko
    Acked-by: Mike Rapoport
    Reviewed-by: David Hildenbrand
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ohhoon Kwon
     
  • Patch series "mm: sparse: remove __section_nr() function", v4.

    This patch (of 3):

    With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts
    mem_section to section_nr could be costly since it iterates all section
    roots to check if the given mem_section is in its range.

    Since both callers of section_mark_present already know section_nr, let's
    also pass section_nr as well as mem_section in order to reduce costly
    translation.

    Link: https://lkml.kernel.org/r/20210707150212.855-1-ohoono.kwon@samsung.com
    Link: https://lkml.kernel.org/r/20210707150212.855-2-ohoono.kwon@samsung.com
    Signed-off-by: Ohhoon Kwon
    Acked-by: Mike Rapoport
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ohhoon Kwon
     

01 Jul, 2021

1 commit

  • Patch series "Free some vmemmap pages of HugeTLB page", v23.

    This patch series will free some vmemmap pages(struct page structures)
    associated with each HugeTLB page when preallocated to save memory.

    In order to reduce the difficulty of the first version of code review. In
    this version, we disable PMD/huge page mapping of vmemmap if this feature
    was enabled. This acutely eliminates a bunch of the complex code doing
    page table manipulation. When this patch series is solid, we cam add the
    code of vmemmap page table manipulation in the future.

    The struct page structures (page structs) are used to describe a physical
    page frame. By default, there is an one-to-one mapping from a page frame
    to it's corresponding page struct.

    The HugeTLB pages consist of multiple base page size pages and is
    supported by many architectures. See hugetlbpage.rst in the Documentation
    directory for more details. On the x86 architecture, HugeTLB pages of
    size 2MB and 1GB are currently supported. Since the base page size on x86
    is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
    page consists of 4096 base pages. For each base page, there is a
    corresponding page struct.

    Within the HugeTLB subsystem, only the first 4 page structs are used to
    contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
    provides this upper limit. The only 'useful' information in the remaining
    page structs is the compound_head field, and this field is the same for
    all tail pages.

    By removing redundant page structs for HugeTLB pages, memory can returned
    to the buddy allocator for other uses.

    When the system boot up, every 2M HugeTLB has 512 struct page structs which
    size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).

    HugeTLB struct pages(8 pages) page frame(8 pages)
    +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
    | | | 0 | -------------> | 0 |
    | | +-----------+ +-----------+
    | | | 1 | -------------> | 1 |
    | | +-----------+ +-----------+
    | | | 2 | -------------> | 2 |
    | | +-----------+ +-----------+
    | | | 3 | -------------> | 3 |
    | | +-----------+ +-----------+
    | | | 4 | -------------> | 4 |
    | 2MB | +-----------+ +-----------+
    | | | 5 | -------------> | 5 |
    | | +-----------+ +-----------+
    | | | 6 | -------------> | 6 |
    | | +-----------+ +-----------+
    | | | 7 | -------------> | 7 |
    | | +-----------+ +-----------+
    | |
    | |
    | |
    +-----------+

    The value of page->compound_head is the same for all tail pages. The
    first page of page structs (page 0) associated with the HugeTLB page
    contains the 4 page structs necessary to describe the HugeTLB. The only
    use of the remaining pages of page structs (page 1 to page 7) is to point
    to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1.
    Only 2 pages of page structs will be used for each HugeTLB page. This
    will allow us to free the remaining 6 pages to the buddy allocator.

    Here is how things look after remapping.

    HugeTLB struct pages(8 pages) page frame(8 pages)
    +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
    | | | 0 | -------------> | 0 |
    | | +-----------+ +-----------+
    | | | 1 | -------------> | 1 |
    | | +-----------+ +-----------+
    | | | 2 | ----------------^ ^ ^ ^ ^ ^
    | | +-----------+ | | | | |
    | | | 3 | ------------------+ | | | |
    | | +-----------+ | | | |
    | | | 4 | --------------------+ | | |
    | 2MB | +-----------+ | | |
    | | | 5 | ----------------------+ | |
    | | +-----------+ | |
    | | | 6 | ------------------------+ |
    | | +-----------+ |
    | | | 7 | --------------------------+
    | | +-----------+
    | |
    | |
    | |
    +-----------+

    When a HugeTLB is freed to the buddy system, we should allocate 6 pages
    for vmemmap pages and restore the previous mapping relationship.

    Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
    to the 2MB HugeTLB page. We also can use this approach to free the
    vmemmap pages.

    In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a
    very substantial gain. On our server, run some SPDK/QEMU applications
    which will use 1024GB HugeTLB page. With this feature enabled, we can
    save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.

    Because there are vmemmap page tables reconstruction on the
    freeing/allocating path, it increases some overhead. Here are some
    overhead analysis.

    1) Allocating 10240 2MB HugeTLB pages.

    a) With this patch series applied:
    # time echo 10240 > /proc/sys/vm/nr_hugepages

    real 0m0.166s
    user 0m0.000s
    sys 0m0.166s

    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
    kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
    @start[tid]); delete(@start[tid]); }'
    Attaching 2 probes...

    @latency:
    [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
    [32K, 64K) 4 | |

    b) Without this patch series:
    # time echo 10240 > /proc/sys/vm/nr_hugepages

    real 0m0.067s
    user 0m0.000s
    sys 0m0.067s

    # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
    kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
    @start[tid]); delete(@start[tid]); }'
    Attaching 2 probes...

    @latency:
    [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [8K, 16K) 93 | |

    Summarize: this feature is about ~2x slower than before.

    2) Freeing 10240 2MB HugeTLB pages.

    a) With this patch series applied:
    # time echo 0 > /proc/sys/vm/nr_hugepages

    real 0m0.213s
    user 0m0.000s
    sys 0m0.213s

    # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
    kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
    @start[tid]); delete(@start[tid]); }'
    Attaching 2 probes...

    @latency:
    [8K, 16K) 6 | |
    [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [32K, 64K) 7 | |

    b) Without this patch series:
    # time echo 0 > /proc/sys/vm/nr_hugepages

    real 0m0.081s
    user 0m0.000s
    sys 0m0.081s

    # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
    kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
    @start[tid]); delete(@start[tid]); }'
    Attaching 2 probes...

    @latency:
    [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
    [16K, 32K) 8 | |

    Summary: The overhead of __free_hugepage is about ~2-3x slower than before.

    Although the overhead has increased, the overhead is not significant.
    Like Mike said, "However, remember that the majority of use cases create
    HugeTLB pages at or shortly after boot time and add them to the pool. So,
    additional overhead is at pool creation time. There is no change to
    'normal run time' operations of getting a page from or returning a page to
    the pool (think page fault/unmap)".

    Despite the overhead and in addition to the memory gains from this series.
    The following data is obtained by Joao Martins. Very thanks to his
    effort.

    There's an additional benefit which is page (un)pinners will see an improvement
    and Joao presumes because there are fewer memmap pages and thus the tail/head
    pages are staying in cache more often.

    Out of the box Joao saw (when comparing linux-next against linux-next +
    this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):

    get_user_pages(): ~32k -> ~9k
    unpin_user_pages(): ~75k -> ~70k

    Usually any tight loop fetching compound_head(), or reading tail pages
    data (e.g. compound_head) benefit a lot. There's some unpinning
    inefficiencies Joao was fixing[2], but with that in added it shows even
    more:

    unpin_user_pages(): ~27k -> ~3.8k

    [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
    [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/

    This patch (of 9):

    Move bootmem info registration common API to individual bootmem_info.c.
    And we will use {get,put}_page_bootmem() to initialize the page for the
    vmemmap pages or free the vmemmap pages to buddy in the later patch. So
    move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement
    without any functional change.

    Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Acked-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Miaohe Lin
    Tested-by: Chen Huang
    Tested-by: Bodeddula Balasubramaniam
    Cc: Jonathan Corbet
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: x86@kernel.org
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Alexander Viro
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Randy Dunlap
    Cc: Oliver Neukum
    Cc: Anshuman Khandual
    Cc: Joerg Roedel
    Cc: Mina Almasry
    Cc: David Rientjes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Barry Song
    Cc: HORIGUCHI NAOYA
    Cc: Joao Martins
    Cc: Xiongchun Duan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     

30 Jun, 2021

1 commit

  • After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
    configuration options are equivalent.

    Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

    Done with

    $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
    $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
    $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
    $(git grep -wl NEED_MULTIPLE_NODES)

    with manual tweaks afterwards.

    [rppt@linux.ibm.com: fix arm boot crash]
    Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

    Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: Arnd Bergmann
    Acked-by: David Hildenbrand
    Cc: Geert Uytterhoeven
    Cc: Ivan Kokshaysky
    Cc: Jonathan Corbet
    Cc: Matt Turner
    Cc: Richard Henderson
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

17 Jun, 2021

1 commit

  • I see a "virt_to_phys used for non-linear address" warning from
    check_usemap_section_nr() on arm64 platforms.

    In current implementation of NODE_DATA, if CONFIG_NEED_MULTIPLE_NODES=y,
    pglist_data is dynamically allocated and assigned to node_data[].

    For example, in arch/arm64/include/asm/mmzone.h:

    extern struct pglist_data *node_data[];
    #define NODE_DATA(nid) (node_data[(nid)])

    If CONFIG_NEED_MULTIPLE_NODES=n, pglist_data is defined as a global
    variable named "contig_page_data".

    For example, in include/linux/mmzone.h:

    extern struct pglist_data contig_page_data;
    #define NODE_DATA(nid) (&contig_page_data)

    If CONFIG_DEBUG_VIRTUAL is not enabled, __pa() can handle both
    dynamically allocated linear addresses and symbol addresses. However,
    if (CONFIG_DEBUG_VIRTUAL=y && CONFIG_NEED_MULTIPLE_NODES=n) we can see
    the "virt_to_phys used for non-linear address" warning because that
    &contig_page_data is not a linear address on arm64.

    Warning message:

    virt_to_phys used for non-linear address: (contig_page_data+0x0/0x1c00)
    WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Tainted: G W 5.13.0-rc1-00074-g1140ab592e2e #3
    Hardware name: linux,dummy-virt (DT)
    pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
    Call trace:
    __virt_to_phys+0x58/0x68
    check_usemap_section_nr+0x50/0xfc
    sparse_init_nid+0x1ac/0x28c
    sparse_init+0x1c4/0x1e0
    bootmem_init+0x60/0x90
    setup_arch+0x184/0x1f0
    start_kernel+0x78/0x488

    To fix it, create a small function to handle both translation.

    Link: https://lkml.kernel.org/r/1623058729-27264-1-git-send-email-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Cc: Mike Rapoport
    Cc: Baoquan He
    Cc: Kazu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     

06 May, 2021

2 commits

  • Various coding style tweaks to various files under mm/

    [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

    Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
    Signed-off-by: Zhiyuan Dai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhiyuan Dai
     
  • Physical memory hotadd has to allocate a memmap (struct page array) for
    the newly added memory section. Currently, alloc_pages_node() is used
    for those allocations.

    This has some disadvantages:
    a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
    This can even lead to extreme cases where system goes OOM because
    the physically hotplugged memory depletes the available memory before
    it is onlined.
    b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
    c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

    This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

    Vmemap page tables can map arbitrary memory. That means that we can
    reserve a part of the physically hotadded memory to back vmemmap page
    tables. This implementation uses the beginning of the hotplugged memory
    for that purpose.

    There are some non-obviously things to consider though.

    Vmemmap pages are allocated/freed during the memory hotplug events
    (add_memory_resource(), try_remove_memory()) when the memory is
    added/removed. This means that the reserved physical range is not
    online although it is used. The most obvious side effect is that
    pfn_to_online_page() returns NULL for those pfns. The current design
    expects that this should be OK as the hotplugged memory is considered a
    garbage until it is onlined. For example hibernation wouldn't save the
    content of those vmmemmaps into the image so it wouldn't be restored on
    resume but this should be OK as there no real content to recover anyway
    while metadata is reachable from other data structures (e.g. vmemmap
    page tables).

    The reserved space is therefore (de)initialized during the {on,off}line
    events (mhp_{de}init_memmap_on_memory). That is done by extracting page
    allocator independent initialization from the regular onlining path.
    The primary reason to handle the reserved space outside of
    {on,off}line_pages is to make each initialization specific to the
    purpose rather than special case them in a single function.

    As per above, the functions that are introduced are:

    - mhp_init_memmap_on_memory:
    Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
    kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
    fully span.

    - mhp_deinit_memmap_on_memory:
    Offlines as many sections as vmemmap pages fully span, removes the
    range from zhe zone by remove_pfn_range_from_zone(), and calls
    kasan_remove_zero_shadow() for the range.

    The new function memory_block_online() calls mhp_init_memmap_on_memory()
    before doing the actual online_pages(). Should online_pages() fail, we
    clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of
    present_pages is done at the end once we know that online_pages()
    succedeed.

    On offline, memory_block_offline() needs to unaccount vmemmap pages from
    present_pages() before calling offline_pages(). This is necessary because
    offline_pages() tears down some structures based on the fact whether the
    node or the zone become empty. If offline_pages() fails, we account back
    vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory().

    Hot-remove:

    We need to be careful when removing memory, as adding and
    removing memory needs to be done with the same granularity.
    To check that this assumption is not violated, we check the
    memory range we want to remove and if a) any memory block has
    vmemmap pages and b) the range spans more than a single memory
    block, we scream out loud and refuse to proceed.

    If all is good and the range was using memmap on memory (aka vmemmap pages),
    we construct an altmap structure so free_hugepage_table does the right
    thing and calls vmem_altmap_free instead of free_pagetable.

    Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Anshuman Khandual
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

01 May, 2021

1 commit

  • sparse_buffer_init() and sparse_buffer_fini() should appear in pair, or a
    WARN issue would be through the next time sparse_buffer_init() runs.

    Add the missing sparse_buffer_fini() in error branch.

    Link: https://lkml.kernel.org/r/20210325113155.118574-1-wangwensheng4@huawei.com
    Fixes: 85c77f791390 ("mm/sparse: add new sparse_init_nid() and sparse_init()")
    Signed-off-by: Wang Wensheng
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Wensheng
     

17 Oct, 2020

1 commit

  • We soon want to pass flags via a new type to add_memory() and friends.
    That revealed that we currently don't guard some declarations by
    CONFIG_MEMORY_HOTPLUG.

    While some definitions could be moved to different places, let's keep it
    minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
    compiled with CONFIG_MEMORY_HOTPLUG.

    Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
    from CONFIG_MEMORY_HOTPLUG code.

    While at it, remove allow_online_pfn_range(), which is no longer around,
    and mhp_notimplemented(), which is unused.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Boris Ostrovsky
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Heiko Carstens
    Cc: Jason Gunthorpe
    Cc: Jason Wang
    Cc: Juergen Gross
    Cc: Julien Grall
    Cc: Kees Cook
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Roger Pau Monné
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

14 Oct, 2020

1 commit

  • There are several occurrences of the following pattern:

    for_each_memblock(memory, reg) {
    start_pfn = memblock_region_memory_base_pfn(reg);
    end_pfn = memblock_region_memory_end_pfn(reg);

    /* do something with start_pfn and end_pfn */
    }

    Rather than iterate over all memblock.memory regions and each time query
    for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
    simpler and clearer code.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Miguel Ojeda [.clang-format]
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Aug, 2020

3 commits

  • After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
    functions that call memory_present() for each region in memblock.memory:
    sparse_memory_present_with_active_regions() and membocks_present().

    Moreover, all architectures have a call to either of these functions
    preceding the call to sparse_init() and in the most cases they are called
    one after the other.

    Mark the regions from memblock.memory as present during sparce_init() by
    making sparse_init() call memblocks_present(), make memblocks_present()
    and memory_present() functions static and remove redundant
    sparse_memory_present_with_active_regions() function.

    Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For early sections, its memmap is handled specially even sub-section is
    enabled. The memmap could only be populated as a whole.

    Quoted from the comment of section_activate():

    * The early init code does not consider partially populated
    * initial sections, it simply assumes that memory will never be
    * referenced. If we hot-add memory into such a section then we
    * do not need to populate the memmap and can simply reuse what
    * is already there.

    While current section_deactivate() breaks this rule. When hot-remove a
    sub-section, section_deactivate() would depopulate its memmap. The
    consequence is if we hot-add this subsection again, its memmap never get
    proper populated.

    We can reproduce the case by following steps:

    1. Hacking qemu to allow sub-section early section

    : diff --git a/hw/i386/pc.c b/hw/i386/pc.c
    : index 51b3050d01..c6a78d83c0 100644
    : --- a/hw/i386/pc.c
    : +++ b/hw/i386/pc.c
    : @@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
    : }
    :
    : machine->device_memory->base =
    : - ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
    : + 0x100000000ULL + x86ms->above_4g_mem_size;
    :
    : if (pcmc->enforce_aligned_dimm) {
    : /* size device region assuming 1G page max alignment per slot */

    2. Bootup qemu with PSE disabled and a sub-section aligned memory size

    Part of the qemu command would look like this:

    sudo x86_64-softmmu/qemu-system-x86_64 \
    --enable-kvm -cpu host,pse=off \
    -m 4160M,maxmem=20G,slots=1 \
    -smp sockets=2,cores=16 \
    -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
    -machine pc,nvdimm \
    -nographic \
    -object memory-backend-ram,id=mem0,size=8G \
    -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k

    3. Re-config a pmem device with sub-section size in guest

    ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

    Then you would see the following call trace:

    pmem0: detected capacity change from 0 to 16777216
    BUG: unable to handle page fault for address: ffffec73c51000b4
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
    Oops: 0002 [#1] SMP NOPTI
    CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G W 5.8.0-rc2+ #24
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
    RIP: 0010:memmap_init_zone+0x154/0x1c2
    Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
    RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
    RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
    RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
    RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
    R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
    R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
    FS: 00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
    Call Trace:
    move_pfn_range_to_zone+0x128/0x150
    memremap_pages+0x4e4/0x5a0
    devm_memremap_pages+0x1e/0x60
    dev_dax_probe+0x69/0x160 [device_dax]
    really_probe+0x298/0x3c0
    driver_probe_device+0xe1/0x150
    ? driver_allows_async_probing+0x50/0x50
    bus_for_each_drv+0x7e/0xc0
    __device_attach+0xdf/0x160
    bus_probe_device+0x8e/0xa0
    device_add+0x3b9/0x740
    __devm_create_dev_dax+0x127/0x1c0
    __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
    dax_pmem_probe+0xc/0x1b [dax_pmem]
    nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
    really_probe+0x147/0x3c0
    driver_probe_device+0xe1/0x150
    device_driver_attach+0x53/0x60
    bind_store+0xd1/0x110
    kernfs_fop_write+0xce/0x1b0
    vfs_write+0xb6/0x1a0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x4d/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200625223534.18024-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Patch series "mm: cleanup usage of "

    Most architectures have very similar versions of pXd_alloc_one() and
    pXd_free_one() for intermediate levels of page table. These patches add
    generic versions of these functions in and enable
    use of the generic functions where appropriate.

    In addition, functions declared and defined in headers are
    used mostly by core mm and early mm initialization in arch and there is no
    actual reason to have the included all over the place.
    The first patch in this series removes unneeded includes of

    In the end it didn't work out as neatly as I hoped and moving
    pXd_alloc_track() definitions to would require
    unnecessary changes to arches that have custom page table allocations, so
    I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
    to mm/.

    This patch (of 8):

    In most cases header is required only for allocations of
    page table memory. Most of the .c files that include that header do not
    use symbols declared in and do not require that header.

    As for the other header files that used to include , it is
    possible to move that include into the .c file that actually uses symbols
    from and drop the include from the header file.

    The process was somewhat automated using

    sed -i -E '/[
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Geert Uytterhoeven [m68k]
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Joerg Roedel
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

10 Jun, 2020

1 commit

  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

05 Jun, 2020

1 commit


08 Apr, 2020

5 commits

  • No functional change.

    [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
    Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Wei Yang
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • And tell check_pfn_span() gating the porper alignment and size of hot
    added memory region.

    And also move the code comments from inside section_deactivate() to being
    above it. The code comments are reasonable for the whole function, and
    the moving makes code cleaner.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Currently, to support subsection aligned memory region adding for pmem,
    subsection map is added to track which subsection is present.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, it's meaningless. Even worse, it may confuse people when
    checking code related to the classic sparse.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    Combining the above reasons, no need to provide subsection map and the
    relevant handling for the classic sparse. Let's remove them.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Factor out the code which clear subsection map of one memory region from
    section_deactivate() into clear_subsection_map().

    And also add helper function is_subsection_map_empty() to check if the
    current subsection map is empty or not.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.

    Memory sub-section hotplug was added to fix the issue that nvdimm could be
    mapped at non-section aligned starting address. A subsection map is added
    into struct mem_section_usage to implement it.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, subsection map is meaningless and confusing.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    This patch (of 5):

    Factor out the code that fills the subsection map from section_activate()
    into fill_subsection_map(), this makes section_activate() cleaner and
    easier to follow.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     

03 Apr, 2020

3 commits

  • When allocating memmap for hot added memory with the classic sparse, the
    specified 'nid' is ignored in populate_section_memmap().

    While in allocating memmap for the classic sparse during boot, the node
    given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in
    both boot stage and memory hot adding. So seems no reason to not respect
    the node of 'nid' for the classic sparse when hot adding memory.

    Use kvmalloc_node instead to use the passed in 'nid'.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Hildenbrand
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srv
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • This change makes populate_section_memmap()/depopulate_section_memmap
    much simpler.

    Suggested-by: Michal Hocko
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200316125450.GG3486@MiWiFi-R3L-srv
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • memmap should be the address to page struct instead of address to pfn.

    As mentioned by David, if system memory and devmem sit within a section,
    the mismatch address would lead kdump to dump unexpected memory.

    Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
    to get the page struct address at this point.

    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Baoquan He
    Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

30 Mar, 2020

1 commit

  • Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
    section_deactivate+0x220/0x240
    __remove_pages+0x118/0x170
    arch_remove_memory+0x3c/0x150
    memunmap_pages+0x1cc/0x2f0
    devm_action_release+0x30/0x50
    release_nodes+0x2f8/0x3e0
    device_release_driver_internal+0x168/0x270
    unbind_store+0x130/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x68/0x80
    kernfs_fop_write+0x100/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xcc/0x240
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The crash is due to NULL dereference at

    test_bit(idx, ms->usage->subsection_map);

    due to ms->usage = NULL in pfn_section_valid()

    With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
    SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
    depopulate_section_mem(). This was done so that pfn_page() can work
    correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
    config pfn_to_page does

    __section_mem_map_addr(__sec) + __pfn;

    where

    static inline struct page *__section_mem_map_addr(struct mem_section *section)
    {
    unsigned long map = section->section_mem_map;
    map &= SECTION_MAP_MASK;
    return (struct page *)map;
    }

    Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
    used to check the pfn validity (pfn_valid()). Since section_deactivate
    release mem_section->usage if a section is fully deactivated,
    pfn_valid() check after a subsection_deactivate cause a kernel crash.

    static inline int pfn_valid(unsigned long pfn)
    {
    ...
    return early_section(ms) || pfn_section_valid(ms, pfn);
    }

    where

    static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
    {
    int idx = subsection_map_index(pfn);

    return test_bit(idx, ms->usage->subsection_map);
    }

    Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
    freed. For architectures like ppc64 where large pages are used for
    vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
    sections. Hence before a vmemmap mapping page can be freed, the kernel
    needs to make sure there are no valid sections within that mapping.
    Clearing the section valid bit before depopulate_section_memap enables
    this.

    [aneesh.kumar@linux.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
    Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
    Reported-by: Sachin Sant
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Michael Ellerman
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

22 Mar, 2020

1 commit

  • In section_deactivate(), pfn_to_page() doesn't work any more after
    ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
    causes a hot remove failure:

    kernel BUG at mm/page_alloc.c:4806!
    invalid opcode: 0000 [#1] SMP PTI
    CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:free_pages+0x85/0xa0
    Call Trace:
    __remove_pages+0x99/0xc0
    arch_remove_memory+0x23/0x4d
    try_remove_memory+0xc8/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x72/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x2eb/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    Let's move the ->section_mem_map resetting after
    depopulate_section_memmap() to fix it.

    [akpm@linux-foundation.org: remove unneeded initialization, per David]
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     

22 Feb, 2020

1 commit

  • When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
    doesn't work before sparse_init_one_section() is called.

    This leads to a crash when hotplug memory:

    BUG: unable to handle page fault for address: 0000000006400000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:__memset+0x24/0x30
    Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
    RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
    RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
    RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
    R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
    FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    sparse_add_section+0x1c9/0x26a
    __add_pages+0xbf/0x150
    add_pages+0x12/0x60
    add_memory_resource+0xc8/0x210
    __add_memory+0x62/0xb0
    acpi_memory_device_add+0x13f/0x300
    acpi_bus_attach+0xf6/0x200
    acpi_bus_scan+0x43/0x90
    acpi_device_hotplug+0x275/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    We should use memmap as it did.

    On x86 the impact is limited to x86_32 builds, or x86_64 configurations
    that override the default setting for SPARSEMEM_VMEMMAP.

    Other memory hotplug archs (arm64, ia64, and ppc) also default to
    SPARSEMEM_VMEMMAP=y.

    [dan.j.williams@intel.com: changelog update]
    {rppt@linux.ibm.com: changelog update]
    Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Baoquan He
    Acked-by: David Hildenbrand
    Reviewed-by: Baoquan He
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

04 Feb, 2020

1 commit

  • Let's move it to the header and use the shorter variant from
    mm/page_alloc.c (the original one will also check
    "__highest_present_section_nr + 1", which is not necessary). While at
    it, make the section_nr in next_pfn() const.

    In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
    exceed __highest_present_section_nr, which doesn't make a difference in
    the caller as it is big enough (>= all sane end_pfn).

    Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Cc: Baoquan He
    Cc: Dan Williams
    Cc: "Jin, Zhi"
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

01 Feb, 2020

1 commit

  • After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
    when a mem section is fully deactivated, section_mem_map still records
    the section's start pfn, which is not used any more and will be
    reassigned during re-addition.

    In analogy with alloc/free pattern, it is better to clear all fields of
    section_mem_map.

    Beside this, it breaks the user space tool "makedumpfile" [1], which
    makes assumption that a hot-removed section has mem_map as NULL, instead
    of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
    will be better to change the assumption, and need a patch)

    The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
    trigger a crash, and save vmcore by makedumpfile

    [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

    Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Qian Cai
    Cc: Kazuhito Hagio
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     

14 Jan, 2020

1 commit

  • When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

02 Dec, 2019

4 commits

  • sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
    for page management structure, if memory allocation fails from specified
    node, it will fall back to allocate from other nodes.

    Normally, the page management structure will not exceed 2% of the total
    memory, but a large continuous block of allocation is needed. In most
    cases, memory allocation from the specified node will succeed, but a
    node memory become highly fragmented will fail. we expect to allocate
    memory base section rather than by allocating a large block of memory
    from other NUMA nodes

    Add memblock_alloc_exact_nid_raw() for this situation, which allocate
    boot memory block on the exact node. If a large contiguous block memory
    allocate fail in sparse_buffer_init(), it will fall back to allocate
    small block memory base section.

    Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
    Signed-off-by: Yunfeng Ye
    Reviewed-by: Mike Rapoport
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yunfeng Ye
     
  • Vincent has noticed [1] that there is something unusual with the memmap
    allocations going on on his platform

    : I noticed this because on my ARM64 platform, with 1 GiB of memory the
    : first [and only] section is allocated from the zeroing path while with
    : 2 GiB of memory the first 1 GiB section is allocated from the
    : non-zeroing path.

    The underlying problem is that although sparse_buffer_init allocates
    enough memory for all sections on the node sparse_buffer_alloc is not
    able to consume them due to mismatch in the expected allocation
    alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
    alignment the real memmap has to be aligned to section_map_size() this
    results in a wasted initial chunk of the preallocated memmap and
    unnecessary fallback allocation for a section.

    While we are at it also change __populate_section_memmap to align to the
    requested size because at least VMEMMAP has constrains to have memmap
    properly aligned.

    [1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com

    [akpm@linux-foundation.org: tweak layout, per David]
    Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
    Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations")
    Signed-off-by: Michal Hocko
    Reported-by: Vincent Whitchurch
    Debugged-by: Vincent Whitchurch
    Acked-by: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Building the kernel on s390 with -Og produces the following warning:

    WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
    The function populate_section_memmap() references
    the function __meminit __populate_section_memmap().
    This is often because populate_section_memmap lacks a __meminit
    annotation or the annotation of __populate_section_memmap is wrong.

    While -Og is not supported, in theory this might still happen with
    another compiler or on another architecture. So fix this by using the
    correct section annotations.

    [iii@linux.ibm.com: v2]
    Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
    Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
    Signed-off-by: Ilya Leoshkevich
    Acked-by: David Hildenbrand
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Leoshkevich
     
  • sparsemem without VMEMMAP has two allocation paths to allocate the
    memory needed for its memmap (done in sparse_mem_map_populate()).

    In one allocation path (sparse_buffer_alloc() succeeds), the memory is
    not zeroed (since it was previously allocated with
    memblock_alloc_try_nid_raw()).

    In the other allocation path (sparse_buffer_alloc() fails and
    sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the
    memory is zeroed.

    AFAICS this difference does not appear to be on purpose. If the code is
    supposed to work with non-initialized memory (__init_single_page() takes
    care of zeroing the struct pages which are actually used), we should
    consistently not zero the memory, to avoid masking bugs.

    ( I noticed this because on my ARM64 platform, with 1 GiB of memory the
    first [and only] section is allocated from the zeroing path while with
    2 GiB of memory the first 1 GiB section is allocated from the
    non-zeroing path. )

    Michal:
    "the main user visible problem is a memory wastage. The overal amount
    of memory should be small. I wouldn't call it stable material."

    Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
    Signed-off-by: Vincent Whitchurch
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Whitchurch
     

08 Oct, 2019

1 commit

  • We get two warnings when build kernel W=1:

    mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
    mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

    Make the functions static to fix this.

    Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Reviewed-by: David Hildenbrand
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     

25 Sep, 2019

3 commits

  • There is no possibility for memmap to be NULL in the current codebase.

    This check was added in commit 95a4774d055c ("memory-hotplug: update
    mce_bad_pages when removing the memory") where memmap was originally
    inited to NULL, and only conditionally given a value.

    The code that could have passed a NULL has been removed by commit
    ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
    longer a possibility that memmap can be NULL.

    Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Mike Rapoport
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Alexander Duyck
    Cc: Logan Gunthorpe
    Cc: Baoquan He
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • Use the function written to do it instead.

    Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Acked-by: Mike Rapoport
    Reviewed-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • __pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

    Since we already get section_nr, it is not necessary to get mem_section
    from start_pfn. By doing so, we reduce one redundant operation.

    Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Anshuman Khandual
    Tested-by: Anshuman Khandual
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang