01 Apr, 2020

1 commit

  • commit b943f045a9af9fd02f923e43fe8d7517e9961701 upstream.

    Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
    section_deactivate+0x220/0x240
    __remove_pages+0x118/0x170
    arch_remove_memory+0x3c/0x150
    memunmap_pages+0x1cc/0x2f0
    devm_action_release+0x30/0x50
    release_nodes+0x2f8/0x3e0
    device_release_driver_internal+0x168/0x270
    unbind_store+0x130/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x68/0x80
    kernfs_fop_write+0x100/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xcc/0x240
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The crash is due to NULL dereference at

    test_bit(idx, ms->usage->subsection_map);

    due to ms->usage = NULL in pfn_section_valid()

    With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
    SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
    depopulate_section_mem(). This was done so that pfn_page() can work
    correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
    config pfn_to_page does

    __section_mem_map_addr(__sec) + __pfn;

    where

    static inline struct page *__section_mem_map_addr(struct mem_section *section)
    {
    unsigned long map = section->section_mem_map;
    map &= SECTION_MAP_MASK;
    return (struct page *)map;
    }

    Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
    used to check the pfn validity (pfn_valid()). Since section_deactivate
    release mem_section->usage if a section is fully deactivated,
    pfn_valid() check after a subsection_deactivate cause a kernel crash.

    static inline int pfn_valid(unsigned long pfn)
    {
    ...
    return early_section(ms) || pfn_section_valid(ms, pfn);
    }

    where

    static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
    {
    int idx = subsection_map_index(pfn);

    return test_bit(idx, ms->usage->subsection_map);
    }

    Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
    freed. For architectures like ppc64 where large pages are used for
    vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
    sections. Hence before a vmemmap mapping page can be freed, the kernel
    needs to make sure there are no valid sections within that mapping.
    Clearing the section valid bit before depopulate_section_memap enables
    this.

    [aneesh.kumar@linux.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
    Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
    Reported-by: Sachin Sant
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Michael Ellerman
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

25 Mar, 2020

1 commit

  • commit d41e2f3bd54699f85b3d6f45abd09fa24a222cb9 upstream.

    In section_deactivate(), pfn_to_page() doesn't work any more after
    ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
    causes a hot remove failure:

    kernel BUG at mm/page_alloc.c:4806!
    invalid opcode: 0000 [#1] SMP PTI
    CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:free_pages+0x85/0xa0
    Call Trace:
    __remove_pages+0x99/0xc0
    arch_remove_memory+0x23/0x4d
    try_remove_memory+0xc8/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x72/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x2eb/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    Let's move the ->section_mem_map resetting after
    depopulate_section_memmap() to fix it.

    [akpm@linux-foundation.org: remove unneeded initialization, per David]
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     

29 Feb, 2020

1 commit

  • commit 18e19f195cd888f65643a77a0c6aee8f5be6439a upstream.

    When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
    doesn't work before sparse_init_one_section() is called.

    This leads to a crash when hotplug memory:

    BUG: unable to handle page fault for address: 0000000006400000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:__memset+0x24/0x30
    Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
    RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
    RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
    RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
    R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
    FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    sparse_add_section+0x1c9/0x26a
    __add_pages+0xbf/0x150
    add_pages+0x12/0x60
    add_memory_resource+0xc8/0x210
    __add_memory+0x62/0xb0
    acpi_memory_device_add+0x13f/0x300
    acpi_bus_attach+0xf6/0x200
    acpi_bus_scan+0x43/0x90
    acpi_device_hotplug+0x275/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    We should use memmap as it did.

    On x86 the impact is limited to x86_32 builds, or x86_64 configurations
    that override the default setting for SPARSEMEM_VMEMMAP.

    Other memory hotplug archs (arm64, ia64, and ppc) also default to
    SPARSEMEM_VMEMMAP=y.

    [dan.j.williams@intel.com: changelog update]
    {rppt@linux.ibm.com: changelog update]
    Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Baoquan He
    Acked-by: David Hildenbrand
    Reviewed-by: Baoquan He
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     

11 Feb, 2020

1 commit

  • commit 1f503443e7df8dc8366608b4d810ce2d6669827c upstream.

    After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
    when a mem section is fully deactivated, section_mem_map still records
    the section's start pfn, which is not used any more and will be
    reassigned during re-addition.

    In analogy with alloc/free pattern, it is better to clear all fields of
    section_mem_map.

    Beside this, it breaks the user space tool "makedumpfile" [1], which
    makes assumption that a hot-removed section has mem_map as NULL, instead
    of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
    will be better to change the assumption, and need a patch)

    The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
    trigger a crash, and save vmcore by makedumpfile

    [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

    Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Qian Cai
    Cc: Kazuhito Hagio
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pingfan Liu
     

23 Jan, 2020

1 commit

  • commit 8068df3b60373c390198f660574ea14c8098de57 upstream.

    When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

09 Jan, 2020

1 commit

  • [ Upstream commit 030eab4f9ffb469344c10a46bc02c5149db0a2a9 ]

    Building the kernel on s390 with -Og produces the following warning:

    WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
    The function populate_section_memmap() references
    the function __meminit __populate_section_memmap().
    This is often because populate_section_memmap lacks a __meminit
    annotation or the annotation of __populate_section_memmap is wrong.

    While -Og is not supported, in theory this might still happen with
    another compiler or on another architecture. So fix this by using the
    correct section annotations.

    [iii@linux.ibm.com: v2]
    Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
    Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
    Signed-off-by: Ilya Leoshkevich
    Acked-by: David Hildenbrand
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Ilya Leoshkevich
     

08 Oct, 2019

1 commit

  • We get two warnings when build kernel W=1:

    mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
    mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

    Make the functions static to fix this.

    Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Reviewed-by: David Hildenbrand
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     

25 Sep, 2019

5 commits

  • There is no possibility for memmap to be NULL in the current codebase.

    This check was added in commit 95a4774d055c ("memory-hotplug: update
    mce_bad_pages when removing the memory") where memmap was originally
    inited to NULL, and only conditionally given a value.

    The code that could have passed a NULL has been removed by commit
    ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), so there is no
    longer a possibility that memmap can be NULL.

    Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Mike Rapoport
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Alexander Duyck
    Cc: Logan Gunthorpe
    Cc: Baoquan He
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • Use the function written to do it instead.

    Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
    Signed-off-by: Alastair D'Silva
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Acked-by: Mike Rapoport
    Reviewed-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alastair D'Silva
     
  • __pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

    Since we already get section_nr, it is not necessary to get mem_section
    from start_pfn. By doing so, we reduce one redundant operation.

    Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Anshuman Khandual
    Tested-by: Anshuman Khandual
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The size argument passed into sparse_buffer_alloc() has already been
    aligned with PAGE_SIZE or PMD_SIZE.

    If the size after aligned is not power of 2 (e.g. 0x480000), the
    PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
    up to next multiple of size.

    Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com
    Signed-off-by: Lecopzer Chen
    Signed-off-by: Mark-PK Tsai
    Cc: YJ Chiang
    Cc: Lecopzer Chen
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lecopzer Chen
     
  • sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
    after being aligned with the size. However, the size is at least
    PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
    than PAGE_SIZE.

    Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
    sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
    first, the aligned space before sparsemap_buf is wasted and no one will
    touch it.

    In our ARM32 platform (without SPARSEMEM_VMEMMAP)
    Sparse_buffer_init
    Reserve d359c000 - d3e9c000 (9M)
    Sparse_buffer_alloc
    Alloc d3a00000 - d3E80000 (4.5M)
    Sparse_buffer_fini
    Free d3e80000 - d3e9c000 (~=100k)
    The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.

    In ARM64 platform (with SPARSEMEM_VMEMMAP)

    sparse_buffer_init
    Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
    Sparse_buffer_alloc
    Alloc ffffffc07d800000 - ffffffc07f600000 (30M)
    Sparse_buffer_fini
    Free ffffffc07f600000 - ffffffc07f623000 (140K)
    The reserved memory between ffffffc07d623000 - ffffffc07d800000
    (~=1.9M) is unfreed.

    Let's explicit free redundant aligned memory.

    [arnd@arndb.de: mark sparse_buffer_free as __meminit]
    Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
    Signed-off-by: Lecopzer Chen
    Signed-off-by: Mark-PK Tsai
    Signed-off-by: Arnd Bergmann
    Cc: YJ Chiang
    Cc: Lecopzer Chen
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lecopzer Chen
     

19 Jul, 2019

11 commits

  • David points out that there is a mixture of 'int' and 'unsigned long'
    usage for section number data types. Update the memory hotplug path to
    use 'unsigned long' consistently for section numbers.

    [akpm@linux-foundation.org: fix printk format]
    Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare the memory hot-{add,remove} paths for handling sub-section
    ranges by plumbing the starting page frame and number of pages being
    handled through arch_{add,remove}_memory() to
    sparse_{add,remove}_one_section().

    This is simply plumbing, small cleanups, and some identifier renames.
    No intended functional changes.

    Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Allow sub-section sized ranges to be added to the memmap.

    populate_section_memmap() takes an explict pfn range rather than
    assuming a full section, and those parameters are plumbed all the way
    through to vmmemap_populate(). There should be no sub-section usage in
    current deployments. New warnings are added to clarify which memmap
    allocation paths are sub-section capable.

    Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
    sub-section active bitmask, each bit representing a PMD_SIZE span of the
    architecture's memory hotplug section size.

    The implications of a partially populated section is that pfn_valid()
    needs to go beyond a valid_section() check and either determine that the
    section is an "early section", or read the sub-section active ranges
    from the bitmask. The expectation is that the bitmask (subsection_map)
    fits in the same cacheline as the valid_section() / early_section()
    data, so the incremental performance overhead to pfn_valid() should be
    negligible.

    The rationale for using early_section() to short-ciruit the
    subsection_map check is that there are legacy code paths that use
    pfn_valid() at section granularity before validating the pfn against
    pgdat data. So, the early_section() check allows those traditional
    assumptions to persist while also permitting subsection_map to tell the
    truth for purposes of populating the unused portions of early sections
    with PMEM and other ZONE_DEVICE mappings.

    Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Qian Cai
    Tested-by: Jane Chu
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for sub-section hotplug, track whether a given section
    was created during early memory initialization, or later via memory
    hotplug. This distinction is needed to maintain the coarse expectation
    that pfn_valid() returns true for any pfn within a given section even if
    that section has pages that are reserved from the page allocator.

    For example one of the of goals of subsection hotplug is to support
    cases where the system physical memory layout collides System RAM and
    PMEM within a section. Several pfn_valid() users expect to just check
    if a section is valid, but they are not careful to check if the given
    pfn is within a "System RAM" boundary and instead expect pgdat
    information to further validate the pfn.

    Rather than unwind those paths to make their pfn_valid() queries more
    precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
    traditional expectation that pfn_valid() returns true for all early
    sections.

    Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
    Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Qian Cai
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Sub-section memory hotplug support", v10.

    The memory hotplug section is an arbitrary / convenient unit for memory
    hotplug. 'Section-size' units have bled into the user interface
    ('memblock' sysfs) and can not be changed without breaking existing
    userspace. The section-size constraint, while mostly benign for typical
    memory hotplug, has and continues to wreak havoc with 'device-memory'
    use cases, persistent memory (pmem) in particular. Recall that pmem
    uses devm_memremap_pages(), and subsequently arch_add_memory(), to
    allocate a 'struct page' memmap for pmem. However, it does not use the
    'bottom half' of memory hotplug, i.e. never marks pmem pages online and
    never exposes the userspace memblock interface for pmem. This leaves an
    opening to redress the section-size constraint.

    To date, the libnvdimm subsystem has attempted to inject padding to
    satisfy the internal constraints of arch_add_memory(). Beyond
    complicating the code, leading to bugs [2], wasting memory, and limiting
    configuration flexibility, the padding hack is broken when the platform
    changes this physical memory alignment of pmem from one boot to the
    next. Device failure (intermittent or permanent) and physical
    reconfiguration are events that can cause the platform firmware to
    change the physical placement of pmem on a subsequent boot, and device
    failure is an everyday event in a data-center.

    It turns out that sections are only a hard requirement of the
    user-facing interface for memory hotplug and with a bit more
    infrastructure sub-section arch_add_memory() support can be added for
    kernel internal usages like devm_memremap_pages(). Here is an analysis
    of the current design assumptions in the current code and how they are
    addressed in the new implementation:

    Current design assumptions:

    - Sections that describe boot memory (early sections) are never
    unplugged / removed.

    - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
    valid_section() check

    - __add_pages() and helper routines assume all operations occur in
    PAGES_PER_SECTION units.

    - The memblock sysfs interface only comprehends full sections

    New design assumptions:

    - Sections are instrumented with a sub-section bitmask to track (on
    x86) individual 2MB sub-divisions of a 128MB section.

    - Partially populated early sections can be extended with additional
    sub-sections, and those sub-sections can be removed with
    arch_remove_memory(). With this in place we no longer lose usable
    memory capacity to padding.

    - pfn_valid() is updated to look deeper than valid_section() to also
    check the active-sub-section mask. This indication is in the same
    cacheline as the valid_section() so the performance impact is
    expected to be negligible. So far the lkp robot has not reported any
    regressions.

    - Outside of the core vmemmap population routines which are replaced,
    other helper routines like shrink_{zone,pgdat}_span() are updated to
    handle the smaller granularity. Core memory hotplug routines that
    deal with online memory are not touched.

    - The existing memblock sysfs user api guarantees / assumptions are not
    touched since this capability is limited to !online
    !memblock-sysfs-accessible sections.

    Meanwhile the issue reports continue to roll in from users that do not
    understand when and how the 128MB constraint will bite them. The current
    implementation relied on being able to support at least one misaligned
    namespace, but that immediately falls over on any moderately complex
    namespace creation attempt. Beyond the initial problem of 'System RAM'
    colliding with pmem, and the unsolvable problem of physical alignment
    changes, Linux is now being exposed to platforms that collide pmem ranges
    with other pmem ranges by default [3]. In short, devm_memremap_pages()
    has pushed the venerable section-size constraint past the breaking point,
    and the simplicity of section-aligned arch_add_memory() is no longer
    tenable.

    These patches are exposed to the kbuild robot on a subsection-v10 branch
    [4], and a preview of the unit test for this functionality is available
    on the 'subsection-pending' branch of ndctl [5].

    [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
    [3]: https://github.com/pmem/ndctl/issues/76
    [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
    [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

    This patch (of 13):

    Towards enabling memory hotplug to track partial population of a section,
    introduce 'struct mem_section_usage'.

    A pointer to a 'struct mem_section_usage' instance replaces the existing
    pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
    'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
    a new 'subsection_map' bitmap. The new bitmap enables the memory
    hot{plug,remove} implementation to act on incremental sub-divisions of a
    section.

    SUBSECTION_SHIFT is defined as global constant instead of per-architecture
    value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
    subsection users. Specifically a common subsection size allows for the
    possibility that persistent memory namespace configurations be made
    compatible across architectures.

    The primary motivation for this functionality is to support platforms that
    mix "System RAM" and "Persistent Memory" within a single section, or
    multiple PMEM ranges with different mapping lifetimes within a single
    section. The section restriction for hotplug has caused an ongoing saga
    of hacks and bugs for devm_memremap_pages() users.

    Beyond the fixups to teach existing paths how to retrieve the 'usemap'
    from a section, and updates to usemap allocation path, there are no
    expected behavior changes.

    Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jérôme Glisse
    Cc: Mike Rapoport
    Cc: Jane Chu
    Cc: Pavel Tatashin
    Cc: Jonathan Corbet
    Cc: Qian Cai
    Cc: Logan Gunthorpe
    Cc: Toshi Kani
    Cc: Jeff Moyer
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Further memory block device cleanups", v1.

    Some further cleanups around memory block devices. Especially, clean up
    and simplify walk_memory_range(). Including some other minor cleanups.

    This patch (of 6):

    We are using a mixture of "int" and "unsigned long". Let's make this
    consistent by using "unsigned long" everywhere. We'll do the same with
    memory block ids next.

    While at it, turn the "unsigned long i" in removable_show() into an int
    - sections_per_block is an int.

    [akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/]
    [david@redhat.com: v3]
    Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com
    Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Wei Yang
    Cc: Johannes Weiner
    Cc: Arun KS
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Stephen Rothwell
    Cc: Mike Rapoport
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
    section_to_node_table[]. While for hot-add memory, this is missed.
    Without this information, page_to_nid() may not give the right node id.

    BTW, current online_pages works because it leverages nid in
    memory_block. But the granularity of node id should be mem_section
    wide.

    Link: http://lkml.kernel.org/r/20190618005537.18878-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Anshuman Khandual
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The parameter is unused, so let's drop it. Memory removal paths should
    never care about zones. This is the job of memory offlining and will
    require more refactorings.

    Link: http://lkml.kernel.org/r/20190527111152.16324-12-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Reviewed-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Alex Deucher
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Arun KS
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Chris Wilson
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Brown
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Mathieu Malaterre
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: "mike.travis@hpe.com"
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We want to improve error handling while adding memory by allowing to use
    arch_remove_memory() and __remove_pages() even if
    CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:

    arch_add_memory()
    rc = do_something();
    if (rc) {
    arch_remove_memory();
    }

    We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
    quite some dependencies for memory offlining.

    Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: "Kirill A. Shutemov"
    Cc: Alex Deucher
    Cc: "David S. Miller"
    Cc: Mark Brown
    Cc: Chris Wilson
    Cc: Christophe Leroy
    Cc: Nicholas Piggin
    Cc: Vasily Gorbik
    Cc: Rob Herring
    Cc: Masahiro Yamada
    Cc: "mike.travis@hpe.com"
    Cc: Andrew Banman
    Cc: Arun KS
    Cc: Qian Cai
    Cc: Mathieu Malaterre
    Cc: Baoquan He
    Cc: Logan Gunthorpe
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: Mark Rutland
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Robin Murphy
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

15 May, 2019

1 commit

  • The code comment above sparse_add_one_section() is obsolete and incorrect.
    Clean it up and write a new one.

    Link: http://lkml.kernel.org/r/20190329144250.14315-1-bhe@redhat.com
    Signed-off-by: Baoquan He
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Mukesh Ojha
    Reviewed-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     

30 Mar, 2019

1 commit

  • Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
    memory to zones until online") introduced move_pfn_range_to_zone() which
    calls memmap_init_zone() during onlining a memory block.
    memmap_init_zone() will reset pagetype flags and makes migrate type to
    be MOVABLE.

    However, in __offline_pages(), it also call undo_isolate_page_range()
    after offline_isolated_pages() to do the same thing. Due to commit
    2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
    __first_valid_page() to skip offline pages, undo_isolate_page_range()
    here just waste CPU cycles looping around the offlining PFN range while
    doing nothing, because __first_valid_page() will return NULL as
    offline_isolated_pages() has already marked all memory sections within
    the pfn range as offline via offline_mem_sections().

    Also, after calling the "useless" undo_isolate_page_range() here, it
    reaches the point of no returning by notifying MEM_OFFLINE. Those pages
    will be marked as MIGRATE_MOVABLE again once onlining. The only thing
    left to do is to decrease the number of isolated pageblocks zone counter
    which would make some paths of the page allocation slower that the above
    commit introduced.

    Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
    on ppc64, an "int" should still be enough to represent the number of
    pageblocks there. Fix an incorrect comment along the way.

    [cai@lca.pw: v4]
    Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
    Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Vlastimil Babka
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

13 Mar, 2019

2 commits

  • As all the memblock allocation functions return NULL in case of error
    rather than panic(), the duplicates with _nopanic suffix can be removed.

    Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Petr Mladek [printk]
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Guo Ren [c-sky]
    Cc: Heiko Carstens
    Cc: Juergen Gross [Xen]
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Paul Burton
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Add check for the return value of memblock_alloc*() functions and call
    panic() in case of error. The panic message repeats the one used by
    panicing memblock allocators with adjustment of parameters to include
    only relevant ones.

    The replacement was mostly automated with semantic patches like the one
    below with manual massaging of format strings.

    @@
    expression ptr, size, align;
    @@
    ptr = memblock_alloc(size, align);
    + if (!ptr)
    + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);

    [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
    Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
    [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
    Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
    [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
    Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
    [akpm@linux-foundation.org: fix xtensa printk warning]
    Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Anders Roxell
    Reviewed-by: Guo Ren [c-sky]
    Acked-by: Paul Burton [MIPS]
    Acked-by: Heiko Carstens [s390]
    Reviewed-by: Juergen Gross [Xen]
    Reviewed-by: Geert Uytterhoeven [m68k]
    Acked-by: Max Filippov [xtensa]
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Petr Mladek
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

06 Mar, 2019

1 commit

  • next_present_section_nr() could only return an unsigned number -1, so
    just check it specifically where compilers will convert -1 to unsigned
    if needed.

    mm/sparse.c: In function 'sparse_init_nid':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:478:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:497:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c: In function 'sparse_init':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:520:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin + 1, pnum_end) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~

    Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
    Fixes: c4e1be9ec113 ("mm, sparsemem: break out of loops early")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

29 Dec, 2018

3 commits

  • Since the information needed in sparse_add_one_section() is node id to
    allocate proper memory, it is not necessary to pass its pgdat.

    This patch changes the prototype of sparse_add_one_section() to pass node
    id directly. This is intended to reduce misleading that
    sparse_add_one_section() would touch pgdat.

    Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • pgdat_resize_lock is used to protect pgdat's memory region information
    like: node_start_pfn, node_present_pages, etc. While in function
    sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
    initialization/release of one mem_section. This looks not proper.

    These code paths are currently protected by mem_hotplug_lock currently but
    should there ever be any reason for locking at the sparse layer a
    dedicated lock should be introduced.

    Following is the current call trace of sparse_add/remove_one_section()

    mem_hotplug_begin()
    arch_add_memory()
    add_pages()
    __add_pages()
    __add_section()
    sparse_add_one_section()
    mem_hotplug_done()

    mem_hotplug_begin()
    arch_remove_memory()
    __remove_pages()
    __remove_section()
    sparse_remove_one_section()
    mem_hotplug_done()

    The comment above the pgdat_resize_lock also mentions "Holding this will
    also guarantee that any pfn_valid() stays that way.", which is true with
    the current implementation and false after this patch. But current
    implementation doesn't meet this comment. There isn't any pfn walkers to
    take the lock so this looks like a relict from the past. This patch also
    removes this comment.

    [richard.weiyang@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
    [mhocko@suse.com: changelog suggestion]
    Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In hot remove, we try to clear poisoned pages, but a small optimization to
    check if num_poisoned_pages is 0 helps remove the iteration through
    nr_pages.

    [akpm@linux-foundation.org: tweak comment text]
    Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com
    Signed-off-by: Balbir Singh
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

15 Dec, 2018

1 commit

  • Presently the arches arm64, arm and sh have a function which loops
    through each memblock and calls memory present. riscv will require a
    similar function.

    Introduce a common memblocks_present() function that can be used by all
    the arches. Subsequent patches will cleanup the arches that make use of
    this.

    Link: http://lkml.kernel.org/r/20181107205433.3875-3-logang@deltatee.com
    Signed-off-by: Logan Gunthorpe
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     

31 Oct, 2018

5 commits

  • When a memblock allocation APIs are called with align = 0, the alignment
    is implicitly set to SMP_CACHE_BYTES.

    Implicit alignment is done deep in the memblock allocator and it can
    come as a surprise. Not that such an alignment would be wrong even
    when used incorrectly but it is better to be explicit for the sake of
    clarity and the prinicple of the least surprise.

    Replace all such uses of memblock APIs with the 'align' parameter
    explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
    in the memblock internal allocation functions.

    For the case when memblock APIs are used via helper functions, e.g. like
    iommu_arena_new_node() in Alpha, the helper functions were detected with
    Coccinelle's help and then manually examined and updated where
    appropriate.

    The direct memblock APIs users were updated using the semantic patch below:

    @@
    expression size, min_addr, max_addr, nid;
    @@
    (
    |
    - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
    |
    - memblock_alloc(size, 0)
    + memblock_alloc(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_raw(size, 0)
    + memblock_alloc_raw(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from(size, 0, min_addr)
    + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_nopanic(size, 0)
    + memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low(size, 0)
    + memblock_alloc_low(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low_nopanic(size, 0)
    + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from_nopanic(size, 0, min_addr)
    + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_node(size, 0, nid)
    + memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
    )

    [mhocko@suse.com: changelog update]
    [akpm@linux-foundation.org: coding-style fixes]
    [rppt@linux.ibm.com: fix missed uses of implicit alignment]
    Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
    Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Paul Burton [MIPS]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
    identical MEMBLOCK definitions.

    Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • With the align parameter memblock_alloc_node() can be used as drop in
    replacement for alloc_bootmem_pages_node() and __alloc_bootmem_node(),
    which is done in the following patches.

    Link: http://lkml.kernel.org/r/1536927045-23536-15-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
    $(git grep -l memblock_virt_alloc)

    Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

27 Oct, 2018

1 commit

  • Patch series "Address issues slowing persistent memory initialization", v5.

    The main thing this patch set achieves is that it allows us to initialize
    each node worth of persistent memory independently. As a result we reduce
    page init time by about 2 minutes because instead of taking 30 to 40
    seconds per node and going through each node one at a time, we process all
    4 nodes in parallel in the case of a 12TB persistent memory setup spread
    evenly over 4 nodes.

    This patch (of 3):

    On systems with a large amount of memory it can take a significant amount
    of time to initialize all of the page structs with the PAGE_POISON_PATTERN
    value. I have seen it take over 2 minutes to initialize a system with
    over 12TB of RAM.

    In order to work around the issue I had to disable CONFIG_DEBUG_VM and
    then the boot time returned to something much more reasonable as the
    arch_add_memory call completed in milliseconds versus seconds. However in
    doing that I had to disable all of the other VM debugging on the system.

    In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
    on a system that has a large amount of memory I have added a new kernel
    parameter named "vm_debug" that can be set to "-" in order to disable it.

    Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Alexander Duyck
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

18 Aug, 2018

2 commits

  • Rename new_sparse_init() to sparse_init() which enables it. Delete old
    sparse_init() and all the code that became obsolete with.

    [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
    Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Tested-by: Michael Ellerman [powerpc]
    Tested-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • sparse_init() requires to temporary allocate two large buffers: usemap_map
    and map_map. Baoquan He has identified that these buffers are so large
    that Linux is not bootable on small memory machines, such as a kdump boot.
    The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
    are scaled to the maximum physical memory size.

    Baoquan provided a fix, which reduces these sizes of these buffers, but it
    is much better to get rid of them entirely.

    Add a new way to initialize sparse memory: sparse_init_nid(), which only
    operates within one memory node, and thus allocates memory either in large
    contiguous block or allocates section by section. This eliminates the
    need for use of temporary buffers.

    For simplified bisecting and review temporarly call sparse_init()
    new_sparse_init(), the new interface is going to be enabled as well as old
    code removed in the next patch.

    Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Tested-by: Michael Ellerman [powerpc]
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin