07 Oct, 2020

2 commits

  • commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream.

    In register_mem_sect_under_node() the system_state's value is checked to
    detect whether the call is made during boot time or during an hot-plug
    operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
    because regular memory is registered at SYSTEM_SCHEDULING state. In
    addition, memory hot-plug operation can be triggered at this system
    state by the ACPI [1]. So checking against the system state is not
    enough.

    The consequence is that on system with interleaved node's ranges like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    This can be seen on PowerPC LPAR after multiple memory hot-plug and
    hot-unplug operations are done. At the next reboot the node's memory
    ranges can be interleaved and since the call to link_mem_sections() is
    made in topology_init() while the system is in the SYSTEM_SCHEDULING
    state, the node's id is not checked, and the sections registered to
    multiple nodes:

    $ ls -l /sys/devices/system/memory/memory21/node*
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

    In that case, the system is able to boot but if later one of theses
    memory blocks is hot-unplugged and then hot-plugged, the sysfs
    inconsistency is detected and this is triggering a BUG_ON():

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This patch addresses the root cause by not relying on the system_state
    value to detect whether the call is due to a hot-plug operation. An
    extra parameter is added to link_mem_sections() detailing whether the
    operation is due to a hot-plug operation.

    [1] According to Oscar Salvador, using this qemu command line, ACPI
    memory hotplug operations are raised at SYSTEM_SCHEDULING state:

    $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
    -m size=$MEM,slots=255,maxmem=4294967296k \
    -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
    -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
    -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
    -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
    -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
    -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
    -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
    -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

    Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Fenghua Yu
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     
  • commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream.

    Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     

23 Sep, 2020

1 commit

  • commit 9683182612214aa5f5e709fad49444b847cd866a upstream.

    There is a race during page offline that can lead to infinite loop:
    a page never ends up on a buddy list and __offline_pages() keeps
    retrying infinitely or until a termination signal is received.

    Thread#1 - a new process:

    load_elf_binary
    begin_new_exec
    exec_mmap
    mmput
    exit_mmap
    tlb_finish_mmu
    tlb_flush_mmu
    release_pages
    free_unref_page_list
    free_unref_page_prepare
    set_pcppage_migratetype(page, migratetype);
    // Set page->index migration type below MIGRATE_PCPTYPES

    Thread#2 - hot-removes memory
    __offline_pages
    start_isolate_page_range
    set_migratetype_isolate
    set_pageblock_migratetype(page, MIGRATE_ISOLATE);
    Set migration type to MIGRATE_ISOLATE-> set
    drain_all_pages(zone);
    // drain per-cpu page lists to buddy allocator.

    Thread#1 - continue
    free_unref_page_commit
    migratetype = get_pcppage_migratetype(page);
    // get old migration type
    list_add(&page->lru, &pcp->lists[migratetype]);
    // add new page to already drained pcp list

    Thread#2
    Never drains pcp again, and therefore gets stuck in the loop.

    The fix is to try to drain per-cpu lists again after
    check_pages_isolated_cb() fails.

    Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Wei Yang
    Cc:
    Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
    Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

21 Aug, 2020

1 commit

  • commit b4223a510e2ab1bf0f971d50af7c1431014b25ad upstream.

    When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
    online at that time), mem_hotplug_begin/done is unpaired in such case.

    Therefore a warning:
    Call Trace:
    percpu_up_write+0x33/0x40
    try_remove_memory+0x66/0x120
    ? _cond_resched+0x19/0x30
    remove_memory+0x2b/0x40
    dev_dax_kmem_remove+0x36/0x72 [kmem]
    device_release_driver_internal+0xf0/0x1c0
    device_release_driver+0x12/0x20
    bus_remove_device+0xe1/0x150
    device_del+0x17b/0x3e0
    unregister_dev_dax+0x29/0x60
    devm_action_release+0x15/0x20
    release_nodes+0x19a/0x1e0
    devres_release_all+0x3f/0x50
    device_release_driver_internal+0x100/0x1c0
    driver_detach+0x4c/0x8f
    bus_remove_driver+0x5c/0xd0
    driver_unregister+0x31/0x50
    dax_pmem_exit+0x10/0xfe0 [dax_pmem]

    Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: [5.6+]
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chuhong Yuan
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Mike Rapoport
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     

12 Mar, 2020

1 commit

  • commit c87cbc1f007c4b46165f05ceca04e1973cda0b9c upstream.

    Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
    fixed memory hotplug with debug_pagealloc enabled, where onlining a page
    goes through page freeing, which removes the direct mapping. Some arches
    don't like when the page is not mapped in the first place, so
    generic_online_page() maps it first. This is somewhat wasteful, but
    better than special casing page freeing fast paths.

    The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
    it's actually enabled. One has to test debug_pagealloc_enabled() since
    031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
    configurable"), or alternatively debug_pagealloc_enabled_static() since
    8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
    but this is not done.

    As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
    will crash:

    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 0000000000000000 TEID: 0000000000000483
    Fault in home space mode while using kernel ASCE.
    AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
    Oops: 0004 ilc:2 [#1] SMP
    CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
    Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
    0000000000000001 0000000000000000 0000000000000002 0000000000000100
    0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
    000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
    Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
    0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
    >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
    0000001ecd281ba2: 41505008 la %r5,8(%r5)
    0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
    0000001ecd281bac: 1a07 ar %r0,%r7
    0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
    Call Trace:
    [] __kernel_map_pages+0x166/0x188
    [] online_pages_range+0xf6/0x128
    [] walk_system_ram_range+0x7e/0xd8
    [] online_pages+0x2fe/0x3f0
    [] memory_subsys_online+0x8e/0xc0
    [] device_online+0x5a/0xc8
    [] state_store+0x88/0x118
    [] kernfs_fop_write+0xc2/0x200
    [] vfs_write+0x176/0x1e0
    [] ksys_write+0xa2/0x100
    [] system_call+0xd8/0x2c8

    Fix this by checking debug_pagealloc_enabled_static() before calling
    kernel_map_pages(). Backports for kernel before 5.5 should use
    debug_pagealloc_enabled() instead. Also add comments.

    Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
    Reported-by: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Reviewed-by: David Hildenbrand
    Cc:
    Cc: Joonsoo Kim
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

11 Feb, 2020

1 commit

  • commit f1037ec0cc8ac1a450974ad9754e991f72884f48 upstream.

    The daxctl unit test for the dax_kmem driver currently triggers the
    (false positive) lockdep splat below. It results from the fact that
    remove_memory_block_devices() is invoked under the mem_hotplug_lock()
    causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
    active state tracking). It is a false positive because the sysfs
    attribute path triggering the memory remove is not the same attribute
    path associated with memory-block device.

    sysfs_break_active_protection() is not applicable since there is no real
    deadlock conflict, instead move memory-block device removal outside the
    lock. The mem_hotplug_lock() is not needed to synchronize the
    memory-block device removal vs the page online state, that is already
    handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
    is sufficient to allow try_remove_memory() to check the offline state of
    the memblocks and be assured that any in progress online attempts are
    flushed / blocked by kernfs_drain() / attribute removal.

    The add_memory() path safely creates memblock devices under the
    mem_hotplug_lock(). There is no kernfs active state synchronization in
    the memblock device_register() path, so nothing to fix there.

    This change is only possible thanks to the recent change that refactored
    memory block device removal out of arch_remove_memory() (commit
    4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
    arch_remove_memory()"), and David's due diligence tracking down the
    guarantees afforded by kernfs_drain(). Not flagged for -stable since
    this only impacts ongoing development and lockdep validation, not a
    runtime issue.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.5.0-rc3+ #230 Tainted: G OE
    ------------------------------------------------------
    lt-daxctl/6459 is trying to acquire lock:
    ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

    but task is already holding lock:
    ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (mem_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    get_online_mems+0x3e/0xb0
    kmem_cache_create_usercopy+0x2e/0x260
    kmem_cache_create+0x12/0x20
    ptlock_cache_init+0x20/0x28
    start_kernel+0x243/0x547
    secondary_startup_64+0xb6/0xc0

    -> #1 (cpu_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    cpus_read_lock+0x3e/0xb0
    online_pages+0x37/0x300
    memory_subsys_online+0x17d/0x1c0
    device_online+0x60/0x80
    state_store+0x65/0xd0
    kernfs_fop_write+0xcf/0x1c0
    vfs_write+0xdb/0x1d0
    ksys_write+0x65/0xe0
    do_syscall_64+0x5c/0xa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#241){++++}:
    check_prev_add+0x98/0xa40
    validate_chain+0x576/0x860
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    __kernfs_remove+0x25f/0x2e0
    kernfs_remove_by_name_ns+0x41/0x80
    remove_files.isra.0+0x30/0x70
    sysfs_remove_group+0x3d/0x80
    sysfs_remove_groups+0x29/0x40
    device_remove_attrs+0x39/0x70
    device_del+0x16a/0x3f0
    device_unregister+0x16/0x60
    remove_memory_block_devices+0x82/0xb0
    try_remove_memory+0xb5/0x130
    remove_memory+0x26/0x40
    dev_dax_kmem_remove+0x44/0x6a [kmem]
    device_release_driver_internal+0xe4/0x1c0
    unbind_store+0xef/0x120
    kernfs_fop_write+0xcf/0x1c0
    vfs_write+0xdb/0x1d0
    ksys_write+0x65/0xe0
    do_syscall_64+0x5c/0xa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
    kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(mem_hotplug_lock.rw_sem);
    lock(cpu_hotplug_lock.rw_sem);
    lock(mem_hotplug_lock.rw_sem);
    lock(kn->count#241);

    *** DEADLOCK ***

    No fixes tag as this has been a long standing issue that predated the
    addition of kernfs lockdep annotations.

    Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Vishal Verma
    Cc: Pavel Tatashin
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

09 Jan, 2020

1 commit

  • commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.

    We currently try to shrink a single zone when removing memory. We use
    the zone of the first page of the memory we are removing. If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __remove_pages+0x4b/0x640
    arch_remove_memory+0x63/0x8d
    try_remove_memory+0xdb/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x70/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x227/0x3a0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x221/0x550
    worker_thread+0x50/0x3b0
    kthread+0x105/0x140
    ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that. We now
    properly shrink the zones, even if we have DIMMs whereby

    - Some memory blocks fall into no zone (never onlined)

    - Some memory blocks fall into multiple zones (offlined+re-onlined)

    - Multiple memory blocks that fall into different zones

    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().

    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

23 Nov, 2019

1 commit

  • Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
    We should never try to touch the memmap of offline sections where we
    could have uninitialized memmaps and could trigger BUGs when calling
    page_to_nid() on poisoned pages.

    There is no reliable way to distinguish an uninitialized memmap from an
    initialized memmap that belongs to ZONE_DEVICE, as we don't have
    anything like SECTION_IS_ONLINE we can use similar to
    pfn_to_online_section() for !ZONE_DEVICE memory.

    E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
    and will therefore never set a ZONE_DEVICE zone consecutive. Stopping
    to shrink the ZONE_DEVICE therefore results in no observable changes,
    besides /proc/zoneinfo indicating different boundaries - something we
    can totally live with.

    Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
    hotplug"), the memmap was initialized with 0 and the node with the right
    value. So the zone might be wrong but not garbage. After that commit,
    both the zone and the node will be garbage when touching uninitialized
    memmaps.

    Toshiki reported a BUG (race between delayed initialization of
    ZONE_DEVICE memmaps without holding the memory hotplug lock and
    concurrent zone shrinking).

    https://lkml.org/lkml/2019/11/14/1040

    "Iteration of create and destroy namespace causes the panic as below:

    kernel BUG at mm/page_alloc.c:535!
    CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
    Call Trace:
    memmap_init_zone_device+0x165/0x17c
    memremap_pages+0x4c1/0x540
    devm_memremap_pages+0x1d/0x60
    pmem_attach_disk+0x16b/0x600 [nd_pmem]
    nvdimm_bus_probe+0x69/0x1c0
    really_probe+0x1c2/0x3e0
    driver_probe_device+0xb4/0x100
    device_driver_attach+0x4f/0x60
    bind_store+0xc9/0x110
    kernfs_fop_write+0x116/0x190
    vfs_write+0xa5/0x1a0
    ksys_write+0x59/0xd0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    While creating a namespace and initializing memmap, if you destroy the
    namespace and shrink the zone, it will initialize the memmap outside
    the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
    pfn), page) in set_pfnblock_flags_mask()."

    This BUG is also mitigated by this commit, where we for now stop to
    shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.

    Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reported-by: Aneesh Kumar K.V
    Reported-by: Toshiki Fukasawa
    Cc: Oscar Salvador
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Alexander Duyck
    Cc: Alexander Potapenko
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Damian Tometzki
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Gerald Schaefer
    Cc: Greg Kroah-Hartman
    Cc: Halil Pasic
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jun Yao
    Cc: Logan Gunthorpe
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Pankaj Gupta
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Rich Felker
    Cc: Robin Murphy
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

16 Nov, 2019

1 commit

  • try_offline_node() is pretty much broken right now:

    - The node span is updated when onlining memory, not when adding it. We
    ignore memory that was mever onlined. Bad.

    - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
    trigger a kernel panic. Bad for memory that is offline but also bad
    for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
    first PFN of a section might contain garbage.

    - Sections belonging to mixed nodes are not properly considered.

    As memory blocks might belong to multiple nodes, we would have to walk
    all pageblocks (or at least subsections) within present sections.
    However, we don't have a way to identify whether a memmap that is not
    online was initialized (relevant for ZONE_DEVICE). This makes things
    more complicated.

    Luckily, we can piggy pack on the node span and the nid stored in memory
    blocks. Currently, the node span is grown when calling
    move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
    removing memory, before calling try_offline_node(). Sysfs links are
    created via link_mem_sections(), e.g., during boot or when adding
    memory.

    If the node still spans memory or if any memory block belongs to the
    nid, we don't set the node offline. As memory blocks that span multiple
    nodes cannot get offlined, the nid stored in memory blocks is reliable
    enough (for such online memory blocks, the node still spans the memory).

    Introduce for_each_memory_block() to efficiently walk all memory blocks.

    Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
    when removing ZONE_DEVICE memory to fix similar issues (access of
    garbage memmaps) - until we have a reliable way to identify whether
    these memmaps were properly initialized. This implies later, that once
    a node had ZONE_DEVICE memory, we won't be able to set a node offline -
    which should be acceptable.

    Since commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
    hotadded memory to zones until online") memory that is added is not
    assoziated with a zone/node (memmap not initialized). The introducing
    commit 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
    already missed that we could have multiple nodes for a section and that
    the zone/node span is updated when onlining pages, not when adding them.

    I tested this by hotplugging two DIMMs to a memory-less and cpu-less
    NUMA node. The node is properly onlined when adding the DIMMs. When
    removing the DIMMs, the node is properly offlined.

    Masayoshi Mizuma reported:

    : Without this patch, memory hotplug fails as panic:
    :
    : BUG: kernel NULL pointer dereference, address: 0000000000000000
    : ...
    : Call Trace:
    : remove_memory_block_devices+0x81/0xc0
    : try_remove_memory+0xb4/0x130
    : __remove_memory+0xa/0x20
    : acpi_memory_device_remove+0x84/0x100
    : acpi_bus_trim+0x57/0x90
    : acpi_bus_trim+0x2e/0x90
    : acpi_device_hotplug+0x2b2/0x4d0
    : acpi_hotplug_work_fn+0x1a/0x30
    : process_one_work+0x171/0x380
    : worker_thread+0x49/0x3f0
    : kthread+0xf8/0x130
    : ret_from_fork+0x35/0x40

    [david@redhat.com: v3]
    Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
    Fixes: 60a5a19e7419 ("memory-hotplug: remove sysfs file of node")
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b319
    Signed-off-by: David Hildenbrand
    Tested-by: Masayoshi Mizuma
    Cc: Tang Chen
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Keith Busch
    Cc: Jiri Olsa
    Cc: "Peter Zijlstra (Intel)"
    Cc: Jani Nikula
    Cc: Nayna Jain
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Stephen Rothwell
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

07 Nov, 2019

1 commit

  • We recently started updating the node span based on the zone span to
    avoid touching uninitialized memmaps.

    Currently, we will always detect the node span to start at 0, meaning a
    node can easily span too many pages. pgdat_is_empty() will still work
    correctly if all zones span no pages. We should skip over all zones
    without spanned pages and properly handle the first detected zone that
    spans pages.

    Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
    span cannot easily be inspected and tested. The node span gives no real
    guarantees when an architecture supports memory hotplug, meaning it can
    easily contain holes or span pages of different nodes.

    The node span is not really used after init on architectures that
    support memory hotplug.

    E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
    mm/kmemleak.c:kmemleak_scan(). These users seem to be fine.

    Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
    Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
    Signed-off-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Stephen Rothwell
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

19 Oct, 2019

1 commit

  • We might use the nid of memmaps that were never initialized. For
    example, if the memmap was poisoned, we will crash the kernel in
    pfn_to_nid() right now. Let's use the calculated boundaries of the
    separate zones instead. This now also avoids having to iterate over a
    whole bunch of subsections again, after shrinking one zone.

    Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
    hotplug"), the memmap was initialized to 0 and the node was set to the
    right value. After that commit, the node might be garbage.

    We'll have to fix shrink_zone_span() next.

    Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reported-by: Aneesh Kumar K.V
    Cc: Oscar Salvador
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Wei Yang
    Cc: Alexander Duyck
    Cc: Alexander Potapenko
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Damian Tometzki
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Gerald Schaefer
    Cc: Greg Kroah-Hartman
    Cc: Halil Pasic
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jun Yao
    Cc: Logan Gunthorpe
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Pankaj Gupta
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Rich Felker
    Cc: Robin Murphy
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

25 Sep, 2019

9 commits

  • Correct typo in comment.

    Link: http://lkml.kernel.org/r/1568233954-3913-1-git-send-email-jrdr.linux@gmail.com
    Signed-off-by: Souptick Joarder
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • walk_system_ram_range() will fail with -EINVAL in case
    online_pages_range() was never called (== no resource applicable in the
    range). Otherwise, we will always call online_pages_range() with nr_pages
    > 0 and, therefore, have online_pages > 0.

    Remove that special handling.

    Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Arun KS
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Nadav Amit
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Commit a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as
    higher order") assumed that any PFN we get via memory resources is aligned
    to to MAX_ORDER - 1, I am not convinced that is always true. Let's play
    safe, check the alignment and fallback to single pages.

    akpm: warn in this situation so we get to find out if and why this ever
    occurs.

    [akpm@linux-foundation.org: add WARN_ON_ONCE()]
    Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Arun KS
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Nadav Amit
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • online_pages always corresponds to nr_pages. Simplify the code, getting
    rid of online_pages_blocks(). Add some comments.

    Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Arun KS
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Nadav Amit
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • move_pfn_range_to_zone() will set all pages to PG_reserved via
    memmap_init_zone(). The only way a page could no longer be reserved would
    be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
    done (the online_page callback is used for that purpose by e.g., Hyper-V
    instead). walk_system_ram_range() will never call online_pages_range()
    with duplicate PFNs, so drop the PageReserved() check.

    This seems to be a leftover from ancient times where the memmap was
    initialized when adding memory and we wanted to check for already onlined
    memory.

    Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Cc: Arun KS
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Nadav Amit
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • When offlining a node in try_offline_node(), pgdat is not released. So
    that pgdat could be reused in hotadd_new_pgdat(). While we reallocate
    pgdat->per_cpu_nodestats if this pgdat is reused.

    This patch prevents the memory leak by just allocating per_cpu_nodestats
    when it is a new pgdat.

    Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Each memory block spans the same amount of sections/pages/bytes. The size
    is determined before the first memory block is created. No need to store
    what we can easily calculate - and the calculations even look simpler now.

    Michal brought up the idea of variable-sized memory blocks. However, if
    we ever implement something like this, we will need an API compatibility
    switch and reworks at various places (most code assumes a fixed memory
    block size). So let's cleanup what we have right now.

    While at it, fix the variable naming in register_mem_sect_under_node() -
    we no longer talk about a single section.

    Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's remove this indirection. We need the zone in the caller either way,
    so let's just detect it there. Add some documentation for
    move_pfn_range_to_zone() instead.

    [akpm@linux-foundation.org: restore newline, per David]
    Link: http://lkml.kernel.org/r/20190724142324.3686-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

03 Aug, 2019

1 commit


19 Jul, 2019

14 commits

  • David points out that there is a mixture of 'int' and 'unsigned long'
    usage for section number data types. Update the memory hotplug path to
    use 'unsigned long' consistently for section numbers.

    [akpm@linux-foundation.org: fix printk format]
    Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare the memory hot-{add,remove} paths for handling sub-section
    ranges by plumbing the starting page frame and number of pages being
    handled through arch_{add,remove}_memory() to
    sparse_{add,remove}_one_section().

    This is simply plumbing, small cleanups, and some identifier renames.
    No intended functional changes.

    Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Pavel Tatashin
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The zone type check was a leftover from the cleanup that plumbed altmap
    through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
    vmem_altmap to arch_remove_memory and __remove_pages".

    Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Sub-section hotplug support reduces the unit of operation of hotplug
    from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
    (PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
    PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
    valid_section(), can toggle.

    [osalvador@suse.de: fix shrink_{zone,node}_span]
    Link: http://lkml.kernel.org/r/20190717090725.23618-3-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092351496.979959.12703722803097017492.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Sub-section memory hotplug support", v10.

    The memory hotplug section is an arbitrary / convenient unit for memory
    hotplug. 'Section-size' units have bled into the user interface
    ('memblock' sysfs) and can not be changed without breaking existing
    userspace. The section-size constraint, while mostly benign for typical
    memory hotplug, has and continues to wreak havoc with 'device-memory'
    use cases, persistent memory (pmem) in particular. Recall that pmem
    uses devm_memremap_pages(), and subsequently arch_add_memory(), to
    allocate a 'struct page' memmap for pmem. However, it does not use the
    'bottom half' of memory hotplug, i.e. never marks pmem pages online and
    never exposes the userspace memblock interface for pmem. This leaves an
    opening to redress the section-size constraint.

    To date, the libnvdimm subsystem has attempted to inject padding to
    satisfy the internal constraints of arch_add_memory(). Beyond
    complicating the code, leading to bugs [2], wasting memory, and limiting
    configuration flexibility, the padding hack is broken when the platform
    changes this physical memory alignment of pmem from one boot to the
    next. Device failure (intermittent or permanent) and physical
    reconfiguration are events that can cause the platform firmware to
    change the physical placement of pmem on a subsequent boot, and device
    failure is an everyday event in a data-center.

    It turns out that sections are only a hard requirement of the
    user-facing interface for memory hotplug and with a bit more
    infrastructure sub-section arch_add_memory() support can be added for
    kernel internal usages like devm_memremap_pages(). Here is an analysis
    of the current design assumptions in the current code and how they are
    addressed in the new implementation:

    Current design assumptions:

    - Sections that describe boot memory (early sections) are never
    unplugged / removed.

    - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
    valid_section() check

    - __add_pages() and helper routines assume all operations occur in
    PAGES_PER_SECTION units.

    - The memblock sysfs interface only comprehends full sections

    New design assumptions:

    - Sections are instrumented with a sub-section bitmask to track (on
    x86) individual 2MB sub-divisions of a 128MB section.

    - Partially populated early sections can be extended with additional
    sub-sections, and those sub-sections can be removed with
    arch_remove_memory(). With this in place we no longer lose usable
    memory capacity to padding.

    - pfn_valid() is updated to look deeper than valid_section() to also
    check the active-sub-section mask. This indication is in the same
    cacheline as the valid_section() so the performance impact is
    expected to be negligible. So far the lkp robot has not reported any
    regressions.

    - Outside of the core vmemmap population routines which are replaced,
    other helper routines like shrink_{zone,pgdat}_span() are updated to
    handle the smaller granularity. Core memory hotplug routines that
    deal with online memory are not touched.

    - The existing memblock sysfs user api guarantees / assumptions are not
    touched since this capability is limited to !online
    !memblock-sysfs-accessible sections.

    Meanwhile the issue reports continue to roll in from users that do not
    understand when and how the 128MB constraint will bite them. The current
    implementation relied on being able to support at least one misaligned
    namespace, but that immediately falls over on any moderately complex
    namespace creation attempt. Beyond the initial problem of 'System RAM'
    colliding with pmem, and the unsolvable problem of physical alignment
    changes, Linux is now being exposed to platforms that collide pmem ranges
    with other pmem ranges by default [3]. In short, devm_memremap_pages()
    has pushed the venerable section-size constraint past the breaking point,
    and the simplicity of section-aligned arch_add_memory() is no longer
    tenable.

    These patches are exposed to the kbuild robot on a subsection-v10 branch
    [4], and a preview of the unit test for this functionality is available
    on the 'subsection-pending' branch of ndctl [5].

    [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
    [3]: https://github.com/pmem/ndctl/issues/76
    [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
    [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

    This patch (of 13):

    Towards enabling memory hotplug to track partial population of a section,
    introduce 'struct mem_section_usage'.

    A pointer to a 'struct mem_section_usage' instance replaces the existing
    pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
    'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
    a new 'subsection_map' bitmap. The new bitmap enables the memory
    hot{plug,remove} implementation to act on incremental sub-divisions of a
    section.

    SUBSECTION_SHIFT is defined as global constant instead of per-architecture
    value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
    subsection users. Specifically a common subsection size allows for the
    possibility that persistent memory namespace configurations be made
    compatible across architectures.

    The primary motivation for this functionality is to support platforms that
    mix "System RAM" and "Persistent Memory" within a single section, or
    multiple PMEM ranges with different mapping lifetimes within a single
    section. The section restriction for hotplug has caused an ongoing saga
    of hacks and bugs for devm_memremap_pages() users.

    Beyond the fixups to teach existing paths how to retrieve the 'usemap'
    from a section, and updates to usemap allocation path, there are no
    expected behavior changes.

    Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jérôme Glisse
    Cc: Mike Rapoport
    Cc: Jane Chu
    Cc: Pavel Tatashin
    Cc: Jonathan Corbet
    Cc: Qian Cai
    Cc: Logan Gunthorpe
    Cc: Toshi Kani
    Cc: Jeff Moyer
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Let's move walk_memory_blocks() to the place where memory block logic
    resides and simplify it. While at it, add a type for the callback
    function.

    Link: http://lkml.kernel.org/r/20190614100114.311-6-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: David Hildenbrand
    Cc: Stephen Rothwell
    Cc: Pavel Tatashin
    Cc: Andrew Banman
    Cc: Mike Travis
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • walk_memory_range() was once used to iterate over sections. Now, it
    iterates over memory blocks. Rename the function, fixup the
    documentation.

    Also, pass start+size instead of PFNs, which is what most callers
    already have at hand. (we'll rework link_mem_sections() most probably
    soon)

    Follow-up patches will rework, simplify, and move walk_memory_blocks()
    to drivers/base/memory.c.

    Note: walk_memory_blocks() only works correctly right now if the
    start_pfn is aligned to a section start. This is the case right now,
    but we'll generalize the function in a follow up patch so the semantics
    match the documentation.

    [akpm@linux-foundation.org: remove unused variable]
    Link: http://lkml.kernel.org/r/20190614100114.311-5-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: "Rafael J. Wysocki"
    Cc: Len Brown
    Cc: Greg Kroah-Hartman
    Cc: David Hildenbrand
    Cc: Rashmica Gupta
    Cc: Pavel Tatashin
    Cc: Anshuman Khandual
    Cc: Michael Neuling
    Cc: Thomas Gleixner
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Wei Yang
    Cc: Juergen Gross
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Nick Desaulniers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • The parameter is unused, so let's drop it. Memory removal paths should
    never care about zones. This is the job of memory offlining and will
    require more refactorings.

    Link: http://lkml.kernel.org/r/20190527111152.16324-12-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Reviewed-by: Wei Yang
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Alex Deucher
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Arun KS
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Chris Wilson
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Brown
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Mathieu Malaterre
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: "mike.travis@hpe.com"
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's factor out removing of memory block devices, which is only
    necessary for memory added via add_memory() and friends that created
    memory block devices. Remove the devices before calling
    arch_remove_memory().

    This finishes factoring out memory block device handling from
    arch_add_memory() and arch_remove_memory().

    Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: David Hildenbrand
    Cc: "mike.travis@hpe.com"
    Cc: Andrew Banman
    Cc: Ingo Molnar
    Cc: Alex Deucher
    Cc: "David S. Miller"
    Cc: Mark Brown
    Cc: Chris Wilson
    Cc: Oscar Salvador
    Cc: Jonathan Cameron
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • No longer needed, the callers of arch_add_memory() can handle this
    manually.

    Link: http://lkml.kernel.org/r/20190527111152.16324-9-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Joonsoo Kim
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Mike Rapoport
    Cc: Alex Deucher
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Chris Wilson
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Brown
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: "mike.travis@hpe.com"
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Only memory to be added to the buddy and to be onlined/offlined by user
    space using /sys/devices/system/memory/... needs (and should have!)
    memory block devices.

    Factor out creation of memory block devices. Create all devices after
    arch_add_memory() succeeded. We can later drop the want_memblock
    parameter, because it is now effectively stale.

    Only after memory block devices have been added, memory can be onlined
    by user space. This implies, that memory is not visible to user space
    at all before arch_add_memory() succeeded.

    While at it
    - use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
    - introduce find_memory_block_by_id() to search via block id
    - Use find_memory_block_by_id() in init_memory_block() to catch
    duplicates

    Link: http://lkml.kernel.org/r/20190527111152.16324-8-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: David Hildenbrand
    Cc: "mike.travis@hpe.com"
    Cc: Ingo Molnar
    Cc: Andrew Banman
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Alex Deucher
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Chris Wilson
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Brown
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We want to improve error handling while adding memory by allowing to use
    arch_remove_memory() and __remove_pages() even if
    CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:

    arch_add_memory()
    rc = do_something();
    if (rc) {
    arch_remove_memory();
    }

    We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
    quite some dependencies for memory offlining.

    Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: "Kirill A. Shutemov"
    Cc: Alex Deucher
    Cc: "David S. Miller"
    Cc: Mark Brown
    Cc: Chris Wilson
    Cc: Christophe Leroy
    Cc: Nicholas Piggin
    Cc: Vasily Gorbik
    Cc: Rob Herring
    Cc: Masahiro Yamada
    Cc: "mike.travis@hpe.com"
    Cc: Andrew Banman
    Cc: Arun KS
    Cc: Qian Cai
    Cc: Mathieu Malaterre
    Cc: Baoquan He
    Cc: Logan Gunthorpe
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: Mark Rutland
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Robin Murphy
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3.

    We only want memory block devices for memory to be onlined/offlined
    (add/remove from the buddy). This is required so user space can
    online/offline memory and kdump gets notified about newly onlined
    memory.

    Let's factor out creation/removal of memory block devices. This helps
    to further cleanup arch_add_memory/arch_remove_memory() and to make
    implementation of new features easier - especially sub-section memory
    hot add from Dan.

    Anshuman Khandual is currently working on arch_remove_memory(). I added
    a temporary solution via "arm64/mm: Add temporary arch_remove_memory()
    implementation", that is sufficient as a firsts tep in the context of
    this series. (we don't cleanup page tables in case anything goes wrong
    already)

    Did a quick sanity test with DIMM plug/unplug, making sure all devices
    and sysfs links properly get added/removed. Compile tested on s390x and
    x86-64.

    This patch (of 11):

    By converting start and size to page granularity, we actually ignore
    unaligned parts within a page instead of properly bailing out with an
    error.

    Link: http://lkml.kernel.org/r/20190527111152.16324-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Reviewed-by: Wei Yang
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Alex Deucher
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Ard Biesheuvel
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chintan Pandya
    Cc: Christophe Leroy
    Cc: Chris Wilson
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Joonsoo Kim
    Cc: Jun Yao
    Cc: "Kirill A. Shutemov"
    Cc: Logan Gunthorpe
    Cc: Mark Brown
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: "mike.travis@hpe.com"
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Robin Murphy
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

17 Jul, 2019

1 commit

  • Presently the remove_memory() interface is inherently broken. It tries
    to remove memory but panics if some memory is not offline. The problem
    is that it is impossible to ensure that all memory blocks are offline as
    this function also takes lock_device_hotplug that is required to change
    memory state via sysfs.

    So, between calling this function and offlining all memory blocks there
    is always a window when lock_device_hotplug is released, and therefore,
    there is always a chance for a panic during this window.

    Make this interface to return an error if memory removal fails. This
    way it is safe to call this function without panicking machine, and also
    makes it symmetric to add_memory() which already returns an error.

    Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: Fengguang Wu
    Cc: Huang Ying
    Cc: James Morris
    Cc: Jérôme Glisse
    Cc: Keith Busch
    Cc: Ross Zwisler
    Cc: Sasha Levin
    Cc: Takashi Iwai
    Cc: Tom Lendacky
    Cc: Vishal Verma
    Cc: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

03 Jul, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • All callers of arch_remove_memory() ignore errors. And we should really
    try to remove any errors from the memory removal path. No more errors are
    reported from __remove_pages(). BUG() in s390x code in case
    arch_remove_memory() is triggered. We may implement that properly later.
    WARN in case powerpc code failed to remove the section mapping, which is
    better than ignoring the error completely right now.

    Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: "Kirill A. Shutemov"
    Cc: Christophe Leroy
    Cc: Stefan Agner
    Cc: Nicholas Piggin
    Cc: Pavel Tatashin
    Cc: Vasily Gorbik
    Cc: Arun KS
    Cc: Geert Uytterhoeven
    Cc: Masahiro Yamada
    Cc: Rob Herring
    Cc: Joonsoo Kim
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Mike Travis
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand