21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

10 commits

  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • All callers of arch_remove_memory() ignore errors. And we should really
    try to remove any errors from the memory removal path. No more errors are
    reported from __remove_pages(). BUG() in s390x code in case
    arch_remove_memory() is triggered. We may implement that properly later.
    WARN in case powerpc code failed to remove the section mapping, which is
    better than ignoring the error completely right now.

    Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: "Kirill A. Shutemov"
    Cc: Christophe Leroy
    Cc: Stefan Agner
    Cc: Nicholas Piggin
    Cc: Pavel Tatashin
    Cc: Vasily Gorbik
    Cc: Arun KS
    Cc: Geert Uytterhoeven
    Cc: Masahiro Yamada
    Cc: Rob Herring
    Cc: Joonsoo Kim
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Mike Travis
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's just warn in case a section is not valid instead of failing to
    remove somewhere in the middle of the process, returning an error that
    will be mostly ignored by callers.

    Link: http://lkml.kernel.org/r/20190409100148.24703-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Qian Cai
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mike Travis
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Failing while removing memory is mostly ignored and cannot really be
    handled. Let's treat errors in unregister_memory_section() in a nice way,
    warning, but continuing.

    Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Cc: Andrew Banman
    Cc: Mike Travis
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Qian Cai
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: Better error handling when removing
    memory", v1.

    Error handling when removing memory is somewhat messed up right now. Some
    errors result in warnings, others are completely ignored. Memory unplug
    code can essentially not deal with errors properly as of now.
    remove_memory() will never fail.

    We have basically two choices:
    1. Allow arch_remov_memory() and friends to fail, propagating errors via
    remove_memory(). Might be problematic (e.g. DIMMs consisting of multiple
    pieces added/removed separately).
    2. Don't allow the functions to fail, handling errors in a nicer way.

    It seems like most errors that can theoretically happen are really corner
    cases and mostly theoretical (e.g. "section not valid"). However e.g.
    aborting removal of sections while all callers simply continue in case of
    errors is not nice.

    If we can gurantee that removal of memory always works (and WARN/skip in
    case of theoretical errors so we can figure out what is going on), we can
    go ahead and implement better error handling when adding memory.

    E.g. via add_memory():

    arch_add_memory()
    ret = do_stuff()
    if (ret) {
    arch_remove_memory();
    goto error;
    }

    Handling here that arch_remove_memory() might fail is basically
    impossible. So I suggest, let's avoid reporting errors while removing
    memory, warning on theoretical errors instead and continuing instead of
    aborting.

    This patch (of 4):

    __add_pages() doesn't add the memory resource, so __remove_pages()
    shouldn't remove it. Let's factor it out. Especially as it is a special
    case for memory used as system memory, added via add_memory() and friends.

    We now remove the resource after removing the sections instead of doing it
    the other way around. I don't think this change is problematic.

    add_memory()
    register memory resource
    arch_add_memory()

    remove_memory
    arch_remove_memory()
    release memory resource

    While at it, explain why we ignore errors and that it only happeny if
    we remove memory in a different granularity as we added it.

    [david@redhat.com: fix printk warning]
    Link: http://lkml.kernel.org/r/20190417120204.6997-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20190409100148.24703-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mike Travis
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • arch_add_memory, __add_pages take a want_memblock which controls whether
    the newly added memory should get the sysfs memblock user API (e.g.
    ZONE_DEVICE users do not want/need this interface). Some callers even
    want to control where do we allocate the memmap from by configuring
    altmap.

    Add a more generic hotplug context for arch_add_memory and __add_pages.
    struct mhp_restrictions contains flags which contains additional features
    to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
    altmap for alternative memmap allocator.

    This patch shouldn't introduce any functional change.

    [akpm@linux-foundation.org: build fix]
    Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • check_pages_isolated_cb currently accounts the whole pfn range as being
    offlined if test_pages_isolated suceeds on the range. This is based on
    the assumption that all pages in the range are freed which is currently
    the case in most cases but it won't be with later changes, as pages marked
    as vmemmap won't be isolated.

    Move the offlined pages counting to offline_isolated_pages_cb and rely on
    __offline_isolated_pages to return the correct value.
    check_pages_isolated_cb will still do it's primary job and check the pfn
    range.

    While we are at it remove check_pages_isolated and offline_isolated_pages
    and use directly walk_system_ram_range as do in online_pages.

    Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
    Reviewed-by: David Hildenbrand
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In node_states_check_changes_online(), N_HIGH_MEMORY is used to substitute
    ZONE_HIGHMEM directly. This is not right. N_HIGH_MEMORY is to mark the
    memory state of node. Here zone index is checked, which should be
    compared with 'ZONE_HIGHMEM' accordingly.

    Replace it with ZONE_HIGHMEM.

    This is a code cleanup - no known runtime effects.

    Link: http://lkml.kernel.org/r/20190320080732.14933-1-bhe@redhat.com
    Fixes: 8efe33f40f3e ("mm/memory_hotplug.c: simplify node_states_check_changes_online")
    Signed-off-by: Baoquan He
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Wei Yang
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • has_unmovable_pages() already checks whether the hugetlb page supports
    migration, so all non-migratable hugetlb pages should have been caught
    there. Let us drop the check from scan_movable_pages() as is redundant.

    Link: http://lkml.kernel.org/r/20190320152658.10855-3-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • On x86_64, 1GB-hugetlb pages could never be offlined due to the fact
    that hugepage_migration_supported() returned false for PUD_SHIFT.
    So whenever we wanted to offline a memblock containing a gigantic
    hugetlb page, we never got beyond has_unmovable_pages() check.
    This changed with [1], where now we also return true for PUD_SHIFT.

    After that patch, the check in has_unmovable_pages() and scan_movable_pages()
    returned true, but we still had a final barrier in do_migrate_range():

    if (compound_order(head) > PFN_SECTION_SHIFT) {
    ret = -EBUSY;
    break;
    }

    This is not really nice, and we do not really need it.
    It is perfectly possible to migrate a gigantic page as long as another node has
    a spare gigantic page for us.
    In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages,
    and if any, we try to dequeue one from another node.

    This all works fine when we do have another node with a spare gigantic page,
    but if that is not the case, alloc_huge_page_nodemask() ends up calling
    alloc_migrate_huge_page() which bails out if the wanted page is gigantic.
    That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous
    memory is quite unlikely when the system has been running for a while.

    In that situation, we will keep looping forever because scan_movable_pages()
    will give us the same page and we will fail again because there is no node
    where we can dequeue a gigantic page from.
    This is not nice, and it has been raised that we might want to treat -ENOMEM
    as a fatal error in do_migrate_range(), but this has to be checked further.

    Anyway, I would tend say that this is the administrator's job, to make sure
    that the system can keep up with the memory to be offlined, so that would mean
    that if we want to use gigantic pages, make sure that the other nodes have at
    least enough gigantic pages to keep up in case we need to offline memory.

    Just for the sake of completeness, this is one of the tests done:

    # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
    # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages

    # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
    1
    # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
    1

    # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages
    1
    # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
    1

    (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB)

    # numactl -m 1 ./hugetlb1gb
    # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
    0
    # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
    1

    # offline node1 memory
    # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages
    0

    [1] https://lore.kernel.org/patchwork/patch/998796/

    Link: http://lkml.kernel.org/r/20190320152658.10855-2-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

27 Apr, 2019

1 commit

  • Right now we are using find_memory_block() to get the node id for the
    pfn range to online. We are missing to drop a reference to the memory
    block device. While the device still gets unregistered via
    device_unregister(), resulting in no user visible problem, the device is
    never released via device_release(), resulting in a memory leak. Fix
    that by properly using a put_device().

    Link: http://lkml.kernel.org/r/20190411110955.1430-1-david@redhat.com
    Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

30 Mar, 2019

2 commits

  • When start_isolate_page_range() returned -EBUSY in __offline_pages(), it
    calls memory_notify(MEM_CANCEL_OFFLINE, &arg) with an uninitialized
    "arg". As the result, it triggers warnings below. Also, it is only
    necessary to notify MEM_CANCEL_OFFLINE after MEM_GOING_OFFLINE.

    page:ffffea0001200000 count:1 mapcount:0 mapping:0000000000000000
    index:0x0
    flags: 0x3fffe000001000(reserved)
    raw: 003fffe000001000 ffffea0001200008 ffffea0001200008 0000000000000000
    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
    page dumped because: unmovable page
    WARNING: CPU: 25 PID: 1665 at mm/kasan/common.c:665
    kasan_mem_notifier+0x34/0x23b
    CPU: 25 PID: 1665 Comm: bash Tainted: G W 5.0.0+ #94
    Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20
    10/25/2017
    RIP: 0010:kasan_mem_notifier+0x34/0x23b
    RSP: 0018:ffff8883ec737890 EFLAGS: 00010206
    RAX: 0000000000000246 RBX: ff10f0f4435f1000 RCX: f887a7a21af88000
    RDX: dffffc0000000000 RSI: 0000000000000020 RDI: ffff8881f221af88
    RBP: ffff8883ec737898 R08: ffff888000000000 R09: ffffffffb0bddcd0
    R10: ffffed103e857088 R11: ffff8881f42b8443 R12: dffffc0000000000
    R13: 00000000fffffff9 R14: dffffc0000000000 R15: 0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000560fbd31d730 CR3: 00000004049c6003 CR4: 00000000001606a0
    Call Trace:
    notifier_call_chain+0xbf/0x130
    __blocking_notifier_call_chain+0x76/0xc0
    blocking_notifier_call_chain+0x16/0x20
    memory_notify+0x1b/0x20
    __offline_pages+0x3e2/0x1210
    offline_pages+0x11/0x20
    memory_block_action+0x144/0x300
    memory_subsys_offline+0xe5/0x170
    device_offline+0x13f/0x1e0
    state_store+0xeb/0x110
    dev_attr_store+0x3f/0x70
    sysfs_kf_write+0x104/0x150
    kernfs_fop_write+0x25c/0x410
    __vfs_write+0x66/0x120
    vfs_write+0x15a/0x4f0
    ksys_write+0xd2/0x1b0
    __x64_sys_write+0x73/0xb0
    do_syscall_64+0xeb/0xb78
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f14f75cc3b8
    RSP: 002b:00007ffe84d01d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f14f75cc3b8
    RDX: 0000000000000008 RSI: 0000563f8e433d70 RDI: 0000000000000001
    RBP: 0000563f8e433d70 R08: 000000000000000a R09: 00007ffe84d018f0
    R10: 000000000000000a R11: 0000000000000246 R12: 00007f14f789e780
    R13: 0000000000000008 R14: 00007f14f7899740 R15: 0000000000000008

    Link: http://lkml.kernel.org/r/20190320204255.53571-1-cai@lca.pw
    Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: [5.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
    memory to zones until online") introduced move_pfn_range_to_zone() which
    calls memmap_init_zone() during onlining a memory block.
    memmap_init_zone() will reset pagetype flags and makes migrate type to
    be MOVABLE.

    However, in __offline_pages(), it also call undo_isolate_page_range()
    after offline_isolated_pages() to do the same thing. Due to commit
    2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
    __first_valid_page() to skip offline pages, undo_isolate_page_range()
    here just waste CPU cycles looping around the offlining PFN range while
    doing nothing, because __first_valid_page() will return NULL as
    offline_isolated_pages() has already marked all memory sections within
    the pfn range as offline via offline_mem_sections().

    Also, after calling the "useless" undo_isolate_page_range() here, it
    reaches the point of no returning by notifying MEM_OFFLINE. Those pages
    will be marked as MIGRATE_MOVABLE again once onlining. The only thing
    left to do is to decrease the number of isolated pageblocks zone counter
    which would make some paths of the page allocation slower that the above
    commit introduced.

    Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
    on ppc64, an "int" should still be enough to represent the number of
    pageblocks there. Fix an incorrect comment along the way.

    [cai@lca.pw: v4]
    Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
    Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Vlastimil Babka
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

17 Mar, 2019

1 commit

  • Pull device-dax updates from Dan Williams:
    "New device-dax infrastructure to allow persistent memory and other
    "reserved" / performance differentiated memories, to be assigned to
    the core-mm as "System RAM".

    Some users want to use persistent memory as additional volatile
    memory. They are willing to cope with potential performance
    differences, for example between DRAM and 3D Xpoint, and want to use
    typical Linux memory management apis rather than a userspace memory
    allocator layered over an mmap() of a dax file. The administration
    model is to decide how much Persistent Memory (pmem) to use as System
    RAM, create a device-dax-mode namespace of that size, and then assign
    it to the core-mm. The rationale for device-dax is that it is a
    generic memory-mapping driver that can be layered over any "special
    purpose" memory, not just pmem. On subsequent boots udev rules can be
    used to restore the memory assignment.

    One implication of using pmem as RAM is that mlock() no longer keeps
    data off persistent media. For this reason it is recommended to enable
    NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
    at rest. We considered making this recommendation an actively enforced
    requirement, but in the end decided to leave it as a distribution /
    administrator policy to allow for emulation and test environments that
    lack security capable NVDIMMs.

    Summary:

    - Replace the /sys/class/dax device model with /sys/bus/dax, and
    include a compat driver so distributions can opt-in to the new ABI.

    - Allow for an alternative driver for the device-dax address-range

    - Introduce the 'kmem' driver to hotplug / assign a device-dax
    address-range to the core-mm.

    - Arrange for the device-dax target-node to be onlined so that the
    newly added memory range can be uniquely referenced by numa apis"

    NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
    we currently have special - and very annoying rules in the kernel about
    accessing PMEM only with the "MC safe" accessors, because machine checks
    inside the regular repeat string copy functions can be fatal in some
    (not described) circumstances.

    And apparently the PMEM modules can cause that a lot more than regular
    RAM. The argument is that this happens because PMEM doesn't necessarily
    get scrubbed at boot like RAM does, but that is planned to be added for
    the user space tooling.

    Quoting Dan from another email:
    "The exposure can be reduced in the volatile-RAM case by scanning for
    and clearing errors before it is onlined as RAM. The userspace tooling
    for that can be in place before v5.1-final. There's also runtime
    notifications of errors via acpi_nfit_uc_error_notify() from
    background scrubbers on the DIMM devices. With that mechanism the
    kernel could proactively clear newly discovered poison in the volatile
    case, but that would be additional development more suitable for v5.2.

    I understand the concern, and the need to highlight this issue by
    tapping the brakes on feature development, but I don't see PMEM as RAM
    making the situation worse when the exposure is also there via DAX in
    the PMEM case. Volatile-RAM is arguably a safer use case since it's
    possible to repair pages where the persistent case needs active
    application coordination"

    * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    device-dax: "Hotplug" persistent memory for use like normal RAM
    mm/resource: Let walk_system_ram_range() search child resources
    mm/memory-hotplug: Allow memory resources to be children
    mm/resource: Move HMM pr_debug() deeper into resource code
    mm/resource: Return real error codes from walk failures
    device-dax: Add a 'modalias' attribute to DAX 'bus' devices
    device-dax: Add a 'target_node' attribute
    device-dax: Auto-bind device after successful new_id
    acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
    device-dax: Add /sys/class/dax backwards compatibility
    device-dax: Add support for a dax override driver
    device-dax: Move resource pinning+mapping into the common driver
    device-dax: Introduce bus + driver model
    device-dax: Start defining a dax bus model
    device-dax: Remove multi-resource infrastructure
    device-dax: Kill dax_region base
    device-dax: Kill dax_region ida

    Linus Torvalds
     

12 Mar, 2019

1 commit

  • Pull xen updates from Juergen Gross:
    "xen fixes and features:

    - remove fallback code for very old Xen hypervisors

    - three patches for fixing Xen dom0 boot regressions

    - an old patch for Xen PCI passthrough which was never applied for
    unknown reasons

    - some more minor fixes and cleanup patches"

    * tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen: fix dom0 boot on huge systems
    xen, cpu_hotplug: Prevent an out of bounds access
    xen: remove pre-xen3 fallback handlers
    xen/ACPI: Switch to bitmap_zalloc()
    x86/xen: dont add memory above max allowed allocation
    x86: respect memory size limiting via mem= parameter
    xen/gntdev: Check and release imported dma-bufs on close
    xen/gntdev: Do not destroy context while dma-bufs are in use
    xen/pciback: Don't disable PCI_COMMAND on PCI device reset.
    xen-scsiback: mark expected switch fall-through
    xen: mark expected switch fall-through

    Linus Torvalds
     

06 Mar, 2019

5 commits

  • When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
    in the block from kernel, However, it does not map those pages while
    offlining at the beginning. As the result, it triggers a panic below
    while onlining on ppc64le as it checks if the pages are mapped before
    unmapping. However, the imbalance exists for all arches where
    double-unmappings could happen. Therefore, let kernel map those pages
    in generic_online_page() before they have being freed into the page
    allocator for the first time where it will set the page count to one.

    On the other hand, it works fine during the boot, because at least for
    IBM POWER8, it does,

    early_setup
    early_init_mmu
    harsh__early_init_mmu
    htab_initialize [1]
    htab_bolt_mapping [2]

    where it effectively map all memblock regions just like
    kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
    will unmap them just fine without any imbalance. On other arches
    without this imbalance checking, it still unmap them once at the most.

    [1]
    for_each_memblock(memory, reg) {
    base = (unsigned long)__va(reg->base);
    size = reg->size;

    DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
    base, size, prot);

    BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
    prot, mmu_linear_psize, mmu_kernel_ssize));
    }

    [2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
    kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
    CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
    NIP: c000000000062670 LR: c00000000006265c CTR: 0000000000000000
    REGS: c0000005bf8a75b0 TRAP: 0700 Not tainted (5.0.0-rc7+)
    MSR: 800000000282b033 CR: 28422842
    XER: 00000000
    CFAR: c000000000804f44 IRQMASK: 1
    NIP [c000000000062670] __kernel_map_pages+0x2e0/0x4f0
    LR [c00000000006265c] __kernel_map_pages+0x2cc/0x4f0
    Call Trace:
    __kernel_map_pages+0x2cc/0x4f0
    free_unref_page_prepare+0x2f0/0x4d0
    free_unref_page+0x44/0x90
    __online_page_free+0x84/0x110
    online_pages_range+0xc0/0x150
    walk_system_ram_range+0xc8/0x120
    online_pages+0x280/0x5a0
    memory_subsys_online+0x1b4/0x270
    device_online+0xc0/0xf0
    state_store+0xc0/0x180
    dev_attr_store+0x3c/0x60
    sysfs_kf_write+0x70/0xb0
    kernfs_fop_write+0x10c/0x250
    __vfs_write+0x48/0x240
    vfs_write+0xd8/0x210
    ksys_write+0x70/0x120
    system_call+0x5c/0x70

    Link: http://lkml.kernel.org/r/20190301220814.97339-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman [powerpc]
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • isolate_huge_page() expects we pass the head of hugetlb page to it:

    bool isolate_huge_page(...)
    {
    ...
    VM_BUG_ON_PAGE(!PageHead(page), page);
    ...
    }

    While I really cannot think of any situation where we end up with a
    non-head page between hands in do_migrate_range(), let us make sure the
    code is as sane as possible by explicitly passing the Head. Since we
    already got the pointer, it does not take us extra effort.

    Link: http://lkml.kernel.org/r/20190208090604.975-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Anthony Yznaga
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • In the current implementation, there are two places to isolate a range
    of page: __offline_pages() and alloc_contig_range(). During this
    procedure, it will drain pages on pcp list.

    Below is a brief call flow:

    __offline_pages()/alloc_contig_range()
    start_isolate_page_range()
    set_migratetype_isolate()
    drain_all_pages()
    drain_all_pages()
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • When freeing pages are done with higher order, time spent on coalescing
    pages by buddy allocator can be reduced. With section size of 256MB,
    hot add latency of a single section shows improvement from 50-60 ms to
    less than 1 ms, hence improving the hot add latency by 60 times. Modify
    external providers of online callback to align with the change.

    [arunks@codeaurora.org: v11]
    Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
    [akpm@linux-foundation.org: remove unused local, per Arun]
    [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
    [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
    [arunks@codeaurora.org: v8]
    Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
    [arunks@codeaurora.org: v9]
    Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
    Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Alexander Duyck
    Cc: K. Y. Srinivasan
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Greg Kroah-Hartman
    Cc: Mathieu Malaterre
    Cc: "Kirill A. Shutemov"
    Cc: Souptick Joarder
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Srivatsa Vaddagiri
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

01 Mar, 2019

2 commits

  • The mm/resource.c code is used to manage the physical address
    space. The current resource configuration can be viewed in
    /proc/iomem. An example of this is at the bottom of this
    description.

    The nvdimm subsystem "owns" the physical address resources which
    map to persistent memory and has resources inserted for them as
    "Persistent Memory". The best way to repurpose this for volatile
    use is to leave the existing resource in place, but add a "System
    RAM" resource underneath it. This clearly communicates the
    ownership relationship of this memory.

    The request_resource_conflict() API only deals with the
    top-level resources. Replace it with __request_region() which
    will search for !IORESOURCE_BUSY areas lower in the resource
    tree than the top level.

    We *could* also simply truncate the existing top-level
    "Persistent Memory" resource and take over the released address
    space. But, this means that if we ever decide to hot-unplug the
    "RAM" and give it back, we need to recreate the original setup,
    which may mean going back to the BIOS tables.

    This should have no real effect on the existing collision
    detection because the areas that truly conflict should be marked
    IORESOURCE_BUSY.

    00000000-00000fff : Reserved
    00001000-0009fbff : System RAM
    0009fc00-0009ffff : Reserved
    000a0000-000bffff : PCI Bus 0000:00
    000c0000-000c97ff : Video ROM
    000c9800-000ca5ff : Adapter ROM
    000f0000-000fffff : Reserved
    000f0000-000fffff : System ROM
    00100000-9fffffff : System RAM
    01000000-01e071d0 : Kernel code
    01e071d1-027dfdff : Kernel data
    02dc6000-0305dfff : Kernel bss
    a0000000-afffffff : Persistent Memory (legacy)
    a0000000-a7ffffff : System RAM
    b0000000-bffdffff : System RAM
    bffe0000-bfffffff : Reserved
    c0000000-febfffff : PCI Bus 0000:00

    Signed-off-by: Dave Hansen
    Reviewed-by: Dan Williams
    Reviewed-by: Vishal Verma
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     
  • HMM consumes physical address space for its own use, even
    though nothing is mapped or accessible there. It uses a
    special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
    to uniquely identify these areas.

    When HMM consumes address space, it makes a best guess about
    what to consume. However, it is possible that a future memory
    or device hotplug can collide with the reserved area. In the
    case of these conflicts, there is an error message in
    register_memory_resource().

    Later patches in this series move register_memory_resource()
    from using request_resource_conflict() to __request_region().
    Unfortunately, __request_region() does not return the conflict
    like the previous function did, which makes it impossible to
    check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
    resource.

    Instead of warning in register_memory_resource(), move the
    check into the core resource code itself (__request_region())
    where the conflicting resource _is_ available. This has the
    added bonus of producing a warning in case of HMM conflicts
    with devices *or* RAM address space, as opposed to the RAM-
    only warnings that were there previously.

    Signed-off-by: Dave Hansen
    Reviewed-by: Jerome Glisse
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     

22 Feb, 2019

1 commit

  • Rong Chen has reported the following boot crash:

    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    RIP: 0010:page_mapping+0x12/0x80
    Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
    RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
    RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
    RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
    RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
    R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
    R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
    FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
    Call Trace:
    __dump_page+0x14/0x2c0
    is_mem_section_removable+0x24c/0x2c0
    removable_show+0x87/0xa0
    dev_attr_show+0x25/0x60
    sysfs_kf_seq_show+0xba/0x110
    seq_read+0x196/0x3f0
    __vfs_read+0x34/0x180
    vfs_read+0xa0/0x150
    ksys_read+0x44/0xb0
    do_syscall_64+0x5e/0x4a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    and bisected it down to commit efad4e475c31 ("mm, memory_hotplug:
    is_mem_section_removable do not pass the end of a zone").

    The reason for the crash is that the mapping is garbage for poisoned
    (uninitialized) page. This shouldn't happen as all pages in the zone's
    boundary should be initialized.

    Later debugging revealed that the actual problem is an off-by-one when
    evaluating the end_page. 'start_pfn + nr_pages' resp 'zone_end_pfn'
    refers to a pfn after the range and as such it might belong to a
    differen memory section.

    This along with CONFIG_SPARSEMEM then makes the loop condition
    completely bogus because a pointer arithmetic doesn't work for pages
    from two different sections in that memory model.

    Fix the issue by reworking is_pageblock_removable to be pfn based and
    only use struct page where necessary. This makes the code slightly
    easier to follow and we will remove the problematic pointer arithmetic
    completely.

    Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
    Fixes: efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
    Signed-off-by: Michal Hocko
    Reported-by:
    Tested-by:
    Acked-by: Mike Rapoport
    Reviewed-by: Oscar Salvador
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Feb, 2019

1 commit

  • When limiting memory size via kernel parameter "mem=" this should be
    respected even in case of memory made accessible via a PCI card.

    Today this kind of memory won't be made usable in initial memory
    setup as the memory won't be visible in E820 map, but it might be
    added when adding PCI devices due to corresponding ACPI table entries.

    Not respecting "mem=" can be corrected by adding a global max_mem_size
    variable set by parse_memopt() which will result in rejecting adding
    memory areas resulting in a memory size above the allowed limit.

    Signed-off-by: Juergen Gross
    Acked-by: Ingo Molnar
    Reviewed-by: William Kucharski
    Signed-off-by: Juergen Gross

    Juergen Gross
     

02 Feb, 2019

5 commits

  • Jan has noticed that we do double unlock on some failure paths when
    offlining a page range. This is indeed the case when
    test_pages_in_a_zone respp. start_isolate_page_range fail. This was an
    omission when forward porting the debugging patch from an older kernel.

    Fix the issue by dropping mem_hotplug_done from the failure condition
    and keeping the single unlock in the catch all failure path.

    Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
    Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
    Signed-off-by: Michal Hocko
    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Tested-by: Jan Kara
    Reviewed-by: Oscar Salvador
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm,
    page_alloc: fix has_unmovable_pages for HugePages").

    Gigantic hugepages cross several memblocks, so it can be that the page
    we get in scan_movable_pages() is a page-tail belonging to a
    1G-hugepage. If that happens, page_hstate()->size_to_hstate() will
    return NULL, and we will blow up in hugepage_migration_supported().

    The splat is as follows:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__offline_pages+0x6ae/0x900
    Call Trace:
    memory_subsys_offline+0x42/0x60
    device_offline+0x80/0xa0
    state_store+0xab/0xc0
    kernfs_fop_write+0x102/0x180
    __vfs_write+0x26/0x190
    vfs_write+0xad/0x1b0
    ksys_write+0x42/0x90
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)

    [akpm@linux-foundation.org: fix brace layout, per David. Reduce indentation]
    Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Anthony Yznaga
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • If memory end is not aligned with the sparse memory section boundary,
    the mapping of such a section is only partly initialized. This may lead
    to VM_BUG_ON due to uninitialized struct pages access from
    test_pages_in_a_zone() function triggered by memory_hotplug sysfs
    handlers.

    Here are the the panic examples:
    CONFIG_DEBUG_VM_PGFLAGS=y
    kernel parameter mem=2050M
    --------------------------
    page:000003d082008000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    test_pages_in_a_zone+0xde/0x160
    show_valid_zones+0x5c/0x190
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    test_pages_in_a_zone+0xde/0x160
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Fix this by checking whether the pfn to check is within the zone.

    [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org

    [mhocko@suse.com: separated this change from
    http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Signed-off-by: Michal Hocko
    Signed-off-by: Mikhail Zaslonko
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Tested-by: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Mikhail Gavrilov
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikhail Zaslonko
     
  • Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

    Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
    [1]. I have pushed back on those fixes because I believed that it is
    much better to plug the problem at the initialization time rather than
    play whack-a-mole all over the hotplug code and find all the places
    which expect the full memory section to be initialized.

    We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug:
    initialize struct pages for the full memory section") merged and cause a
    regression [2][3]. The reason is that there might be memory layouts
    when two NUMA nodes share the same memory section so the merged fix is
    simply incorrect.

    In order to plug this hole we really have to be zone range aware in
    those handlers. I have split up the original patch into two. One is
    unchanged (patch 2) and I took a different approach for `removable'
    crash.

    [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
    [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

    This patch (of 2):

    Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
    removable state of a memory block:

    page:000003d08300c000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    is_mem_section_removable+0xb4/0x190
    show_mem_removable+0x9a/0xd8
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    is_mem_section_removable+0xb4/0x190
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    The reason is that the memory block spans the zone boundary and we are
    stumbling over an unitialized struct page. Fix this by enforcing zone
    range in is_mem_section_removable so that we never run away from a zone.

    Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikhail Zaslonko
    Debugged-by: Mikhail Zaslonko
    Tested-by: Gerald Schaefer
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_migrate_range() takes a memory range and tries to isolate the pages
    to put them into a list. This list will be later on used in
    migrate_pages() to know the pages we need to migrate.

    Currently, if we fail to isolate a single page, we put all already
    isolated pages back to their LRU and we bail out from the function.
    This is quite suboptimal, as this will force us to start over again
    because scan_movable_pages will give us the same range. If there is no
    chance that we can isolate that page, we will loop here forever.

    Issue debugged in [1] has proved that. During the debugging of that
    issue, it was noticed that if do_migrate_ranges() fails to isolate a
    single page, we will just discard the work we have done so far and bail
    out, which means that scan_movable_pages() will find again the same set
    of pages.

    Instead, we can just skip the error, keep isolating as much pages as
    possible and then proceed with the call to migrate_pages().

    This will allow us to do as much work as possible at once.

    [1] https://lkml.org/lkml/2018/12/6/324

    Michal said:

    : I still think that this doesn't give us a whole picture. Looping for
    : ever is a bug. Failing the isolation is quite possible and it should
    : be a ephemeral condition (e.g. a race with freeing the page or
    : somebody else isolating the page for whatever reason). And here comes
    : the disadvantage of the current implementation. We simply throw
    : everything on the floor just because of a ephemeral condition. The
    : racy page_count check is quite dubious to prevent from that.

    Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Dan Williams
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

29 Dec, 2018

10 commits

  • Memory migration might fail during offlining and we keep retrying in that
    case. This is currently obfuscated by goto retry loop. The code is hard
    to follow and as a result it is even suboptimal becase each retry round
    scans the full range from start_pfn even though we have successfully
    scanned/migrated [start_pfn, pfn] range already. This is all only because
    check_pages_isolated failure has to rescan the full range again.

    De-obfuscate the migration retry loop by promoting it to a real for loop.
    In fact remove the goto altogether by making it a proper double loop
    (yeah, gotos are nasty in this specific case). In the end we will get a
    slightly more optimal code which is better readable.

    [akpm@linux-foundation.org: reflow comments to 80 cols]
    Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Pavel Tatashin
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "few memory offlining enhancements".

    I have been chasing memory offlining not making progress recently. On the
    way I have noticed few weird decisions in the code. The migration itself
    is restricted without a reasonable justification and the retry loop around
    the migration is quite messy. This is addressed by patch 1 and patch 2.

    Patch 3 is targeting on the faultaround code which has been a hot
    candidate for the initial issue reported upstream [2] and that I am
    debugging internally. It turned out to be not the main contributor in the
    end but I believe we should address it regardless. See the patch
    description for more details.

    [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv

    This patch (of 3):

    do_migrate_range has been limiting the number of pages to migrate to 256
    for some reason which is not documented. Even if the limit made some
    sense back then when it was introduced it doesn't really serve a good
    purpose these days. If the range contains huge pages then we break out of
    the loop too early and go through LRU and pcp caches draining and
    scan_movable_pages is quite suboptimal.

    The only reason to limit the number of pages I can think of is to reduce
    the potential time to react on the fatal signal. But even then the number
    of pages is a questionable metric because even a single page migration
    might block in a non-killable state (e.g. __unmap_and_move).

    Remove the limit and offline the full requested range (this is one
    memblock worth of pages with the current code). Should we ever get a
    report that offlining takes too long to react on fatal signal then we
    should rather fix the core migration to use killable waits and bailout
    on a signal.

    Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
    Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We have received a bug report that an injected MCE about faulty memory
    prevents memory offline to succeed on 4.4 base kernel. The underlying
    reason was that the HWPoison page has an elevated reference count and the
    migration keeps failing. There are two problems with that. First of all
    it is dubious to migrate the poisoned page because we know that accessing
    that memory is possible to fail. Secondly it doesn't make any sense to
    migrate a potentially broken content and preserve the memory corruption
    over to a new location.

    Oscar has found out that 4.4 and the current upstream kernels behave
    slightly differently with his simply testcase

    ===

    int main(void)
    {
    int ret;
    int i;
    int fd;
    char *array = malloc(4096);
    char *array_locked = malloc(4096);

    fd = open("/tmp/data", O_RDONLY);
    read(fd, array, 4095);

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
    if (ret)
    perror("mlock");

    sleep (20);

    ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
    if (ret)
    perror("madvise");

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    return 0;
    }
    ===

    + offline this memory.

    In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
    list
    kernel: [] dump_trace+0x59/0x340
    kernel: [] show_stack_log_lvl+0xea/0x170
    kernel: [] show_stack+0x21/0x40
    kernel: [] dump_stack+0x5c/0x7c
    kernel: [] warn_slowpath_common+0x81/0xb0
    kernel: [] __pagevec_lru_add_fn+0x14c/0x160
    kernel: [] pagevec_lru_move_fn+0xad/0x100
    kernel: [] __lru_cache_add+0x6c/0xb0
    kernel: [] add_to_page_cache_lru+0x46/0x70
    kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
    kernel: [] __do_page_cache_readahead+0x177/0x200
    kernel: [] ondemand_readahead+0x168/0x2a0
    kernel: [] generic_file_read_iter+0x41f/0x660
    kernel: [] __vfs_read+0xcd/0x140
    kernel: [] vfs_read+0x7a/0x120
    kernel: [] kernel_read+0x3b/0x50
    kernel: [] do_execveat_common.isra.29+0x490/0x6f0
    kernel: [] do_execve+0x28/0x30
    kernel: [] call_usermodehelper_exec_async+0xfb/0x130
    kernel: [] ret_from_fork+0x55/0x80

    And that latter confuses the hotremove path because an LRU page is
    attempted to be migrated and that fails due to an elevated reference
    count. It is quite possible that the reuse of the HWPoisoned page is some
    kind of fixed race condition but I am not really sure about that.

    With the upstream kernel the failure is slightly different. The page
    doesn't seem to have LRU bit set but isolate_movable_page simply fails and
    do_migrate_range simply puts all the isolated pages back to LRU and
    therefore no progress is made and scan_movable_pages finds same set of
    pages over and over again.

    Fix both cases by explicitly checking HWPoisoned pages before we even try
    to get reference on the page, try to unmap it if it is still mapped. As
    explained by Naoya:

    : Hwpoison code never unmapped those for no big reason because
    : Ksm pages never dominate memory, so we simply didn't have strong
    : motivation to save the pages.

    Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
    HWPoison pages which shouldn't happen but I couldn't convince myself about
    that. Naoya has noted the following:

    : Theoretically no such gurantee, because try_to_unmap() doesn't have a
    : guarantee of success and then memory_failure() returns immediately
    : when hwpoison_user_mappings fails.
    : Or the following code (comes after hwpoison_user_mappings block) also impli=
    : es
    : that the target page can still have PageLRU flag.
    :
    : /*
    : * Torn down by someone else?
    : */
    : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
    : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
    : res =3D -EBUSY;
    : goto out;
    : }
    :
    : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
    : current version of your patch.

    Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Debugged-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • During online_pages phase, pgdat->nr_zones will be updated in case this
    zone is empty.

    Currently the online_pages phase is protected by the global locks
    (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
    no contention during the update of nr_zones.

    These global locks introduces scalability issues (especially the second
    one), which slow down code relying on get_online_mems(). This is also a
    preparation for not having to rely on get_online_mems() but instead some
    more fine grained locks.

    The patch moves init_currently_empty_zone under both zone_span_writelock
    and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
    and the zone's start_pfn. Also this patch changes the documentation of
    node_size_lock to include the protection of nr_zones.

    Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Since the information needed in sparse_add_one_section() is node id to
    allocate proper memory, it is not necessary to pass its pgdat.

    This patch changes the prototype of sparse_add_one_section() to pass node
    id directly. This is intended to reduce misleading that
    sparse_add_one_section() would touch pgdat.

    Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Patch series "Do not touch pages in hot-remove path", v2.

    This patchset aims for two things:

    1) A better definition about offline and hot-remove stage
    2) Solving bugs where we can access non-initialized pages
    during hot-remove operations [2] [3].

    This is achieved by moving all page/zone handling to the offline
    stage, so we do not need to access pages when hot-removing memory.

    [1] https://patchwork.kernel.org/cover/10691415/
    [2] https://patchwork.kernel.org/patch/10547445/
    [3] https://www.spinics.net/lists/linux-mm/msg161316.html

    This patch (of 5):

    This is a preparation for the following-up patches. The idea of passing
    the nid is that it will allow us to get rid of the zone parameter
    afterwards.

    Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jerome Glisse
    Cc: Jonathan Cameron
    Cc: "Rafael J. Wysocki"

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Userspace should always be in charge of how to online memory and if memory
    should be onlined automatically in the kernel. Let's drop the parameter
    to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
    so we can directly use that instead internally.

    Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Acked-by: Juergen Gross
    Cc: Boris Ostrovsky
    Cc: Stefano Stabellini
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Per-cpu numa_node provides a default node for each possible cpu. The
    association gets initialized during the boot when the architecture
    specific code explores cpu->NUMA affinity. When the whole NUMA node is
    removed though we are clearing this association

    try_offline_node
    check_and_unmap_cpu_on_node
    unmap_cpu_on_node
    numa_clear_node
    numa_set_node(cpu, NUMA_NO_NODE)

    This means that whoever calls cpu_to_node for a cpu associated with such a
    node will get NUMA_NO_NODE. This is problematic for two reasons. First
    it is fragile because __alloc_pages_node would simply blow up on an
    out-of-bound access. We have encountered this when loading kvm module

    BUG: unable to handle kernel paging request at 00000000000021c0
    IP: __alloc_pages_nodemask+0x93/0xb70
    PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
    Oops: 0000 [#1] SMP
    [...]
    CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1
    RIP: __alloc_pages_nodemask+0x93/0xb70
    RSP: 0018:ffff887354493b40 EFLAGS: 00010202
    RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
    RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
    R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
    R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
    FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
    hardware_setup+0x781/0x849 [kvm_intel]
    kvm_arch_hardware_setup+0x28/0x190 [kvm]
    kvm_init+0x7c/0x2d0 [kvm]
    vmx_init+0x1e/0x32c [kvm_intel]
    do_one_initcall+0xca/0x1f0
    do_init_module+0x5a/0x1d7
    load_module+0x1393/0x1c90
    SYSC_finit_module+0x70/0xa0
    entry_SYSCALL_64_fastpath+0x1e/0xb7
    DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

    on an older kernel but the code is basically the same in the current Linus
    tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would
    recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
    to numa_mem_id but that is wrong as well because it would use a cpu
    affinity of the local CPU which might be quite far from the original node.
    It is also reasonable to expect that cpu_to_node will provide a sane
    value and there might be many more callers like that.

    The second problem is that __register_one_node relies on cpu_to_node to
    properly associate cpus back to the node when it is onlined. We do not
    want to lose that link as there is no arch independent way to get it from
    the early boot time AFAICS.

    Drop the whole check_and_unmap_cpu_on_node machinery and keep the
    association to fix both issues. The NODE_DATA(nid) is not deallocated so
    it will stay in place and if anybody wants to allocate from that node then
    a fallback node will be used.

    Thanks to Vlastimil Babka for his live system debugging skills that helped
    debugging the issue.

    Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
    Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Reported-by: Miroslav Benes
    Acked-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Heiko has complained that his log is swamped by warnings from
    has_unmovable_pages

    [ 20.536664] page dumped because: has_unmovable_pages
    [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
    [ 20.536794] flags: 0x3fffe0000010200(slab|head)
    [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
    [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
    [ 20.536797] page dumped because: has_unmovable_pages
    [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
    [ 20.536815] flags: 0x7fffe0000000000()
    [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
    [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

    which are not triggered by the memory hotplug but rather CMA allocator.
    The original idea behind dumping the page state for all call paths was
    that these messages will be helpful debugging failures. From the above it
    seems that this is not the case for the CMA path because we are lacking
    much more context. E.g the second reported page might be a CMA allocated
    page. It is still interesting to see a slab page in the CMA area but it
    is hard to tell whether this is bug from the above output alone.

    Address this issue by dumping the page state only on request. Both
    start_isolate_page_range and has_unmovable_pages already have an argument
    to ignore hwpoison pages so make this argument more generic and turn it
    into flags and allow callers to combine non-default modes into a mask.
    While we are at it, has_unmovable_pages call from
    is_pageblock_removable_nolock (sysfs removable file) is questionable to
    report the failure so drop it from there as well.

    Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Heiko Carstens
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There is only very limited information printed when the memory offlining
    fails:

    [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

    This tells us that the failure is triggered by the userspace intervention
    but it doesn't tell us much more about the underlying reason. It might be
    that the page migration failes repeatedly and the userspace timeout
    expires and send a signal or it might be some of the earlier steps
    (isolation, memory notifier) takes too long.

    If the migration failes then it would be really helpful to see which page
    that and its state. The same applies to the isolation phase. If we fail
    to isolate a page from the allocator then knowing the state of the page
    would be helpful as well.

    Dump the page state that fails to get isolated or migrated. This will
    tell us more about the failure and what to focus on during debugging.

    [akpm@linux-foundation.org: add missing printk arg]
    [mhocko@suse.com: tweak dump_page() `reason' text]
    Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
    Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Reviewed-by: Anshuman Khandual
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko