13 Jan, 2019

1 commit

  • commit b15c87263a69272423771118c653e9a1d0672caa upstream.

    We have received a bug report that an injected MCE about faulty memory
    prevents memory offline to succeed on 4.4 base kernel. The underlying
    reason was that the HWPoison page has an elevated reference count and the
    migration keeps failing. There are two problems with that. First of all
    it is dubious to migrate the poisoned page because we know that accessing
    that memory is possible to fail. Secondly it doesn't make any sense to
    migrate a potentially broken content and preserve the memory corruption
    over to a new location.

    Oscar has found out that 4.4 and the current upstream kernels behave
    slightly differently with his simply testcase

    ===

    int main(void)
    {
    int ret;
    int i;
    int fd;
    char *array = malloc(4096);
    char *array_locked = malloc(4096);

    fd = open("/tmp/data", O_RDONLY);
    read(fd, array, 4095);

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
    if (ret)
    perror("mlock");

    sleep (20);

    ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
    if (ret)
    perror("madvise");

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    return 0;
    }
    ===

    + offline this memory.

    In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
    list
    kernel: [] dump_trace+0x59/0x340
    kernel: [] show_stack_log_lvl+0xea/0x170
    kernel: [] show_stack+0x21/0x40
    kernel: [] dump_stack+0x5c/0x7c
    kernel: [] warn_slowpath_common+0x81/0xb0
    kernel: [] __pagevec_lru_add_fn+0x14c/0x160
    kernel: [] pagevec_lru_move_fn+0xad/0x100
    kernel: [] __lru_cache_add+0x6c/0xb0
    kernel: [] add_to_page_cache_lru+0x46/0x70
    kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
    kernel: [] __do_page_cache_readahead+0x177/0x200
    kernel: [] ondemand_readahead+0x168/0x2a0
    kernel: [] generic_file_read_iter+0x41f/0x660
    kernel: [] __vfs_read+0xcd/0x140
    kernel: [] vfs_read+0x7a/0x120
    kernel: [] kernel_read+0x3b/0x50
    kernel: [] do_execveat_common.isra.29+0x490/0x6f0
    kernel: [] do_execve+0x28/0x30
    kernel: [] call_usermodehelper_exec_async+0xfb/0x130
    kernel: [] ret_from_fork+0x55/0x80

    And that latter confuses the hotremove path because an LRU page is
    attempted to be migrated and that fails due to an elevated reference
    count. It is quite possible that the reuse of the HWPoisoned page is some
    kind of fixed race condition but I am not really sure about that.

    With the upstream kernel the failure is slightly different. The page
    doesn't seem to have LRU bit set but isolate_movable_page simply fails and
    do_migrate_range simply puts all the isolated pages back to LRU and
    therefore no progress is made and scan_movable_pages finds same set of
    pages over and over again.

    Fix both cases by explicitly checking HWPoisoned pages before we even try
    to get reference on the page, try to unmap it if it is still mapped. As
    explained by Naoya:

    : Hwpoison code never unmapped those for no big reason because
    : Ksm pages never dominate memory, so we simply didn't have strong
    : motivation to save the pages.

    Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
    HWPoison pages which shouldn't happen but I couldn't convince myself about
    that. Naoya has noted the following:

    : Theoretically no such gurantee, because try_to_unmap() doesn't have a
    : guarantee of success and then memory_failure() returns immediately
    : when hwpoison_user_mappings fails.
    : Or the following code (comes after hwpoison_user_mappings block) also impli=
    : es
    : that the target page can still have PageLRU flag.
    :
    : /*
    : * Torn down by someone else?
    : */
    : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
    : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
    : res =3D -EBUSY;
    : goto out;
    : }
    :
    : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
    : current version of your patch.

    Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Debugged-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

04 Oct, 2017

3 commits

  • find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
    and return the pfn of the section. But the functions are defined as int.
    So the functions always return 0x00000000 - 0xffffffff. It means if
    memory address is over 16TB, the functions does not work correctly.

    To handle 64 bit value, the patch defines
    find_{smallest|biggest}_section_pfn() as unsigned long.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
    pfn_to_section_nr() has no issue even if it is defined as macro. But
    section_nr_to_pfn() has overflow issue if sec is defined as int.

    section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is
    defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
    But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
    value.

    __remove_section() calculates start_pfn using section_nr_to_pfn() and
    scn_nr defined as int. So if hot-removed memory address is over 16TB,
    overflow issue occurs and section_nr_to_pfn() does not calculate correct
    pfn.

    To make callers use proper arg, the patch changes the macros to inline
    functions.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • Patch series "mm, memory_hotplug: fix few soft lockups in memory
    hotadd".

    Johannes has noticed few soft lockups when adding a large nvdimm device.
    All of them were caused by a long loop without any explicit cond_resched
    which is a problem for !PREEMPT kernels.

    The fix is quite straightforward. Just make sure that cond_resched gets
    called from time to time.

    This patch (of 3):

    __add_pages gets a pfn range to add and there is no upper bound for a
    single call. This is usually a memory block aligned size for the
    regular memory hotplug - smaller sizes are usual for memory balloning
    drivers, or the whole NUMA node for physical memory online. There is no
    explicit scheduling point in that code path though.

    This can lead to long latencies while __add_pages is executed and we
    have even seen a soft lockup report during nvdimm initialization with
    !PREEMPT kernel

    NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
    [...]
    Workqueue: events_unbound async_run_entry_fn
    task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
    RIP: _raw_spin_unlock_irqrestore+0x11/0x20
    RSP: 0018:ffff881809277b10 EFLAGS: 00000286
    [...]
    Call Trace:
    sparse_add_one_section+0x13d/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    devm_memremap_pages+0x29d/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70
    DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

    Fix this by adding cond_resched once per each memory section in the
    given pfn range. Each section is constant amount of work which itself
    is not too expensive but many of them will just add up.

    Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Sep, 2017

2 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This patch enables thp migration for memory hotremove.

    Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

5 commits

  • zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
    potential race while building zonelist for new populated zone") to
    protect zonelist building from races. This is no longer needed though
    because both memory online and offline are fully serialized. New users
    have grown since then.

    Notably setup_per_zone_wmarks wants to prevent from races between memory
    hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
    (see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
    add a private lock for that purpose. This will not prevent from seeing
    halfway through memory hotplug operation but that shouldn't be a big
    deal becuse memory hotplug will update watermarks explicitly so we will
    eventually get a full picture. The lock just makes sure we won't race
    when updating watermarks leading to weird results.

    Also __build_all_zonelists manipulates global data so add a private lock
    for it as well. This doesn't seem to be necessary today but it is more
    robust to have a lock there.

    While we are at it make sure we document that memory online/offline
    depends on a full serialization either via mem_hotplug_begin() or
    device_lock.

    Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Haicheng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_online_node calls hotadd_new_pgdat which already calls
    build_all_zonelists. So the additional call is redundant. Even though
    hotadd_new_pgdat will only initialize zonelists of the new node this is
    the right thing to do because such a node doesn't have any memory so
    other zonelists would ignore all the zones from this node anyway.

    Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Toshi Kani
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
    to precede the Movable zone in the physical memory range. The purpose
    of the movable zone is, however, not bound to any physical memory
    restriction. It merely defines a class of migrateable and reclaimable
    memory.

    There are users (e.g. CMA) who might want to reserve specific physical
    memory ranges for their own purpose. Moreover our pfn walkers have to
    be prepared for zones overlapping in the physical range already because
    we do support interleaving NUMA nodes and therefore zones can interleave
    as well. This means we can allow each memory block to be associated
    with a different zone.

    Loosen the current onlining semantic and allow explicit onlining type on
    any memblock. That means that online_{kernel,movable} will be allowed
    regardless of the physical address of the memblock as long as it is
    offline of course. This might result in moveble zone overlapping with
    other kernel zones. Default onlining then becomes a bit tricky but
    still sensible. echo online > memoryXY/state will online the given
    block to

    1) the default zone if the given range is outside of any zone
    2) the enclosing zone if such a zone doesn't interleave with
    any other zone
    3) the default zone if more zones interleave for this range

    where default zone is movable zone only if movable_node is enabled
    otherwise it is a kernel zone.

    Here is an example of the semantic with (movable_node is not present but
    it work in an analogous way). We start with following memblocks, all of
    them offline:

    memory34/valid_zones:Normal Movable
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Normal Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal Movable
    memory40/valid_zones:Normal Movable
    memory41/valid_zones:Normal Movable

    Now, we online block 34 in default mode and block 37 as movable

    root@test1:/sys/devices/system/node/node1# echo online > memory34/state
    root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal Movable
    memory40/valid_zones:Normal Movable
    memory41/valid_zones:Normal Movable

    As we can see all other blocks can still be onlined both into Normal and
    Movable zones and the Normal is default because the Movable zone spans
    only block37 now.

    root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Movable Normal
    memory39/valid_zones:Movable Normal
    memory40/valid_zones:Movable Normal
    memory41/valid_zones:Movable

    Now the default zone for blocks 37-41 has changed because movable zone
    spans that range.

    root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal
    memory40/valid_zones:Movable Normal
    memory41/valid_zones:Movable

    Note that the block 39 now belongs to the zone Normal and so block38
    falls into Normal by default as well.

    For completness

    root@test1:/sys/devices/system/node/node1# for i in memory[34]?
    do
    echo online > $i/state 2>/dev/null
    done

    memory34/valid_zones:Normal
    memory35/valid_zones:Normal
    memory36/valid_zones:Normal
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal
    memory39/valid_zones:Normal
    memory40/valid_zones:Movable
    memory41/valid_zones:Movable

    Implementation wise the change is quite straightforward. We can get rid
    of allow_online_pfn_range altogether. online_pages allows only offline
    nodes already. The original default_zone_for_pfn will become
    default_kernel_zone_for_pfn. New default_zone_for_pfn implements the
    above semantic. zone_for_pfn_range is slightly reorganized to implement
    kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
    catch all default behavior.

    Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Acked-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc:
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Wei Yang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
    hotadded memory to zones until online") we used to allow to change the
    valid zone types of a memory block if it is adjacent to a different zone
    type.

    This fact was reflected in memoryNN/valid_zones by the ordering of
    printed zones. The first one was default (echo online > memoryNN/state)
    and the other one could be onlined explicitly by online_{movable,kernel}.

    This behavior was removed by the said patch and as such the ordering was
    not all that important. In most cases a kernel zone would be default
    anyway. The only exception is movable_node handled by "mm,
    memory_hotplug: support movable_node for hotpluggable nodes".

    Let's reintroduce this behavior again because later patch will remove
    the zone overlap restriction and so user will be allowed to online
    kernel resp. movable block regardless of its placement. Original
    behavior will then become significant again because it would be
    non-trivial for users to see what is the default zone to online into.

    Implementation is really simple. Pull out zone selection out of
    move_pfn_range into zone_for_pfn_range helper and use it in
    show_valid_zones to display the zone for default onlining and then both
    kernel and movable if they are allowed. Default online zone is not
    duplicated.

    Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Reza Arbab
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc:
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jul, 2017

7 commits

  • Andrey reported a potential deadlock with the memory hotplug lock and
    the cpu hotplug lock.

    The reason is that memory hotplug takes the memory hotplug lock and then
    calls stop_machine() which calls get_online_cpus(). That's the reverse
    lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

    The problem has been there forever. The reason why this was never
    reported is that the cpu hotplug locking had this homebrewn recursive
    reader writer semaphore construct which due to the recursion evaded the
    full lock dep coverage. The memory hotplug code copied that construct
    verbatim and therefor has similar issues.

    Three steps to fix this:

    1) Convert the memory hotplug locking to a per cpu rwsem so the
    potential issues get reported proper by lockdep.

    2) Lock the online cpus in mem_hotplug_begin() before taking the memory
    hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
    code to avoid recursive locking.

    3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
    hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
    by invoking lru_add_drain_all_cpuslocked() instead.

    Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
    Reported-by: Andrey Ryabinin
    Signed-off-by: Thomas Gleixner
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Vladimir Davydov
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • __remove_zone() sets up up zone_type, but never uses it for anything.
    This does not cause a warning, due to the (necessary) use of
    -Wno-unused-but-set-variable. However, it's noise, so just delete it.

    Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
    neighbor node when mem-offline") has duplicated a large part of
    alloc_migrate_target with some hotplug specific special casing.

    To be more precise it tried to enfore the allocation from a different
    node than the original page. As a result the two function diverged in
    their shared logic, e.g. the hugetlb allocation strategy.

    Let's unify the two and express different NUMA requirements by the given
    nodemask. new_node_page will simply exclude the node it doesn't care
    about and alloc_migrate_target will use all the available nodes.
    alloc_migrate_target will then learn to migrate hugetlb pages more
    sanely and use preallocated pool when possible.

    Please note that alloc_migrate_target used to call alloc_page resp.
    alloc_pages_current so the memory policy of the current context which is
    quite strange when we consider that it is used in the context of
    alloc_contig_range which just tries to migrate pages which stand in the
    way.

    Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • new_node_page will try to use the origin's next NUMA node as the
    migration destination for hugetlb pages. If such a node doesn't have
    any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
    to allocate a surplus page instead. This is quite subotpimal for any
    configuration when hugetlb pages are no distributed to all NUMA nodes
    evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
    node 0

    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

    Now we consume the whole pool on node 4 and try to offline this node.
    All the allocated pages should be moved to node0 which has enough
    preallocated pages to hold them. With the current implementation
    offlining very likely fails because hugetlb allocations during runtime
    are much less reliable.

    Fix this by reusing the nodemask which excludes migration source and try
    to find a first node which has a page in the preallocated pool first and
    fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
    consumed.

    [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
    Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • new_node_page tries to allocate the target page on a different NUMA node
    than the source page. This makes sense in most cases during the hotplug
    because we are likely to offline the whole numa node. But there are
    cases where there are no other nodes to fallback (e.g. when offlining
    parts of the only existing node) and we have to fallback to allocating
    from the source node. The current code does that but it can be
    simplified by checking the nmask and updating it before we even try to
    allocate rather than special casing it.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • movable_node kernel parameter allows making hotpluggable NUMA nodes to
    put all the hotplugable memory into movable zone which allows more or
    less reliable memory hotremove. At least this is the case for the NUMA
    nodes present during the boot (see find_zone_movable_pfns_for_nodes).

    This is not the case for the memory hotplug, though.

    echo online > /sys/devices/system/memory/memoryXYZ/state

    will default to a kernel zone (usually ZONE_NORMAL) unless the
    particular memblock is already in the movable zone range which is not
    the case normally when onlining the memory from the udev rule context
    for a freshly hotadded NUMA node. The only option currently is to have
    a special udev rule to echo online_movable to all memblocks belonging to
    such a node which is rather clumsy. Not to mention this is inconsistent
    as well because what ended up in the movable zone during the boot will
    end up in a kernel zone after hotremove & hotadd without special care.

    It would be nice to reuse memblock_is_hotpluggable but the runtime
    hotplug doesn't have that information available because the boot and
    hotplug paths are not shared and it would be really non trivial to make
    them use the same code path because the runtime hotplug doesn't play
    with the memblock allocator at all.

    Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
    movable_node is enabled and the range doesn't overlap with the existing
    normal zone. This should provide a reasonable default onlining
    strategy.

    Strictly speaking the semantic is not identical with the boot time
    initialization because find_zone_movable_pfns_for_nodes covers only the
    hotplugable range as described by the BIOS/FW. From my experience this
    is usually a full node though (except for Node0 which is special and
    never goes away completely). If this turns out to be a problem in the
    real life we can tweak the code to store hotplug flag into memblocks but
    let's keep this simple now.

    Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Yasuaki Ishimatsu
    Cc:
    Cc: Kani Toshimitsu
    Cc:
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
    might be NULL.

    rollback_node_hotadd() dereferences this pointer. Add NULL check to
    avoid a potential NULL pointer dereference.

    Addresses-Coverity-ID: 1369133
    Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     

07 Jul, 2017

15 commits

  • movable_node_is_enabled is defined in memblock proper while it is
    initialized from the memory hotplug proper. This is quite messy and it
    makes a dependency between the two so move movable_node along with the
    helper functions to memory_hotplug.

    To make it more entertaining the kernel parameter is ignored unless
    CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
    information for each memblock otherwise. So let's warn when the option
    is disabled.

    Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Reza Arbab
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
    movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
    good explanation on why it is actually useful.

    It makes a lot of sense to make movable node semantic opt in but we
    already have that because the feature has to be explicitly enabled on
    the kernel command line. A config option on top only makes the
    configuration space larger without a good reason. It also adds an
    additional ifdefery that pollutes the code.

    Just drop the config option and make it de-facto always enabled. This
    shouldn't introduce any change to the semantic.

    Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Reza Arbab
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "remove CONFIG_MOVABLE_NODE".

    I am continuing to clean up the memory hotplug code and
    CONFIG_MOVABLE_NODE seems dubious at best. The following two patches
    simply removes the flag and make it de-facto always enabled.

    The current semantic of the config option is twofold 1) it automatically
    binds hotplugable nodes to have memory in zone_movable by default when
    movable_node is enabled 2) forbids memory hotplug to online all the
    memory as movable when !CONFIG_MOVABLE_NODE.

    The later restriction is quite dubious because there is no clear cut of
    how much normal memory do we need for a reasonable system operation. A
    single memory block which is sufficient to allow further movable onlines
    is far from sufficient (e.g a node with >2GB and memblocks 128MB will
    fill up this zone with struct pages leaving nothing for other
    allocations). Removing the config option will not only reduce the
    configuration space it also removes quite some code.

    The semantic of the movable_node command line parameter is preserved.

    The first patch removes the restriction mentioned above and the second
    one simply removes all the CONFIG_MOVABLE_NODE related stuff. The last
    patch moves movable_node flag handling to memory_hotplug proper where it
    belongs.

    [1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org

    This patch (of 3):

    Commit 74d42d8fe146 ("memory_hotplug: ensure every online node has
    NORMAL memory") has introduced a restriction that every numa node has to
    have at least some memory in !movable zones before a first movable
    memory can be onlined if !CONFIG_MOVABLE_NODE.

    Likewise can_offline_normal checks the amount of normal memory in
    !movable zones and it disallows to offline memory if there is no normal
    memory left with a justification that "memory-management acts bad when
    we have nodes which is online but don't have any normal memory".

    While it is true that not having _any_ memory for kernel allocations on
    a NUMA node is far from great and such a node would be quite subotimal
    because all kernel allocations will have to fallback to another NUMA
    node but there is no reason to disallow such a configuration in
    principle.

    Besides that there is not really a big difference to have one memblock
    for ZONE_NORMAL available or none. With 128MB size memblocks the system
    might trash on the kernel allocations requests anyway. It is really
    hard to draw a line on how much normal memory is really sufficient so we
    have to rely on administrator to configure system sanely therefore drop
    the artificial restriction and remove can_offline_normal and
    can_online_high_movable altogether.

    Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The main allocator function __alloc_pages_nodemask() takes a zonelist
    pointer as one of its parameters. All of its callers directly or
    indirectly obtain the zonelist via node_zonelist() using a preferred
    node id and gfp_mask. We can make the code a bit simpler by doing the
    zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
    id instead (gfp_mask is already another parameter).

    There are some code size benefits thanks to removal of inlined
    node_zonelist():

    bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

    This will also make things simpler if we proceed with converting cpusets
    to zonelists.

    Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Dimitri Sivanich
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • zone_for_memory doesn't have any user anymore as well as the whole zone
    shifting infrastructure so drop them all.

    This shouldn't introduce any functional changes.

    Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Tobias has reported following section mismatches introduced by "mm,
    memory_hotplug: do not associate hotadded memory to zones until online".

    WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
    The function move_pfn_range_to_zone() references
    the function __meminit memmap_init_zone().
    This is often because move_pfn_range_to_zone lacks a __meminit
    annotation or the annotation of memmap_init_zone is wrong.

    WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
    The function move_pfn_range_to_zone() references
    the function __meminit init_currently_empty_zone().
    This is often because move_pfn_range_to_zone lacks a __meminit
    annotation or the annotation of init_currently_empty_zone is wrong.

    WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
    The function move_pfn_range_to_zone() references
    the function __meminit memmap_init_zone().
    This is often because move_pfn_range_to_zone lacks a __meminit
    annotation or the annotation of memmap_init_zone is wrong.

    WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
    The function move_pfn_range_to_zone() references
    the function __meminit init_currently_empty_zone().
    This is often because move_pfn_range_to_zone lacks a __meminit
    annotation or the annotation of init_currently_empty_zone is wrong.

    Both memmap_init_zone and init_currently_empty_zone are marked __meminit
    but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
    devm_memremap_pages) so we have to hide it from the checker by __ref
    annotation.

    Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tobias Regnery
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • arch_add_memory gets for_device argument which then controls whether we
    want to create memblocks for created memory sections. Simplify the
    logic by telling whether we want memblocks directly rather than going
    through pointless negation. This also makes the api easier to
    understand because it is clear what we want rather than nothing telling
    for_device which can mean anything.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Heiko Carstens has noticed that he can generate overlapping zones for
    ZONE_DMA and ZONE_NORMAL:

    DMA [mem 0x0000000000000000-0x000000007fffffff]
    Normal [mem 0x0000000080000000-0x000000017fffffff]

    $ cat /sys/devices/system/memory/block_size_bytes
    10000000
    $ cat /sys/devices/system/memory/memory5/valid_zones
    DMA
    $ echo 0 > /sys/devices/system/memory/memory5/online
    $ cat /sys/devices/system/memory/memory5/valid_zones
    Normal
    $ echo 1 > /sys/devices/system/memory/memory5/online
    Normal

    $ cat /proc/zoneinfo
    Node 0, zone DMA
    spanned 524288
    Reported-by: Heiko Carstens
    Tested-by: Heiko Carstens
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently

    $ grep . memory3?/valid_zones
    memory34/valid_zones:Normal Movable
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Normal Movable

    $ echo online_movable > memory34/state
    $ grep . memory3?/valid_zones
    memory34/valid_zones:Movable
    memory35/valid_zones:Movable
    memory36/valid_zones:Movable
    memory37/valid_zones:Movable

    $ echo online > memory36/state
    $ grep . memory3?/valid_zones
    memory34/valid_zones:Movable
    memory36/valid_zones:Normal
    memory37/valid_zones:Movable

    so we have effectively punched a hole into the movable zone.

    The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is
    wrong. It only checks whether the given range is already part of the
    movable zone which is not the case here as only memory34 is in the zone.
    Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if
    that is false then we can be sure that movable onlining is the right
    thing to do.

    Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
    Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Heiko Carstens
    Tested-by: Heiko Carstens
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current memory hotplug implementation relies on having all the
    struct pages associate with a zone/node during the physical hotplug
    phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
    vast majority of cases this means that they are added to ZONE_NORMAL.
    This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
    hotadd without sparsemem") and it wasn't a big deal back then because
    movable onlining didn't exist yet.

    Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
    onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
    memory and portion memory") and then things got more complicated.
    Rather than reconsidering the zone association which was no longer
    needed (because the memory hotplug already depended on SPARSEMEM) a
    convoluted semantic of zone shifting has been developed. Only the
    currently last memblock or the one adjacent to the zone_movable can be
    onlined movable. This essentially means that the online type changes as
    the new memblocks are added.

    Let's simulate memory hot online manually
    $ echo 0x100000000 > /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory32/valid_zones
    Normal Movable

    $ echo $((0x100000000+(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    $ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    $ echo online_movable > /sys/devices/system/memory/memory34/state
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable Normal

    This is an awkward semantic because an udev event is sent as soon as the
    block is onlined and an udev handler might want to online it based on
    some policy (e.g. association with a node) but it will inherently race
    with new blocks showing up.

    This patch changes the physical online phase to not associate pages with
    any zone at all. All the pages are just marked reserved and wait for
    the onlining phase to be associated with the zone as per the online
    request. There are only two requirements

    - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

    - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

    the latter one is not an inherent requirement and can be changed in the
    future. It preserves the current behavior and made the code slightly
    simpler. This is subject to change in future.

    This means that the same physical online steps as above will lead to the
    following state: Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable

    Implementation:
    The current move_pfn_range is reimplemented to check the above
    requirements (allow_online_pfn_range) and then updates the respective
    zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
    pfn range with the zone/node. __add_pages is updated to not require the
    zone and only initializes sections in the range. This allowed to
    simplify the arch_add_memory code (s390 could get rid of quite some of
    code).

    devm_memremap_pages is the only user of arch_add_memory which relies on
    the zone association because it only hooks into the memory hotplug only
    half way. It uses it to associate the new memory with ZONE_DEVICE but
    doesn't allow it to be {on,off}lined via sysfs. This means that this
    particular code path has to call move_pfn_range_to_zone explicitly.

    The original zone shifting code is kept in place and will be removed in
    the follow up patch for an easier review.

    Please note that this patch also changes the original behavior when
    offlining a memory block adjacent to another zone (Normal vs. Movable)
    used to allow to change its movable type. This will be handled later.

    [richard.weiyang@gmail.com: simplify zone_intersects()]
    Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
    [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
    Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
    [akpm@linux-foundation.org: remove unused local `i']
    Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Wei Yang
    Tested-by: Dan Williams
    Tested-by: Reza Arbab
    Acked-by: Heiko Carstens # For s390 bits
    Acked-by: Vlastimil Babka
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __pageblock_pfn_to_page has two users currently, set_zone_contiguous
    which checks whether the given zone contains holes and
    pageblock_pfn_to_page which then carefully returns a first valid page
    from the given pfn range for the given zone. This doesn't handle zones
    which are not fully populated though. Memory pageblocks can be offlined
    or might not have been onlined yet. In such a case the zone should be
    considered to have holes otherwise pfn walkers can touch and play with
    offline pages.

    Current callers of pageblock_pfn_to_page in compaction seem to work
    properly right now because they only isolate PageBuddy
    (isolate_freepages_block) or PageLRU resp. __PageMovable
    (isolate_migratepages_block) which will be always false for these pages.
    It would be safer to skip these pages altogether, though.

    In order to do this patch adds a new memory section state
    (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
    in online_pages_range during the memory hotplug. Similarly
    offline_mem_sections clears the bit and it is called when the memory
    range is offlined.

    pfn_to_online_page helper is then added which check the mem section and
    only returns a page if it is onlined already.

    Use the new helper in __pageblock_pfn_to_page and skip the whole page
    block in such a case.

    [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
    mark sections online after all struct pages are initialized in
    online_pages_range (Vlastimil)]
    Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Memory hotplug (add_memory_resource) has to reinitialize node
    infrastructure if the node is offline (one which went through the
    complete add_memory(); remove_memory() cycle). That involves node
    registration to the kobj infrastructure (register_node), the proper
    association with cpus (register_cpu_under_node) and finally creation of
    nodememblock symlinks (link_mem_sections).

    The last part requires to know node_start_pfn and node_spanned_pages
    which we currently have but a leter patch will postpone this
    initialization to the onlining phase which happens later. In fact we do
    not need to rely on the early pgdat initialization even now because the
    currently hot added pfn range is currently known.

    Split register_one_node into core which does all the common work for the
    boot time NUMA initialization and the hotplug (__register_one_node).
    register_one_node keeps the full initialization while hotplug calls
    __register_one_node and manually calls link_mem_sections for the proper
    range.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Device memory hotplug hooks into regular memory hotplug only half way.
    It needs memory sections to track struct pages but there is no
    need/desire to associate those sections with memory blocks and export
    them to the userspace via sysfs because they cannot be onlined anyway.

    This is currently expressed by for_device argument to arch_add_memory
    which then makes sure to associate the given memory range with
    ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
    to distinguish special memory hotplug from the regular one. While this
    works now, later patches in this series want to move __add_zone outside
    of arch_add_memory path so we have to come up with something else.

    Add want_memblock down the __add_pages path and use it to control
    whether the section->memblock association should be done.
    arch_add_memory then just trivially want memblock for everything but
    for_device hotplug.

    remove_memory_section doesn't need is_zone_device_section either. We
    can simply skip all the memblock specific cleanup if there is no
    memblock for the given section.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The primary purpose of this helper is to query the node state so use the
    node id directly. This is a preparatory patch for later changes.

    This shouldn't introduce any functional change

    Link: http://lkml.kernel.org/r/20170515085827.16474-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm: make movable onlining suck less", v4.

    Movable onlining is a real hack with many downsides - mainly
    reintroduction of lowmem/highmem issues we used to have on 32b systems -
    but it is the only way to make the memory hotremove more reliable which
    is something that people are asking for.

    The current semantic of memory movable onlinening is really cumbersome,
    however. The main reason for this is that the udev driven approach is
    basically unusable because udev races with the memory probing while only
    the last memory block or the one adjacent to the existing zone_movable
    are allowed to be onlined movable. In short the criterion for the
    successful online_movable changes under udev's feet. A reliable udev
    approach would require a 2 phase approach where the first successful
    movable online would have to check all the previous blocks and online
    them in descending order. This is hard to be considered sane.

    This patchset aims at making the onlining semantic more usable. First
    of all it allows to online memory movable as long as it doesn't clash
    with the existing ZONE_NORMAL. That means that ZONE_NORMAL and
    ZONE_MOVABLE cannot overlap. Currently I preserve the original ordering
    semantic so the zone always precedes the movable zone but I have plans
    to remove this restriction in future because it is not really necessary.

    First 3 patches are cleanups which should be ready to be merged right
    away (unless I have missed something subtle of course).

    Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.

    Patch 5 deals with implicit assumptions of register_one_node on pgdat
    initialization.

    Patches 6-10 deal with offline holes in the zone for pfn walkers. I
    hope I got all of them right but people familiar with compaction should
    double check this.

    Patch 11 is the core of the change. In order to make it easier to
    review I have tried it to be as minimalistic as possible and the large
    code removal is moved to patch 14.

    Patch 12 is a trivial follow up cleanup. Patch 13 fixes sparse warnings
    and finally patch 14 removes the unused code.

    I have tested the patches in kvm:
    # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...

    and then probed the additional memory by
    (qemu) object_add memory-backend-ram,id=mem1,size=1G
    (qemu) device_add pc-dimm,id=dimm1,memdev=mem1

    Then I have used this simple script to probe the memory block by hand
    # cat probe_memblock.sh
    #!/bin/sh

    BLOCK_NR=$1

    # echo $((0x100000000+$BLOCK_NR*(128< /sys/devices/system/memory/probe

    # for i in $(seq 10); do sh probe_memblock.sh $i; done
    # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable
    /sys/devices/system/memory/memory35/valid_zones:Normal Movable
    /sys/devices/system/memory/memory36/valid_zones:Normal Movable
    /sys/devices/system/memory/memory37/valid_zones:Normal Movable
    /sys/devices/system/memory/memory38/valid_zones:Normal Movable
    /sys/devices/system/memory/memory39/valid_zones:Normal Movable

    The main difference to the original implementation is that all new
    memblocks can be both online_kernel and online_movable initially because
    there is no clash obviously. For the comparison the original
    implementation would have

    /sys/devices/system/memory/memory33/valid_zones:Normal
    /sys/devices/system/memory/memory34/valid_zones:Normal
    /sys/devices/system/memory/memory35/valid_zones:Normal
    /sys/devices/system/memory/memory36/valid_zones:Normal
    /sys/devices/system/memory/memory37/valid_zones:Normal
    /sys/devices/system/memory/memory38/valid_zones:Normal
    /sys/devices/system/memory/memory39/valid_zones:Normal Movable

    Now
    # echo online_movable > /sys/devices/system/memory/memory34/state
    # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable
    /sys/devices/system/memory/memory35/valid_zones:Movable
    /sys/devices/system/memory/memory36/valid_zones:Movable
    /sys/devices/system/memory/memory37/valid_zones:Movable
    /sys/devices/system/memory/memory38/valid_zones:Movable
    /sys/devices/system/memory/memory39/valid_zones:Movable

    Block 33 can still be online both kernel and movable while all
    the remaining can be only movable.

    /proc/zonelist says
    Node 0, zone Normal
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    --
    Node 0, zone Movable
    pages free 32753
    min 85
    low 117
    high 149
    spanned 32768
    present 32768

    A new memblock at a lower address will result in a new memblock (32)
    which will still allow both Normal and Movable.

    # sh probe_memblock.sh 0
    # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable
    /sys/devices/system/memory/memory35/valid_zones:Movable

    and online_kernel will convert it to the zone normal properly
    while 33 can be still onlined both ways.

    # echo online_kernel > /sys/devices/system/memory/memory32/state
    # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable
    /sys/devices/system/memory/memory35/valid_zones:Movable

    /proc/zoneinfo will now tell
    Node 0, zone Normal
    pages free 65441
    min 165
    low 230
    high 295
    spanned 65536
    present 65536
    --
    Node 0, zone Movable
    pages free 32740
    min 82
    low 114
    high 146
    spanned 32768
    present 32768

    so both zones have one memblock spanned and present.

    Onlining 39 should associate this block to the movable zone

    # echo online > /sys/devices/system/memory/memory39/state

    /proc/zoneinfo will now tell
    Node 0, zone Normal
    pages free 32765
    min 80
    low 112
    high 144
    spanned 32768
    present 32768
    --
    Node 0, zone Movable
    pages free 65501
    min 160
    low 225
    high 290
    spanned 196608
    present 65536

    so we will have a movable zone which spans 6 memblocks, 2 present and 4
    representing a hole.

    Offlining both movable blocks will lead to the zone with no present
    pages which is the expected behavior I believe.

    # echo offline > /sys/devices/system/memory/memory39/state
    # echo offline > /sys/devices/system/memory/memory34/state
    # grep -A6 "Movable\|Normal" /proc/zoneinfo
    Node 0, zone Normal
    pages free 32735
    min 90
    low 122
    high 154
    spanned 32768
    present 32768
    --
    Node 0, zone Movable
    pages free 0
    min 0
    low 0
    high 0
    spanned 196608
    present 0

    As a bonus we will get a nice cleanup in the memory hotplug codebase.

    This patch (of 16):

    init_currently_empty_zone doesn't have any error to return yet it is
    still an int and callers try to be defensive and try to handle potential
    error. Remove this nonsense and simplify all callers.

    This patch shouldn't have any visible effect

    Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Balbir Singh
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

1 commit

  • kswapd is woken to reclaim a node based on a failed allocation request
    from any eligible zone. Once reclaiming in balance_pgdat(), it will
    continue reclaiming until there is an eligible zone available for the
    zone it was woken for. kswapd tracks what zone it was recently woken
    for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
    this zone will be 0.

    However, the decision on whether to sleep is made on
    kswapd_classzone_idx which is 0 without a recent wakeup request and that
    classzone does not account for lowmem reserves. This allows kswapd to
    sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
    request even if a stream of allocations cannot use that zone. While
    kswapd may be woken again shortly in the near future there are two
    consequences -- the pgdat bits that control congestion are cleared
    prematurely and direct reclaim is more likely as kswapd slept
    prematurely.

    This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
    invalid index) when there has been no recent wakeups. If there are no
    wakeups, it'll decide whether to sleep based on the highest possible
    zone available (MAX_NR_ZONES - 1). It then becomes critical that the
    "pgdat balanced" decisions during reclaim and when deciding to sleep are
    the same. If there is a mismatch, kswapd can stay awake continually
    trying to balance tiny zones.

    simoop was used to evaluate it again. Two of the preparation patches
    regressed the workload so they are included as the second set of
    results. Otherwise this patch looks artifically excellent

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
    Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
    Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
    Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
    Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
    Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
    Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
    Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
    Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)

    With this patch on top, write and allocation latencies are massively
    improved. The read latencies are slightly impaired but it's worth
    noting that this is mostly due to the IO scheduler and not directly
    related to reclaim. The vmstats are a bit of a mix but the relevant
    ones are as follows;

    4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
    mmots-20170209 clear-v1r25keepawake-v1r25
    Swap Ins 0 0 0
    Swap Outs 0 608 0
    Direct pages scanned 6910672 3132699 6357298
    Kswapd pages scanned 57036946 82488665 56986286
    Kswapd pages reclaimed 55993488 63474329 55939113
    Direct pages reclaimed 6905990 2964843 6352115
    Kswapd efficiency 98% 76% 98%
    Kswapd velocity 12494.375 17597.507 12488.065
    Direct efficiency 99% 94% 99%
    Direct velocity 1513.835 668.306 1393.148
    Page writes by reclaim 0.000 4410243.000 0.000
    Page writes file 0 4409635 0
    Page writes anon 0 608 0
    Page reclaim immediate 1036792 14175203 1042571

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Swap Ins 0 12 0
    Swap Outs 0 838 0
    Direct pages scanned 6579706 3237270 6256811
    Kswapd pages scanned 61853702 79961486 54837791
    Kswapd pages reclaimed 60768764 60755788 53849586
    Direct pages reclaimed 6579055 2987453 6256151
    Kswapd efficiency 98% 75% 98%
    Page writes by reclaim 0.000 4389496.000 0.000
    Page writes file 0 4388658 0
    Page writes anon 0 838 0
    Page reclaim immediate 1073573 14473009 982507

    Swap-outs are equivalent to baseline.

    Direct reclaim is reduced but not eliminated. It's worth noting that
    there are two periods of direct reclaim for this workload. The first is
    when it switches from preparing the files for the actual test itself.
    It's a lot of file IO followed by a lot of allocs that reclaims heavily
    for a brief window. While direct reclaim is lower with clear-v2, it is
    due to kswapd scanning aggressively and trying to reclaim the world
    which is not the right thing to do. With the patches applied, there is
    still direct reclaim but the phase change from "creating work files" to
    starting multiple threads that allocate a lot of anonymous memory faster
    than kswapd can reclaim.

    Scanning/reclaim efficiency is restored by this patch.

    Page writes from reclaim context are back at 0 which is ideal.

    Pages immediately reclaimed after IO completes is slightly improved but
    it is expected this will vary slightly.

    On UMA, there is almost no change so this is not expected to be a
    universal win.

    [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
    Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
    Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Shantanu Goel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Mar, 2017

1 commit

  • Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
    introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
    in order to allow similar semantics for memory hotplug like for cpu
    hotplug.

    The corresponding functions for cpu hotplug are get/put_online_cpus()
    and cpu_hotplug_begin/done() for cpu hotplug.

    The commit however missed to introduce functions that would serialize
    memory hotplug operations like they are done for cpu hotplug with
    cpu_maps_update_begin/done().

    This basically leaves mem_hotplug.active_writer unprotected and allows
    concurrent writers to modify it, which may lead to problems as outlined
    by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
    mem_hotplug_{begin, done}").

    That commit was extended again with commit b5d24fda9c3d ("mm,
    devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
    done}") which serializes memory hotplug operations for some call sites
    by using the device_hotplug lock.

    In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
    for memory hotplug") a sanity check was added to mem_hotplug_begin() to
    verify that the device_hotplug lock is held.

    This in turn triggers the following warning on s390:

    WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
    Call Trace:
    assert_held_device_hotplug+0x40/0x58)
    mem_hotplug_begin+0x34/0xc8
    add_memory_resource+0x7e/0x1f8
    add_memory+0xda/0x130
    add_memory_merged+0x15c/0x178
    sclp_detect_standby_memory+0x2ae/0x2f8
    do_one_initcall+0xa2/0x150
    kernel_init_freeable+0x228/0x2d8
    kernel_init+0x2a/0x140
    kernel_thread_starter+0x6/0xc

    One possible fix would be to add more lock_device_hotplug() and
    unlock_device_hotplug() calls around each call site of
    mem_hotplug_begin/end(). But that would give the device_hotplug lock
    additional semantics it better should not have (serialize memory hotplug
    operations).

    Instead add a new memory_add_remove_lock which has the similar semantics
    like cpu_add_remove_lock for cpu hotplug.

    To keep things hopefully a bit easier the lock will be locked and unlocked
    within the mem_hotplug_begin/end() functions.

    Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Reported-by: Sebastian Ott
    Acked-by: Dan Williams
    Acked-by: Rafael J. Wysocki
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Ben Hutchings
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

02 Mar, 2017

1 commit


25 Feb, 2017

4 commits

  • Commit 31bc3858ea3e ("add automatic onlining policy for the newly added
    memory") provides the capability to have added memory automatically
    onlined during add, but this appears to be slightly broken.

    The current implementation uses walk_memory_range() to call
    online_memory_block, which uses memory_block_change_state() to online
    the memory. Instead, we should be calling device_online() for the
    memory block in online_memory_block(). This would online the memory
    (the memory bus online routine memory_subsys_online() called from
    device_online calls memory_block_change_state()) and properly update the
    device struct offline flag.

    As a result of the current implementation, attempting to remove a memory
    block after adding it using auto online fails. This is because doing a
    remove, for instance

    echo offline > /sys/devices/system/memory/memoryXXX/state

    uses device_offline() which checks the dev->offline flag.

    Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com
    Signed-off-by: Nathan Fontenot
    Cc: Michael Ellerman
    Cc: Michael Roth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • When mainline introduced commit a96dfddbcc04 ("base/memory, hotplug: fix
    a kernel oops in show_valid_zones()"), it obtained the valid start and
    end pfn from the given pfn range. The valid start pfn can fix the
    actual issue, but it introduced another issue. The valid end pfn will
    may exceed the given end_pfn.

    Although the incorrect overflow will not result in actual problem at
    present, but I think it need to be fixed.

    [toshi.kani@hpe.com: remove assumption that end_pfn is aligned by MAX_ORDER_NR_PAGES]
    Fixes: a96dfddbcc04 ("base/memory, hotplug: fix a kernel oops in show_valid_zones()")
    Link: http://lkml.kernel.org/r/1486467299-22648-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Signed-off-by: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • We had considered all of the non-lru pages as unmovable before commit
    bda807d44454 ("mm: migrate: support non-lru movable page migration").
    But now some of non-lru pages like zsmalloc, virtio-balloon pages also
    become movable. So we can offline such blocks by using non-lru page
    migration.

    This patch straightforwardly adds non-lru migration code, which means
    adding non-lru related code to the functions which scan over pfn and
    collect pages to be migrated and isolate them before migration.

    Signed-off-by: Yisheng Xie
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Hanjun Guo
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Taku Izumi
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yisheng Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • It has no modular callers.

    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton