14 Mar, 2019

3 commits

  • [ Upstream commit 891cb2a72d821f930a39d5900cb7a3aa752c1d5b ]

    Rong Chen has reported the following boot crash:

    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    RIP: 0010:page_mapping+0x12/0x80
    Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
    RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
    RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
    RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
    RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
    R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
    R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
    FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
    Call Trace:
    __dump_page+0x14/0x2c0
    is_mem_section_removable+0x24c/0x2c0
    removable_show+0x87/0xa0
    dev_attr_show+0x25/0x60
    sysfs_kf_seq_show+0xba/0x110
    seq_read+0x196/0x3f0
    __vfs_read+0x34/0x180
    vfs_read+0xa0/0x150
    ksys_read+0x44/0xb0
    do_syscall_64+0x5e/0x4a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    and bisected it down to commit efad4e475c31 ("mm, memory_hotplug:
    is_mem_section_removable do not pass the end of a zone").

    The reason for the crash is that the mapping is garbage for poisoned
    (uninitialized) page. This shouldn't happen as all pages in the zone's
    boundary should be initialized.

    Later debugging revealed that the actual problem is an off-by-one when
    evaluating the end_page. 'start_pfn + nr_pages' resp 'zone_end_pfn'
    refers to a pfn after the range and as such it might belong to a
    differen memory section.

    This along with CONFIG_SPARSEMEM then makes the loop condition
    completely bogus because a pointer arithmetic doesn't work for pages
    from two different sections in that memory model.

    Fix the issue by reworking is_pageblock_removable to be pfn based and
    only use struct page where necessary. This makes the code slightly
    easier to follow and we will remove the problematic pointer arithmetic
    completely.

    Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
    Fixes: efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
    Signed-off-by: Michal Hocko
    Reported-by:
    Tested-by:
    Acked-by: Mike Rapoport
    Reviewed-by: Oscar Salvador
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 24feb47c5fa5b825efb0151f28906dfdad027e61 ]

    If memory end is not aligned with the sparse memory section boundary,
    the mapping of such a section is only partly initialized. This may lead
    to VM_BUG_ON due to uninitialized struct pages access from
    test_pages_in_a_zone() function triggered by memory_hotplug sysfs
    handlers.

    Here are the the panic examples:
    CONFIG_DEBUG_VM_PGFLAGS=y
    kernel parameter mem=2050M
    --------------------------
    page:000003d082008000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    test_pages_in_a_zone+0xde/0x160
    show_valid_zones+0x5c/0x190
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    test_pages_in_a_zone+0xde/0x160
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Fix this by checking whether the pfn to check is within the zone.

    [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org

    [mhocko@suse.com: separated this change from
    http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Signed-off-by: Michal Hocko
    Signed-off-by: Mikhail Zaslonko
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Tested-by: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Mikhail Gavrilov
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Mikhail Zaslonko
     
  • [ Upstream commit efad4e475c312456edb3c789d0996d12ed744c13 ]

    Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

    Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
    [1]. I have pushed back on those fixes because I believed that it is
    much better to plug the problem at the initialization time rather than
    play whack-a-mole all over the hotplug code and find all the places
    which expect the full memory section to be initialized.

    We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug:
    initialize struct pages for the full memory section") merged and cause a
    regression [2][3]. The reason is that there might be memory layouts
    when two NUMA nodes share the same memory section so the merged fix is
    simply incorrect.

    In order to plug this hole we really have to be zone range aware in
    those handlers. I have split up the original patch into two. One is
    unchanged (patch 2) and I took a different approach for `removable'
    crash.

    [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
    [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

    This patch (of 2):

    Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
    removable state of a memory block:

    page:000003d08300c000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    is_mem_section_removable+0xb4/0x190
    show_mem_removable+0x9a/0xd8
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    is_mem_section_removable+0xb4/0x190
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    The reason is that the memory block spans the zone boundary and we are
    stumbling over an unitialized struct page. Fix this by enforcing zone
    range in is_mem_section_removable so that we never run away from a zone.

    Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikhail Zaslonko
    Debugged-by: Mikhail Zaslonko
    Tested-by: Gerald Schaefer
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     

07 Feb, 2019

1 commit

  • commit eeb0efd071d821a88da3fbd35f2d478f40d3b2ea upstream.

    This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm,
    page_alloc: fix has_unmovable_pages for HugePages").

    Gigantic hugepages cross several memblocks, so it can be that the page
    we get in scan_movable_pages() is a page-tail belonging to a
    1G-hugepage. If that happens, page_hstate()->size_to_hstate() will
    return NULL, and we will blow up in hugepage_migration_supported().

    The splat is as follows:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__offline_pages+0x6ae/0x900
    Call Trace:
    memory_subsys_offline+0x42/0x60
    device_offline+0x80/0xa0
    state_store+0xab/0xc0
    kernfs_fop_write+0x102/0x180
    __vfs_write+0x26/0x190
    vfs_write+0xad/0x1b0
    ksys_write+0x42/0x90
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)

    [akpm@linux-foundation.org: fix brace layout, per David. Reduce indentation]
    Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Anthony Yznaga
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oscar Salvador
     

13 Jan, 2019

1 commit

  • commit b15c87263a69272423771118c653e9a1d0672caa upstream.

    We have received a bug report that an injected MCE about faulty memory
    prevents memory offline to succeed on 4.4 base kernel. The underlying
    reason was that the HWPoison page has an elevated reference count and the
    migration keeps failing. There are two problems with that. First of all
    it is dubious to migrate the poisoned page because we know that accessing
    that memory is possible to fail. Secondly it doesn't make any sense to
    migrate a potentially broken content and preserve the memory corruption
    over to a new location.

    Oscar has found out that 4.4 and the current upstream kernels behave
    slightly differently with his simply testcase

    ===

    int main(void)
    {
    int ret;
    int i;
    int fd;
    char *array = malloc(4096);
    char *array_locked = malloc(4096);

    fd = open("/tmp/data", O_RDONLY);
    read(fd, array, 4095);

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
    if (ret)
    perror("mlock");

    sleep (20);

    ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
    if (ret)
    perror("madvise");

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    return 0;
    }
    ===

    + offline this memory.

    In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
    list
    kernel: [] dump_trace+0x59/0x340
    kernel: [] show_stack_log_lvl+0xea/0x170
    kernel: [] show_stack+0x21/0x40
    kernel: [] dump_stack+0x5c/0x7c
    kernel: [] warn_slowpath_common+0x81/0xb0
    kernel: [] __pagevec_lru_add_fn+0x14c/0x160
    kernel: [] pagevec_lru_move_fn+0xad/0x100
    kernel: [] __lru_cache_add+0x6c/0xb0
    kernel: [] add_to_page_cache_lru+0x46/0x70
    kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
    kernel: [] __do_page_cache_readahead+0x177/0x200
    kernel: [] ondemand_readahead+0x168/0x2a0
    kernel: [] generic_file_read_iter+0x41f/0x660
    kernel: [] __vfs_read+0xcd/0x140
    kernel: [] vfs_read+0x7a/0x120
    kernel: [] kernel_read+0x3b/0x50
    kernel: [] do_execveat_common.isra.29+0x490/0x6f0
    kernel: [] do_execve+0x28/0x30
    kernel: [] call_usermodehelper_exec_async+0xfb/0x130
    kernel: [] ret_from_fork+0x55/0x80

    And that latter confuses the hotremove path because an LRU page is
    attempted to be migrated and that fails due to an elevated reference
    count. It is quite possible that the reuse of the HWPoisoned page is some
    kind of fixed race condition but I am not really sure about that.

    With the upstream kernel the failure is slightly different. The page
    doesn't seem to have LRU bit set but isolate_movable_page simply fails and
    do_migrate_range simply puts all the isolated pages back to LRU and
    therefore no progress is made and scan_movable_pages finds same set of
    pages over and over again.

    Fix both cases by explicitly checking HWPoisoned pages before we even try
    to get reference on the page, try to unmap it if it is still mapped. As
    explained by Naoya:

    : Hwpoison code never unmapped those for no big reason because
    : Ksm pages never dominate memory, so we simply didn't have strong
    : motivation to save the pages.

    Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
    HWPoison pages which shouldn't happen but I couldn't convince myself about
    that. Naoya has noted the following:

    : Theoretically no such gurantee, because try_to_unmap() doesn't have a
    : guarantee of success and then memory_failure() returns immediately
    : when hwpoison_user_mappings fails.
    : Or the following code (comes after hwpoison_user_mappings block) also impli=
    : es
    : that the target page can still have PageLRU flag.
    :
    : /*
    : * Torn down by someone else?
    : */
    : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
    : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
    : res =3D -EBUSY;
    : goto out;
    : }
    :
    : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
    : current version of your patch.

    Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Debugged-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

21 Nov, 2018

1 commit

  • commit dd33ad7b251f900481701b2a82d25de583867708 upstream.

    We have received a bug report that unbinding a large pmem (>1TB) can
    result in a soft lockup:

    NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
    [...]
    Supported: Yes
    CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
    Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
    task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
    RIP: 0010:__put_page+0x62/0x80
    Call Trace:
    devm_memremap_pages_release+0x152/0x260
    release_nodes+0x18d/0x1d0
    device_release_driver_internal+0x160/0x210
    unbind_store+0xb3/0xe0
    kernfs_fop_write+0x102/0x180
    __vfs_write+0x26/0x150
    vfs_write+0xad/0x1a0
    SyS_write+0x42/0x90
    do_syscall_64+0x74/0x150
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x7fd13166b3d0

    It has been reported on an older (4.12) kernel but the current upstream
    code doesn't cond_resched in the hot remove code at all and the given
    range to remove might be really large. Fix the issue by calling
    cond_resched once per memory section.

    Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Thumshirn
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

05 Sep, 2018

1 commit

  • When scanning for movable pages, filter out Hugetlb pages if hugepage
    migration is not supported. Without this we hit infinte loop in
    __offline_pages() where we do

    pfn = scan_movable_pages(start_pfn, end_pfn);
    if (pfn) { /* We have movable pages */
    ret = do_migrate_range(pfn, end_pfn);
    goto repeat;
    }

    Fix this by checking hugepage_migration_supported both in
    has_unmovable_pages which is the primary backoff mechanism for page
    offlining and for consistency reasons also into scan_movable_pages
    because it doesn't make any sense to return a pfn to non-migrateable
    huge page.

    This issue was revealed by, but not caused by 72b39cfc4d75 ("mm,
    memory_hotplug: do not fail offlining too early").

    Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
    Fixes: 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Haren Myneni
    Acked-by: Michal Hocko
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

23 Aug, 2018

1 commit

  • Currently, whenever a new node is created/re-used from the memhotplug
    path, we call free_area_init_node()->free_area_init_core(). But there is
    some code that we do not really need to run when we are coming from such
    path.

    free_area_init_core() performs the following actions:

    1) Initializes pgdat internals, such as spinlock, waitqueues and more.
    2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
    when creating hash tables.
    3) Account number of managed_pages per zone, substracting dma_reserved and
    memmap pages.
    4) Initializes some fields of the zone structure data
    5) Calls init_currently_empty_zone to initialize all the freelists
    6) Calls memmap_init to initialize all pages belonging to certain zone

    When called from memhotplug path, free_area_init_core() only performs
    actions #1 and #4.

    Action #2 is pointless as the zones do not have any pages since either the
    node was freed, or we are re-using it, eitherway all zones belonging to
    this node should have 0 pages. For the same reason, action #3 results
    always in manages_pages being 0.

    Action #5 and #6 are performed later on when onlining the pages:
    online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
    online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

    This patch does two things:

    First, moves the node/zone initializtion to their own function, so it
    allows us to create a small version of free_area_init_core, where we only
    perform:

    1) Initialization of pgdat internals, such as spinlock, waitqueues and more
    4) Initialization of some fields of the zone structure data

    These two functions are: pgdat_init_internals() and zone_init_internals().

    The second thing this patch does, is to introduce
    free_area_init_core_hotplug(), the memhotplug version of
    free_area_init_core():

    Currently, we call free_area_init_node() from the memhotplug path. In
    there, we set some pgdat's fields, and call calculate_node_totalpages().
    calculate_node_totalpages() calculates the # of pages the node has.

    Since the node is either new, or we are re-using it, the zones belonging
    to this node should not have any pages, so there is no point to calculate
    this now.

    Actually, we re-set these values to 0 later on with the calls to:

    reset_node_managed_pages()
    reset_node_present_pages()

    The # of pages per node and the # of pages per zone will be calculated when
    onlining the pages:

    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

    Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
    __paginginit with __init, so their code gets freed up.

    [osalvador@techadventures.net: fix section usage]
    Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
    [osalvador@suse.de: v6]
    Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
    Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Pasha Tatashin
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

18 Aug, 2018

3 commits

  • link_mem_sections() and walk_memory_range() share most of the code, so
    we can use convert link_mem_sections() into a dummy function that calls
    walk_memory_range() with a callback to register_mem_sect_under_node().

    This patch converts register_mem_sect_under_node() in order to match a
    walk_memory_range's callback, getting rid of the check_nid argument and
    checking instead if the system is still boothing, since we only have to
    check for the nid if the system is in such state.

    Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Suggested-by: Pavel Tatashin
    Tested-by: Reza Arbab
    Tested-by: Jonathan Cameron
    Reviewed-by: Pavel Tatashin
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • When hotplugging memory, it is possible that two calls are being made to
    register_mem_sect_under_node().

    One comes from __add_section()->hotplug_memory_register() and the other
    from add_memory_resource()->link_mem_sections() if we had to register a
    new node.

    In case we had to register a new node, hotplug_memory_register() will
    only handle/allocate the memory_block's since
    register_mem_sect_under_node() will return right away because the node
    it is not online yet.

    I think it is better if we leave hotplug_memory_register() to
    handle/allocate only memory_block's and make link_mem_sections() to call
    register_mem_sect_under_node().

    So this patch removes the call to register_mem_sect_under_node() from
    hotplug_memory_register(), and moves the call to link_mem_sections() out
    of the condition, so it will always be called. In this way we only have
    one place where the memory sections are registered.

    Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Tested-by: Reza Arbab
    Tested-by: Jonathan Cameron
    Cc: Pasha Tatashin
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • This is a small cleanup for the memhotplug code. A lot more could be
    done, but it is better to start somewhere. I tried to unify/remove
    duplicated code.

    The following is what this patchset does:

    1) add_memory_resource() has code to allocate a node in case it was
    offline. Since try_online_node has some code for that as well, I just
    made add_memory_resource() to use that so we can remove duplicated
    code.. This is better explained in patch 1/4.

    2) register_mem_sect_under_node() will be called only from
    link_mem_sections()

    3) Make register_mem_sect_under_node() a callback of
    walk_memory_range()

    4) Drop unnecessary checks from register_mem_sect_under_node()

    I have done some tests and I could not see anything broken because of
    this patchset.

    add_memory_resource() contains code to allocate a new node in case it is
    necessary. Since try_online_node() also has some code for this purpose,
    let us make use of that and remove duplicate code.

    This introduces __try_online_node(), which is called by
    add_memory_resource() and try_online_node(). __try_online_node() has
    two new parameters, start_addr of the node, and if the node should be
    onlined and registered right away. This is always wanted if we are
    calling from do_cpu_up(), but not when we are calling from memhotplug
    code. Nothing changes from the point of view of the users of
    try_online_node(), since try_online_node passes start_addr=0 and
    online_node=true to __try_online_node().

    Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Tested-by: Reza Arbab
    Tested-by: Jonathan Cameron
    Cc: Pasha Tatashin
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

08 Jun, 2018

1 commit

  • is_pageblock_removable_nolock() is not used outside of
    mm/memory_hotplug.c. Move it next to unique caller
    is_mem_section_removable() and make it static.

    Remove prototype in to silence gcc warning (W=1):

    mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]

    Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Suggested-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

26 May, 2018

1 commit

  • The case of a new numa node got missed in avoiding using the node info
    from page_struct during hotplug. In this path we have a call to
    register_mem_sect_under_node (which allows us to specify it is hotplug
    so don't change the node), via link_mem_sections which unfortunately
    does not.

    Fix is to pass check_nid through link_mem_sections as well and disable
    it in the new numa node path.

    Note the bug only 'sometimes' manifests depending on what happens to be
    in the struct page structures - there are lots of them and it only needs
    to match one of them.

    The result of the bug is that (with a new memory only node) we never
    successfully call register_mem_sect_under_node so don't get the memory
    associated with the node in sysfs and meminfo for the node doesn't
    report it.

    It came up whilst testing some arm64 hotplug patches, but appears to be
    universal. Whilst I'm triggering it by removing then reinserting memory
    to a node with no other elements (thus making the node disappear then
    appear again), it appears it would happen on hotplugging memory where
    there was none before and it doesn't seem to be related the arm64
    patches.

    These patches call __add_pages (where most of the issue was fixed by
    Pavel's patch). If there is a node at the time of the __add_pages call
    then all is well as it calls register_mem_sect_under_node from there
    with check_nid set to false. Without a node that function returns
    having not done the sysfs related stuff as there is no node to use.
    This is expected but it is the resulting path that fails...

    Exact path to the problem is as follows:

    mm/memory_hotplug.c: add_memory_resource()

    The node is not online so we enter the 'if (new_node)' twice, on the
    second such block there is a call to link_mem_sections which calls
    into

    drivers/node.c: link_mem_sections() which calls

    drivers/node.c: register_mem_sect_under_node() which calls
    get_nid_for_pfn and keeps trying until the output of that matches
    the expected node (passed all the way down from
    add_memory_resource)

    It is effectively the same fix as the one referred to in the fixes tag
    just in the code path for a new node where the comments point out we
    have to rerun the link creation because it will have failed in
    register_new_memory (as there was no node at the time). (actually that
    comment is wrong now as we don't have register_new_memory any more it
    got renamed to hotplug_memory_register in Pavel's patch).

    Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
    Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug")
    Signed-off-by: Jonathan Cameron
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Cameron
     

12 Apr, 2018

2 commits

  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Apr, 2018

4 commits

  • Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • During memory hotplugging we traverse struct pages three times:

    1. memset(0) in sparse_add_one_section()
    2. loop in __add_section() to set do: set_page_node(page, nid); and
    SetPageReserved(page);
    3. loop in memmap_init_zone() to call __init_single_pfn()

    This patch removes the first two loops, and leaves only loop 3. All
    struct pages are initialized in one place, the same as it is done during
    boot.

    The benefits:

    - We improve memory hotplug performance because we are not evicting the
    cache several times and also reduce loop branching overhead.

    - Remove condition from hotpath in __init_single_pfn(), that was added
    in order to fix the problem that was reported by Bharata in the above
    email thread, thus also improve performance during normal boot.

    - Make memory hotplug more similar to the boot memory initialization
    path because we zero and initialize struct pages only in one
    function.

    - Simplifies memory hotplug struct page initialization code, and thus
    enables future improvements, such as multi-threading the
    initialization of struct pages in order to improve hotplug
    performance even further on larger machines.

    [pasha.tatashin@oracle.com: v5]
    Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Ingo Molnar
    Cc: Michal Hocko
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • During memory hotplugging the probe routine will leave struct pages
    uninitialized, the same as it is currently done during boot. Therefore,
    we do not want to access the inside of struct pages before
    __init_single_page() is called during onlining.

    Because during hotplug we know that pages in one memory block belong to
    the same numa node, we can skip the checking. We should keep checking
    for the boot case.

    [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
    Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Ingo Molnar
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "optimize memory hotplug", v3.

    This patchset:

    - Improves hotplug performance by eliminating a number of struct page
    traverses during memory hotplug.

    - Fixes some issues with hotplugging, where boundaries were not
    properly checked. And on x86 block size was not properly aligned with
    end of memory

    - Also, potentially improves boot performance by eliminating condition
    from __init_single_page().

    - Adds robustness by verifying that that struct pages are correctly
    poisoned when flags are accessed.

    The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
    2.60GHz with 1T RAM:

    booting in qemu with 960G of memory, time to initialize struct pages:

    no-kvm:
    TRY1 TRY2
    BEFORE: 39.433668 39.39705
    AFTER: 36.903781 36.989329

    with-kvm:
    BEFORE: 10.977447 11.103164
    AFTER: 10.929072 10.751885

    Hotplug 896G memory:
    no-kvm:
    TRY1 TRY2
    BEFORE: 848.740000 846.910000
    AFTER: 783.070000 786.560000

    with-kvm:
    TRY1 TRY2
    BEFORE: 34.410000 33.57
    AFTER: 29.810000 29.580000

    This patch (of 6):

    Start qemu with the following arguments:

    -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

    Which: boots machine with 64G, and adds a device mem1 with 2G which can
    be hotplugged later.

    Also make sure that config has the following turned on:
    CONFIG_MEMORY_HOTPLUG
    CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
    CONFIG_ACPI_HOTPLUG_MEMORY

    Using the qemu monitor hotplug the memory (make sure config has (qemu)
    device_add pc-dimm,id=dimm1,memdev=mem1

    The operation will fail with the following trace:

    WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
    pages_correctly_reserved+0xe6/0x110
    Modules linked in:
    CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
    BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:pages_correctly_reserved+0xe6/0x110
    Call Trace:
    memory_subsys_online+0x44/0xa0
    device_online+0x51/0x80
    store_mem_state+0x5e/0xe0
    kernfs_fop_write+0xfa/0x170
    __vfs_write+0x2e/0x150
    vfs_write+0xa8/0x1a0
    SyS_write+0x4d/0xb0
    do_syscall_64+0x5d/0x110
    entry_SYSCALL_64_after_hwframe+0x21/0x86
    ---[ end trace 6203bc4f1a5d30e8 ]---

    The problem is detected in: drivers/base/memory.c

    static bool pages_correctly_reserved(unsigned long start_pfn)
    205 if (WARN_ON_ONCE(!pfn_valid(pfn)))

    This function loops through every section in the newly added memory
    block and verifies that the first pfn is valid, meaning section exists,
    has mapping (struct page array), and is online.

    The block size on x86 is usually 128M, but when machine is booted with
    more than 64G of memory, the block size is changed to 2G: $ cat
    /sys/devices/system/memory/block_size_bytes 80000000

    or

    $ dmesg | grep "block size"
    [ 0.086469] x86/mm: Memory block size: 2048MB

    During memory hotplug, and hotremove we verify that the range is section
    size aligned, but we actually must verify that it is block size aligned,
    because that is the proper unit for hotplug operations. See:
    Documentation/memory-hotplug.txt

    So, when the start_pfn of newly added memory is not block size aligned,
    we can get a memory block that has only part of it with properly
    populated sections.

    In our case the start_pfn starts from the last_pfn (end of physical
    memory).

    $ dmesg | grep last_pfn
    [ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

    0x1040000 == 65G, and so is not 2G aligned!

    The fix is to enforce that memory that is hotplugged and hotremoved is
    block size aligned.

    With this fix, running the above sequence yield to the following result:

    (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
    Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
    size 0x80000000
    acpi PNP0C80:00: add_memory failed
    acpi PNP0C80:00: acpi_memory_enable_device() error
    acpi PNP0C80:00: Enumeration failure

    Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Ingo Molnar
    Acked-by: Michal Hocko
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

07 Feb, 2018

1 commit

  • Pull libnvdimm updates from Ross Zwisler:

    - Require struct page by default for filesystem DAX to remove a number
    of surprising failure cases. This includes failures with direct I/O,
    gdb and fork(2).

    - Add support for the new Platform Capabilities Structure added to the
    NFIT in ACPI 6.2a. This new table tells us whether the platform
    supports flushing of CPU and memory controller caches on unexpected
    power loss events.

    - Revamp vmem_altmap and dev_pagemap handling to clean up code and
    better support future future PCI P2P uses.

    - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
    become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
    spec, and instead rely on the generic ND_CMD_CALL approach used by
    the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

    - Enhance nfit_test so we can test some of the new things added in
    version 1.6 of the DSM specification. This includes testing firmware
    download and simulating the Last Shutdown State (LSS) status.

    * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
    libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
    acpi, nfit: fix register dimm error handling
    libnvdimm, namespace: make min namespace size 4K
    tools/testing/nvdimm: force nfit_test to depend on instrumented modules
    libnvdimm/nfit_test: adding support for unit testing enable LSS status
    libnvdimm/nfit_test: add firmware download emulation
    nfit-test: Add platform cap support from ACPI 6.2a to test
    libnvdimm: expose platform persistence attribute for nd_region
    acpi: nfit: add persistent memory control flag for nd_region
    acpi: nfit: Add support for detect platform CPU cache flush on power loss
    device-dax: Fix trailing semicolon
    libnvdimm, btt: fix uninitialized err_lock
    dax: require 'struct page' by default for filesystem dax
    ext2: auto disable dax instead of failing mount
    ext4: auto disable dax instead of failing mount
    mm, dax: introduce pfn_t_special()
    mm: Fix devm_memremap_pages() collision handling
    mm: Fix memory size alignment in devm_memremap_pages_release()
    memremap: merge find_dev_pagemap into get_dev_pagemap
    memremap: change devm_memremap_pages interface to use struct dev_pagemap
    ...

    Linus Torvalds
     

01 Feb, 2018

3 commits

  • In register_page_bootmem_info_section() we call __nr_to_section() in
    order to get the mem_section struct at the beginning of the function.
    Since we already got it, there is no need for a second call to
    __nr_to_section().

    Link: http://lkml.kernel.org/r/20171207102914.GA12396@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • When we call register_page_bootmem_info_section() having
    CONFIG_SPARSEMEM_VMEMMAP enabled, we check if the pfn is valid.

    This check is redundant as we already checked this in
    register_page_bootmem_info_node() before calling
    register_page_bootmem_info_section(), so let's get rid of it.

    Link: http://lkml.kernel.org/r/20171205143422.GA31458@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Pulling cpu hotplug locks inside the mm core function like
    lru_add_drain_all just asks for problems and the recent lockdep splat
    [1] just proves this. While the usage in that particular case might be
    wrong we should avoid the locking as lru_add_drain_all() is used in many
    places. It seems that this is not all that hard to achieve actually.

    We have done the same thing for drain_all_pages which is analogous by
    commit a459eeb7b852 ("mm, page_alloc: do not depend on cpu hotplug locks
    inside the allocator"). All we have to care about is to handle

    - the work item might be executed on a different cpu in worker from
    unbound pool so it doesn't run on pinned on the cpu

    - we have to make sure that we do not race with page_alloc_cpu_dead
    calling lru_add_drain_cpu

    the first part is already handled because the worker calls lru_add_drain
    which disables preemption when calling lru_add_drain_cpu on the local
    cpu it is draining. The later is true because page_alloc_cpu_dead is
    called on the controlling CPU after the hotplugged CPU vanished
    completely.

    [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com

    [add a cpu hotplug locking interaction as per tglx]
    Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jan, 2018

6 commits


16 Nov, 2017

3 commits

  • Here, pfn_to_node should be page_to_nid.

    Link: http://lkml.kernel.org/r/1510735205-22540-1-git-send-email-fan.du@intel.com
    Signed-off-by: Fan Du
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fan Du
     
  • We have a hardcoded 120s timeout after which the memory offline fails
    basically since the hot remove has been introduced. This is essentially
    a policy implemented in the kernel. Moreover there is no way to adjust
    the timeout and so we are sometimes facing memory offline failures if
    the system is under a heavy memory pressure or very intensive CPU
    workload on large machines.

    It is not very clear what purpose the timeout actually serves. The
    offline operation is interruptible by a signal so if userspace wants
    some timeout based termination this can be done trivially by sending a
    signal.

    If there is a strong usecase to do this from the kernel then we should
    do it properly and have a it tunable from the userspace with the timeout
    disabled by default along with the explanation who uses it and for what
    purporse.

    Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: KAMEZAWA Hiroyuki
    Cc: Reza Arbab
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.

    While testing memory hotplug on a large 4TB machine we have noticed that
    memory offlining is just too eager to fail. The primary reason is that
    the retry logic is just too easy to give up. We have 4 ways out of the
    offline

    - we have a permanent failure (isolation or memory notifiers fail,
    or hugetlb pages cannot be dropped)
    - userspace sends a signal
    - a hardcoded 120s timeout expires
    - page migration fails 5 times

    This is way too convoluted and it doesn't scale very well. We have seen
    both temporary migration failures as well as 120s being triggered.
    After removing those restrictions we were able to pass stress testing
    during memory hot remove without any other negative side effects
    observed. Therefore I suggest dropping both hard coded policies. I
    couldn't have found any specific reason for them in the changelog. I
    neither didn't get any response [1] from Kamezawa. If we need some
    upper bound - e.g. timeout based - then we should have a proper and
    user defined policy for that. In any case there should be a clear use
    case when introducing it.

    This patch (of 2):

    Memory offlining can fail too eagerly under heavy memory pressure.

    page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
    flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
    page dumped because: isolation failed
    page->mem_cgroup:ffff8801cd662000
    memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

    Isolation has failed here because the page is not on LRU. Most probably
    because it was on the pcp LRU cache or it has been removed from the LRU
    already but it hasn't been freed yet. In both cases the page doesn't
    look non-migrable so retrying more makes sense.

    __offline_pages seems rather cluttered when it comes to the retry logic.
    We have 5 retries at maximum and a timeout. We could argue whether the
    timeout makes sense but failing just because of a race when somebody
    isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It
    only takes it to race with a process which unmaps some pages and remove
    them from the LRU list and we can fail the whole offline because of
    something that is a temporary condition and actually not harmful for the
    offline.

    Please note that unmovable pages should be already excluded during
    start_isolate_page_range. We could argue that has_unmovable_pages is
    racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
    but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
    pages in most cases and movable zone shouldn't contain unmovable pages
    at all. Some of those pages might be pinned but not for ever because
    that would be a bug on its own. In any case the context is still
    interruptible and so the userspace can easily bail out when the
    operation takes too long. This is certainly better behavior than a
    hardcoded retry loop which is racy.

    Fix this by removing the max retry count and only rely on the timeout
    resp. interruption by a signal from the userspace. Also retry rather
    than fail when check_pages_isolated sees some !free pages because those
    could be a result of the race as well.

    Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: KAMEZAWA Hiroyuki
    Cc: Reza Arbab
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Oct, 2017

3 commits

  • find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
    and return the pfn of the section. But the functions are defined as int.
    So the functions always return 0x00000000 - 0xffffffff. It means if
    memory address is over 16TB, the functions does not work correctly.

    To handle 64 bit value, the patch defines
    find_{smallest|biggest}_section_pfn() as unsigned long.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
    pfn_to_section_nr() has no issue even if it is defined as macro. But
    section_nr_to_pfn() has overflow issue if sec is defined as int.

    section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is
    defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
    But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
    value.

    __remove_section() calculates start_pfn using section_nr_to_pfn() and
    scn_nr defined as int. So if hot-removed memory address is over 16TB,
    overflow issue occurs and section_nr_to_pfn() does not calculate correct
    pfn.

    To make callers use proper arg, the patch changes the macros to inline
    functions.

    Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
    Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Reza Arbab
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YASUAKI ISHIMATSU
     
  • Patch series "mm, memory_hotplug: fix few soft lockups in memory
    hotadd".

    Johannes has noticed few soft lockups when adding a large nvdimm device.
    All of them were caused by a long loop without any explicit cond_resched
    which is a problem for !PREEMPT kernels.

    The fix is quite straightforward. Just make sure that cond_resched gets
    called from time to time.

    This patch (of 3):

    __add_pages gets a pfn range to add and there is no upper bound for a
    single call. This is usually a memory block aligned size for the
    regular memory hotplug - smaller sizes are usual for memory balloning
    drivers, or the whole NUMA node for physical memory online. There is no
    explicit scheduling point in that code path though.

    This can lead to long latencies while __add_pages is executed and we
    have even seen a soft lockup report during nvdimm initialization with
    !PREEMPT kernel

    NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
    [...]
    Workqueue: events_unbound async_run_entry_fn
    task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
    RIP: _raw_spin_unlock_irqrestore+0x11/0x20
    RSP: 0018:ffff881809277b10 EFLAGS: 00000286
    [...]
    Call Trace:
    sparse_add_one_section+0x13d/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    devm_memremap_pages+0x29d/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70
    DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

    Fix this by adding cond_resched once per each memory section in the
    given pfn range. Each section is constant amount of work which itself
    is not too expensive but many of them will just add up.

    Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Sep, 2017

2 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This patch enables thp migration for memory hotremove.

    Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

3 commits

  • zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
    potential race while building zonelist for new populated zone") to
    protect zonelist building from races. This is no longer needed though
    because both memory online and offline are fully serialized. New users
    have grown since then.

    Notably setup_per_zone_wmarks wants to prevent from races between memory
    hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
    (see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
    add a private lock for that purpose. This will not prevent from seeing
    halfway through memory hotplug operation but that shouldn't be a big
    deal becuse memory hotplug will update watermarks explicitly so we will
    eventually get a full picture. The lock just makes sure we won't race
    when updating watermarks leading to weird results.

    Also __build_all_zonelists manipulates global data so add a private lock
    for it as well. This doesn't seem to be necessary today but it is more
    robust to have a lock there.

    While we are at it make sure we document that memory online/offline
    depends on a full serialization either via mem_hotplug_begin() or
    device_lock.

    Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Haicheng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_online_node calls hotadd_new_pgdat which already calls
    build_all_zonelists. So the additional call is redundant. Even though
    hotadd_new_pgdat will only initialize zonelists of the new node this is
    the right thing to do because such a node doesn't have any memory so
    other zonelists would ignore all the zones from this node anyway.

    Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Toshi Kani
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko