09 Jan, 2020

1 commit

  • commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.

    We currently try to shrink a single zone when removing memory. We use
    the zone of the first page of the memory we are removing. If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __remove_pages+0x4b/0x640
    arch_remove_memory+0x63/0x8d
    try_remove_memory+0xdb/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x70/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x227/0x3a0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x221/0x550
    worker_thread+0x50/0x3b0
    kthread+0x105/0x140
    ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that. We now
    properly shrink the zones, even if we have DIMMs whereby

    - Some memory blocks fall into no zone (never onlined)

    - Some memory blocks fall into multiple zones (offlined+re-onlined)

    - Multiple memory blocks that fall into different zones

    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().

    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

19 Oct, 2019

1 commit

  • Patch series "mm/memory_hotplug: Shrink zones before removing memory",
    v6.

    This series fixes the access of uninitialized memmaps when shrinking
    zones/nodes and when removing memory. Also, it contains all fixes for
    crashes that can be triggered when removing certain namespace using
    memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

    We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
    more involved (we don't have SECTION_IS_ONLINE as an indicator), and
    shrinking is only of limited use (set_zone_contiguous() cannot detect
    the ZONE_DEVICE as contiguous).

    We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
    of code to a minimum. Shrinking is especially necessary to keep
    zone->contiguous set where possible, especially, on memory unplug of
    DIMMs at zone boundaries.

    --------------------------------------------------------------------------

    Zones are now properly shrunk when offlining memory blocks or when
    onlining failed. This allows to properly shrink zones on memory unplug
    even if the separate memory blocks of a DIMM were onlined to different
    zones or re-onlined to a different zone after offlining.

    Example:

    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0
    :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
    :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 98304
    present 65536
    managed 65536
    :/# echo 0 > /sys/devices/system/memory/memory43/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 32768
    present 32768
    managed 32768
    :/# echo 0 > /sys/devices/system/memory/memory41/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0

    This patch (of 10):

    With an altmap, the memmap falling into the reserved altmap space are not
    initialized and, therefore, contain a garbage NID and a garbage zone.
    Make sure to read the NID/zone from a memmap that was initialized.

    This fixes a kernel crash that is observed when destroying a namespace:

    kernel BUG at include/linux/mm.h:1107!
    cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
    ...
    pid = 3669, comm = ndctl
    kernel BUG at include/linux/mm.h:1107!
    devm_action_release+0x30/0x50
    release_nodes+0x268/0x2d0
    device_release_driver_internal+0x174/0x240
    unbind_store+0x13c/0x190
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x70/0xa0
    kernfs_fop_write+0x1ac/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xe4/0x200
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
    devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
    think we will never have driver reserved memory with
    MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).

    [david@redhat.com: minimze code changes, rephrase description]
    Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
    Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Damian Tometzki
    Cc: Alexander Duyck
    Cc: Alexander Potapenko
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Gerald Schaefer
    Cc: Greg Kroah-Hartman
    Cc: Halil Pasic
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jun Yao
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Rich Felker
    Cc: Robin Murphy
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

08 Oct, 2019

1 commit

  • SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But
    they do conflict with existing definitions on arm64 platform causing
    following warning during build. Lets drop these unused macros.

    mm/memremap.c:16: warning: "SECTION_MASK" redefined
    #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
    arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
    #define SECTION_MASK (~(SECTION_SIZE-1))

    mm/memremap.c:17: warning: "SECTION_SIZE" redefined
    #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
    arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
    #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT)

    Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reported-by: kbuild test robot
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

22 Aug, 2019

1 commit

  • From rdma.git

    Jason Gunthorpe says:

    ====================
    This is a collection of general cleanups for ODP to clarify some of the
    flows around umem creation and use of the interval tree.
    ====================

    The branch is based on v5.3-rc5 due to dependencies, and is being taken
    into hmm.git due to dependencies in the next patches.

    * odp_fixes:
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    RDMA/core: Make invalidate_range a device operation
    RDMA/odp: Use kvcalloc for the dma_list and page_list
    RDMA/odp: Check for overflow when computing the umem_odp end
    RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
    RDMA/odp: Split creating a umem_odp from ib_umem_get
    RDMA/odp: Make the three ways to create a umem_odp clear
    RMDA/odp: Consolidate umem_odp initialization
    RDMA/odp: Make it clearer when a umem is an implicit ODP umem
    RDMA/odp: Iterate over the whole rbtree directly
    RDMA/odp: Use the common interval tree library instead of generic
    RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

14 Aug, 2019

1 commit

  • When a ZONE_DEVICE private page is freed, the page->mapping field can be
    set. If this page is reused as an anonymous page, the previous value
    can prevent the page from being inserted into the CPU's anon rmap table.
    For example, when migrating a pte_none() page to device memory:

    migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
    src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
    /* no page to lock or isolate so OK */
    migrate_vma_unmap()
    /* no page to unmap so OK */
    ops->alloc_and_copy()
    /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
    migrate_vma_insert_page()
    page_add_new_anon_rmap()
    __page_set_anon_rmap()
    /* This check sees the page's stale mapping field */
    if (PageAnon(page))
    return
    /* page->mapping is not updated */

    The result is that the migration appears to succeed but a subsequent CPU
    fault will be unable to migrate the page back to system memory or worse.

    Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
    pointer data doesn't affect future page use.

    Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
    Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: "Jérôme Glisse"
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

10 Aug, 2019

1 commit

  • Currently, attempts to shutdown and re-enable a device-dax instance
    trigger:

    Missing reference count teardown definition
    WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
    [..]
    RIP: 0010:devm_memremap_pages+0x234/0x850
    [..]
    Call Trace:
    dev_dax_probe+0x66/0x190 [device_dax]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100
    device_driver_attach+0x4f/0x60

    Given that the setup path initializes pgmap->ref, arrange for it to be
    also torn down so devm_memremap_pages() is ready to be called again and
    not be mistaken for the 3rd-party per-cpu-ref case.

    Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap")
    Reported-by: Fan Du
    Tested-by: Vishal Verma
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Aug, 2019

1 commit

  • memremap.c implements MM functionality for ZONE_DEVICE, so it really
    should be in the mm/ directory, not the kernel/ one.

    Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Anshuman Khandual
    Acked-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig