03 Nov, 2020

1 commit

  • commit 6f42193fd86e ("memremap: don't use a separate devm action for
    devmap_managed_enable_get") changed the static key updates such that we
    now call devmap_managed_enable_put() without doing the equivalent
    devmap_managed_enable_get().

    devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
    MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
    types too. This results in the below warning when switching between
    system-ram and devdax mode for devdax namespace.

    jump label: negative count!
    WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
    Modules linked in:
    ....

    NIP static_key_slow_try_dec+0x88/0xa0
    LR static_key_slow_try_dec+0x84/0xa0
    Call Trace:
    static_key_slow_try_dec+0x84/0xa0
    __static_key_slow_dec_cpuslocked+0x34/0xd0
    static_key_slow_dec+0x54/0xf0
    memunmap_pages+0x36c/0x500
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    bus_remove_device+0x124/0x210
    device_del+0x1d4/0x530
    unregister_dev_dax+0x48/0xe0
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    unbind_store+0x130/0x170
    drv_attr_store+0x40/0x60
    sysfs_kf_write+0x6c/0xb0
    kernfs_fop_write+0x118/0x280
    vfs_write+0xe8/0x2a0
    ksys_write+0x84/0x140
    system_call_exception+0x120/0x270
    system_call_common+0xf0/0x27c

    Reported-by: Aneesh Kumar K.V
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

17 Oct, 2020

1 commit

  • On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
    un-isolate the pages after memory onlining is complete. Let's allow
    passing in the migratetype.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: Dan Williams
    Cc: Mike Rapoport
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Michel Lespinasse
    Cc: Charan Teja Reddy
    Cc: Mel Gorman
    Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

14 Oct, 2020

4 commits

  • While reviewing Protection Key Supervisor support it was pointed out that
    using a counter to track static branch enable was an anti-pattern which
    was better solved using the provided static_branch_{inc,dec} functions.[1]

    Fix up devmap_managed_key to work the same way. Also this should be safer
    because there is a very small (very unlikely) race when multiple callers
    try to enable at the same time.

    [1] https://lore.kernel.org/lkml/20200714194031.GI5523@worktop.programming.kicks-ass.net/

    Signed-off-by: Ira Weiny
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Dan Williams
    Cc: Vishal Verma
    Link: https://lkml.kernel.org/r/20200810235319.2796597-1-ira.weiny@intel.com
    Signed-off-by: Linus Torvalds

    Ira Weiny
     
  • To activate a page, mark_page_accessed() always holds a reference on it.
    It either gets a new reference when adding a page to
    lru_pvecs.activate_page or reuses an existing one it previously got when
    it added a page to lru_pvecs.lru_add. So it doesn't call SetPageActive()
    on a page that doesn't have any reference left. Therefore, the race is
    impossible these days (I didn't brother to dig into its history).

    For other paths, namely reclaim and migration, a reference count is always
    held while calling SetPageActive() on a page.

    SetPageSlabPfmemalloc() also uses SetPageActive(), but it's irrelevant to
    LRU pages.

    Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Cc: Alexander Duyck
    Cc: David Hildenbrand
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200818184704.3625199-2-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • In support of device-dax growing the ability to front physically
    dis-contiguous ranges of memory, update devm_memremap_pages() to track
    multiple ranges with a single reference counter and devm instance.

    Convert all [devm_]memremap_pages() users to specify the number of ranges
    they are mapping in their 'struct dev_pagemap' instance.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'struct resource' in 'struct dev_pagemap' is only used for holding
    resource span information. The other fields, 'name', 'flags', 'desc',
    'parent', 'sibling', and 'child' are all unused wasted space.

    This is in preparation for introducing a multi-range extension of
    devm_memremap_pages().

    The bulk of this change is unwinding all the places internal to libnvdimm
    that used 'struct resource' unnecessarily, and replacing instances of
    'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

    P2PDMA had a minor usage of the resource flags field, but only to report
    failures with "%pR". That is replaced with an open coded print of the
    range.

    [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
    Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam

    Signed-off-by: Dan Williams
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky [xen]
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

04 Sep, 2020

1 commit


11 Apr, 2020

3 commits

  • PCI BAR IO memory should never be mapped as WB, however prior to this
    the PAT bits were set WB and it was typically overridden by MTRR
    registers set by the firmware.

    Set PCI P2PDMA memory to be UC as this is what it currently, typically,
    ends up being mapped as on x86 after the MTRR registers override the
    cache setting.

    Future use-cases may need to generalize this by adding flags to select
    the caching type, as some P2PDMA cases may not want UC. However, those
    use-cases are not upstream yet and this can be changed when they arrive.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     
  • devm_memremap_pages() is currently used by the PCI P2PDMA code to create
    struct page mappings for IO memory. At present, these mappings are
    created with PAGE_KERNEL which implies setting the PAT bits to be WB.
    However, on x86, an mtrr register will typically override this and force
    the cache type to be UC-. In the case firmware doesn't set this
    register it is effectively WB and will typically result in a machine
    check exception when it's accessed.

    Other arches are not currently likely to function correctly seeing they
    don't have any MTRR registers to fall back on.

    To solve this, provide a way to specify the pgprot value explicitly to
    arch_add_memory().

    Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
    simple change to pass the pgprot_t down to their respective functions
    which set up the page tables. For x86_32, set the page tables
    explicitly using _set_memory_prot() (seeing they are already mapped).

    For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
    should be fine, for now, seeing these architectures don't support
    ZONE_DEVICE.

    A check in __add_pages() is also added to ensure the pgprot parameter
    was set for all arches.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Dave Hansen
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     
  • The mhp_restrictions struct really doesn't specify anything resembling a
    restriction anymore so rename it to be mhp_params as it is a list of
    extended parameters.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Dave Hansen
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     

09 Apr, 2020

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     

27 Mar, 2020

1 commit

  • Add a new opaque owner field to struct dev_pagemap, which will allow the
    hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and
    refuse to work on mappings not owned by the calling entity.

    Link: https://lore.kernel.org/r/20200316193216.920734-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Bharata B Rao
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

21 Feb, 2020

1 commit

  • The "sub-section memory hotplug" facility allows memremap_pages() users
    like libnvdimm to compensate for hardware platforms like x86 that have a
    section size larger than their hardware memory mapping granularity. The
    compensation that sub-section support affords is being tolerant of
    physical memory resources shifting by units smaller (64MiB on x86) than
    the memory-hotplug section size (128 MiB). Where the platform
    physical-memory mapping granularity is limited by the number and
    capability of address-decode-registers in the memory controller.

    While the sub-section support allows memremap_pages() to operate on
    sub-section (2MiB) granularity, the Power architecture may still
    require 16MiB alignment on "!radix_enabled()" platforms.

    In order for libnvdimm to be able to detect and manage this per-arch
    limitation, introduce memremap_compat_align() as a common minimum
    alignment across all driver-facing memory-mapping interfaces, and let
    Power override it to 16MiB in the "!radix_enabled()" case.

    The assumption / requirement for 16MiB to be a viable
    memremap_compat_align() value is that Power does not have platforms
    where its equivalent of address-decode-registers never hardware remaps a
    persistent memory resource on smaller than 16MiB boundaries. Note that I
    tried my best to not add a new Kconfig symbol, but header include
    entanglements defeated the #ifndef memremap_compat_align design pattern
    and the need to export it defeats the __weak design pattern for arch
    overrides.

    Based on an initial patch by Aneesh.

    Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com
    Reported-by: Aneesh Kumar K.V
    Reported-by: Jeff Moyer
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Michael Ellerman (powerpc)
    Signed-off-by: Dan Williams

    Dan Williams
     

04 Feb, 2020

1 commit

  • Let's poison the pages similar to when adding new memory in
    sparse_add_section(). Also call remove_pfn_range_from_zone() from
    memunmap_pages(), so we can poison the memmap from there as well.

    Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

01 Feb, 2020

2 commits

  • An upcoming patch changes and complicates the refcounting and especially
    the "put page" aspects of it. In order to keep everything clean,
    refactor the devmap page release routines:

    * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
    limit the functionality to "read only": return a bool, with no side
    effects.

    * Add a new routine, put_devmap_managed_page(), to handle decrementing
    the refcount for ZONE_DEVICE pages.

    * Change callers (just release_pages() and put_page()) to check
    page_is_devmap_managed() before calling the new
    put_devmap_managed_page() routine. This is a performance point:
    put_page() is a hot path, so we need to avoid non- inline function calls
    where possible.

    * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
    limit the functionality to unconditionally freeing a devmap page.

    This is originally based on a separate patch by Ira Weiny, which applied
    to an early version of the put_user_page() experiments. Since then,
    Jérôme Glisse suggested the refactoring described above.

    Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
    Signed-off-by: Ira Weiny
    Signed-off-by: John Hubbard
    Suggested-by: Jérôme Glisse
    Reviewed-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Kirill A. Shutemov
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Daniel Vetter
    Cc: Hans Verkuil
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jonathan Corbet
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • After the removal of the device-public infrastructure there are only 2
    ->page_free() call backs in the kernel. One of those is a
    device-private callback in the nouveau driver, the other is a generic
    wakeup needed in the DAX case. In the hopes that all ->page_free()
    callbacks can be migrated to common core kernel functionality, move the
    device-private specific actions in __put_devmap_managed_page() under the
    is_device_private_page() conditional, including the ->page_free()
    callback. For the other page types just open-code the generic wakeup.

    Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
    does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
    case.

    Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
    Signed-off-by: Dan Williams
    Signed-off-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jérôme Glisse
    Cc: Jan Kara
    Cc: Ira Weiny
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Daniel Vetter
    Cc: Hans Verkuil
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

05 Jan, 2020

1 commit

  • We currently try to shrink a single zone when removing memory. We use
    the zone of the first page of the memory we are removing. If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __remove_pages+0x4b/0x640
    arch_remove_memory+0x63/0x8d
    try_remove_memory+0xdb/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x70/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x227/0x3a0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x221/0x550
    worker_thread+0x50/0x3b0
    kthread+0x105/0x140
    ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that. We now
    properly shrink the zones, even if we have DIMMs whereby

    - Some memory blocks fall into no zone (never onlined)

    - Some memory blocks fall into multiple zones (offlined+re-onlined)

    - Multiple memory blocks that fall into different zones

    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().

    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

19 Oct, 2019

1 commit

  • Patch series "mm/memory_hotplug: Shrink zones before removing memory",
    v6.

    This series fixes the access of uninitialized memmaps when shrinking
    zones/nodes and when removing memory. Also, it contains all fixes for
    crashes that can be triggered when removing certain namespace using
    memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

    We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
    more involved (we don't have SECTION_IS_ONLINE as an indicator), and
    shrinking is only of limited use (set_zone_contiguous() cannot detect
    the ZONE_DEVICE as contiguous).

    We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
    of code to a minimum. Shrinking is especially necessary to keep
    zone->contiguous set where possible, especially, on memory unplug of
    DIMMs at zone boundaries.

    --------------------------------------------------------------------------

    Zones are now properly shrunk when offlining memory blocks or when
    onlining failed. This allows to properly shrink zones on memory unplug
    even if the separate memory blocks of a DIMM were onlined to different
    zones or re-onlined to a different zone after offlining.

    Example:

    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0
    :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
    :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 98304
    present 65536
    managed 65536
    :/# echo 0 > /sys/devices/system/memory/memory43/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 32768
    present 32768
    managed 32768
    :/# echo 0 > /sys/devices/system/memory/memory41/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0

    This patch (of 10):

    With an altmap, the memmap falling into the reserved altmap space are not
    initialized and, therefore, contain a garbage NID and a garbage zone.
    Make sure to read the NID/zone from a memmap that was initialized.

    This fixes a kernel crash that is observed when destroying a namespace:

    kernel BUG at include/linux/mm.h:1107!
    cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
    ...
    pid = 3669, comm = ndctl
    kernel BUG at include/linux/mm.h:1107!
    devm_action_release+0x30/0x50
    release_nodes+0x268/0x2d0
    device_release_driver_internal+0x174/0x240
    unbind_store+0x13c/0x190
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x70/0xa0
    kernfs_fop_write+0x1ac/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xe4/0x200
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
    devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
    think we will never have driver reserved memory with
    MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).

    [david@redhat.com: minimze code changes, rephrase description]
    Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
    Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Damian Tometzki
    Cc: Alexander Duyck
    Cc: Alexander Potapenko
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Gerald Schaefer
    Cc: Greg Kroah-Hartman
    Cc: Halil Pasic
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jun Yao
    Cc: Mark Rutland
    Cc: Masahiro Yamada
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Rich Felker
    Cc: Robin Murphy
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Wei Yang
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Yu Zhao
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

08 Oct, 2019

1 commit

  • SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But
    they do conflict with existing definitions on arm64 platform causing
    following warning during build. Lets drop these unused macros.

    mm/memremap.c:16: warning: "SECTION_MASK" redefined
    #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
    arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
    #define SECTION_MASK (~(SECTION_SIZE-1))

    mm/memremap.c:17: warning: "SECTION_SIZE" redefined
    #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
    arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
    #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT)

    Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reported-by: kbuild test robot
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

22 Aug, 2019

1 commit

  • From rdma.git

    Jason Gunthorpe says:

    ====================
    This is a collection of general cleanups for ODP to clarify some of the
    flows around umem creation and use of the interval tree.
    ====================

    The branch is based on v5.3-rc5 due to dependencies, and is being taken
    into hmm.git due to dependencies in the next patches.

    * odp_fixes:
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    RDMA/core: Make invalidate_range a device operation
    RDMA/odp: Use kvcalloc for the dma_list and page_list
    RDMA/odp: Check for overflow when computing the umem_odp end
    RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
    RDMA/odp: Split creating a umem_odp from ib_umem_get
    RDMA/odp: Make the three ways to create a umem_odp clear
    RMDA/odp: Consolidate umem_odp initialization
    RDMA/odp: Make it clearer when a umem is an implicit ODP umem
    RDMA/odp: Iterate over the whole rbtree directly
    RDMA/odp: Use the common interval tree library instead of generic
    RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

14 Aug, 2019

1 commit

  • When a ZONE_DEVICE private page is freed, the page->mapping field can be
    set. If this page is reused as an anonymous page, the previous value
    can prevent the page from being inserted into the CPU's anon rmap table.
    For example, when migrating a pte_none() page to device memory:

    migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
    src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
    /* no page to lock or isolate so OK */
    migrate_vma_unmap()
    /* no page to unmap so OK */
    ops->alloc_and_copy()
    /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
    migrate_vma_insert_page()
    page_add_new_anon_rmap()
    __page_set_anon_rmap()
    /* This check sees the page's stale mapping field */
    if (PageAnon(page))
    return
    /* page->mapping is not updated */

    The result is that the migration appears to succeed but a subsequent CPU
    fault will be unable to migrate the page back to system memory or worse.

    Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
    pointer data doesn't affect future page use.

    Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
    Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: "Jérôme Glisse"
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

10 Aug, 2019

1 commit

  • Currently, attempts to shutdown and re-enable a device-dax instance
    trigger:

    Missing reference count teardown definition
    WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
    [..]
    RIP: 0010:devm_memremap_pages+0x234/0x850
    [..]
    Call Trace:
    dev_dax_probe+0x66/0x190 [device_dax]
    really_probe+0xef/0x390
    driver_probe_device+0xb4/0x100
    device_driver_attach+0x4f/0x60

    Given that the setup path initializes pgmap->ref, arrange for it to be
    also torn down so devm_memremap_pages() is ready to be called again and
    not be mistaken for the 3rd-party per-cpu-ref case.

    Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap")
    Reported-by: Fan Du
    Tested-by: Vishal Verma
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

03 Aug, 2019

1 commit

  • memremap.c implements MM functionality for ZONE_DEVICE, so it really
    should be in the mm/ directory, not the kernel/ one.

    Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Anshuman Khandual
    Acked-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig