13 Jan, 2019

2 commits

  • commit 06489cfbd915ff36c8e36df27f1c2dc60f97ca56 upstream.

    Given the fact that devm_memremap_pages() requires a percpu_ref that is
    torn down by devm_memremap_pages_release() the current support for mapping
    RAM is broken.

    Support for remapping "System RAM" has been broken since the beginning and
    there is no existing user of this this code path, so just kill the support
    and make it an explicit error.

    This cleanup also simplifies a follow-on patch to fix the error path when
    setting a devm release action for devm_memremap_pages_release() fails.

    Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: "Jérôme Glisse"
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 808153e1187fa77ac7d7dad261ff476888dcf398 upstream.

    devm_memremap_pages() is a facility that can create struct page entries
    for any arbitrary range and give drivers the ability to subvert core
    aspects of page management.

    Specifically the facility is tightly integrated with the kernel's memory
    hotplug functionality. It injects an altmap argument deep into the
    architecture specific vmemmap implementation to allow allocating from
    specific reserved pages, and it has Linux specific assumptions about page
    structure reference counting relative to get_user_pages() and
    get_user_pages_fast(). It was an oversight and a mistake that this was
    not marked EXPORT_SYMBOL_GPL from the outset.

    Again, devm_memremap_pagex() exposes and relies upon core kernel internal
    assumptions and will continue to evolve along with 'struct page', memory
    hotplug, and support for new memory types / topologies. Only an in-kernel
    GPL-only driver is expected to keep up with this ongoing evolution. This
    interface, and functionality derived from this interface, is not suitable
    for kernel-external drivers.

    Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Acked-by: Michal Hocko
    Cc: "Jérôme Glisse"
    Cc: Balbir Singh
    Cc: Logan Gunthorpe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

20 Oct, 2018

1 commit

  • commit 15d36fecd0bdc7510b70a0e5ec6671140b3fce0c upstream.

    When pmem namespaces created are smaller than section size, this can
    cause an issue during removal and gpf was observed:

    general protection fault: 0000 1 SMP PTI
    CPU: 36 PID: 3941 Comm: ndctl Tainted: G W 4.14.28-1.el7uek.x86_64 #2
    task: ffff88acda150000 task.stack: ffffc900233a4000
    RIP: 0010:__put_page+0x56/0x79
    Call Trace:
    devm_memremap_pages_release+0x155/0x23a
    release_nodes+0x21e/0x260
    devres_release_all+0x3c/0x48
    device_release_driver_internal+0x15c/0x207
    device_release_driver+0x12/0x14
    unbind_store+0xba/0xd8
    drv_attr_store+0x27/0x31
    sysfs_kf_write+0x3f/0x46
    kernfs_fop_write+0x10f/0x18b
    __vfs_write+0x3a/0x16d
    vfs_write+0xb2/0x1a1
    SyS_write+0x55/0xb9
    do_syscall_64+0x79/0x1ae
    entry_SYSCALL_64_after_hwframe+0x3d/0x0

    Add code to check whether we have a mapping already in the same section
    and prevent additional mappings from being created if that is the case.

    Link: http://lkml.kernel.org/r/152909478401.50143.312364396244072931.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Cc: Dan Williams
    Cc: Robert Elliott
    Cc: Jeff Moyer
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Dave Jiang
     

15 Sep, 2018

1 commit

  • commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream.

    If devm_memremap_pages() detects a collision while adding entries
    to the radix-tree, we call pgmap_radix_release(). Unfortunately,
    the function removes *all* entries for the range -- including the
    entries that caused the collision in the first place.

    Modify pgmap_radix_release() to take an additional argument to
    indicate where to stop, so that only newly added entries are removed
    from the tree.

    Cc:
    Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()")
    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Dan Williams
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

22 Feb, 2018

1 commit

  • commit 10a0cd6e4932b5078215b1ec2c896597eec0eff9 upstream.

    The functions devm_memremap_pages() and devm_memremap_pages_release() use
    different ways to calculate the section-aligned amount of memory. The
    latter function may use an incorrect size if the memory region is small
    but straddles a section border.

    Use the same code for both.

    Cc:
    Fixes: 5f29a77cd957 ("mm: fix mixed zone detection in devm_memremap_pages")
    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

04 Oct, 2017

1 commit

  • devm_memremap_pages is initializing struct pages in for_each_device_pfn
    and that can take quite some time. We have even seen a soft lockup
    triggering on a non preemptive kernel

    NMI watchdog: BUG: soft lockup - CPU#61 stuck for 22s! [kworker/u641:11:1808]
    [...]
    RIP: 0010:[] [] devm_memremap_pages+0x327/0x430
    [...]
    Call Trace:
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70

    fix this by adding cond_resched every 1024 pages.

    Link: http://lkml.kernel.org/r/20170918121410.24466-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Sep, 2017

4 commits

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM pages (private or public device pages) are ZONE_DEVICE page and thus
    need special handling when it comes to lru or refcount. This patch make
    sure that memcontrol properly handle those when it face them. Those pages
    are use like regular pages in a process address space either as anonymous
    page or as file back page. So from memcg point of view we want to handle
    them like regular page for now at least.

    Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Balbir Singh
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
    any user. For device private pages this is important to catch and thus we
    need to special case put_page() for this.

    Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

1 commit

  • devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
    per section's worth of memory (128MB). The key for each of those
    entries is a section number.

    This leads to false positives when devm_memremap_pages() is passed a
    section-unaligned range as lookups in the misalignment fail to return
    NULL. We can close this hole by using the pfn as the key for entries in
    the tree. The number of entries required to describe a remapped range
    is reduced by leveraging multi-order entries.

    In practice this approach usually yields just one entry in the tree if
    the size and starting address are of the same power-of-2 alignment.
    Previously we always needed nr_entries = mapping_size / 128MB.

    Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
    Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Toshi Kani
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

18 Jul, 2017

1 commit

  • Boot data (such as EFI related data) is not encrypted when the system is
    booted because UEFI/BIOS does not run with SME active. In order to access
    this data properly it needs to be mapped decrypted.

    Update early_memremap() to provide an arch specific routine to modify the
    pagetable protection attributes before they are applied to the new
    mapping. This is used to remove the encryption mask for boot related data.

    Update memremap() to provide an arch specific routine to determine if RAM
    remapping is allowed. RAM remapping will cause an encrypted mapping to be
    generated. By preventing RAM remapping, ioremap_cache() will be used
    instead, which will provide a decrypted mapping of the boot related data.

    Signed-off-by: Tom Lendacky
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Matt Fleming
    Reviewed-by: Borislav Petkov
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brijesh Singh
    Cc: Dave Young
    Cc: Dmitry Vyukov
    Cc: Jonathan Corbet
    Cc: Konrad Rzeszutek Wilk
    Cc: Larry Woodman
    Cc: Linus Torvalds
    Cc: Michael S. Tsirkin
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Toshimitsu Kani
    Cc: kasan-dev@googlegroups.com
    Cc: kvm@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/81fb6b4117a5df6b9f2eda342f81bbef4b23d2e5.1500319216.git.thomas.lendacky@amd.com
    Signed-off-by: Ingo Molnar

    Tom Lendacky
     

07 Jul, 2017

2 commits

  • arch_add_memory gets for_device argument which then controls whether we
    want to create memblocks for created memory sections. Simplify the
    logic by telling whether we want memblocks directly rather than going
    through pointless negation. This also makes the api easier to
    understand because it is clear what we want rather than nothing telling
    for_device which can mean anything.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current memory hotplug implementation relies on having all the
    struct pages associate with a zone/node during the physical hotplug
    phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
    vast majority of cases this means that they are added to ZONE_NORMAL.
    This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
    hotadd without sparsemem") and it wasn't a big deal back then because
    movable onlining didn't exist yet.

    Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
    onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
    memory and portion memory") and then things got more complicated.
    Rather than reconsidering the zone association which was no longer
    needed (because the memory hotplug already depended on SPARSEMEM) a
    convoluted semantic of zone shifting has been developed. Only the
    currently last memblock or the one adjacent to the zone_movable can be
    onlined movable. This essentially means that the online type changes as
    the new memblocks are added.

    Let's simulate memory hot online manually
    $ echo 0x100000000 > /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory32/valid_zones
    Normal Movable

    $ echo $((0x100000000+(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    $ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    $ echo online_movable > /sys/devices/system/memory/memory34/state
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable Normal

    This is an awkward semantic because an udev event is sent as soon as the
    block is onlined and an udev handler might want to online it based on
    some policy (e.g. association with a node) but it will inherently race
    with new blocks showing up.

    This patch changes the physical online phase to not associate pages with
    any zone at all. All the pages are just marked reserved and wait for
    the onlining phase to be associated with the zone as per the online
    request. There are only two requirements

    - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

    - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

    the latter one is not an inherent requirement and can be changed in the
    future. It preserves the current behavior and made the code slightly
    simpler. This is subject to change in future.

    This means that the same physical online steps as above will lead to the
    following state: Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable

    Implementation:
    The current move_pfn_range is reimplemented to check the above
    requirements (allow_online_pfn_range) and then updates the respective
    zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
    pfn range with the zone/node. __add_pages is updated to not require the
    zone and only initializes sections in the range. This allowed to
    simplify the arch_add_memory code (s390 could get rid of quite some of
    code).

    devm_memremap_pages is the only user of arch_add_memory which relies on
    the zone association because it only hooks into the memory hotplug only
    half way. It uses it to associate the new memory with ZONE_DEVICE but
    doesn't allow it to be {on,off}lined via sysfs. This means that this
    particular code path has to call move_pfn_range_to_zone explicitly.

    The original zone shifting code is kept in place and will be removed in
    the follow up patch for an easier review.

    Please note that this patch also changes the original behavior when
    offlining a memory block adjacent to another zone (Normal vs. Movable)
    used to allow to change its movable type. This will be handled later.

    [richard.weiyang@gmail.com: simplify zone_intersects()]
    Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
    [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
    Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
    [akpm@linux-foundation.org: remove unused local `i']
    Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Wei Yang
    Tested-by: Dan Williams
    Tested-by: Reza Arbab
    Acked-by: Heiko Carstens # For s390 bits
    Acked-by: Vlastimil Babka
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 May, 2017

1 commit

  • The x86 conversion to the generic GUP code included a small change which causes
    crashes and data corruption in the pmem code - not good.

    The root cause is that the /dev/pmem driver code implicitly relies on the x86
    get_user_pages() implementation doing a get_page() on the page refcount, because
    get_page() does a get_zone_device_page() which properly refcounts pmem's separate
    page struct arrays that are not present in the regular page struct structures.
    (The pmem driver does this because it can cover huge memory areas.)

    But the x86 conversion to the generic GUP code changed the get_page() to
    page_cache_get_speculative() which is faster but doesn't do the
    get_zone_device_page() call the pmem code relies on.

    One way to solve the regression would be to change the generic GUP code to use
    get_page(), but that would slow things down a bit and punish other generic-GUP
    using architectures for an x86-ism they did not care about. (Arguably the pmem
    driver was probably not working reliably for them: but nvdimm is an Intel
    feature, so non-x86 exposure is probably still limited.)

    So restructure the pmem code's interface with the MM instead: get rid of the
    get/put_zone_device_page() distinction, integrate put_zone_device_page() into
    __put_page() and and restructure the pmem completion-wait and teardown machinery:

    Kirill points out that the calls to {get,put}_dev_pagemap() can be
    removed from the mm fast path if we take a single get_dev_pagemap()
    reference to signify that the page is alive and use the final put of the
    page to drop that reference.

    This does require some care to make sure that any waits for the
    percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
    since it now maintains its own elevated reference.

    This speeds up things while also making the pmem refcounting more robust going
    forward.

    Suggested-by: Kirill Shutemov
    Tested-by: Kirill Shutemov
    Signed-off-by: Dan Williams
    Reviewed-by: Logan Gunthorpe
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Jérôme Glisse
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Ingo Molnar

    Dan Williams
     

17 Mar, 2017

1 commit

  • Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
    introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
    in order to allow similar semantics for memory hotplug like for cpu
    hotplug.

    The corresponding functions for cpu hotplug are get/put_online_cpus()
    and cpu_hotplug_begin/done() for cpu hotplug.

    The commit however missed to introduce functions that would serialize
    memory hotplug operations like they are done for cpu hotplug with
    cpu_maps_update_begin/done().

    This basically leaves mem_hotplug.active_writer unprotected and allows
    concurrent writers to modify it, which may lead to problems as outlined
    by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
    mem_hotplug_{begin, done}").

    That commit was extended again with commit b5d24fda9c3d ("mm,
    devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
    done}") which serializes memory hotplug operations for some call sites
    by using the device_hotplug lock.

    In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
    for memory hotplug") a sanity check was added to mem_hotplug_begin() to
    verify that the device_hotplug lock is held.

    This in turn triggers the following warning on s390:

    WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
    Call Trace:
    assert_held_device_hotplug+0x40/0x58)
    mem_hotplug_begin+0x34/0xc8
    add_memory_resource+0x7e/0x1f8
    add_memory+0xda/0x130
    add_memory_merged+0x15c/0x178
    sclp_detect_standby_memory+0x2ae/0x2f8
    do_one_initcall+0xa2/0x150
    kernel_init_freeable+0x228/0x2d8
    kernel_init+0x2a/0x140
    kernel_thread_starter+0x6/0xc

    One possible fix would be to add more lock_device_hotplug() and
    unlock_device_hotplug() calls around each call site of
    mem_hotplug_begin/end(). But that would give the device_hotplug lock
    additional semantics it better should not have (serialize memory hotplug
    operations).

    Instead add a new memory_add_remove_lock which has the similar semantics
    like cpu_add_remove_lock for cpu hotplug.

    To keep things hopefully a bit easier the lock will be locked and unlocked
    within the mem_hotplug_begin/end() functions.

    Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Reported-by: Sebastian Ott
    Acked-by: Dan Williams
    Acked-by: Rafael J. Wysocki
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Ben Hutchings
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

25 Feb, 2017

1 commit

  • The mem_hotplug_{begin,done} lock coordinates with {get,put}_online_mems()
    to hold off "readers" of the current state of memory from new hotplug
    actions. mem_hotplug_begin() expects exclusive access, via the
    device_hotplug lock, to set mem_hotplug.active_writer. Calling
    mem_hotplug_begin() without locking device_hotplug can lead to
    corrupting mem_hotplug.refcount and missed wakeups / soft lockups.

    [dan.j.williams@intel.com: v2]
    Link: http://lkml.kernel.org/r/148728203365.38457.17804568297887708345.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/148693885680.16345.17802627926777862337.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: f931ab479dd2 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
    Signed-off-by: Dan Williams
    Reported-by: Ben Hutchings
    Cc: Michal Hocko
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Masayoshi Mizuma
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

11 Jan, 2017

1 commit

  • Both arch_add_memory() and arch_remove_memory() expect a single threaded
    context.

    For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
    not hold any locks over this check and branch:

    if (pgd_val(*pgd)) {
    pud = (pud_t *)pgd_page_vaddr(*pgd);
    paddr_last = phys_pud_init(pud, __pa(vaddr),
    __pa(vaddr_end),
    page_size_mask);
    continue;
    }

    pud = alloc_low_page();
    paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
    page_size_mask);

    The result is that two threads calling devm_memremap_pages()
    simultaneously can end up colliding on pgd initialization. This leads
    to crash signatures like the following where the loser of the race
    initializes the wrong pgd entry:

    BUG: unable to handle kernel paging request at ffff888ebfff0000
    IP: memcpy_erms+0x6/0x10
    PGD 2f8e8fc067 PUD 0 /*
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

10 Sep, 2016

1 commit

  • track_pfn_insert() in vmf_insert_pfn_pmd() is marking dax mappings as
    uncacheable rendering them impractical for application usage. DAX-pte
    mappings are cached and the goal of establishing DAX-pmd mappings is to
    attain more performance, not dramatically less (3 orders of magnitude).

    track_pfn_insert() relies on a previous call to reserve_memtype() to
    establish the expected page_cache_mode for the range. While memremap()
    arranges for reserve_memtype() to be called, devm_memremap_pages() does
    not. So, teach track_pfn_insert() and untrack_pfn() how to handle
    tracking without a vma, and arrange for devm_memremap_pages() to
    establish the write-back-cache reservation in the memtype tree.

    Cc:
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Nilesh Choudhury
    Cc: Kirill A. Shutemov
    Reported-by: Toshi Kani
    Reported-by: Kai Zhang
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

29 Jul, 2016

2 commits

  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     
  • Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
    ifdef guards to just ZONE_DEVICE.

    Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Cc: Eric Sandeen
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

25 Jun, 2016

1 commit

  • Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to
    override it and indicate that nfit_test-pmem is not device-mapped. Now,
    we want to enable nfit_test to operate without DMA_CMA and the pmem it
    provides will no longer be physically contiguous, i.e. won't be capable
    of supporting direct_access requests larger than a page. Make
    pmem_direct_access() a weak symbol so that it can be replaced by the
    tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static
    inline now that it no longer needs to be overridden.

    Acked-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

04 Apr, 2016

1 commit

  • Currently, the memremap code serves MEMREMAP_WB mappings directly from
    the kernel direct mapping, unless the region is in high memory, in which
    case it falls back to using ioremap_cache(). However, the semantics of
    ioremap_cache() are not unambiguously defined, and on ARM, it will
    actually result in a mapping type that differs from the attributes used
    for the linear mapping, and for this reason, the ioremap_cache() call
    fails if the region is part of the memory managed by the kernel.

    So instead, implement an optional hook 'arch_memremap_wb' whose default
    implementation calls ioremap_cache() as before, but which can be
    overridden by the architecture to do what is appropriate for it.

    Acked-by: Dan Williams
    Signed-off-by: Ard Biesheuvel

    Ard Biesheuvel
     

23 Mar, 2016

2 commits

  • Add a flag to memremap() for writecombine mappings. Mappings satisfied
    by this flag will not be cached, however writes may be delayed or
    combined into more efficient bursts. This is most suitable for buffers
    written sequentially by the CPU for use by other DMA devices.

    Signed-off-by: Brian Starkey
    Reviewed-by: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • These patches implement a MEMREMAP_WC flag for memremap(), which can be
    used to obtain writecombine mappings. This is then used for setting up
    dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.

    The motivation is to fix an alignment fault on arm64, and the suggestion
    to implement MEMREMAP_WC for this case was made at [1]. That particular
    issue is handled in patch 4, which makes sure that the appropriate
    memset function is used when zeroing allocations mapped as IO memory.

    This patch (of 4):

    Don't modify the flags input argument to memremap(). MEMREMAP_WB is
    already a special case so we can check for it directly instead of
    clearing flag bits in each mapper.

    Signed-off-by: Brian Starkey
    Cc: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     

16 Mar, 2016

1 commit

  • Commit 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment
    vmemmap_populate()"), introduced the to_vmem_altmap() function.

    The comments in this function contain two typos (one misspelling of the
    Kconfig option CONFIG_SPARSEMEM_VMEMMAP, and one missing letter 'n'),
    let's fix them up.

    Signed-off-by: Andreas Ziegler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Ziegler
     

15 Mar, 2016

1 commit

  • Pull ram resource handling changes from Ingo Molnar:
    "Core kernel resource handling changes to support NVDIMM error
    injection.

    This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
    for System RAM while keeping the current IORESOURCE_MEM type bit set
    for all memory-mapped ranges (including System RAM) for backward
    compatibility.

    With this resource flag it no longer takes a strcmp() loop through the
    resource tree to find "System RAM" resources.

    The new resource type is then used to extend ACPI/APEI error injection
    facility to also support NVDIMM"

    * 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ACPI/EINJ: Allow memory error injection to NVDIMM
    resource: Kill walk_iomem_res()
    x86/kexec: Remove walk_iomem_res() call with GART type
    x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
    resource: Add walk_iomem_res_desc()
    memremap: Change region_intersects() to take @flags and @desc
    arm/samsung: Change s3c_pm_run_res() to use System RAM type
    resource: Change walk_system_ram() to use System RAM type
    drivers: Initialize resource entry to zero
    xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
    kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
    arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
    ia64: Set System RAM type and descriptor
    x86/e820: Set System RAM type and descriptor
    resource: Add I/O resource descriptor
    resource: Handle resource flags properly
    resource: Add System RAM resource type

    Linus Torvalds
     

10 Mar, 2016

3 commits

  • In memremap's helper function try_ram_remap(), we dereference a struct
    page pointer that was derived from a PFN that is known to be covered by
    a 'System RAM' iomem region, and is thus assumed to be a 'valid' PFN,
    i.e., a PFN that has a struct page associated with it and is covered by
    the kernel direct mapping.

    However, the assumption that there is a 1:1 relation between the System
    RAM iomem region and the kernel direct mapping is not universally valid
    on all architectures, and on ARM and arm64, 'System RAM' may include
    regions for which pfn_valid() returns false.

    Generally speaking, both __va() and pfn_to_page() should only ever be
    called on PFNs/physical addresses for which pfn_valid() returns true, so
    add that check to try_ram_remap().

    Signed-off-by: Ard Biesheuvel
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • The check for whether we overlap "System RAM" needs to be done at
    section granularity. For example a system with the following mapping:

    100000000-37bffffff : System RAM
    37c000000-837ffffff : Persistent Memory

    ...is unable to use devm_memremap_pages() as it would result in two
    zones colliding within a given section.

    Signed-off-by: Dan Williams
    Cc: Ross Zwisler
    Reviewed-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Given we have uninitialized list_heads being passed to list_add() it
    will always be the case that those uninitialized values randomly trigger
    the poison value. Especially since a list_add() operation will seed the
    stack with the poison value for later stack allocations to trip over.

    For example, see these two false positive reports:

    list_add attempted on force-poisoned entry
    WARNING: at lib/list_debug.c:34
    [..]
    NIP [c00000000043c390] __list_add+0xb0/0x150
    LR [c00000000043c38c] __list_add+0xac/0x150
    Call Trace:
    __list_add+0xac/0x150 (unreliable)
    __down+0x4c/0xf8
    down+0x68/0x70
    xfs_buf_lock+0x4c/0x150 [xfs]

    list_add attempted on force-poisoned entry(0000000000000500),
    new->next == d0000000059ecdb0, new->prev == 0000000000000500
    WARNING: at lib/list_debug.c:33
    [..]
    NIP [c00000000042db78] __list_add+0xa8/0x140
    LR [c00000000042db74] __list_add+0xa4/0x140
    Call Trace:
    __list_add+0xa4/0x140 (unreliable)
    rwsem_down_read_failed+0x6c/0x1a0
    down_read+0x58/0x60
    xfs_log_commit_cil+0x7c/0x600 [xfs]

    Fixes: commit 5c2c2587b132 ("mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup")
    Signed-off-by: Dan Williams
    Reported-by: Eryu Guan
    Tested-by: Eryu Guan
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

04 Mar, 2016

1 commit


26 Feb, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:

    - Two fixes for compatibility with the ACPI 6.1 specification.

    Without these fixes multi-interface DIMMs will fail to be probed, and
    address range scrub commands to find memory errors will give results
    that the kernel will mis-interpret. For multi-interface DIMMs Linux
    will accept either the original 6.0 implementation or 6.1.

    For address range scrub we'll only support 6.1 since ACPI formalized
    this DSM differently than the original example [1] implemented in
    v4.2. The expectation is that production systems will only ever ship
    the ACPI 6.1 address range scrub command definition.

    - The wider async address range scrub work targeting 4.6 discovered
    that the original synchronous implementation in 4.5 is not sizing its
    return buffer correctly.

    - Arnd caught that my recent fix to the size of the pfn_t flags missed
    updating the flags variable used in the pmem driver.

    - Toshi found that we mishandle the memremap() return value in
    devm_memremap().

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    nvdimm: use 'u64' for pfn flags
    devm_memremap: Fix error value when memremap failed
    nfit: update address range scrub commands to the acpi 6.1 format
    libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
    nfit: fix multi-interface dimm handling, acpi6.1 compatibility

    Linus Torvalds
     

24 Feb, 2016

1 commit

  • devm_memremap() returns an ERR_PTR() value in case of error.
    However, it returns NULL when memremap() failed. This causes
    the caller, such as the pmem driver, to proceed and oops later.

    Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
    failed.

    Signed-off-by: Toshi Kani
    Cc: Andrew Morton
    Cc:
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Toshi Kani
     

19 Feb, 2016

1 commit

  • The pmem driver calls devm_memremap() to map a persistent memory range.
    When the pmem driver is unloaded, this memremap'd range is not released
    so the kernel will leak a vma.

    Fix devm_memremap_release() to handle a given memremap'd address
    properly.

    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Cc: Christoph Hellwig
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

12 Feb, 2016

1 commit

  • The pfn_t type uses an unsigned long to store a pfn + flags value. On a
    64-bit platform the upper 12 bits of an unsigned long are never used for
    storing the value of a pfn. However, this is not true on highmem
    platforms, all 32-bits of a pfn value are used to address a 44-bit
    physical address space. A pfn_t needs to store a 64-bit value.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=112211
    Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
    Signed-off-by: Dan Williams
    Reported-by: Stuart Foster
    Reported-by: Julian Margetson
    Tested-by: Julian Margetson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

01 Feb, 2016

1 commit

  • A dma_addr_t is potentially smaller than a phys_addr_t on some archs.
    Don't truncate the address when doing the pfn conversion.

    Cc: Ross Zwisler
    Reported-by: Matthew Wilcox
    [willy: fix pfn_t_to_phys as well]
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Jan, 2016

2 commits

  • Change region_intersects() to identify a target with @flags and
    @desc, instead of @name with strcmp().

    Change the callers of region_intersects(), memremap() and
    devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
    IORES_DESC_NONE in @desc when searching System RAM.

    Also, export region_intersects() so that the ACPI EINJ error
    injection driver can call this function in a later patch.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Acked-by: Dan Williams
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jakub Sitnicki
    Cc: Jan Kara
    Cc: Jiang Liu
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • to_vmem_altmap() needs to return valid results until
    arch_remove_memory() completes. It also needs to be valid for any pfn
    in a section regardless of whether that pfn maps to data. This escape
    was a result of a bug in the unit test.

    The signature of this bug is that free_pagetable() fails to retrieve a
    vmem_altmap and goes off into the weeds:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] get_pfnblock_flags_mask+0x49/0x60
    [..]
    Call Trace:
    [] free_hot_cold_page+0x97/0x1d0
    [] __free_pages+0x2a/0x40
    [] free_pagetable+0x8c/0xd4
    [] remove_pagetable+0x37a/0x808
    [] vmemmap_free+0x10/0x20

    Fixes: 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
    Cc: Andrew Morton
    Reported-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jan, 2016

2 commits

  • A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
    has established a devm_memremap_pages() mapping, i.e. when the pfn_t
    return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
    encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
    struct dev_pagemap instance to keep the result of pfn_to_page() valid
    until put_page().

    Signed-off-by: Dan Williams
    Tested-by: Logan Gunthorpe
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • get_dev_page() enables paths like get_user_pages() to pin a dynamically
    mapped pfn-range (devm_memremap_pages()) while the resulting struct page
    objects are in use. Unlike get_page() it may fail if the device is, or
    is in the process of being, disabled. While the initial lookup of the
    range may be an expensive list walk, the result is cached to speed up
    subsequent lookups which are likely to be in the same mapped range.

    devm_memremap_pages() now requires a reference counter to be specified
    at init time. For pmem this means moving request_queue allocation into
    pmem_alloc() so the existing queue usage counter can track "device
    pages".

    ZONE_DEVICE pages always have an elevated count and will never be on an
    lru reclaim list. That space in 'struct page' can be redirected for
    other uses, but for safety introduce a poison value that will always
    trip __list_add() to assert. This allows half of the struct list_head
    storage to be reclaimed with some assurance to back up the assumption
    that the page count never goes to zero and a list_add() is never
    attempted.

    Signed-off-by: Dan Williams
    Tested-by: Logan Gunthorpe
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams