13 Jan, 2019

2 commits

  • commit 02917e9f8676207a4c577d4d94eae12bf348e9d7 upstream.

    At Maintainer Summit, Greg brought up a topic I proposed around
    EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
    EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
    step of reclassifying an existing export. Specifically, I wanted to make
    the case that although the line is fuzzy and hard to specify in abstract
    terms, it is nonetheless clear that devm_memremap_pages() and HMM
    (Heterogeneous Memory Management) have crossed it. The
    devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
    beginning, and HMM as a derivative of that functionality should have
    naturally picked up that designation as well.

    Contrary to typical rules, the HMM infrastructure was merged upstream with
    zero in-tree consumers. There was a promise at the time that those users
    would be merged "soon", but it has been over a year with no drivers
    arriving. While the Nouveau driver is about to belatedly make good on
    that promise it is clear that HMM was targeted first and foremost at an
    out-of-tree consumer.

    HMM is derived from devm_memremap_pages(), a facility Christoph and I
    spearheaded to support persistent memory. It combines a device lifetime
    model with a dynamically created 'struct page' / memmap array for any
    physical address range. It enables coordination and control of the many
    code paths in the kernel built to interact with memory via 'struct page'
    objects. With HMM the integration goes even deeper by allowing device
    drivers to hook and manipulate page fault and page free events.

    One interpretation of when EXPORT_SYMBOL is suitable is when it is
    exporting stable and generic leaf functionality. The
    devm_memremap_pages() facility continues to see expanding use cases,
    peer-to-peer DMA being the most recent, with no clear end date when it
    will stop attracting reworks and semantic changes. It is not suitable to
    export devm_memremap_pages() as a stable 3rd party driver API due to the
    fact that it is still changing and manipulates core behavior. Moreover,
    it is not in the best interest of the long term development of the core
    memory management subsystem to permit any external driver to effectively
    define its own system-wide memory management policies with no
    encouragement to engage with upstream.

    I am also concerned that HMM was designed in a way to minimize further
    engagement with the core-MM. That, with these hooks in place,
    device-drivers are free to implement their own policies without much
    consideration for whether and how the core-MM could grow to meet that
    need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
    core-MM should be allowed the opportunity and stimulus to change and
    address these new use cases as first class functionality.

    Original changelog:

    hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
    devm_memremap_pages() and are now simple now wrappers around the core
    facility to inject a dev_pagemap instance into the global pgmap_radix and
    hook page-idle events. The devm_memremap_pages() interface is base
    infrastructure for HMM. HMM has more and deeper ties into the kernel
    memory management implementation than base ZONE_DEVICE which is itself a
    EXPORT_SYMBOL_GPL facility.

    Originally, the HMM page structure creation routines copied the
    devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
    implementations was discussed during the initial review:
    http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
    extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
    this cleanup to move forward.

    In addition to the integration with devm_memremap_pages() HMM depends on
    other GPL-only symbols:

    mmu_notifier_unregister_no_release
    percpu_ref
    region_intersects
    __class_create

    It goes further to consume / indirectly expose functionality that is not
    exported to any other driver:

    alloc_pages_vma
    walk_page_range

    HMM is derived from devm_memremap_pages(), and extends deep core-kernel
    fundamentals. Similar to devm_memremap_pages(), mark its entry points
    EXPORT_SYMBOL_GPL().

    [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
    Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
    Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Logan Gunthorpe
    Cc: "Jérôme Glisse"
    Cc: Balbir Singh ,
    Cc: Michal Hocko
    Cc: Benjamin Herrenschmidt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 58ef15b765af0d2cbe6799ec564f1dc485010ab8 upstream.

    devm semantics arrange for resources to be torn down when
    device-driver-probe fails or when device-driver-release completes.
    Similar to devm_memremap_pages() there is no need to support an explicit
    remove operation when the users properly adhere to devm semantics.

    Note that devm_kzalloc() automatically handles allocating node-local
    memory.

    Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jérôme Glisse
    Cc: "Jérôme Glisse"
    Cc: Logan Gunthorpe
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

14 Nov, 2018

1 commit

  • commit 86a2d59841ab0b147ffc1b7b3041af87927cf312 upstream.

    In hmm_mirror_unregister(), mm->hmm is set to NULL and then
    mmu_notifier_unregister_no_release() is called. That creates a small
    window where mmu_notifier can call mmu_notifier_ops with mm->hmm equal to
    NULL. Fix this by first unregistering mmu notifier callbacks and then
    setting mm->hmm to NULL.

    Similarly in hmm_register(), set mm->hmm before registering mmu_notifier
    callbacks so callback functions always see mm->hmm set.

    Link: http://lkml.kernel.org/r/20181019160442.18723-4-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Reviewed-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ralph Campbell
     

26 Aug, 2018

1 commit

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     

23 Aug, 2018

1 commit

  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Aug, 2018

3 commits

  • Variables align_start and align_end are being assigned but are never
    used hence they are redundant and can be removed.

    Cleans up clang warnings:
    warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
    warning: variable 'align_size' set but not used [-Wunused-but-set-variable]

    Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    In this patch all the caller of handle_mm_fault() are changed to return
    vm_fault_t type.

    Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Cc: Matthew Wilcox
    Cc: Richard Henderson
    Cc: Tony Luck
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Jonas Bonn
    Cc: James E.J. Bottomley
    Cc: Benjamin Herrenschmidt
    Cc: Palmer Dabbelt
    Cc: Yoshinori Sato
    Cc: David S. Miller
    Cc: Richard Weinberger
    Cc: Guan Xuetao
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Levin, Alexander (Sasha Levin)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

24 Jul, 2018

1 commit

  • MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
    is only ever cleared by truncation, not final put.

    Without this fix dax pages may forget their mapping association at the
    end of every page pin event.

    Move this atypical behavior that HMM wants into the HMM ->page_free()
    callback.

    Cc:
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Fixes: d2c997c0f145 ("fs, dax: use page->mapping...")
    Signed-off-by: Dan Williams
    Acked-by: Jérôme Glisse
    Signed-off-by: Dave Jiang

    Dan Williams
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

12 Apr, 2018

14 commits

  • hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
    actually uses the RCU protection. The only caller is
    hmm_devmem_pages_create() which already grabs the mutex and does
    superfluous rcu_read_lock/unlock() around the function.

    This doesn't add anything and just adds to confusion. Remove the RCU
    protection and open-code the radix tree lookup. If this needs to become
    more sophisticated in the future, let's add them back when necessary.

    Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reviewed-by: Jérôme Glisse
    Cc: Paul E. McKenney
    Cc: Benjamin LaHaise
    Cc: Al Viro
    Cc: Kent Overstreet
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
    pfn shift value allowing them to define their own encoding for HMM pfn
    that are fill inside the pfns array of the hmm_range struct. With this
    device driver can get pfn that match their own private encoding out of HMM
    without having to do any conversion.

    [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
    Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
    [rcampbell@nvidia.com: clarify fault logic for device private memory]
    Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Ralph Campbell
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This changes hmm_vma_fault() to not take a global write fault flag for a
    range but instead rely on caller to populate HMM pfns array with proper
    fault flag ie HMM_PFN_VALID if driver want read fault for that address or
    HMM_PFN_VALID and HMM_PFN_WRITE for write.

    Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
    device private memory to be migrated back to system memory through page
    fault.

    This is more flexible API and it better reflects how device handles and
    reports fault.

    Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • No functional change, just create one function to handle pmd and one to
    handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

    Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Move hmm_pfns_clear() closer to where it is used to make it clear it is
    not use by page table walkers.

    Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Make naming consistent across code, DEVICE_PRIVATE is the name use outside
    HMM code so use that one.

    Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • There is no point in differentiating between a range for which there is
    not even a directory (and thus entries) and empty entry (pte_none() or
    pmd_none() returns true).

    Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
    duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

    Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Special vma (one with any of the VM_SPECIAL flags) can not be access by
    device because there is no consistent model across device drivers on those
    vma and their backing memory.

    This patch directly use hmm_range struct for hmm_pfns_special() argument
    as it is always affecting the whole vma and thus the whole range.

    It also make behavior consistent after this patch both hmm_vma_fault() and
    hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
    hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
    were filling the HMM pfn array with special entry.

    Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • All device driver we care about are using 64bits page table entry. In
    order to match this and to avoid useless define convert all HMM pfn to
    directly use uint64_t. It is a first step on the road to allow driver to
    directly use pfn value return by HMM (saving memory and CPU cycles use for
    conversion between the two).

    Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Only peculiar architecture allow write without read thus assume that any
    valid pfn do allow for read. Note we do not care for write only because
    it does make sense with thing like atomic compare and exchange or any
    other operations that allow you to get the memory value through them.

    Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
    as parameter and were initializing that struct with others of their
    parameters. Have caller of those function do this as they are likely to
    already do and only pass this struct to both function this shorten
    function signature and make it easier in the future to add new parameters
    by simply adding them to the structure.

    Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The private field of mm_walk struct point to an hmm_vma_walk struct and
    not to the hmm_range struct desired. Fix to get proper struct pointer.

    Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This code was lost in translation at one point. This properly call
    mmu_notifier_unregister_no_release() once last user is gone. This fix the
    zombie mm_struct as without this patch we do not drop the refcount we have
    on it.

    Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • hmm_mirror_register() registers a callback for when the CPU pagetable is
    modified. Normally, the device driver will call hmm_mirror_unregister()
    when the process using the device is finished. However, if the process
    exits uncleanly, the struct_mm can be destroyed with no warning to the
    device driver.

    Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

07 Feb, 2018

1 commit

  • Pull libnvdimm updates from Ross Zwisler:

    - Require struct page by default for filesystem DAX to remove a number
    of surprising failure cases. This includes failures with direct I/O,
    gdb and fork(2).

    - Add support for the new Platform Capabilities Structure added to the
    NFIT in ACPI 6.2a. This new table tells us whether the platform
    supports flushing of CPU and memory controller caches on unexpected
    power loss events.

    - Revamp vmem_altmap and dev_pagemap handling to clean up code and
    better support future future PCI P2P uses.

    - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
    become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
    spec, and instead rely on the generic ND_CMD_CALL approach used by
    the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

    - Enhance nfit_test so we can test some of the new things added in
    version 1.6 of the DSM specification. This includes testing firmware
    download and simulating the Last Shutdown State (LSS) status.

    * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
    libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
    acpi, nfit: fix register dimm error handling
    libnvdimm, namespace: make min namespace size 4K
    tools/testing/nvdimm: force nfit_test to depend on instrumented modules
    libnvdimm/nfit_test: adding support for unit testing enable LSS status
    libnvdimm/nfit_test: add firmware download emulation
    nfit-test: Add platform cap support from ACPI 6.2a to test
    libnvdimm: expose platform persistence attribute for nd_region
    acpi: nfit: add persistent memory control flag for nd_region
    acpi: nfit: Add support for detect platform CPU cache flush on power loss
    device-dax: Fix trailing semicolon
    libnvdimm, btt: fix uninitialized err_lock
    dax: require 'struct page' by default for filesystem dax
    ext2: auto disable dax instead of failing mount
    ext4: auto disable dax instead of failing mount
    mm, dax: introduce pfn_t_special()
    mm: Fix devm_memremap_pages() collision handling
    mm: Fix memory size alignment in devm_memremap_pages_release()
    memremap: merge find_dev_pagemap into get_dev_pagemap
    memremap: change devm_memremap_pages interface to use struct dev_pagemap
    ...

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • The variable 'entry' is used before being initialized in
    hmm_vma_walk_pmd().

    No bad effect (beside performance hit) so !non_swap_entry(0) evaluate to
    true which trigger a fault as if CPU was trying to access migrated
    memory and migrate memory back from device memory to regular memory.

    This function (hmm_vma_walk_pmd()) is called when a device driver tries
    to populate its own page table. For migrated memory it should not
    happen as the device driver should already have populated its page table
    correctly during the migration.

    Only case I can think of is multi-GPU where a second GPU triggers
    migration back to regular memory. Again this would just result in a
    performance hit, nothing bad would happen.

    Link: http://lkml.kernel.org/r/20180122185759.26286-1-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

09 Jan, 2018

4 commits


16 Dec, 2017

1 commit

  • This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90.

    We'll probably need to revisit this, but basically we should not
    complicate the get_user_pages_fast() case, and checking the actual page
    table protection key bits will require more care anyway, since the
    protection keys depend on the exact state of the VM in question.

    Particularly when doing a "remote" page lookup (ie in somebody elses VM,
    not your own), you need to be much more careful than this was. Dave
    Hansen says:

    "So, the underlying bug here is that we now a get_user_pages_remote()
    and then go ahead and do the p*_access_permitted() checks against the
    current PKRU. This was introduced recently with the addition of the
    new p??_access_permitted() calls.

    We have checks in the VMA path for the "remote" gups and we avoid
    consulting PKRU for them. This got missed in the pkeys selftests
    because I did a ptrace read, but not a *write*. I also didn't
    explicitly test it against something where a COW needed to be done"

    It's also not entirely clear that it makes sense to check the protection
    key bits at this level at all. But one possible eventual solution is to
    make the get_user_pages_fast() case just abort if it sees protection key
    bits set, which makes us fall back to the regular get_user_pages() case,
    which then has a vma and can do the check there if we want to.

    We'll see.

    Somewhat related to this all: what we _do_ want to do some day is to
    check the PAGE_USER bit - it should obviously always be set for user
    pages, but it would be a good check to have back. Because we have no
    generic way to test for it, we lost it as part of moving over from the
    architecture-specific x86 GUP implementation to the generic one in
    commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
    get_user_page_fast() implementation").

    Cc: Peter Zijlstra
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2017

2 commits

  • The 'access_permitted' helper is used in the gup-fast path and goes
    beyond the simple _PAGE_RW check to also:

    - validate that the mapping is writable from a protection keys
    standpoint

    - validate that the pte has _PAGE_USER set since all fault paths where
    pte_write is must be referencing user-memory.

    Link: http://lkml.kernel.org/r/151043111604.2842.8051684481794973100.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'access_permitted' helper is used in the gup-fast path and goes
    beyond the simple _PAGE_RW check to also:

    - validate that the mapping is writable from a protection keys
    standpoint

    - validate that the pte has _PAGE_USER set since all fault paths where
    pmd_write is must be referencing user-memory.

    Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

16 Nov, 2017

1 commit

  • Variable align_end is assigned a value but it is never read, so the
    variable is redundant and can be removed. Cleans up the clang warning:
    Value stored to 'align_end' is never read

    Link: http://lkml.kernel.org/r/20171017143837.23207-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

09 Sep, 2017

6 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Unlike unaddressable memory, coherent device memory has a real resource
    associated with it on the system (as CPU can address it). Add a new
    helper to hotplug such memory within the HMM framework.

    Link: http://lkml.kernel.org/r/20170817000548.32038-20-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This introduce a dummy HMM device class so device driver can use it to
    create hmm_device for the sole purpose of registering device memory. It
    is useful to device driver that want to manage multiple physical device
    memory under same struct device umbrella.

    Link: http://lkml.kernel.org/r/20170817000548.32038-13-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This introduce a simple struct and associated helpers for device driver to
    use when hotpluging un-addressable device memory as ZONE_DEVICE. It will
    find a unuse physical address range and trigger memory hotplug for it
    which allocates and initialize struct page for the device memory.

    Device driver should use this helper during device initialization to
    hotplug the device memory. It should only need to remove the memory once
    the device is going offline (shutdown or hotremove). There should not be
    any userspace API to hotplug memory expect maybe for host device driver to
    allow to add more memory to a guest device driver.

    Device's memory is manage by the device driver and HMM only provides
    helpers to that effect.

    Link: http://lkml.kernel.org/r/20170817000548.32038-12-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Signed-off-by: Balbir Singh
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
    any user. For device private pages this is important to catch and thus we
    need to special case put_page() for this.

    Link: http://lkml.kernel.org/r/20170817000548.32038-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse