07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

28 Aug, 2019

2 commits

  • Normally, callers to handle_mm_fault() are supposed to check the
    vma->vm_flags first. hmm_range_fault() checks for VM_READ but doesn't
    check for VM_WRITE if the caller requests a page to be faulted in with
    write permission (via the hmm_range.pfns[] value). If the vma is write
    protected, this can result in an infinite loop:

    hmm_range_fault()
    walk_page_range()
    ...
    hmm_vma_walk_hole()
    hmm_vma_walk_hole_()
    hmm_vma_do_fault()
    handle_mm_fault(FAULT_FLAG_WRITE)
    /* returns VM_FAULT_WRITE */
    /* returns -EBUSY */
    /* returns -EBUSY */
    /* returns -EBUSY */
    /* loops on -EBUSY and range->valid */

    Prevent this by checking for vma->vm_flags & VM_WRITE before calling
    handle_mm_fault().

    Link: https://lore.kernel.org/r/20190823221753.2514-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • Although hmm_range_fault() calls find_vma() to make sure that a vma exists
    before calling walk_page_range(), hmm_vma_walk_hole() can still be called
    with walk->vma == NULL if the start and end address are not contained
    within the vma range.

    hmm_range_fault() /* calls find_vma() but no range check */
    walk_page_range() /* calls find_vma(), sets walk->vma = NULL */
    __walk_page_range()
    walk_pgd_range()
    walk_p4d_range()
    walk_pud_range()
    hmm_vma_walk_hole()
    hmm_vma_walk_hole_()
    hmm_vma_do_fault()
    handle_mm_fault(vma=0)

    Link: https://lore.kernel.org/r/20190823221753.2514-2-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     

23 Aug, 2019

1 commit

  • hmm_range_fault() may return NULL pages because some of the pfns are equal
    to HMM_PFN_NONE. This happens randomly under memory pressure. The reason
    is during the swapped out page pte path, hmm_vma_handle_pte() doesn't
    update the fault variable from cpu_flags, so it failed to call
    hmm_vam_do_fault() to swap the page in.

    The fix is to call hmm_pte_need_fault() to update fault variable.

    Fixes: 74eee180b935 ("mm/hmm/mirror: device page fault handler")
    Link: https://lore.kernel.org/r/20190815205227.7949-1-Philip.Yang@amd.com
    Signed-off-by: Philip Yang
    Reviewed-by: "Jérôme Glisse"
    Signed-off-by: Jason Gunthorpe

    Yang, Philip
     

20 Aug, 2019

1 commit

  • This is a significant simplification, it eliminates all the remaining
    'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
    paths, and takes away all the ugly locking and abuse of page_table_lock.

    mmu_notifier_get() provides the single struct hmm per struct mm which
    eliminates mm->hmm.

    It also directly guarantees that no mmu_notifier op callback is callable
    while concurrent free is possible, this eliminates all the krefs inside
    the mmu_notifier callbacks.

    The remaining krefs in the range code were overly cautious, drivers are
    already not permitted to free the mirror while a range exists.

    Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

08 Aug, 2019

8 commits

  • Stub out the whole function and assign NULL to the .hugetlb_entry method
    if CONFIG_HUGETLB_PAGE is not set, as the method won't ever be called in
    that case.

    Link: https://lore.kernel.org/r/20190806160554.14046-13-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Stub out the whole function when CONFIG_TRANSPARENT_HUGEPAGE is not set to
    make the function easier to read.

    Link: https://lore.kernel.org/r/20190806160554.14046-12-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • We only need the special pud_entry walker if PUD-sized hugepages and pte
    mappings are supported, else the common pagewalk code will take care of
    the iteration. Not implementing this callback reduced the amount of code
    compiled for non-x86 platforms, and also fixes compile failures with other
    architectures when helpers like pud_pfn are not implemented.

    Link: https://lore.kernel.org/r/20190806160554.14046-11-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • pte_index is an internal arch helper in various architectures, without
    consistent semantics. Open code that calculation of a PMD index based on
    the virtual address instead.

    Link: https://lore.kernel.org/r/20190806160554.14046-10-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The pagewalk code already passes the value as the hmask parameter.

    Link: https://lore.kernel.org/r/20190806160554.14046-9-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • All users pass PAGE_SIZE here, and if we wanted to support single entries
    for huge pages we should really just add a HMM_FAULT_HUGEPAGE flag instead
    that uses the huge page size instead of having the caller calculate that
    size once, just for the hmm code to verify it.

    Link: https://lore.kernel.org/r/20190806160554.14046-8-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Acked-by: Felix Kuehling
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The start, end and page_shift values are all saved in the range structure,
    so we might as well use that for argument passing.

    Link: https://lore.kernel.org/r/20190806160554.14046-7-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Felix Kuehling
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Link: https://lore.kernel.org/r/20190806160554.14046-6-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

26 Jul, 2019

8 commits

  • Since hmm_range_fault() doesn't use the struct hmm_range vma field, remove
    it.

    Link: https://lore.kernel.org/r/20190726005650.2566-8-rcampbell@nvidia.com
    Suggested-by: Jason Gunthorpe
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • walk_page_range() will only call hmm_vma_walk_hugetlb_entry() for
    hugetlbfs pages and doesn't call hmm_vma_walk_pmd() in this case.

    Therefore, it is safe to remove the check for vma->vm_flags & VM_HUGETLB
    in hmm_vma_walk_pmd().

    Link: https://lore.kernel.org/r/20190726005650.2566-7-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • Add a HMM_FAULT_SNAPSHOT flag so that hmm_range_snapshot can be merged
    into the almost identical hmm_range_fault function.

    Link: https://lore.kernel.org/r/20190726005650.2566-5-rcampbell@nvidia.com
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • This allows easier expansion to other flags, and also makes the callers a
    little easier to read.

    Link: https://lore.kernel.org/r/20190726005650.2566-4-rcampbell@nvidia.com
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • A few more comments and minor programming style clean ups. There should
    be no functional changes.

    Link: https://lore.kernel.org/r/20190726005650.2566-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • The hmm_mirror_ops callback function sync_cpu_device_pagetables() passes a
    struct hmm_update which is a simplified version of struct
    mmu_notifier_range. This is unnecessary so replace hmm_update with
    mmu_notifier_range directly.

    Link: https://lore.kernel.org/r/20190726005650.2566-2-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    [jgg: white space tuning]
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • The magic dropping of mmap_sem when handle_mm_fault returns VM_FAULT_RETRY
    is rather subtile. Add a comment explaining it.

    Link: https://lore.kernel.org/r/20190724065258.16603-8-hch@lst.de
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe
    [hch: wrote a changelog]
    Signed-off-by: Christoph Hellwig

    Jason Gunthorpe
     
  • We should not have two different error codes for the same
    condition. EAGAIN must be reserved for the FAULT_FLAG_ALLOW_RETRY retry
    case and signals to the caller that the mmap_sem has been unlocked.

    Use EBUSY for the !valid case so that callers can get the locking right.

    Link: https://lore.kernel.org/r/20190724065258.16603-2-hch@lst.de
    Tested-by: Ralph Campbell
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Felix Kuehling
    [jgg: elaborated commit message]
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

03 Jul, 2019

14 commits

  • Christoph Hellwig says:

    ====================
    Below is a series that cleans up the dev_pagemap interface so that it is
    more easily usable, which removes the need to wrap it in hmm and thus
    allowing to kill a lot of code

    Changes since v3:
    - pull in "mm/swap: Fix release_pages() when releasing devmap pages" and
    rebase the other patches on top of that
    - fold the hmm_devmem_add_resource into the DEVICE_PUBLIC memory removal
    patch
    - remove _vm_normal_page as it isn't needed without DEVICE_PUBLIC memory
    - pick up various ACKs

    Changes since v2:
    - fix nvdimm kunit build
    - add a new memory type for device dax
    - fix a few issues in intermediate patches that didn't show up in the end
    result
    - incorporate feedback from Michal Hocko, including killing of
    the DEVICE_PUBLIC memory type entirely

    Changes since v1:
    - rebase
    - also switch p2pdma to the internal refcount
    - add type checking for pgmap->type
    - rename the migrate method to migrate_to_ram
    - cleanup the altmap_valid flag
    - various tidbits from the reviews
    ====================

    Conflicts resolved by:
    - Keeping Ira's version of the code in swap.c
    - Using the delete for the section in hmm.rst
    - Using the delete for the devmap code in hmm.c and .h

    * branch 'hmm-devmem-cleanup.4': (24 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Required for dependencies in the next patches.

    Jason Gunthorpe
     
  • All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau
    depend on it instead of the mix of a dummy dependency symbol plus the
    actually selected one. Drop various odd dependencies, as the code is
    pretty portable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • There isn't really much value add in the hmm_devmem_add wrapper and
    more, as using devm_memremap_pages directly now is just as simple.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The only user of it has just been removed, and there wasn't really any need
    to wrap a basic memory allocator to start with.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a flags field to struct dev_pagemap to replace the altmap_valid
    boolean to be a little more extensible. Also add a pgmap_altmap() helper
    to find the optional altmap and clean up the code using the altmap using
    it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • struct dev_pagemap is always embedded into a containing structure, so
    there is no need to an additional private data field.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • This replaces the hacky ->fault callback, which is currently directly
    called from common code through a hmm specific data structure as an
    exercise in layering violations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Just check if there is a ->page_free operation set and take care of the
    static key enable, as well as the put using device managed resources.
    Also check that a ->page_free is provided for the pgmaps types that
    require it, and check for a valid type as well while we are at it.

    Note that this also fixes the fact that hmm never called
    dev_pagemap_put_ops and thus would leave the slow path enabled forever,
    even after a device driver unload or disable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Passing the actual typed structure leads to more understandable code
    vs just passing the ref member.

    Reported-by: Logan Gunthorpe
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The dev_pagemap is a growing too many callbacks. Move them into a
    separate ops structure so that they are not duplicated for multiple
    instances, and an attacker can't easily overwrite them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Keep the physical address allocation that hmm_add_device does with the
    rest of the resource code, and allow future reuse of it without the hmm
    wrapper.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • ->mapping isn't even used by HMM users, and the field at the same offset
    in the zone_device part of the union is declared as pad. (Which btw is
    rather confusing, as DAX uses ->pgmap and ->mapping from two different
    sides of the union, but DAX doesn't use hmm_devmem_free).

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The code hasn't been used since it was added to the tree, and doesn't
    appear to actually be usable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

02 Jul, 2019

1 commit

  • This code is a trivial wrapper around device model helpers, which
    should have been integrated into the driver device model usage from
    the start. Assuming it actually had users, which it never had since
    the code was added more than 1 1/2 years ago.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

28 Jun, 2019

1 commit

  • If the trylock on the hmm->mirrors_sem fails the function will return
    without decrementing the notifiers that were previously incremented. Since
    the caller will not call invalidate_range_end() on EAGAIN this will result
    in notifiers becoming permanently incremented and deadlock.

    If the sync_cpu_device_pagetables() required blocking the function will
    not return EAGAIN even though the device continues to touch the
    pages. This is a violation of the mmu notifier contract.

    Switch, and rename, the ranges_lock to a spin lock so we can reliably
    obtain it without blocking during error unwind.

    The error unwind is necessary since the notifiers count must be held
    incremented across the call to sync_cpu_device_pagetables() as we cannot
    allow the range to become marked valid by a parallel
    invalidate_start/end() pair while doing sync_cpu_device_pagetables().

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Tested-by: Philip Yang

    Jason Gunthorpe
     

25 Jun, 2019

1 commit

  • hmm_release() is called exactly once per hmm. ops->release() cannot
    accidentally trigger any action that would recurse back onto
    hmm->mirrors_sem.

    This fixes a use after-free race of the form:

    CPU0 CPU1
    hmm_release()
    up_write(&hmm->mirrors_sem);
    hmm_mirror_unregister(mirror)
    down_write(&hmm->mirrors_sem);
    up_write(&hmm->mirrors_sem);
    kfree(mirror)
    mirror->ops->release(mirror)

    The only user we have today for ops->release is an empty function, so this
    is unambiguously safe.

    As a consequence of plugging this race drivers are not allowed to
    register/unregister mirrors from within a release op.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Christoph Hellwig
    Tested-by: Philip Yang

    Jason Gunthorpe