13 Aug, 2020

2 commits

  • Patch series "mm: Page fault accounting cleanups", v5.

    This is v5 of the pf accounting cleanup series. It originates from Gerald
    Schaefer's report on an issue a week ago regarding to incorrect page fault
    accountings for retried page fault after commit 4064b9827063 ("mm: allow
    VM_FAULT_RETRY for multiple times"):

    https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

    What this series did:

    - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault. For example, page fault
    retries should not be counted in page fault counters. Same to the
    perf events.

    - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g. errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense. And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

    - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR). More information in patch 1.

    - Always account the page fault onto the one that triggered the page
    fault. This does not matter much for #PF handlings, but mostly for
    gup. More information on this in patch 25.

    Patchset layout:

    Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
    Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
    Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
    Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

    This patch (of 25):

    This is a preparation patch to move page fault accountings into the
    general code in handle_mm_fault(). This includes both the per task
    flt_maj/flt_min counters, and the major/minor page fault perf events. To
    do this, the pt_regs pointer is passed into handle_mm_fault().

    PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
    handlers.

    So far, all the pt_regs pointer that passed into handle_mm_fault() is
    NULL, which means this patch should have no intented functional change.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
    Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Drop the repeated word "pages".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-4-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

11 Jul, 2020

1 commit

  • hmm_range_fault() returns an array of page frame numbers and flags for how
    the pages are mapped in the requested process' page tables. The PFN can be
    used to get the struct page with hmm_pfn_to_page() and the page size order
    can be determined with compound_order(page).

    However, if the page is larger than order 0 (PAGE_SIZE), there is no
    indication that a compound page is mapped by the CPU using a larger page
    size. Without this information, the caller can't safely use a large device
    PTE to map the compound page because the CPU might be using smaller PTEs
    with different read/write permissions.

    Add a new function hmm_pfn_to_map_order() to return the mapping size order
    so that callers know the pages are being mapped with consistent
    permissions and a large device page table mapping can be used if one is
    available.

    This will allow devices to optimize mapping the page into HW by avoiding
    or batching work for huge pages. For instance the dma_map can be done with
    a high order directly.

    Link: https://lore.kernel.org/r/20200701225352.9649-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     

10 Jun, 2020

1 commit

  • Add new APIs to assert that mmap_sem is held.

    Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
    makes the assertions more tolerant of future changes to the lock type.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

11 May, 2020

3 commits

  • Presumably the intent here was that hmm_range_fault() could put the data
    into some HW specific format and thus avoid some work. However, nothing
    actually does that, and it isn't clear how anything actually could do that
    as hmm_range_fault() provides CPU addresses which must be DMA mapped.

    Perhaps there is some special HW that does not need DMA mapping, but we
    don't have any examples of this, and the theoretical performance win of
    avoiding an extra scan over the pfns array doesn't seem worth the
    complexity. Plus pfns needs to be scanned anyhow to sort out any
    DEVICE_PRIVATE pages.

    This version replaces the uint64_t with an usigned long containing a pfn
    and fixed flags. On input flags is filled with the HMM_PFN_REQ_* values,
    on successful output it is filled with HMM_PFN_* values, describing the
    state of the pages.

    amdgpu is simple to convert, it doesn't use snapshot and doesn't use
    per-page flags.

    nouveau uses only 16 hmm_pte entries at most (ie fits in a few cache
    lines), and it sweeps over its pfns array a couple of times anyhow. It
    also has a nasty call chain before it reaches the dma map and hardware
    suggesting performance isn't important:

    nouveau_svm_fault():
    args.i.m.method = NVIF_VMM_V0_PFNMAP
    nouveau_range_fault()
    nvif_object_ioctl()
    client->driver->ioctl()
    struct nvif_driver nvif_driver_nvkm:
    .ioctl = nvkm_client_ioctl
    nvkm_ioctl()
    nvkm_ioctl_path()
    nvkm_ioctl_v0[type].func(..)
    nvkm_ioctl_mthd()
    nvkm_object_mthd()
    struct nvkm_object_func nvkm_uvmm:
    .mthd = nvkm_uvmm_mthd
    nvkm_uvmm_mthd()
    nvkm_uvmm_mthd_pfnmap()
    nvkm_vmm_pfn_map()
    nvkm_vmm_ptes_get_map()
    func == gp100_vmm_pgt_pfn
    struct nvkm_vmm_desc_func gp100_vmm_desc_spt:
    .pfn = gp100_vmm_pgt_pfn
    nvkm_vmm_iter()
    REF_PTES == func == gp100_vmm_pgt_pfn()
    dma_map_page()

    Link: https://lore.kernel.org/r/5-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
    Acked-by: Felix Kuehling
    Tested-by: Ralph Campbell
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is just an alias for HMM_PFN_ERROR, nothing cares that the error was
    because of a special page vs any other error case.

    Link: https://lore.kernel.org/r/4-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
    Acked-by: Felix Kuehling
    Reviewed-by: Christoph Hellwig
    Reviewed-by: John Hubbard
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • hmm_vma_walk->last is supposed to be updated after every write to the
    pfns, so that it can be returned by hmm_range_fault(). However, this is
    not done consistently. Fortunately nothing checks the return code of
    hmm_range_fault() for anything other than error.

    More importantly last must be set before returning -EBUSY as it is used to
    prevent reading an output pfn as an input flags when the loop restarts.

    For clarity and simplicity make hmm_range_fault() return 0 or -ERRNO. Only
    set last when returning -EBUSY.

    Link: https://lore.kernel.org/r/2-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
    Acked-by: Felix Kuehling
    Tested-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

31 Mar, 2020

4 commits

  • The pagewalker does not call most ops with NULL vma, those are all routed
    to hmm_vma_walk_hole() via ops->pte_hole instead.

    Thus hmm_vma_fault() is only called with a NULL vma from
    hmm_vma_walk_hole(), so hoist the NULL vma check to there.

    Now it is clear that snapshotting with no vma is a HMM_PFN_ERROR as
    without a vma we have no path to call hmm_vma_fault().

    Link: https://lore.kernel.org/r/20200327200021.29372-10-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Most places that return an error code, like -EFAULT, do not set
    HMM_PFN_ERROR, only two places do this.

    Resolve this inconsistency by never setting the pfns on an error
    exit. This doesn't seem like a worthwhile thing to do anyhow.

    If for some reason it becomes important, it makes more sense to directly
    return the address of the failing page rather than have the caller scan
    for the HMM_PFN_ERROR.

    No caller inspects the pnfs output array if hmm_range_fault() fails.

    Link: https://lore.kernel.org/r/20200327200021.29372-9-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • In hmm_vma_handle_pte() and hmm_vma_walk_hugetlb_entry() if fault happens
    then -EBUSY will be returned and the pfns input flags will have been
    destroyed.

    For hmm_vma_handle_pte() set HMM_PFN_NONE only on the success returns that
    don't otherwise store to pfns.

    For hmm_vma_walk_hugetlb_entry() all exit paths already set pfns, so
    remove the redundant store.

    Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
    Link: https://lore.kernel.org/r/20200327200021.29372-8-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • swp_offset() should not be called directly, the wrappers are supposed to
    abstract away the encoding of the device_private specific information in
    the swap entry.

    Link: https://lore.kernel.org/r/20200327200021.29372-7-jgg@ziepe.ca
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

28 Mar, 2020

4 commits

  • Now that flags are handled on a fine-grained per-page basis this global
    flag is redundant and has a confusing overlap with the pfn_flags_mask and
    default_flags.

    Normalize the HMM_FAULT_SNAPSHOT behavior into one place. Callers needing
    the SNAPSHOT behavior should set a pfn_flags_mask and default_flags that
    always results in a cleared HMM_PFN_VALID. Then no pages will be faulted,
    and HMM_FAULT_SNAPSHOT is not a special flow that overrides the masking
    mechanism.

    As this is the last flag, also remove the flags argument. If future flags
    are needed they can be part of the struct hmm_range function arguments.

    Link: https://lore.kernel.org/r/20200327200021.29372-5-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Delete several functions that are never called, fix some desync between
    comments and structure content, toss the now out of date top of file
    header, and move one function only used by hmm.c into hmm.c

    Link: https://lore.kernel.org/r/20200327200021.29372-4-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Using two bools instead of flags return is not necessary and leads to
    bugs. Returning a value is easier for the compiler to check and easier to
    pass around the code flow.

    Convert the two bools into flags and push the change to all callers.

    Link: https://lore.kernel.org/r/20200327200021.29372-3-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • The checking boils down to some racy check if the pagemap is still
    available or not. Instead of checking this, rely entirely on the
    notifiers, if a pagemap is destroyed then all pages that belong to it must
    be removed from the tables and the notifiers triggered.

    Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

27 Mar, 2020

14 commits

  • hmm_range_fault() will succeed for any kind of device private memory, even
    if it doesn't belong to the calling entity. While nouveau has some crude
    checks for that, they are broken because they assume nouveau is the only
    user of device private memory. Fix this by passing in an expected pgmap
    owner in the hmm_range_fault structure.

    If a device_private page is found and doesn't match the owner then it is
    treated as an non-present and non-faultable page.

    This prevents a bug in amdgpu, where it doesn't know how to handle
    device_private pages, but hmm_range_fault would return them anyhow.

    Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
    Link: https://lore.kernel.org/r/20200316193216.920734-5-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Remove the HMM_PFN_DEVICE_PRIVATE flag, no driver has ever set this flag
    on input, and the only place that uses it on output can be trivially
    changed to use is_device_private_page().

    This removes the ability to request that device_private pages are faulted
    back into system memory.

    Link: https://lore.kernel.org/r/20200316193216.920734-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • There is no good reason for this split, as it just obsfucates the flow.

    Link: https://lore.kernel.org/r/20200316135310.899364-6-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Setting a pfns entry to NONE before returning -EBUSY is a bug that will
    cause corruption of the input flags on the next loop.

    There is just a single caller using hmm_vma_walk_hole_() for the non-fault
    case. Use hmm_pfns_fill() to fill the whole pfn array with zeroes in the
    only caller for the non-fault case and remove the non-fault path from
    hmm_vma_walk_hole_(). This avoids setting NONE before returning -EBUSY.

    Also rename the function to hmm_vma_fault() to better describe what it
    does.

    Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
    Link: https://lore.kernel.org/r/20200316135310.899364-5-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Remove the rather confusing goto label and just handle the fault case
    directly in the branch checking for it.

    Link: https://lore.kernel.org/r/20200316135310.899364-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and
    the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem.

    Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no
    fault is requested then the pfns should be returned with the not valid
    flags.

    It should not unconditionally fault if faulting is not requested.

    Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Currently if a special PTE is encountered hmm_range_fault() immediately
    returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
    uses).

    EFAULT should only be returned after testing with hmm_pte_need_fault().

    Also pte_devmap() and pte_special() are exclusive, and there is no need to
    check IS_ENABLED, pte_special() is stubbed out to return false on
    unsupported architectures.

    Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • hmm_range_fault() should never return 0 if the caller requested a valid
    page, but the pfns output for that page would be HMM_PFN_ERROR.

    hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to
    detect if the page is in faulting mode or not.

    Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated
    code.

    Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd")
    Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • The intention with this code is to determine if the caller required the
    pages to be valid, and if so, then take some action to make them valid.
    The action varies depending on the page type.

    In all cases, if the caller doesn't ask for the page, then
    hmm_range_fault() should not return an error.

    Revise the implementation to be clearer, and fix some bugs:

    - hmm_pte_need_fault() must always be called before testing fault or
    write_fault otherwise the defaults of false apply and the if()'s don't
    work. This was missed on the is_migration_entry() branch

    - -EFAULT should not be returned unless hmm_pte_need_fault() indicates
    fault is required - ie snapshotting should not fail.

    - For !pte_present() the cpu_flags are always 0, except in the special
    case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is
    confusing.

    Reorganize the flow so that it always follows the pattern of calling
    hmm_pte_need_fault() and then checking fault || write_fault.

    Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • All return paths that do EFAULT must call hmm_range_need_fault() to
    determine if the user requires this page to be valid.

    If the page cannot be made valid if the user later requires it, due to vma
    flags in this case, then the return should be HMM_PFN_ERROR.

    Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • All success exit paths from the walker functions must set the pfns array.

    A migration entry with no required fault is a HMM_PFN_NONE return, just
    like the pte case.

    Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This eventually calls into handle_mm_fault() which is a sleeping function.
    Release the lock first.

    hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not
    need the lock.

    Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()")
    Cc: Steven Price
    Reviewed-by: Ralph Campbell
    Reviewed-by: Steven Price
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Many of the direct returns of error skipped doing the pte_unmap(). All non
    zero exit paths must unmap the pte.

    The pte_unmap() is split unnaturally like this because some of the error
    exit paths trigger a sleep and must release the lock before sleeping.

    Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
    Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()")
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

04 Feb, 2020

2 commits

  • The pte_hole() callback is called at multiple levels of the page tables.
    Code dumping the kernel page tables needs to know what at what depth the
    missing entry is. Add this is an extra parameter to pte_hole(). When the
    depth isn't know (e.g. processing a vma) then -1 is passed.

    The depth that is reported is the actual level where the entry is missing
    (ignoring any folding that is in place), i.e. any levels where
    PTRS_PER_P?D is set to 1 are ignored.

    Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
    natural numbers as levels 2/3/4.

    Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
    Signed-off-by: Steven Price
    Tested-by: Zong Li
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
    ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
    users. We're about to add users so reintroduce them, along with
    p4d_entry() as we now have 5 levels of tables.

    Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
    transparent hugepages") already re-added pud_entry() but with different
    semantics to the other callbacks. This commit reverts the semantics back
    to match the other callbacks.

    To support hmm.c which now uses the new semantics of pud_entry() a new
    member ('action') of struct mm_walk is added which allows the callbacks to
    either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
    repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
    pud_trans_huge_lock() itself and make use of the splitting/retry logic of
    the core code.

    After this change pud_entry() is called for all entries, not just
    transparent huge pages.

    [arnd@arndb.de: fix unused variable warning]
    Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
    Signed-off-by: Steven Price
    Signed-off-by: Arnd Bergmann
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     

24 Nov, 2019

4 commits

  • These two functions have never been used since they were added.

    Link: https://lore.kernel.org/r/20191113134528.21187-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: John Hubbard
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • hmm_range_fault() calls find_vma() and walk_page_range() in a loop. This
    is unnecessary duplication since walk_page_range() calls find_vma() in a
    loop already.

    Simplify hmm_range_fault() by defining a walk_test() callback function to
    filter unhandled vmas.

    This also fixes a bug where hmm_range_fault() was not checking start >=
    vma->vm_start before checking vma->vm_flags so hmm_range_fault() could
    return an error based on the wrong vma for the requested range.

    It also fixes a bug when the vma has no read access and the caller did not
    request a fault, there shouldn't be any error return code.

    Link: https://lore.kernel.org/r/20191104222141.5173-2-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • The only two users of this are now converted to use mmu_interval_notifier,
    delete all the code and update hmm.rst.

    Link: https://lore.kernel.org/r/20191112202231.3856-14-jgg@ziepe.ca
    Reviewed-by: Jérôme Glisse
    Tested-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • hmm_mirror's handling of ranges does not use a sequence count which
    results in this bug:

    CPU0 CPU1
    hmm_range_wait_until_valid(range)
    valid == true
    hmm_range_fault(range)
    hmm_invalidate_range_start()
    range->valid = false
    hmm_invalidate_range_end()
    range->valid = true
    hmm_range_valid(range)
    valid == true

    Where the hmm_range_valid() should not have succeeded.

    Adding the required sequence count would make it nearly identical to the
    new mmu_interval_notifier. Instead replace the hmm_mirror stuff with
    mmu_interval_notifier.

    Co-existence of the two APIs is the first step.

    Link: https://lore.kernel.org/r/20191112202231.3856-4-jgg@ziepe.ca
    Reviewed-by: Jérôme Glisse
    Tested-by: Philip Yang
    Tested-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

30 Oct, 2019

1 commit

  • If a device driver like nouveau tries to use hmm_range_fault() to access
    the special shared zero page in system memory, hmm_range_fault() will
    return -EFAULT and kill the process.

    Allow hmm_range_fault() to return success (0) when the CPU pagetable entry
    points to the special shared zero page.

    page_to_pfn() and pfn_to_page() are defined on the zero page so just
    handle it like any other page.

    Link: https://lore.kernel.org/r/20191023195515.13168-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: "Jérôme Glisse"
    Acked-by: David Hildenbrand
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

28 Aug, 2019

2 commits

  • Normally, callers to handle_mm_fault() are supposed to check the
    vma->vm_flags first. hmm_range_fault() checks for VM_READ but doesn't
    check for VM_WRITE if the caller requests a page to be faulted in with
    write permission (via the hmm_range.pfns[] value). If the vma is write
    protected, this can result in an infinite loop:

    hmm_range_fault()
    walk_page_range()
    ...
    hmm_vma_walk_hole()
    hmm_vma_walk_hole_()
    hmm_vma_do_fault()
    handle_mm_fault(FAULT_FLAG_WRITE)
    /* returns VM_FAULT_WRITE */
    /* returns -EBUSY */
    /* returns -EBUSY */
    /* returns -EBUSY */
    /* loops on -EBUSY and range->valid */

    Prevent this by checking for vma->vm_flags & VM_WRITE before calling
    handle_mm_fault().

    Link: https://lore.kernel.org/r/20190823221753.2514-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • Although hmm_range_fault() calls find_vma() to make sure that a vma exists
    before calling walk_page_range(), hmm_vma_walk_hole() can still be called
    with walk->vma == NULL if the start and end address are not contained
    within the vma range.

    hmm_range_fault() /* calls find_vma() but no range check */
    walk_page_range() /* calls find_vma(), sets walk->vma = NULL */
    __walk_page_range()
    walk_pgd_range()
    walk_p4d_range()
    walk_pud_range()
    hmm_vma_walk_hole()
    hmm_vma_walk_hole_()
    hmm_vma_do_fault()
    handle_mm_fault(vma=0)

    Link: https://lore.kernel.org/r/20190823221753.2514-2-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell