02 Jul, 2021

2 commits

  • Some devices require exclusive write access to shared virtual memory (SVM)
    ranges to perform atomic operations on that memory. This requires CPU
    page tables to be updated to deny access whilst atomic operations are
    occurring.

    In order to do this introduce a new swap entry type
    (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
    access by a device all page table mappings for the particular range are
    replaced with device exclusive swap entries. This causes any CPU access
    to the page to result in a fault.

    Faults are resovled by replacing the faulting entry with the original
    mapping. This results in MMU notifiers being called which a driver uses
    to update access permissions such as revoking atomic access. After
    notifiers have been called the device will no longer have exclusive access
    to the region.

    Walking of the page tables to find the target pages is handled by
    get_user_pages() rather than a direct page table walk. A direct page
    table walk similar to what migrate_vma_collect()/unmap() does could also
    have been utilised. However this resulted in more code similar in
    functionality to what get_user_pages() provides as page faulting is
    required to make the PTEs present and to break COW.

    [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
    Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda

    Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Signed-off-by: Dan Carpenter
    Reviewed-by: Christoph Hellwig
    Cc: Ben Skeggs
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     
  • Patch series "Add support for SVM atomics in Nouveau", v11.

    Introduction
    ============

    Some devices have features such as atomic PTE bits that can be used to
    implement atomic access to system memory. To support atomic operations to
    a shared virtual memory page such a device needs access to that page which
    is exclusive of the CPU. This series introduces a mechanism to
    temporarily unmap pages granting exclusive access to a device.

    These changes are required to support OpenCL atomic operations in Nouveau
    to shared virtual memory (SVM) regions allocated with the
    CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
    OpenCL SVM feature is available at
    https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
    OpenCL_API.html#_shared_virtual_memory .

    Implementation
    ==============

    Exclusive device access is implemented by adding a new swap entry type
    (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
    difference is that on fault the original entry is immediately restored by
    the fault handler instead of waiting.

    Restoring the entry triggers calls to MMU notifers which allows a device
    driver to revoke the atomic access permission from the GPU prior to the
    CPU finalising the entry.

    Patches
    =======

    Patches 1 & 2 refactor existing migration and device private entry
    functions.

    Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
    functionality into separate functions - try_to_migrate_one() and
    try_to_munlock_one().

    Patch 5 renames some existing code but does not introduce functionality.

    Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

    Patch 7 contains the bulk of the implementation for device exclusive
    memory.

    Patch 8 contains some additions to the HMM selftests to ensure everything
    works as expected.

    Patch 9 is a cleanup for the Nouveau SVM implementation.

    Patch 10 contains the implementation of atomic access for the Nouveau
    driver.

    Testing
    =======

    This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
    which checks that GPU atomic accesses to system memory are atomic.
    Without this series the test fails as there is no way of write-protecting
    the page mapping which results in the device clobbering CPU writes. For
    reference the test is available at
    https://ozlabs.org/~apopple/opencl_svm_atomics/

    Further testing has been performed by adding support for testing exclusive
    access to the hmm-tests kselftests.

    This patch (of 10):

    Remove multiple similar inline functions for dealing with different types
    of special swap entries.

    Both migration and device private swap entries use the swap offset to
    store a pfn. Instead of multiple inline functions to obtain a struct page
    for each swap entry type use a common function pfn_swap_entry_to_page().
    Also open-code the various entry_to_pfn() functions as this results is
    shorter code that is easier to understand.

    Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
    Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Peter Xu
    Cc: Shakeel Butt
    Cc: Ben Skeggs
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     

25 Jun, 2021

10 commits

  • Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
    ptlock in the PVMW_SYNC case? That too might have been responsible for
    BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
    I've never seen any.

    Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
    Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Tested-by: Wang Yugui
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Running certain tests with a DEBUG_VM kernel would crash within hours,
    on the total_mapcount BUG() in split_huge_page_to_list(), while trying
    to free up some memory by punching a hole in a shmem huge page: split's
    try_to_unmap() was unable to find all the mappings of the page (which,
    on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

    Crash dumps showed two tail pages of a shmem huge page remained mapped
    by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
    of a long unmapped range; and no page table had yet been allocated for
    the head of the huge page to be mapped into.

    Although designed to handle these odd misaligned huge-page-mapped-by-pte
    cases, page_vma_mapped_walk() falls short by returning false prematurely
    when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
    are cases when a huge page may span the boundary, with ptes present in
    the next.

    Restructure page_vma_mapped_walk() as a loop to continue in these cases,
    while keeping its layout much as before. Add a step_forward() helper to
    advance pvmw->address across those boundaries: originally I tried to use
    mm's standard p?d_addr_end() macros, but hit the same crash 512 times
    less often: because of the way redundant levels are folded together, but
    folded differently in different configurations, it was just too
    difficult to use them correctly; and step_forward() is simpler anyway.

    Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the
    start, rather than later at next_pte.

    It's a little unnecessary overhead on the first call, but makes for a
    simpler loop in the following commit.

    Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
    and use "goto this_pte", in place of the "while (1)" loop at the end.

    Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: add a level of indentation to much of
    the body, making no functional change in this commit, but reducing the
    later diff when this is all converted to a loop.

    [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix]
    Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com

    Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: adjust the test for crossing page table
    boundary - I believe pvmw->address is always page-aligned, but nothing
    else here assumed that; and remember to reset pvmw->pte to NULL after
    unmapping the page table, though I never saw any bug from that.

    Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
    follow the same "return not_found, return not_found, return true"
    pattern as the block above it (note: returning not_found there is never
    premature, since existence or prior existence of huge pmd guarantees
    good alignment).

    Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Peter Xu
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
    use it in subsequent tests, instead of repeatedly dereferencing pointer.

    Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Peter Xu
    Cc: Alistair Popple
    Cc: Matthew Wilcox
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of
    the way at the start, so no need to worry about it later.

    Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Peter Xu
    Cc: Alistair Popple
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Ralph Campbell
    Cc: Wang Yugui
    Cc: Will Deacon
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".

    I've marked all of these for stable: many are merely cleanups, but I
    think they are much better before the main fix than after.

    This patch (of 11):

    page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page
    was used, sometimes pvmw->page itself: use the local copy "page"
    throughout.

    Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
    Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
    Signed-off-by: Hugh Dickins
    Reviewed-by: Alistair Popple
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Peter Xu
    Cc: Yang Shi
    Cc: Wang Yugui
    Cc: Matthew Wilcox
    Cc: Ralph Campbell
    Cc: Zi Yan
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2021

2 commits

  • Running certain tests with a DEBUG_VM kernel would crash within hours,
    on the total_mapcount BUG() in split_huge_page_to_list(), while trying
    to free up some memory by punching a hole in a shmem huge page: split's
    try_to_unmap() was unable to find all the mappings of the page (which,
    on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

    When that BUG() was changed to a WARN(), it would later crash on the
    VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
    mm/internal.h:vma_address(), used by rmap_walk_file() for
    try_to_unmap().

    vma_address() is usually correct, but there's a wraparound case when the
    vm_start address is unusually low, but vm_pgoff not so low:
    vma_address() chooses max(start, vma->vm_start), but that decides on the
    wrong address, because start has become almost ULONG_MAX.

    Rewrite vma_address() to be more careful about vm_pgoff; move the
    VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
    be safely used from page_mapped_in_vma() and page_address_in_vma() too.

    Add vma_address_end() to apply similar care to end address calculation,
    in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
    though it raises a question of whether callers would do better to supply
    pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

    An irritation is that their apparent generality breaks down on KSM
    pages, which cannot be located by the page->index that page_to_pgoff()
    uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte()
    for ksm pages") once discovered. I dithered over the best thing to do
    about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
    vma_address() and vma_address_end(); though the only place in danger of
    using it on them was try_to_unmap_one().

    Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
    head page, instead of thp_size(): to make the right calculation on a
    hugetlbfs page, whether or not THPs are configured. try_to_unmap() is
    used on hugetlbfs pages, but perhaps the wrong calculation never
    mattered.

    Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
    Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
    (!unmap_success): with dump_page() showing mapcount:1, but then its raw
    struct page output showing _mapcount ffffffff i.e. mapcount 0.

    And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
    it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
    and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
    all indicative of some mapcount difficulty in development here perhaps.
    But the !CONFIG_DEBUG_VM path handles the failures correctly and
    silently.

    I believe the problem is that once a racing unmap has cleared pte or
    pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
    from try_to_unmap() before the racing task has reached decrementing
    mapcount.

    Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
    follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
    TTU_SYNC to the options, and passing that from unmap_page().

    When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
    for both: the slight overhead added should rarely matter, except perhaps
    if splitting sparsely-populated multiply-mapped shmem. Once confident
    that bugs are fixed, TTU_SYNC here can be removed, and the race
    tolerated.

    Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
    Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
    Signed-off-by: Hugh Dickins
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: Kirill A. Shutemov
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 May, 2021

1 commit

  • succed -> succeed in mm/hugetlb.c
    wil -> will in mm/mempolicy.c
    wit -> with in mm/page_alloc.c
    Retruns -> Returns in mm/page_vma_mapped.c
    confict -> conflict in mm/secretmem.c
    No functionality changed.

    Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com
    Signed-off-by: Lu Jialin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lu Jialin
     

16 Dec, 2020

1 commit

  • check_pte() needs a correct colon for kernel-doc markup, otherwise, gcc
    has the following warning for W=1, mm/page_vma_mapped.c:86: warning:
    Function parameter or member 'pvmw' not described in 'check_pte'

    Link: https://lkml.kernel.org/r/1605597167-25145-1-git-send-email-alex.shi@linux.alibaba.com
    Signed-off-by: Alex Shi
    Cc: Randy Dunlap
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     

15 Aug, 2020

2 commits

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

01 Feb, 2020

1 commit

  • When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
    The current implementation apply comparison as

    - normal 4K page: page_pfn < page_pfn + 1
    - hugetlbfs page: page_pfn < page_pfn + HPAGE_PMD_NR
    - THP page: page_pfn < page_pfn + HPAGE_PMD_NR

    in pfn_in_hpage. For hugetlbfs page, it should be page_pfn == pfn

    Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
    not only for THP and explicitly compare for these cases.

    No impact upon current behavior, just make the code clear. I think it
    is important to make the code clear - comparing hugetlbfs page in range
    page_pfn < page_pfn + HPAGE_PMD_NR is confusing.

    Link: http://lkml.kernel.org/r/1578737885-8890-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Li Xinhai
    Acked-by: Kirill A. Shutemov
    Acked-by: Mike Kravetz
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Xinhai
     

25 Sep, 2019

1 commit

  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

31 Oct, 2018

1 commit

  • Private ZONE_DEVICE pages use a special pte entry and thus are not
    present. Properly handle this case in map_pte(), it is already handled in
    check_pte(), the map_pte() part was lost in some rebase most probably.

    Without this patch the slow migration path can not migrate back to any
    private ZONE_DEVICE memory to regular memory. This was found after stress
    testing migration back to system memory. This ultimatly can lead to the
    CPU constantly page fault looping on the special swap entry.

    Link: http://lkml.kernel.org/r/20181019160442.18723-3-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Cc: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

23 Jan, 2018

1 commit

  • The new helper would check if the pfn belongs to the page. For huge
    pages it checks if the PFN is within range covered by the huge page.

    The helper is used in check_pte(). The original code the helper replaces
    had two call to page_to_pfn(). page_to_pfn() is relatively costly.

    Although current GCC is able to optimize code to have one call, it's
    better to do this explicitly.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jan, 2018

1 commit

  • Tetsuo reported random crashes under memory pressure on 32-bit x86
    system and tracked down to change that introduced
    page_vma_mapped_walk().

    The root cause of the issue is the faulty pointer math in check_pte().
    As ->pte may point to an arbitrary page we have to check that they are
    belong to the section before doing math. Otherwise it may lead to weird
    results.

    It wasn't noticed until now as mem_map[] is virtually contiguous on
    flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
    'struct page' pointers. But with classic sparsemem, it doesn't because
    each section memap is allocated separately and so consecutive pfns
    crossing two sections might have struct pages at completely unrelated
    addresses.

    Let's restructure code a bit and replace pointer arithmetic with
    operations on pfns.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

2 commits

  • Loading the pmd without holding the pmd_lock exposes us to races with
    concurrent updaters of the page tables but, worse still, it also allows
    the compiler to cache the pmd value in a register and reuse it later on,
    even if we've performed a READ_ONCE in between and seen a more recent
    value.

    In the case of page_vma_mapped_walk, this leads to the following crash
    when the pmd loaded for the initial pmd_trans_huge check is all zeroes
    and a subsequent valid table entry is loaded by check_pmd. We then
    proceed into map_pte, but the compiler re-uses the zero entry inside
    pte_offset_map, resulting in a junk pointer being installed in
    pvmw->pte:

    PC is at check_pte+0x20/0x170
    LR is at page_vma_mapped_walk+0x2e0/0x540
    [...]
    Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
    Call trace:
    check_pte+0x20/0x170
    page_vma_mapped_walk+0x2e0/0x540
    page_mkclean_one+0xac/0x278
    rmap_walk_file+0xf0/0x238
    rmap_walk+0x64/0xa0
    page_mkclean+0x90/0xa8
    clear_page_dirty_for_io+0x84/0x2a8
    mpage_submit_page+0x34/0x98
    mpage_process_page_bufs+0x164/0x170
    mpage_prepare_extent_to_map+0x134/0x2b8
    ext4_writepages+0x484/0xe30
    do_writepages+0x44/0xe8
    __filemap_fdatawrite_range+0xbc/0x110
    file_write_and_wait_range+0x48/0xd8
    ext4_sync_file+0x80/0x4b8
    vfs_fsync_range+0x64/0xc0
    SyS_msync+0x194/0x1e8

    This patch fixes the problem by ensuring that READ_ONCE is used before
    the initial checks on the pmd, and this value is subsequently used when
    checking whether or not the pmd is present. pmd_check is removed and
    the pmd_present check is inlined directly.

    Link: http://lkml.kernel.org/r/1507222630-5839-1-git-send-email-will.deacon@arm.com
    Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
    Signed-off-by: Will Deacon
    Tested-by: Yury Norov
    Tested-by: Richard Ruigrok
    Acked-by: Kirill A. Shutemov
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • A non present pmd entry can appear after pmd_lock is taken in
    page_vma_mapped_walk(), even if THP migration is not enabled. The
    WARN_ONCE is unnecessary.

    Link: http://lkml.kernel.org/r/20171003142606.12324-1-zi.yan@sent.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Zi Yan
    Reported-by: Abdul Haleem
    Tested-by: Abdul Haleem
    Acked-by: Kirill A. Shutemov
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

09 Sep, 2017

2 commits

  • Allow to unmap and restore special swap entry of un-addressable
    ZONE_DEVICE memory.

    Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Add thp migration's core code, including conversions between a PMD entry
    and a swap entry, setting PMD migration entry, removing PMD migration
    entry, and waiting on PMD migration entries.

    This patch makes it possible to support thp migration. If you fail to
    allocate a destination page as a thp, you just split the source thp as
    we do now, and then enter the normal page migration. If you succeed to
    allocate destination thp, you enter thp migration. Subsequent patches
    actually enable thp migration for each caller of page migration by
    allowing its get_new_page() callback to allocate thps.

    [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
    Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
    [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Jul, 2017

1 commit

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

08 Apr, 2017

1 commit

  • Doug Smythies reports oops with KSM in this backtrace, I've been seeing
    the same:

    page_vma_mapped_walk+0xe6/0x5b0
    page_referenced_one+0x91/0x1a0
    rmap_walk_ksm+0x100/0x190
    rmap_walk+0x4f/0x60
    page_referenced+0x149/0x170
    shrink_active_list+0x1c2/0x430
    shrink_node_memcg+0x67a/0x7a0
    shrink_node+0xe1/0x320
    kswapd+0x34b/0x720

    Just as observed in commit 4b0ece6fa016 ("mm: migrate: fix
    remove_migration_pte() for ksm pages"), you cannot use page->index
    calculations on ksm pages.

    page_vma_mapped_walk() is relying on __vma_address(), where a ksm page
    can lead it off the end of the page table, and into whatever nonsense is
    in the next page, ending as an oops inside check_pte()'s pte_page().

    KSM tells page_vma_mapped_walk() exactly where to look for the page, it
    does not need any page->index calculation: and that's so also for all
    the normal and file and anon pages - just not for THPs and their
    subpages. Get out early in most cases: instead of a PageKsm test, move
    down the earlier not-THP-page test, as suggested by Kirill.

    I'm also slightly worried that this loop can stray into other vmas, so
    added a vm_end test to prevent surprises; though I have not imagined
    anything worse than a very contrived case, in which a page mlocked in
    the next vma might be reclaimed because it is not mlocked in this vma.

    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1704031104400.1118@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Mar, 2017

1 commit


25 Feb, 2017

2 commits

  • For consistency, it worth converting all page_check_address() to
    page_vma_mapped_walk(), so we could drop the former.

    Link: http://lkml.kernel.org/r/20170129173858.45174-11-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Introduce a new interface to check if a page is mapped into a vma. It
    aims to address shortcomings of page_check_address{,_transhuge}.

    Existing interface is not able to handle PTE-mapped THPs: it only finds
    the first PTE. The rest lefted unnoticed.

    page_vma_mapped_walk() iterates over all possible mapping of the page in
    the vma.

    Link: http://lkml.kernel.org/r/20170129173858.45174-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov