19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • hugetlb uses a fault mutex hash table to prevent page faults of the
    same pages concurrently. The key for shared and private mappings is
    different. Shared keys off address_space and file index. Private keys
    off mm and virtual address. Consider a private mappings of a populated
    hugetlbfs file. A fault will map the page from the file and if needed
    do a COW to map a writable page.

    Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
    pages. It uses the address_space file index key. However, private
    mappings will use a different key and could race with this code to map
    the file page. This causes problems (BUG) for the page cache remove
    code as it expects the page to be unmapped. A sample stack is:

    page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
    kernel BUG at mm/filemap.c:169!
    ...
    RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
    ...
    Call Trace:
    __delete_from_page_cache+0x39/0x220
    delete_from_page_cache+0x45/0x70
    remove_inode_hugepages+0x13c/0x380
    ? __add_to_page_cache_locked+0x162/0x380
    hugetlbfs_fallocate+0x403/0x540
    ? _cond_resched+0x15/0x30
    ? __inode_security_revalidate+0x5d/0x70
    ? selinux_file_permission+0x100/0x130
    vfs_fallocate+0x13f/0x270
    ksys_fallocate+0x3c/0x80
    __x64_sys_fallocate+0x1a/0x20
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    There seems to be another potential COW issue/race with this approach
    of different private and shared keys as noted in commit 8382d914ebf7
    ("mm, hugetlb: improve page-fault scalability").

    Since every hugetlb mapping (even anon and private) is actually a file
    mapping, just use the address_space index key for all mappings. This
    results in potentially more hash collisions. However, this should not
    be the common case.

    Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

09 Jan, 2019

1 commit

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

1 commit

  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Dec, 2018

4 commits

  • With MAP_SHARED: recheck the i_size after taking the PT lock, to
    serialize against truncate with the PT lock. Delete the page from the
    pagecache if the i_size_read check fails.

    With MAP_PRIVATE: check the i_size after the PT lock before mapping
    anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.

    A mostly irrelevant cleanup: like we do the delete_from_page_cache()
    pagecache removal after dropping the PT lock, the PT lock is a spinlock
    so drop it before the sleepable page lock.

    Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • After the VMA to register the uffd onto is found, check that it has
    VM_MAYWRITE set before allowing registration. This way we inherit all
    common code checks before allowing to fill file holes in shmem and
    hugetlbfs with UFFDIO_COPY.

    The userfaultfd memory model is not applicable for readonly files unless
    it's a MAP_PRIVATE.

    Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
    Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Userfaultfd did not create private memory when UFFDIO_COPY was invoked
    on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file,
    even when that had not been opened for writing. Though, fortunately,
    that could only happen where there was a hole in the file.

    Fix the shmem-backed implementation of UFFDIO_COPY to create private
    memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation
    was already correct.

    This change is visible to userland, if userfaultfd has been used in
    unintended ways: so it introduces a small risk of incompatibility, but
    is necessary in order to respect file permissions.

    An app that uses UFFDIO_COPY for anything like postcopy live migration
    won't notice the difference, and in fact it'll run faster because there
    will be no copy-on-write and memory waste in the tmpfs pagecache
    anymore.

    Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like
    before.

    The real zeropage can also be built on a MAP_PRIVATE shmem mapping
    through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is
    never dirty, in turn even an mprotect upgrading the vma permission from
    PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable.

    Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Jann Horn
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli