21 Dec, 2018

1 commit

  • commit 01e881f5a1fca4677e82733061868c6d6ea05ca7 upstream.

    Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
    could trigger an harmless false positive WARN_ON. Check the vma is
    already registered before checking VM_MAYWRITE to shut off the false
    positive warning.

    Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
    Cc:
    Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
    Acked-by: Mike Rapoport
    Acked-by: Hugh Dickins
    Acked-by: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

08 Dec, 2018

1 commit

  • commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream.

    After the VMA to register the uffd onto is found, check that it has
    VM_MAYWRITE set before allowing registration. This way we inherit all
    common code checks before allowing to fill file holes in shmem and
    hugetlbfs with UFFDIO_COPY.

    The userfaultfd memory model is not applicable for readonly files unless
    it's a MAP_PRIVATE.

    Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
    Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

06 Aug, 2018

1 commit

  • commit 31e810aa1033a7db50a2746cd34a2432237f6420 upstream.

    The fix in commit 0cbb4b4f4c44 ("userfaultfd: clear the
    vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
    vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
    that were copied from the parent process VMA.

    As the result, there is an inconsistency between the values of
    vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
    in userfaultfd_release().

    Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
    failure resolves the issue.

    Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 0cbb4b4f4c44 ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
    Signed-off-by: Mike Rapoport
    Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
    Cc: Andrea Arcangeli
    Cc: Eric Biggers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     

11 Jul, 2018

1 commit

  • commit 1e2c043628c7736dd56536d16c0ce009bc834ae7 upstream.

    Use huge_ptep_get() to translate huge ptes to normal ptes so we can
    check them with the huge_pte_* functions. Otherwise some architectures
    will check the wrong values and will not wait for userspace to bring in
    the memory.

    Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
    Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Janosch Frank
     

10 Jan, 2018

1 commit

  • commit 0cbb4b4f4c44f54af268969b18d8deda63aded59 upstream.

    The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
    fix fork use after free") corrected the refcounting in case of
    UFFD_EVENT_FORK failure for the fork userfault paths.

    That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
    were set to point to the aborted new uffd ctx earlier in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot
    Reviewed-by: Mike Rapoport
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

04 Oct, 2017

1 commit

  • When reading the event from the uffd, we put it on a temporary
    fork_event list to detect if we can still access it after releasing and
    retaking the event_wqh.lock.

    If fork aborts and removes the event from the fork_event all is fine as
    long as we're still in the userfault read context and fork_event head is
    still alive.

    We've to put the event allocated in the fork kernel stack, back from
    fork_event list-head to the event_wqh head, before returning from
    userfaultfd_ctx_read, because the fork_event head lifetime is limited to
    the userfaultfd_ctx_read stack lifetime.

    Forgetting to move the event back to its event_wqh place then results in
    __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
    userfaultfd_event_wait_completion to remove it from a head that has been
    already freed from the reader stack.

    This could only happen if resolve_userfault_fork failed (for example if
    there are no file descriptors available to allocate the fork uffd). If
    it succeeded it was put back correctly.

    Furthermore, after find_userfault_evt receives a fork event, the forked
    userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
    be released by the fork thread as soon as the event_wqh.lock is
    released. Taking a reference on the fork_nctx before dropping the lock
    prevents an use after free in resolve_userfault_fork().

    If the fork side aborted and it already released everything, we still
    try to succeed resolve_userfault_fork(), if possible.

    Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
    Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Cc: Pavel Emelyanov
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Sep, 2017

1 commit

  • This is an enhancement to avoid a non cooperative userfaultfd manager
    having to unregister all regions before it can close the uffd after all
    userfaultfd activity completed.

    The UFFDIO_UNREGISTER would serialize against the handle_userfault by
    taking the mmap_sem for writing, but we can simply repeat the page fault
    if we detect the uffd was closed and so the regular page fault paths
    should takeover.

    Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Sep, 2017

4 commits

  • No ABI change, but this will make it more explicit to software that ptid
    is only available if requested by passing UFFD_FEATURE_THREAD_ID to
    UFFDIO_API. The fact it's a union will also self document it shouldn't
    be taken for granted there's a tpid there.

    Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Alexey Perevalov
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It could be useful for calculating downtime during postcopy live
    migration per vCPU. Side observer or application itself will be
    informed about proper task's sleep during userfaultfd processing.

    Process's thread id is being provided when user requeste it by setting
    UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.

    Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
    Signed-off-by: Alexey Perevalov
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Perevalov
     
  • In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
    to the faulting process, instead of the page-fault event. Dealing with
    page-fault event using a monitor thread can be an overhead in these
    cases. For example applications like the database could use the
    signaling mechanism for robustness purpose.

    Database uses hugetlbfs for performance reason. Files on hugetlbfs
    filesystem are created and huge pages allocated using fallocate() API.
    Pages are deallocated/freed using fallocate() hole punching support.
    These files are mmapped and accessed by many processes as shared memory.
    The database keeps track of which offsets in the hugetlbfs file have
    pages allocated.

    Any access to mapped address over holes in the file, which can occur due
    to bugs in the application, is considered invalid and expect the process
    to simply receive a SIGBUS. However, currently when a hole in the file
    is accessed via the mapped address, kernel/mm attempts to automatically
    allocate a page at page fault time, resulting in implicitly filling the
    hole in the file. This may not be the desired behavior for applications
    like the database that want to explicitly manage page allocations of
    hugetlbfs files.

    Using userfaultfd mechanism with this support to get a signal, database
    application can prevent pages from being allocated implicitly when
    processes access mapped address over holes in the file.

    This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
    request for a SIGBUS signal.

    See following for previous discussion about the database requirement
    leading to this proposal as suggested by Andrea.

    http://www.spinics.net/lists/linux-mm/msg129224.html

    Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
    Signed-off-by: Prakash Sangappa
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prakash Sangappa
     
  • Now when shmem VMAs can be filled with zero page via userfaultfd we can
    report that UFFDIO_ZEROPAGE is available for those VMAs

    Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Aug, 2017

2 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When the process exit races with outstanding mcopy_atomic, it would be
    better to return ESRCH error. When such race occurs the process and
    it's mm are going away and returning "no such process" to the uffd
    monitor seems better fit than ENOSPC.

    Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

10 Aug, 2017

1 commit


03 Aug, 2017

2 commits

  • There may still be threads waiting on event_wqh at the time the
    userfault file descriptor is closed. Flush the events wait-queue to
    prevent waiting threads from hanging.

    Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report
    non-PF events from uffd descriptor")
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • In the non-cooperative userfaultfd case, the process exit may race with
    outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
    instead of -EINVAL when mm is already gone will allow uffd monitor to
    distinguish this case from other error conditions.

    Unfortunately I overlooked userfaultfd_zeropage when updating
    userfaultd_copy().

    Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 96333187ab162 ("userfaultfd_copy: return -ENOSPC in case mm has gone")
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

07 Jul, 2017

2 commits

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • Calculation of start end end in __wake_userfault function are not used
    and can be removed.

    Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 Jun, 2017

1 commit

  • Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
    __get_user_pages().

    shmem as opposed has no special FOLL_DUMP handling there so
    handle_mm_fault() is invoked without mmap_sem and ends up calling
    handle_userfault() that isn't expecting to be invoked without mmap_sem
    held.

    This makes handle_userfault() fail immediately if invoked through
    shmem_vm_ops->fault during coredumping and solves the problem.

    The side effect is a BUG_ON with no lock held triggered by the
    coredumping process which exits. Only 4.11 is affected, pre-4.11 anon
    memory holes are skipped in __get_user_pages by checking FOLL_DUMP
    explicitly against empty pagetables (mm/gup.c:no_page_table()).

    It's zero cost as we already had a check for current->flags to prevent
    futex to trigger userfaults during exit (PF_EXITING).

    Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: "Dr. David Alan Gilbert"
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Apr, 2017

1 commit

  • fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up
    until recently, the UFFD_API_FEATURES was defined as 0, therefore
    corresponding field in fdinfo always contained zero. Now, with
    introduction of several additional features, UFFD_API_FEATURES is not
    longer 0 and it seems better to report actual features requested for the
    userfaultfd object described by the fdinfo.

    First, the applications that were using userfault will still see zero at
    the features field in fdinfo. Next, reporting actual features rather
    than available features, gives clear indication of what userfault
    features are used by an application.

    Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Mar, 2017

1 commit

  • Merge 5-level page table prep from Kirill Shutemov:
    "Here's relatively low-risk part of 5-level paging patchset. Merging it
    now will make x86 5-level paging enabling in v4.12 easier.

    The first patch is actually x86-specific: detect 5-level paging
    support. It boils down to single define.

    The rest of patchset converts Linux MMU abstraction from 4- to 5-level
    paging.

    Enabling of new abstraction in most cases requires adding single line
    of code in arch-specific code. The rest is taken care by asm-generic/.

    Changes to mm/ code are mostly mechanical: add support for new page
    table level -- p4d_t -- where we deal with pud_t now.

    v2:
    - fix build on microblaze (Michal);
    - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
    - acks from Michal"

    * emailed patches from Kirill A Shutemov :
    mm: introduce __p4d_alloc()
    mm: convert generic code to 5-level paging
    asm-generic: introduce
    arch, mm: convert all architectures to use 5level-fixup.h
    asm-generic: introduce __ARCH_USE_5LEVEL_HACK
    asm-generic: introduce 5level-fixup.h
    x86/cpufeature: Add 5-level paging detection

    Linus Torvalds
     

10 Mar, 2017

8 commits

  • It's a void function, so there is no return value;

    Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We have a memleak in the ->new ctx if the uffd of the parent is closed
    before the fork event is read, nothing frees the new context.

    Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Reported-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
    fails as it has to run userfaultfd_ctx_put on all ctx to pair against
    the userfaultfd_ctx_get that was run on all fctx->orig in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Similar to the handle_userfault() case, also make sure to never attempt
    to send any event past the PF_EXITING point of no return.

    This is purely a robustness check.

    Link: http://lkml.kernel.org/r/20170224181957.19736-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "userfaultfd non-cooperative further update for 4.11 merge
    window".

    Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
    more testing. I've been doing testing before and this was also tested
    by kbuild bot and exercised by the selftest, but this bug never
    reproduced before.

    I dropped userfaultfd_exit as result. I dropped it because of
    implementation difficulty in receiving signals in __mmput and because I
    think -ENOSPC as result from the background UFFDIO_COPY should be enough
    already.

    Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
    wasn't exercised by the selftest and when I tried to exercise it, after
    moving it to a more correct place in __mmput where it would make more
    sense and where the vma list is stable, it resulted in the
    event_wait_completion in D state. So then I added the second patch to
    be sure even if we call userfaultfd_event_wait_completion too late
    during task exit(), we won't risk to generate tasks in D state. The
    same check exists in handle_userfault() for the same reason, except it
    makes a difference there, while here is just a robustness check and it's
    run under WARN_ON_ONCE.

    While looking at the userfaultfd_event_wait_completion() function I
    looked back at its callers too while at it and I think it's not ok to
    stop executing dup_fctx on the fcs list because we relay on
    userfaultfd_event_wait_completion to execute
    userfaultfd_ctx_put(fctx->orig) which is paired against
    userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
    list_add(fcs). This change only takes care of fctx->orig but this area
    also needs further review looking for similar problems in fctx->new.

    The only patch that is urgent is the first because it's an use after
    free during a SMP race condition that affects all processes if
    CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
    impossible without SLUB poisoning enabled.

    This patch (of 3):

    I once reproduced this oops with the userfaultfd selftest, it's not
    easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[] [] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
    do_exit+0x297/0xd10
    SyS_exit+0x17/0x20
    tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP [] userfaultfd_exit+0x29/0xa0
    RSP
    ---[ end trace 9fecd6dcb442846a ]---

    In the debugger I located the "mm" pointer in the stack and walking
    mm->mmap->vm_next through the end shows the vma->vm_next list is fully
    consistent and it is null terminated list as expected. So this has to
    be an SMP race condition where userfaultfd_exit was running while the
    vma list was being modified by another CPU.

    When userfaultfd_exit() run one of the ->vm_next pointers pointed to
    SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

    The reason is that it's not running in __mmput but while there are still
    other threads running and it's not holding the mmap_sem (it can't as it
    has to wait the even to be received by the manager). So this is an use
    after free that was happening for all processes.

    One more implementation problem aside from the race condition:
    userfaultfd_exit has really to check a flag in mm->flags before walking
    the vma or it's going to slowdown the exit() path for regular tasks.

    One more implementation problem: at that point signals can't be
    delivered so it would also create a task in D state if the manager
    doesn't read the event.

    The major design issue: it overall looks superfluous as the manager can
    check for -ENOSPC in the background transfer:

    if (mmget_not_zero(ctx->mm)) {
    [..]
    } else {
    return -ENOSPC;
    }

    It's safer to roll it back and re-introduce it later if at all.

    [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
    Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mike Rapoport
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • __do_fault assumes vmf->page has been initialized and is valid if
    VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).

    handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
    return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).

    This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
    didn't matter for anonymous memory before. It only started to matter
    since shmem was introduced. hugetlbfs also takes a different path and
    doesn't exercise __do_fault.

    Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Dmitry Vyukov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Convert all non-architecture-specific code to 5-level paging.

    It's mostly mechanical adding handling one more page table level in
    places where we deal with pud_t.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Mar, 2017

2 commits

  • …hed.h> into <linux/sched/signal.h>

    Fix up affected files that include this signal functionality via sched.h.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

2 commits

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • Fix typos and add the following to the scripts/spelling.txt:

    an user||a user
    an userspace||a userspace

    I also added "userspace" to the list since it is a common word in Linux.
    I found some instances for "an userfaultfd", but I did not add it to the
    list. I felt it is endless to find words that start with "user" such as
    "userland" etc., so must draw a line somewhere.

    Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

25 Feb, 2017

4 commits

  • In the non-cooperative userfaultfd case, the process exit may race with
    outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
    instead of -EINVAL when mm is already gone will allow uffd monitor to
    distinguish this case from other error conditions.

    Link: http://lkml.kernel.org/r/1485542673-24387-6-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Allow userfaultfd monitor track termination of the processes that have
    memory backed by the uffd.

    [rppt@linux.vnet.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20170202135448.GB19804@rapoport-lnxLink: http://lkml.kernel.org/r/1485542673-24387-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: non-cooperative: add madvise() event for
    MADV_REMOVE request".

    These patches add notification of madvise(MADV_REMOVE) event to
    non-cooperative userfaultfd monitor.

    The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with
    relevant functions and structures. Using _REMOVE instead of
    _MADVDONTNEED describes the event semantics more clearly and I hope it's
    not too late for such change in the ABI.

    This patch (of 3):

    The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about
    removal of certain range from address space tracked by userfaultfd.
    Hence, UFFD_EVENT_REMOVE seems to better reflect the operation
    semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to
    'remove' and the madvise_userfault_dontneed callback is renamed to
    userfaultfd_remove.

    Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Feb, 2017

1 commit

  • Expand the userfaultfd_register/unregister routines to allow shared
    memory VMAs.

    Currently, there is no UFFDIO_ZEROPAGE and write-protection support for
    shared memory VMAs, which is reflected in ioctl methods supported by
    uffdio_register.

    Link: http://lkml.kernel.org/r/20161216144821.5183-34-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport