26 Jan, 2019

1 commit

  • [ Upstream commit 3cfd22be0ad663248fadfc8f6ffa3e255c394552 ]

    When the process being tracked does mremap() without
    UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
    we should not generate the remap event, and at the same time we should
    clear all the uffd flags on the new VMA. Without this patch, we can still
    have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
    handling process does not even know the existance of the VMA.

    Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Reviewed-by: William Kucharski
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Cc: Pravin Shedge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Peter Xu
     

20 Dec, 2018

1 commit

  • commit 01e881f5a1fca4677e82733061868c6d6ea05ca7 upstream.

    Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
    could trigger an harmless false positive WARN_ON. Check the vma is
    already registered before checking VM_MAYWRITE to shut off the false
    positive warning.

    Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
    Cc:
    Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
    Acked-by: Mike Rapoport
    Acked-by: Hugh Dickins
    Acked-by: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

06 Dec, 2018

1 commit

  • commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream.

    After the VMA to register the uffd onto is found, check that it has
    VM_MAYWRITE set before allowing registration. This way we inherit all
    common code checks before allowing to fill file holes in shmem and
    hugetlbfs with UFFDIO_COPY.

    The userfaultfd memory model is not applicable for readonly files unless
    it's a MAP_PRIVATE.

    Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
    Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

14 Nov, 2018

1 commit

  • commit ae62c16e105a869524afcf8a07ee85c5ae5d0479 upstream.

    userfaultfd contains howe-grown locking of the waitqueue lock, and does
    not disable interrupts. This relies on the fact that no one else takes it
    from interrupt context and violates an invariat of the normal waitqueue
    locking scheme. With aio poll it is easy to trigger other locks that
    disable interrupts (or are called from interrupt context).

    Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Andrew Morton
    Cc: [4.19.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

23 Aug, 2018

1 commit

  • The userfaultfd code currently uses the unlocked waitqueue helpers for
    managing fault_wqh, but instead of holding the waitqueue lock for this
    waitqueue around these calls, it the waitqueue lock of
    fault_pending_wq, which is a different waitqueue instance. Given that
    the waitqueue is not exposed to the rest of the kernel this actually
    works ok at the moment, but prevents the userfaultfd locking rules from
    being enforced using lockdep.

    Switch to the internally locked waitqueue helpers instead. This means
    that the lock inside fault_wqh now nests inside the fault_pending_wqh
    lock, but that's not a problem since it was entirely unused before.

    [hch@lst.de: slight changelog updates]
    [rppt@linux.vnet.ibm.com: spotted changelog spellos]
    Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Rapoport
    Cc: Al Viro
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Jason Baron
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

18 Aug, 2018

1 commit

  • Pointer uwq is being assigned but is never used hence it is redundant
    and can be removed.

    Cleans up clang warning:
    warning: variable 'uwq' set but not used [-Wunused-but-set-variable]

    Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

03 Aug, 2018

1 commit

  • The fix in commit 0cbb4b4f4c44 ("userfaultfd: clear the
    vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
    vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
    that were copied from the parent process VMA.

    As the result, there is an inconsistency between the values of
    vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
    in userfaultfd_release().

    Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
    failure resolves the issue.

    Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 0cbb4b4f4c44 ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
    Signed-off-by: Mike Rapoport
    Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
    Cc: Andrea Arcangeli
    Cc: Eric Biggers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jul, 2018

1 commit

  • Use huge_ptep_get() to translate huge ptes to normal ptes so we can
    check them with the huge_pte_* functions. Otherwise some architectures
    will check the wrong values and will not wait for userspace to bring in
    the memory.

    Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
    Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Janosch Frank
     

08 Jun, 2018

1 commit

  • If a process monitored with userfaultfd changes it's memory mappings or
    forks() at the same time as uffd monitor fills the process memory with
    UFFDIO_COPY, the actual creation of page table entries and copying of
    the data in mcopy_atomic may happen either before of after the memory
    mapping modifications and there is no way for the uffd monitor to
    maintain consistent view of the process memory layout.

    For instance, let's consider fork() running in parallel with
    userfaultfd_copy():

    process | uffd monitor
    ---------------------------------+------------------------------
    fork() | userfaultfd_copy()
    ... | ...
    dup_mmap() | down_read(mmap_sem)
    down_write(mmap_sem) | /* create PTEs, copy data */
    dup_uffd() | up_read(mmap_sem)
    copy_page_range() |
    up_write(mmap_sem) |
    dup_uffd_complete() |
    /* notify monitor */ |

    If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
    be present by the time copy_page_range() is called and they will appear
    in the child's memory mappings. However, if the fork() is the first to
    take the mmap_sem, the new pages won't be mapped in the child's address
    space.

    If the pages are not present and child tries to access them, the monitor
    will get page fault notification and everything is fine. However, if
    the pages *are present*, the child can access them without uffd
    noticing. And if we copy them into child it'll see the wrong data.
    Since we are talking about background copy, we'd need to decide whether
    the pages should be copied or not regardless #PF notifications.

    Since userfaultfd monitor has no way to determine what was the order,
    let's disallow userfaultfd_copy in parallel with the non-cooperative
    events. In such case we return -EAGAIN and the uffd monitor can
    understand that userfaultfd_copy() clashed with a non-cooperative event
    and take an appropriate action.

    Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pavel Emelyanov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2018

2 commits

  • Nothing actually calls userfaultfd_file_create() besides the
    userfaultfd() system call itself. So simplify things by folding it into
    the system call and using anon_inode_getfd() instead of
    anon_inode_getfile(). Do the same in resolve_userfault_fork() as well.

    This removes over 50 lines with no change in functionality.

    Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • If THP migration is enabled, for a VMA handled by userfaultfd, consider
    the following situation,

    do_page_fault()
    __do_huge_pmd_anonymous_page()
    handle_userfault()
    userfault_msg()
    /* a huge page is allocated and mapped at fault address */
    /* the huge page is under migration, leaves migration entry
    in page table */
    userfaultfd_must_wait()
    /* return true because !pmd_present() */
    /* may wait in loop until fatal signal */

    That is, it may be possible for userfaultfd_must_wait() encounters a PMD
    entry which is !pmd_none() && !pmd_present(). In the current
    implementation, we will wait for such PMD entries, which may cause
    unnecessary waiting, and potential soft lockup.

    This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
    only wait when pmd_none().

    This may be not a problem in practice, because userfaultfd_must_wait()
    is always called with mm->mmap_sem read-locked. mremap() will
    write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
    mapping. But the change introduced still makes the code more correct,
    and makes the PMD and PTE code more consistent.

    Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Alexander Viro
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

31 Jan, 2018

1 commit

  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

05 Jan, 2018

1 commit

  • The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
    fix fork use after free") corrected the refcounting in case of
    UFFD_EVENT_FORK failure for the fork userfault paths.

    That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
    were set to point to the aborted new uffd ctx earlier in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot
    Reviewed-by: Mike Rapoport
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Nov, 2017

1 commit


16 Nov, 2017

1 commit


25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

04 Oct, 2017

1 commit

  • When reading the event from the uffd, we put it on a temporary
    fork_event list to detect if we can still access it after releasing and
    retaking the event_wqh.lock.

    If fork aborts and removes the event from the fork_event all is fine as
    long as we're still in the userfault read context and fork_event head is
    still alive.

    We've to put the event allocated in the fork kernel stack, back from
    fork_event list-head to the event_wqh head, before returning from
    userfaultfd_ctx_read, because the fork_event head lifetime is limited to
    the userfaultfd_ctx_read stack lifetime.

    Forgetting to move the event back to its event_wqh place then results in
    __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
    userfaultfd_event_wait_completion to remove it from a head that has been
    already freed from the reader stack.

    This could only happen if resolve_userfault_fork failed (for example if
    there are no file descriptors available to allocate the fork uffd). If
    it succeeded it was put back correctly.

    Furthermore, after find_userfault_evt receives a fork event, the forked
    userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
    be released by the fork thread as soon as the event_wqh.lock is
    released. Taking a reference on the fork_nctx before dropping the lock
    prevents an use after free in resolve_userfault_fork().

    If the fork side aborted and it already released everything, we still
    try to succeed resolve_userfault_fork(), if possible.

    Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
    Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Cc: Pavel Emelyanov
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Sep, 2017

1 commit

  • This is an enhancement to avoid a non cooperative userfaultfd manager
    having to unregister all regions before it can close the uffd after all
    userfaultfd activity completed.

    The UFFDIO_UNREGISTER would serialize against the handle_userfault by
    taking the mmap_sem for writing, but we can simply repeat the page fault
    if we detect the uffd was closed and so the regular page fault paths
    should takeover.

    Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Sep, 2017

4 commits

  • No ABI change, but this will make it more explicit to software that ptid
    is only available if requested by passing UFFD_FEATURE_THREAD_ID to
    UFFDIO_API. The fact it's a union will also self document it shouldn't
    be taken for granted there's a tpid there.

    Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Alexey Perevalov
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It could be useful for calculating downtime during postcopy live
    migration per vCPU. Side observer or application itself will be
    informed about proper task's sleep during userfaultfd processing.

    Process's thread id is being provided when user requeste it by setting
    UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.

    Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
    Signed-off-by: Alexey Perevalov
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Perevalov
     
  • In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
    to the faulting process, instead of the page-fault event. Dealing with
    page-fault event using a monitor thread can be an overhead in these
    cases. For example applications like the database could use the
    signaling mechanism for robustness purpose.

    Database uses hugetlbfs for performance reason. Files on hugetlbfs
    filesystem are created and huge pages allocated using fallocate() API.
    Pages are deallocated/freed using fallocate() hole punching support.
    These files are mmapped and accessed by many processes as shared memory.
    The database keeps track of which offsets in the hugetlbfs file have
    pages allocated.

    Any access to mapped address over holes in the file, which can occur due
    to bugs in the application, is considered invalid and expect the process
    to simply receive a SIGBUS. However, currently when a hole in the file
    is accessed via the mapped address, kernel/mm attempts to automatically
    allocate a page at page fault time, resulting in implicitly filling the
    hole in the file. This may not be the desired behavior for applications
    like the database that want to explicitly manage page allocations of
    hugetlbfs files.

    Using userfaultfd mechanism with this support to get a signal, database
    application can prevent pages from being allocated implicitly when
    processes access mapped address over holes in the file.

    This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
    request for a SIGBUS signal.

    See following for previous discussion about the database requirement
    leading to this proposal as suggested by Andrea.

    http://www.spinics.net/lists/linux-mm/msg129224.html

    Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
    Signed-off-by: Prakash Sangappa
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prakash Sangappa
     
  • Now when shmem VMAs can be filled with zero page via userfaultfd we can
    report that UFFDIO_ZEROPAGE is available for those VMAs

    Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Aug, 2017

2 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When the process exit races with outstanding mcopy_atomic, it would be
    better to return ESRCH error. When such race occurs the process and
    it's mm are going away and returning "no such process" to the uffd
    monitor seems better fit than ENOSPC.

    Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

10 Aug, 2017

1 commit


03 Aug, 2017

2 commits

  • There may still be threads waiting on event_wqh at the time the
    userfault file descriptor is closed. Flush the events wait-queue to
    prevent waiting threads from hanging.

    Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report
    non-PF events from uffd descriptor")
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • In the non-cooperative userfaultfd case, the process exit may race with
    outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
    instead of -EINVAL when mm is already gone will allow uffd monitor to
    distinguish this case from other error conditions.

    Unfortunately I overlooked userfaultfd_zeropage when updating
    userfaultd_copy().

    Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 96333187ab162 ("userfaultfd_copy: return -ENOSPC in case mm has gone")
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

07 Jul, 2017

2 commits

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • Calculation of start end end in __wake_userfault function are not used
    and can be removed.

    Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 Jun, 2017

1 commit

  • Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
    __get_user_pages().

    shmem as opposed has no special FOLL_DUMP handling there so
    handle_mm_fault() is invoked without mmap_sem and ends up calling
    handle_userfault() that isn't expecting to be invoked without mmap_sem
    held.

    This makes handle_userfault() fail immediately if invoked through
    shmem_vm_ops->fault during coredumping and solves the problem.

    The side effect is a BUG_ON with no lock held triggered by the
    coredumping process which exits. Only 4.11 is affected, pre-4.11 anon
    memory holes are skipped in __get_user_pages by checking FOLL_DUMP
    explicitly against empty pagetables (mm/gup.c:no_page_table()).

    It's zero cost as we already had a check for current->flags to prevent
    futex to trigger userfaults during exit (PF_EXITING).

    Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: "Dr. David Alan Gilbert"
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Apr, 2017

1 commit

  • fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up
    until recently, the UFFD_API_FEATURES was defined as 0, therefore
    corresponding field in fdinfo always contained zero. Now, with
    introduction of several additional features, UFFD_API_FEATURES is not
    longer 0 and it seems better to report actual features requested for the
    userfaultfd object described by the fdinfo.

    First, the applications that were using userfault will still see zero at
    the features field in fdinfo. Next, reporting actual features rather
    than available features, gives clear indication of what userfault
    features are used by an application.

    Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Mar, 2017

1 commit

  • Merge 5-level page table prep from Kirill Shutemov:
    "Here's relatively low-risk part of 5-level paging patchset. Merging it
    now will make x86 5-level paging enabling in v4.12 easier.

    The first patch is actually x86-specific: detect 5-level paging
    support. It boils down to single define.

    The rest of patchset converts Linux MMU abstraction from 4- to 5-level
    paging.

    Enabling of new abstraction in most cases requires adding single line
    of code in arch-specific code. The rest is taken care by asm-generic/.

    Changes to mm/ code are mostly mechanical: add support for new page
    table level -- p4d_t -- where we deal with pud_t now.

    v2:
    - fix build on microblaze (Michal);
    - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
    - acks from Michal"

    * emailed patches from Kirill A Shutemov :
    mm: introduce __p4d_alloc()
    mm: convert generic code to 5-level paging
    asm-generic: introduce
    arch, mm: convert all architectures to use 5level-fixup.h
    asm-generic: introduce __ARCH_USE_5LEVEL_HACK
    asm-generic: introduce 5level-fixup.h
    x86/cpufeature: Add 5-level paging detection

    Linus Torvalds
     

10 Mar, 2017

4 commits

  • It's a void function, so there is no return value;

    Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We have a memleak in the ->new ctx if the uffd of the parent is closed
    before the fork event is read, nothing frees the new context.

    Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Reported-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
    fails as it has to run userfaultfd_ctx_put on all ctx to pair against
    the userfaultfd_ctx_get that was run on all fctx->orig in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli