Eric Lee / smarc-fsl-linux-kernel

26 Jan, 2019

1 commit

2011eb741 userfaultfd: clear flag if remap event not enabled ... Browse Code »

[ Upstream commit 3cfd22be0ad663248fadfc8f6ffa3e255c394552 ]

When the process being tracked does mremap() without
UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
we should not generate the remap event, and at the same time we should
clear all the uffd flags on the new VMA. Without this patch, we can still
have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
handling process does not even know the existance of the VMA.

Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com
Signed-off-by: Peter Xu
Reviewed-by: Andrea Arcangeli
Acked-by: Mike Rapoport
Reviewed-by: William Kucharski
Cc: Andrea Arcangeli
Cc: Mike Rapoport
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Pavel Emelyanov
Cc: Pravin Shedge
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Peter Xu
2019-01-26 16:32:43 +0800

20 Dec, 2018

1 commit

d41c49daf userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered ... Browse Code »

commit 01e881f5a1fca4677e82733061868c6d6ea05ca7 upstream.

Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.

Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc:
Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: Andrea Arcangeli
Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
Acked-by: Mike Rapoport
Acked-by: Hugh Dickins
Acked-by: Peter Xu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrea Arcangeli
2018-12-20 02:19:50 +0800

06 Dec, 2018

1 commit

34b7a7cc5 userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas ... Browse Code »

commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream.

After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration. This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.

The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.

Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli
Reviewed-by: Mike Rapoport
Reviewed-by: Hugh Dickins
Reported-by: Jann Horn
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc:
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Peter Xu
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrea Arcangeli
2018-12-06 02:32:04 +0800

14 Nov, 2018

1 commit

d2e97f026 userfaultfd: disable irqs when taking the waitqueue lock ... Browse Code »

commit ae62c16e105a869524afcf8a07ee85c5ae5d0479 upstream.

userfaultfd contains howe-grown locking of the waitqueue lock, and does
not disable interrupts. This relies on the fact that no one else takes it
from interrupt context and violates an invariat of the normal waitqueue
locking scheme. With aio poll it is easy to trigger other locks that
disable interrupts (or are called from interrupt context).

Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Andrea Arcangeli
Reviewed-by: Andrew Morton
Cc: [4.19.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2018-11-14 03:08:46 +0800

24 Aug, 2018

1 commit

2b7403035 mm: Change return type int to vm_fault_t for fault handlers ... Browse Code »

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-08-24 09:48:44 +0800

23 Aug, 2018

1 commit

c430d1e84 userfaultfd: use fault_wqh lock ... Browse Code »

The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance. Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.

Switch to the internally locked waitqueue helpers instead. This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.

[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox
Signed-off-by: Christoph Hellwig
Reviewed-by: Mike Rapoport
Cc: Al Viro
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Jason Baron
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-08-23 01:52:47 +0800

18 Aug, 2018

1 commit

5241d4727 fs/userfaultfd.c: remove redundant pointer uwq ... Browse Code »

Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
warning: variable 'uwq' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Colin Ian King
2018-08-18 07:20:32 +0800

03 Aug, 2018

1 commit

31e810aa1 userfaultfd: remove uffd flags from vma->vm_flags if UFFD_EVENT_FORK fails ... Browse Code »

The fix in commit 0cbb4b4f4c44 ("userfaultfd: clear the
vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
that were copied from the parent process VMA.

As the result, there is an inconsistency between the values of
vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
in userfaultfd_release().

Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
failure resolves the issue.

Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 0cbb4b4f4c44 ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
Signed-off-by: Mike Rapoport
Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
Cc: Andrea Arcangeli
Cc: Eric Biggers
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-08-03 07:03:40 +0800

04 Jul, 2018

1 commit

1e2c04362 userfaultfd: hugetlbfs: fix userfaultfd_huge_must_wait() pte access ... Browse Code »

Use huge_ptep_get() to translate huge ptes to normal ptes so we can
check them with the huge_pte_* functions. Otherwise some architectures
will check the wrong values and will not wait for userspace to bring in
the memory.

Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
Signed-off-by: Janosch Frank
Reviewed-by: David Hildenbrand
Reviewed-by: Mike Kravetz
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Janosch Frank
2018-07-04 08:32:18 +0800

08 Jun, 2018

1 commit

df2cc96e7 userfaultfd: prevent non-cooperative events vs mcopy_atomic races ... Browse Code »

If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.

For instance, let's consider fork() running in parallel with
userfaultfd_copy():

process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |

If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.

If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.

Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.

Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Pavel Emelyanov
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Andrei Vagin
Signed-off-by: Andrew Morton

Signed-off-by: Linus Torvalds

Mike Rapoport
2018-06-08 08:34:38 +0800

12 Feb, 2018

1 commit

a9a08845e vfs: do bulk POLL* -> EPOLL* replacement ... Browse Code »

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^$[^\"]*$$\$/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-02-12 06:34:03 +0800

01 Feb, 2018

2 commits

284cd241a userfaultfd: convert to use anon_inode_getfd() ... Browse Code »

Nothing actually calls userfaultfd_file_create() besides the
userfaultfd() system call itself. So simplify things by folding it into
the system call and using anon_inode_getfd() instead of
anon_inode_getfile(). Do the same in resolve_userfault_fork() as well.

This removes over 50 lines with no change in functionality.

Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Biggers
2018-02-01 09:18:39 +0800
a365ac09d mm, userfaultfd, THP: avoid waiting when PMD under THP migration ... Browse Code »

If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,

do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */

That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.

This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().

This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.

Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: "Kirill A. Shutemov"
Cc: Alexander Viro
Cc: Zi Yan
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-02-01 09:18:37 +0800

31 Jan, 2018

1 commit

168fe32a0 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.

Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.

Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.

The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.

As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...

Linus Torvalds
2018-01-31 09:58:07 +0800

05 Jan, 2018

1 commit

0cbb4b4f4 userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails ... Browse Code »

The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
fix fork use after free") corrected the refcounting in case of
UFFD_EVENT_FORK failure for the fork userfault paths.

That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
were set to point to the aborted new uffd ctx earlier in
dup_userfaultfd.

Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: syzbot
Reviewed-by: Mike Rapoport
Cc: Eric Biggers
Cc: Dmitry Vyukov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2018-01-05 08:45:09 +0800

28 Nov, 2017

1 commit

076ccb76e fs: annotate ->poll() instances ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-11-28 05:20:05 +0800

16 Nov, 2017

1 commit

00bb31fa4 userfaultfd: use mmgrab instead of open-coded increment of mm_count ... Browse Code »

Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: "Dr . David Alan Gilbert"
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-11-16 10:21:05 +0800

25 Oct, 2017

1 commit

6aa7de059 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to … ... Browse Code »

…READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Mark Rutland
2017-10-25 17:01:08 +0800

04 Oct, 2017

1 commit

384632e67 userfaultfd: non-cooperative: fix fork use after free ... Browse Code »

When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.

If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.

We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.

Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.

This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.

Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().

If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.

Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Mark Rutland
Tested-by: Mark Rutland
Cc: Pavel Emelyanov
Cc: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-10-04 08:54:25 +0800

09 Sep, 2017

1 commit

656710a60 userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS ... Browse Code »

This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.

The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.

Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Mike Rapoport
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-09-09 09:26:47 +0800

07 Sep, 2017

4 commits

a36985d31 userfaultfd: provide pid in userfault msg - add feat union ... Browse Code »

No ABI change, but this will make it more explicit to software that ptid
is only available if requested by passing UFFD_FEATURE_THREAD_ID to
UFFDIO_API. The fact it's a union will also self document it shouldn't
be taken for granted there's a tpid there.

Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Alexey Perevalov
Cc: Maxime Coquelin
Cc: Mike Kravetz
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-09-07 08:27:29 +0800
9d4ac9348 userfaultfd: provide pid in userfault msg ... Browse Code »

It could be useful for calculating downtime during postcopy live
migration per vCPU. Side observer or application itself will be
informed about proper task's sleep during userfaultfd processing.

Process's thread id is being provided when user requeste it by setting
UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.

Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
Signed-off-by: Alexey Perevalov
Signed-off-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Maxime Coquelin
Cc: Mike Kravetz
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Perevalov
2017-09-07 08:27:29 +0800
2d6d6f5a0 mm: userfaultfd: add feature to request for a signal delivery ... Browse Code »

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the
signaling mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS. However, currently when a hole in the file
is accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa
Reviewed-by: Mike Rapoport
Reviewed-by: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Prakash Sangappa
2017-09-07 08:27:29 +0800
ce53e8e6f userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs ... Browse Code »

Now when shmem VMAs can be filled with zero page via userfaultfd we can
report that UFFDIO_ZEROPAGE is available for those VMAs

Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-09-07 08:27:28 +0800

11 Aug, 2017

2 commits

040cca3ab Merge branch 'linus' into locking/core, to resolve conflicts ... Browse Code »

Conflicts:
include/linux/mm_types.h
mm/huge_memory.c

I removed the smp_mb__before_spinlock() like the following commit does:

8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

and fixed up the affected commits.

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-08-11 19:51:59 +0800
e86b298be userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage ... Browse Code »

When the process exit races with outstanding mcopy_atomic, it would be
better to return ESRCH error. When such race occurs the process and
it's mm are going away and returning "no such process" to the uffd
monitor seems better fit than ENOSPC.

Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Cc: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Pavel Emelyanov
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-08-11 06:54:07 +0800

10 Aug, 2017

1 commit

a9668cd6e locking: Remove smp_mb__before_spinlock() ... Browse Code »

Now that there are no users of smp_mb__before_spinlock() left, remove
it entirely.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-08-10 18:29:03 +0800

03 Aug, 2017

2 commits

5a18b64e3 userfaultfd: non-cooperative: flush event_wqh at release time ... Browse Code »

There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed. Flush the events wait-queue to
prevent waiting threads from hanging.

Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report
non-PF events from uffd descriptor")
Signed-off-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Pavel Emelyanov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-08-03 08:16:13 +0800
9d95aa4ba userfaultfd_zeropage: return -ENOSPC in case mm has gone ... Browse Code »

In the non-cooperative userfaultfd case, the process exit may race with
outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC
instead of -EINVAL when mm is already gone will allow uffd monitor to
distinguish this case from other error conditions.

Unfortunately I overlooked userfaultfd_zeropage when updating
userfaultd_copy().

Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 96333187ab162 ("userfaultfd_copy: return -ENOSPC in case mm has gone")
Signed-off-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Pavel Emelyanov
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-08-03 08:16:12 +0800

07 Jul, 2017

2 commits

7868a2087 mm/hugetlb: add size parameter to huge_pte_offset() ... Browse Code »

A poisoned or migrated hugepage is stored as a swap entry in the page
tables. On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address. Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal
Acked-by: Steve Capper
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle (supporter:MIPS)
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: "David S. Miller"
Cc: Chris Metcalf
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Alexander Viro
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K.V"
Cc: "Kirill A. Shutemov"
Cc: Hillf Danton
Cc: Mark Rutland
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Punit Agrawal
2017-07-07 07:24:34 +0800
f93ae3646 fs/userfaultfd.c: drop dead code ... Browse Code »

Calculation of start end end in __wake_userfault function are not used
and can be removed.

Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-07-07 07:24:33 +0800

20 Jun, 2017

2 commits

2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800

17 Jun, 2017

1 commit

64c2b2030 userfaultfd: shmem: handle coredumping in handle_userfault() ... Browse Code »

Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
__get_user_pages().

shmem as opposed has no special FOLL_DUMP handling there so
handle_mm_fault() is invoked without mmap_sem and ends up calling
handle_userfault() that isn't expecting to be invoked without mmap_sem
held.

This makes handle_userfault() fail immediately if invoked through
shmem_vm_ops->fault during coredumping and solves the problem.

The side effect is a BUG_ON with no lock held triggered by the
coredumping process which exits. Only 4.11 is affected, pre-4.11 anon
memory holes are skipped in __get_user_pages by checking FOLL_DUMP
explicitly against empty pagetables (mm/gup.c:no_page_table()).

It's zero cost as we already had a check for current->flags to prevent
futex to trigger userfaults during exit (PF_EXITING).

Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: "Dr. David Alan Gilbert"
Cc: [4.11+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-06-17 05:37:05 +0800

08 Apr, 2017

1 commit

045098e94 userfaultfd: report actual registered features in fdinfo ... Browse Code »

fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up
until recently, the UFFD_API_FEATURES was defined as 0, therefore
corresponding field in fdinfo always contained zero. Now, with
introduction of several additional features, UFFD_API_FEATURES is not
longer 0 and it seems better to report actual features requested for the
userfaultfd object described by the fdinfo.

First, the applications that were using userfault will still see zero at
the features field in fdinfo. Next, reporting actual features rather
than available features, gives clear indication of what userfault
features are used by an application.

Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrea Arcangeli
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-04-08 15:47:48 +0800

11 Mar, 2017

1 commit

baeedc715 Merge branch 'prep-for-5level' ... Browse Code »

Merge 5-level page table prep from Kirill Shutemov:
"Here's relatively low-risk part of 5-level paging patchset. Merging it
now will make x86 5-level paging enabling in v4.12 easier.

The first patch is actually x86-specific: detect 5-level paging
support. It boils down to single define.

The rest of patchset converts Linux MMU abstraction from 4- to 5-level
paging.

Enabling of new abstraction in most cases requires adding single line
of code in arch-specific code. The rest is taken care by asm-generic/.

Changes to mm/ code are mostly mechanical: add support for new page
table level -- p4d_t -- where we deal with pud_t now.

v2:
- fix build on microblaze (Michal);
- comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
- acks from Michal"

* emailed patches from Kirill A Shutemov :
mm: introduce __p4d_alloc()
mm: convert generic code to 5-level paging
asm-generic: introduce
arch, mm: convert all architectures to use 5level-fixup.h
asm-generic: introduce __ARCH_USE_5LEVEL_HACK
asm-generic: introduce 5level-fixup.h
x86/cpufeature: Add 5-level paging detection

Linus Torvalds
2017-03-11 00:59:07 +0800

10 Mar, 2017

4 commits

2378cd618 userfaultfd: remove wrong comment from userfaultfd_ctx_get() ... Browse Code »

It's a void function, so there is no return value;

Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2017-03-10 09:01:10 +0800
70ccb92fd userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED ... Browse Code »

userfaultfd_remove() has to be execute before zapping the pagetables or
UFFDIO_COPY could keep filling pages after zap_page_range returned,
which would result in non zero data after a MADV_DONTNEED.

However userfaultfd_remove() may have to release the mmap_sem. This was
handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

The fix consists in revalidating the vma in case userfaultfd_remove()
had to release the mmap_sem.

This also optimizes away an unnecessary down_read/up_read in the
MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
userfaultfd_remove() will be defined as "true" at build time.

Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-03-10 09:01:10 +0800
7eb76d457 userfaultfd: non-cooperative: fix fork fctx->new memleak ... Browse Code »

We have a memleak in the ->new ctx if the uffd of the parent is closed
before the fork event is read, nothing frees the new context.

Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.com
Signed-off-by: Mike Rapoport
Signed-off-by: Andrea Arcangeli
Reported-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2017-03-10 09:01:10 +0800
8c9e7bb7a userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete ... Browse Code »

Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
fails as it has to run userfaultfd_ctx_put on all ctx to pair against
the userfaultfd_ctx_get that was run on all fctx->orig in
dup_userfaultfd.

Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-03-10 09:01:09 +0800