02 Dec, 2019

3 commits

  • Merge updates from Andrew Morton:
    "Incoming:

    - a small number of updates to scripts/, ocfs2 and fs/buffer.c

    - most of MM

    I still have quite a lot of material (mostly not MM) staged after
    linux-next due to -next dependencies. I'll send those across next week
    as the preprequisites get merged up"

    * emailed patches from Andrew Morton : (135 commits)
    mm/page_io.c: annotate refault stalls from swap_readpage
    mm/Kconfig: fix trivial help text punctuation
    mm/Kconfig: fix indentation
    mm/memory_hotplug.c: remove __online_page_set_limits()
    mm: fix typos in comments when calling __SetPageUptodate()
    mm: fix struct member name in function comments
    mm/shmem.c: cast the type of unmap_start to u64
    mm: shmem: use proper gfp flags for shmem_writepage()
    mm/shmem.c: make array 'values' static const, makes object smaller
    userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
    fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
    userfaultfd: wrap the common dst_vma check into an inlined function
    userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
    userfaultfd: use vma_pagesize for all huge page size calculation
    mm/madvise.c: use PAGE_ALIGN[ED] for range checking
    mm/madvise.c: replace with page_size() in madvise_inject_error()
    mm/mmap.c: make vma_merge() comment more easy to understand
    mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
    autonuma: reduce cache footprint when scanning page tables
    autonuma: fix watermark checking in migrate_balanced_pgdat()
    ...

    Linus Torvalds
     
  • A while ago Andy noticed
    (http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com)
    that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have
    security implications.

    As the first step of the solution the following patch limits the availably
    of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE.

    The usage of CAP_SYS_PTRACE ensures compatibility with CRIU.

    Yet, if there are other users of non-cooperative userfaultfd that run
    without CAP_SYS_PTRACE, they would be broken :(

    Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file
    descriptor table from the read() implementation of uffd, which may have
    security implications for unprivileged use of the userfaultfd.

    Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have
    CAP_SYS_PTRACE.

    Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Lokesh Gidra
    Cc: Nick Kralevich
    Cc: Nosh Minwalla
    Cc: Pavel Emelyanov
    Cc: Tim Murray
    Cc: Aleksa Sarai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • If the registration is repeated without VM_UFFD_MISSING or VM_UFFD_WP they
    need to be cleared. Currently setting UFFDIO_REGISTER_MODE_WP returns
    -EINVAL, so this patch is a noop until the UFFDIO_REGISTER_MODE_WP support
    is applied.

    Link: http://lkml.kernel.org/r/20191004232834.GP13922@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Wei Yang
    Reviewed-by: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Oct, 2019

1 commit

  • The .ioctl and .compat_ioctl file operations have the same prototype so
    they can both point to the same function, which works great almost all
    the time when all the commands are compatible.

    One exception is the s390 architecture, where a compat pointer is only
    31 bit wide, and converting it into a 64-bit pointer requires calling
    compat_ptr(). Most drivers here will never run in s390, but since we now
    have a generic helper for it, it's easy enough to use it consistently.

    I double-checked all these drivers to ensure that all ioctl arguments
    are used as pointers or are ignored, but are not interpreted as integer
    values.

    Acked-by: Jason Gunthorpe
    Acked-by: Daniel Vetter
    Acked-by: Mauro Carvalho Chehab
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Sterba
    Acked-by: Darren Hart (VMware)
    Acked-by: Jonathan Cameron
    Acked-by: Bjorn Andersson
    Acked-by: Dan Williams
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    userfaultfd code use provided user pointers for vma lookups, which can
    only by done with untagged pointers.

    Untag user pointers in validate_range().

    Link: http://lkml.kernel.org/r/cdc59ddd7011012ca2e689bc88c3b65b1ea7e413.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Mike Rapoport
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Khalid Aziz
    Cc: Mauro Carvalho Chehab
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Aug, 2019

1 commit

  • userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if
    mm->core_state != NULL.

    Otherwise a page fault can see userfaultfd_missing() == T and use an
    already freed userfaultfd_ctx.

    Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com
    Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping")
    Signed-off-by: Oleg Nesterov
    Reported-by: Kefeng Wang
    Reviewed-by: Andrea Arcangeli
    Tested-by: Kefeng Wang
    Cc: Peter Xu
    Cc: Mike Rapoport
    Cc: Jann Horn
    Cc: Jason Gunthorpe
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Jul, 2019

1 commit

  • When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
    and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.

    This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
    userfaultfd_ctx_read(), which in turn can be waiting for
    userfaultfd_ctx::fault_pending_wqh.lock or
    userfaultfd_ctx::event_wqh.lock.

    But elsewhere the fault_pending_wqh and event_wqh locks are taken with
    IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep
    reports that a deadlock is possible.

    Fix it by always disabling IRQs when taking the fault_pending_wqh and
    event_wqh locks.

    Commit ae62c16e105a ("userfaultfd: disable irqs when taking the
    waitqueue lock") didn't fix this because it only accounted for the
    fd_wqh lock, not the other locks nested inside it.

    Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
    Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
    Signed-off-by: Eric Biggers
    Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
    Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
    Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Andrea Arcangeli
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • Userfaultfd can be misued to make it easier to exploit existing
    use-after-free (and similar) bugs that might otherwise only make a
    short window or race condition available. By using userfaultfd to
    stall a kernel thread, a malicious program can keep some state that it
    wrote, stable for an extended period, which it can then access using an
    existing exploit. While it doesn't cause the exploit itself, and while
    it's not the only thing that can stall a kernel thread when accessing a
    memory location, it's one of the few that never needs privilege.

    We can add a flag, allowing userfaultfd to be restricted, so that in
    general it won't be useable by arbitrary user programs, but in
    environments that require userfaultfd it can be turned back on.

    Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
    whether userfaultfd is allowed by unprivileged users. When this is
    set to zero, only privileged users (root user, or users with the
    CAP_SYS_PTRACE capability) will be able to use the userfaultfd
    syscalls.

    Andrea said:

    : The only difference between the bpf sysctl and the userfaultfd sysctl
    : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
    : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
    : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
    : already if it's doing other kind of tracking on processes runtime, in
    : addition of userfaultfd. In other words both syscalls works only for
    : root, when the two sysctl are opt-in set to 1.

    [dgilbert@redhat.com: changelog additions]
    [akpm@linux-foundation.org: documentation tweak, per Mike]
    Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
    Signed-off-by: Peter Xu
    Suggested-by: Andrea Arcangeli
    Suggested-by: Mike Rapoport
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Paolo Bonzini
    Cc: Hugh Dickins
    Cc: Luis Chamberlain
    Cc: Maxime Coquelin
    Cc: Maya Gokhale
    Cc: Jerome Glisse
    Cc: Pavel Emelyanov
    Cc: Johannes Weiner
    Cc: Martin Cracauer
    Cc: Denis Plotnikov
    Cc: Marty McFadden
    Cc: Mike Kravetz
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: "Kirill A . Shutemov"
    Cc: "Dr . David Alan Gilbert"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

20 Apr, 2019

1 commit

  • The core dumping code has always run without holding the mmap_sem for
    writing, despite that is the only way to ensure that the entire vma
    layout will not change from under it. Only using some signal
    serialization on the processes belonging to the mm is not nearly enough.
    This was pointed out earlier. For example in Hugh's post from Jul 2017:

    https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

    "Not strictly relevant here, but a related note: I was very surprised
    to discover, only quite recently, how handle_mm_fault() may be called
    without down_read(mmap_sem) - when core dumping. That seems a
    misguided optimization to me, which would also be nice to correct"

    In particular because the growsdown and growsup can move the
    vm_start/vm_end the various loops the core dump does around the vma will
    not be consistent if page faults can happen concurrently.

    Pretty much all users calling mmget_not_zero()/get_task_mm() and then
    taking the mmap_sem had the potential to introduce unexpected side
    effects in the core dumping code.

    Adding mmap_sem for writing around the ->core_dump invocation is a
    viable long term fix, but it requires removing all copy user and page
    faults and to replace them with get_dump_page() for all binary formats
    which is not suitable as a short term fix.

    For the time being this solution manually covers the places that can
    confuse the core dump either by altering the vma layout or the vma flags
    while it runs. Once ->core_dump runs under mmap_sem for writing the
    function mmget_still_valid() can be dropped.

    Allowing mmap_sem protected sections to run in parallel with the
    coredump provides some minor parallelism advantage to the swapoff code
    (which seems to be safe enough by never mangling any vma field and can
    keep doing swapins in parallel to the core dumping) and to some other
    corner case.

    In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
    however the side effect of this same race condition in /proc/pid/mem
    should be reproducible since before 2.6.12-rc2 so I couldn't add any
    other "Fixes:" because there's no hash beyond the git genesis commit.

    Because find_extend_vma() is the only location outside of the process
    context that could modify the "mm" structures under mmap_sem for
    reading, by adding the mmget_still_valid() check to it, all other cases
    that take the mmap_sem for reading don't need the new check after
    mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
    context also doesn't need the new check, because all tasks under core
    dumping are frozen.

    Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
    Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Jann Horn
    Suggested-by: Oleg Nesterov
    Acked-by: Peter Xu
    Reviewed-by: Mike Rapoport
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Jann Horn
    Acked-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

29 Dec, 2018

2 commits

  • When the process being tracked does mremap() without
    UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
    we should not generate the remap event, and at the same time we should
    clear all the uffd flags on the new VMA. Without this patch, we can still
    have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
    handling process does not even know the existance of the VMA.

    Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Reviewed-by: William Kucharski
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Cc: Pravin Shedge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Reference counters should use refcount_t rather than atomic_t, since the
    refcount_t implementation can prevent overflows, reducing the
    exploitability of reference leak bugs. userfaultfd_ctx::refcount is a
    reference counter with the usual semantics, so convert it to refcount_t.

    Note: I replaced the BUG() on incrementing a 0 refcount with just
    refcount_inc(), since part of the semantics of refcount_t is that that
    incrementing a 0 refcount is not allowed; with CONFIG_REFCOUNT_FULL,
    refcount_inc() already checks for it and warns.

    Link: http://lkml.kernel.org/r/20181115003916.63381-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

27 Dec, 2018

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     

15 Dec, 2018

1 commit

  • Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
    could trigger an harmless false positive WARN_ON. Check the vma is
    already registered before checking VM_MAYWRITE to shut off the false
    positive warning.

    Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
    Cc:
    Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
    Acked-by: Mike Rapoport
    Acked-by: Hugh Dickins
    Acked-by: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Dec, 2018

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions
    to their vanilla RCU counterparts. This series is a step
    towards complete removal of the RCU-bh and RCU-sched update-side
    functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for
    rcutorture testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein
    for a bag-on-head-class bug.

    - RCU torture-test updates.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

01 Dec, 2018

1 commit

  • After the VMA to register the uffd onto is found, check that it has
    VM_MAYWRITE set before allowing registration. This way we inherit all
    common code checks before allowing to fill file holes in shmem and
    hugetlbfs with UFFDIO_COPY.

    The userfaultfd memory model is not applicable for readonly files unless
    it's a MAP_PRIVATE.

    Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
    Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

13 Nov, 2018

1 commit

  • lockdep_assert_held() is better suited to checking locking requirements,
    since it only checks if the current thread holds the lock regardless of
    whether someone else does. This is also a step towards possibly removing
    spin_is_locked().

    Signed-off-by: Lance Roy
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Paul E. McKenney

    Lance Roy
     

27 Oct, 2018

1 commit

  • userfaultfd contains howe-grown locking of the waitqueue lock, and does
    not disable interrupts. This relies on the fact that no one else takes it
    from interrupt context and violates an invariat of the normal waitqueue
    locking scheme. With aio poll it is easy to trigger other locks that
    disable interrupts (or are called from interrupt context).

    Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Andrew Morton
    Cc: [4.19.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

23 Aug, 2018

1 commit

  • The userfaultfd code currently uses the unlocked waitqueue helpers for
    managing fault_wqh, but instead of holding the waitqueue lock for this
    waitqueue around these calls, it the waitqueue lock of
    fault_pending_wq, which is a different waitqueue instance. Given that
    the waitqueue is not exposed to the rest of the kernel this actually
    works ok at the moment, but prevents the userfaultfd locking rules from
    being enforced using lockdep.

    Switch to the internally locked waitqueue helpers instead. This means
    that the lock inside fault_wqh now nests inside the fault_pending_wqh
    lock, but that's not a problem since it was entirely unused before.

    [hch@lst.de: slight changelog updates]
    [rppt@linux.vnet.ibm.com: spotted changelog spellos]
    Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Rapoport
    Cc: Al Viro
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Jason Baron
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

18 Aug, 2018

1 commit

  • Pointer uwq is being assigned but is never used hence it is redundant
    and can be removed.

    Cleans up clang warning:
    warning: variable 'uwq' set but not used [-Wunused-but-set-variable]

    Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

03 Aug, 2018

1 commit

  • The fix in commit 0cbb4b4f4c44 ("userfaultfd: clear the
    vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
    vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
    that were copied from the parent process VMA.

    As the result, there is an inconsistency between the values of
    vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
    in userfaultfd_release().

    Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
    failure resolves the issue.

    Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 0cbb4b4f4c44 ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
    Signed-off-by: Mike Rapoport
    Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
    Cc: Andrea Arcangeli
    Cc: Eric Biggers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jul, 2018

1 commit

  • Use huge_ptep_get() to translate huge ptes to normal ptes so we can
    check them with the huge_pte_* functions. Otherwise some architectures
    will check the wrong values and will not wait for userspace to bring in
    the memory.

    Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
    Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Janosch Frank
     

08 Jun, 2018

1 commit

  • If a process monitored with userfaultfd changes it's memory mappings or
    forks() at the same time as uffd monitor fills the process memory with
    UFFDIO_COPY, the actual creation of page table entries and copying of
    the data in mcopy_atomic may happen either before of after the memory
    mapping modifications and there is no way for the uffd monitor to
    maintain consistent view of the process memory layout.

    For instance, let's consider fork() running in parallel with
    userfaultfd_copy():

    process | uffd monitor
    ---------------------------------+------------------------------
    fork() | userfaultfd_copy()
    ... | ...
    dup_mmap() | down_read(mmap_sem)
    down_write(mmap_sem) | /* create PTEs, copy data */
    dup_uffd() | up_read(mmap_sem)
    copy_page_range() |
    up_write(mmap_sem) |
    dup_uffd_complete() |
    /* notify monitor */ |

    If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
    be present by the time copy_page_range() is called and they will appear
    in the child's memory mappings. However, if the fork() is the first to
    take the mmap_sem, the new pages won't be mapped in the child's address
    space.

    If the pages are not present and child tries to access them, the monitor
    will get page fault notification and everything is fine. However, if
    the pages *are present*, the child can access them without uffd
    noticing. And if we copy them into child it'll see the wrong data.
    Since we are talking about background copy, we'd need to decide whether
    the pages should be copied or not regardless #PF notifications.

    Since userfaultfd monitor has no way to determine what was the order,
    let's disallow userfaultfd_copy in parallel with the non-cooperative
    events. In such case we return -EAGAIN and the uffd monitor can
    understand that userfaultfd_copy() clashed with a non-cooperative event
    and take an appropriate action.

    Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pavel Emelyanov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2018

2 commits

  • Nothing actually calls userfaultfd_file_create() besides the
    userfaultfd() system call itself. So simplify things by folding it into
    the system call and using anon_inode_getfd() instead of
    anon_inode_getfile(). Do the same in resolve_userfault_fork() as well.

    This removes over 50 lines with no change in functionality.

    Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • If THP migration is enabled, for a VMA handled by userfaultfd, consider
    the following situation,

    do_page_fault()
    __do_huge_pmd_anonymous_page()
    handle_userfault()
    userfault_msg()
    /* a huge page is allocated and mapped at fault address */
    /* the huge page is under migration, leaves migration entry
    in page table */
    userfaultfd_must_wait()
    /* return true because !pmd_present() */
    /* may wait in loop until fatal signal */

    That is, it may be possible for userfaultfd_must_wait() encounters a PMD
    entry which is !pmd_none() && !pmd_present(). In the current
    implementation, we will wait for such PMD entries, which may cause
    unnecessary waiting, and potential soft lockup.

    This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
    only wait when pmd_none().

    This may be not a problem in practice, because userfaultfd_must_wait()
    is always called with mm->mmap_sem read-locked. mremap() will
    write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
    mapping. But the change introduced still makes the code more correct,
    and makes the PMD and PTE code more consistent.

    Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Alexander Viro
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

31 Jan, 2018

1 commit

  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

05 Jan, 2018

1 commit

  • The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
    fix fork use after free") corrected the refcounting in case of
    UFFD_EVENT_FORK failure for the fork userfault paths.

    That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
    were set to point to the aborted new uffd ctx earlier in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot
    Reviewed-by: Mike Rapoport
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Nov, 2017

1 commit


16 Nov, 2017

1 commit


25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

04 Oct, 2017

1 commit

  • When reading the event from the uffd, we put it on a temporary
    fork_event list to detect if we can still access it after releasing and
    retaking the event_wqh.lock.

    If fork aborts and removes the event from the fork_event all is fine as
    long as we're still in the userfault read context and fork_event head is
    still alive.

    We've to put the event allocated in the fork kernel stack, back from
    fork_event list-head to the event_wqh head, before returning from
    userfaultfd_ctx_read, because the fork_event head lifetime is limited to
    the userfaultfd_ctx_read stack lifetime.

    Forgetting to move the event back to its event_wqh place then results in
    __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
    userfaultfd_event_wait_completion to remove it from a head that has been
    already freed from the reader stack.

    This could only happen if resolve_userfault_fork failed (for example if
    there are no file descriptors available to allocate the fork uffd). If
    it succeeded it was put back correctly.

    Furthermore, after find_userfault_evt receives a fork event, the forked
    userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
    be released by the fork thread as soon as the event_wqh.lock is
    released. Taking a reference on the fork_nctx before dropping the lock
    prevents an use after free in resolve_userfault_fork().

    If the fork side aborted and it already released everything, we still
    try to succeed resolve_userfault_fork(), if possible.

    Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
    Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Cc: Pavel Emelyanov
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 Sep, 2017

1 commit

  • This is an enhancement to avoid a non cooperative userfaultfd manager
    having to unregister all regions before it can close the uffd after all
    userfaultfd activity completed.

    The UFFDIO_UNREGISTER would serialize against the handle_userfault by
    taking the mmap_sem for writing, but we can simply repeat the page fault
    if we detect the uffd was closed and so the regular page fault paths
    should takeover.

    Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Sep, 2017

4 commits

  • No ABI change, but this will make it more explicit to software that ptid
    is only available if requested by passing UFFD_FEATURE_THREAD_ID to
    UFFDIO_API. The fact it's a union will also self document it shouldn't
    be taken for granted there's a tpid there.

    Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Alexey Perevalov
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It could be useful for calculating downtime during postcopy live
    migration per vCPU. Side observer or application itself will be
    informed about proper task's sleep during userfaultfd processing.

    Process's thread id is being provided when user requeste it by setting
    UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.

    Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
    Signed-off-by: Alexey Perevalov
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Perevalov
     
  • In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
    to the faulting process, instead of the page-fault event. Dealing with
    page-fault event using a monitor thread can be an overhead in these
    cases. For example applications like the database could use the
    signaling mechanism for robustness purpose.

    Database uses hugetlbfs for performance reason. Files on hugetlbfs
    filesystem are created and huge pages allocated using fallocate() API.
    Pages are deallocated/freed using fallocate() hole punching support.
    These files are mmapped and accessed by many processes as shared memory.
    The database keeps track of which offsets in the hugetlbfs file have
    pages allocated.

    Any access to mapped address over holes in the file, which can occur due
    to bugs in the application, is considered invalid and expect the process
    to simply receive a SIGBUS. However, currently when a hole in the file
    is accessed via the mapped address, kernel/mm attempts to automatically
    allocate a page at page fault time, resulting in implicitly filling the
    hole in the file. This may not be the desired behavior for applications
    like the database that want to explicitly manage page allocations of
    hugetlbfs files.

    Using userfaultfd mechanism with this support to get a signal, database
    application can prevent pages from being allocated implicitly when
    processes access mapped address over holes in the file.

    This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
    request for a SIGBUS signal.

    See following for previous discussion about the database requirement
    leading to this proposal as suggested by Andrea.

    http://www.spinics.net/lists/linux-mm/msg129224.html

    Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
    Signed-off-by: Prakash Sangappa
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prakash Sangappa
     
  • Now when shmem VMAs can be filled with zero page via userfaultfd we can
    report that UFFDIO_ZEROPAGE is available for those VMAs

    Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Aug, 2017

2 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When the process exit races with outstanding mcopy_atomic, it would be
    better to return ESRCH error. When such race occurs the process and
    it's mm are going away and returning "no such process" to the uffd
    monitor seems better fit than ENOSPC.

    Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport