27 Jul, 2016

1 commit

  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 May, 2016

1 commit

  • userfaultfd_file_create() increments mm->mm_users; this means that the
    memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
    after that can populate the orphaned mm more.

    Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
    mm->mm_count to pin mm_struct. This means that
    atomic_inc_not_zero(mm->mm_users) is needed when we are going to
    actually play with this memory. Except handle_userfault() path doesn't
    need this, the caller must already have a reference.

    The patch adds the new trivial helper, mmget_not_zero(), it can have
    more users.

    Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Mar, 2016

1 commit

  • The exit path will do some final updates to the VM of an exiting process
    to inform others of the fact that the process is going away.

    That happens, for example, for robust futex state cleanup, but also if
    the parent has asked for a TID update when the process exits (we clear
    the child tid field in user space).

    However, at the time we do those final VM accesses, we've already
    stopped accepting signals, so the usual "stop waiting for userfaults on
    signal" code in fs/userfaultfd.c no longer works, and the process can
    become an unkillable zombie waiting for something that will never
    happen.

    To solve this, just make handle_userfault() abort any user fault
    handling if we're already in the exit path past the signal handling
    state being dead (marked by PF_EXITING).

    This VM special case is pretty ugly, and it is possible that we should
    look at finalizing signals later (or move the VM final accesses
    earlier). But in the meantime this is a fairly minimally intrusive fix.

    Reported-and-tested-by: Dmitry Vyukov
    Acked-by: Andrea Arcangeli
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Sep, 2015

1 commit

  • This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
    fs/userfaultfd.c to use the old version of that function.

    It didn't look robust to call __wake_up_common with "nr == 1" when we
    absolutely require wakeall semantics, but we've full control of what we
    insert in the two waitqueue heads of the blocked userfaults. No
    exclusive waitqueue risks to be inserted into those two waitqueue heads
    so we can as well stick to "nr == 1" of the old code and we can rely
    purely on the fact no waitqueue inserted in one of the two waitqueue
    heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

    Signed-off-by: Andrea Arcangeli
    Cc: Dr. David Alan Gilbert
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

18 Sep, 2015

1 commit


05 Sep, 2015

11 commits

  • During the refile in userfaultfd_read both waitqueues could look empty to
    the lockless wake_userfault(). Use a seqcount to prevent this false
    negative that could leave an userfault blocked.

    Signed-off-by: Andrea Arcangeli
    Cc: Pavel Emelyanov
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This is only simple to achieve if the userfault is going to return to
    userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
    despite we temporarily released the mmap_sem. The fault would just be
    retried by userland then. This is safe at least on x86 and powerpc (the
    two archs with the syscall implemented so far).

    Hint to verify for which archs this is safe: after handle_mm_fault
    returns, no access to data structures protected by the mmap_sem must be
    done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
    is called.

    This has two main benefits: signals can run with lower latency in
    production (signals aren't blocked by userfaults and userfaults are
    immediately repeated after signal processing) and gdb can then trivially
    debug the threads blocked in this kind of userfaults coming directly from
    userland.

    On a side note: while gdb has a need to get signal processed, coredumps
    always worked perfectly with userfaults, no matter if the userfault is
    triggered by GUP a kernel copy_user or directly from userland.

    Signed-off-by: Andrea Arcangeli
    Cc: Pavel Emelyanov
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • UFFDIO_API was already forced before read/poll could work. This makes the
    code more strict to force it also for all other ioctls.

    All users would already have been required to call UFFDIO_API before
    invoking other ioctls but this makes it more explicit.

    This will ensure we can change all ioctls (all but UFFDIO_API/struct
    uffdio_api) with a bump of uffdio_api.api.

    There's no actual plan or need to change the API or the ioctl, the current
    API already should cover fine even the non cooperative usage, but this is
    just for the longer term future just in case.

    Signed-off-by: Andrea Arcangeli
    Cc: Pavel Emelyanov
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • These two ioctl allows to either atomically copy or to map zeropages
    into the virtual address space. This is used by the thread that opened
    the userfaultfd to resolve the userfaults.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
    userfaultfd_read if they are run on different threads simultaneously.

    Until now qemu solved the race in userland: the race was explicitly
    and intentionally left for userland to solve. However we can also
    solve it in kernel.

    Requiring all users to solve this race if they use two threads (one
    for the background transfer and one for the userfault reads) isn't
    very attractive from an API prospective, furthermore this allows to
    remove a whole bunch of mutex and bitmap code from qemu, making it
    faster. The cost of __get_user_pages_fast should be insignificant
    considering it scales perfectly and the pagetables are already hot in
    the CPU cache, compared to the overhead in userland to maintain those
    structures.

    Applying this patch is backwards compatible with respect to the
    userfaultfd userland API, however reverting this change wouldn't be
    backwards compatible anymore.

    Without this patch qemu in the background transfer thread, has to read
    the old state, and do UFFDIO_WAKE if old_state is missing but it
    become REQUESTED by the time it tries to set it to RECEIVED (signaling
    the other side received an userfault).

    vcpu background_thr userfault_thr
    ----- ----- -----
    vcpu0 handle_mm_fault()

    postcopy_place_page
    read old_state -> MISSING
    UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

    poll() -> POLLIN
    read() -> 0x7fb76a139000
    postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

    tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
    /* check that no userfault raced with UFFDIO_COPY */
    if (old_state == MISSING && tmp_state == REQUESTED)
    UFFDIO_WAKE from background thread

    And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu background_thr userfault_thr
    ----- ----- -----
    vcpu0 handle_mm_fault()

    postcopy_place_page
    read old_state -> MISSING
    UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
    tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

    poll() -> POLLIN
    read() -> 0x7fb76a139000

    if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
    UFFDIO_WAKE from userfault thread

    This patch removes the need of both UFFDIO_WAKE and of the associated
    per-page tristate as well.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Use proper slab to guarantee alignment.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This makes read O(1) and poll that was already O(1) becomes lockless.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This is an optimization but it's a userland visible one and it affects
    the API.

    The downside of this optimization is that if you call poll() and you
    get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
    may be waken by a different thread, before read(ufd) comes
    around. This in short means that poll() isn't really usable if the
    userfaultfd is opened in blocking mode.

    userfaults won't wait in "pending" state to be read anymore and any
    UFFDIO_WAKE or similar operations that has the objective of waking
    userfaults after their resolution, will wake all blocked userfaults
    for the resolved range, including those that haven't been read() by
    userland yet.

    The behavior of poll() becomes not standard, but this obviates the
    need of "spurious" UFFDIO_WAKE and it lets the userland threads to
    restart immediately without requiring an UFFDIO_WAKE. This is even
    more significant in case of repeated faults on the same address from
    multiple threads.

    This optimization is justified by the measurement that the number of
    spurious UFFDIO_WAKE accounts for 5% and 10% of the total
    userfaults for heavy workloads, so it's worth optimizing those away.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • I had requests to return the full address (not the page aligned one) to
    userland.

    It's not entirely clear how the page offset could be relevant because
    userfaults aren't like SIGBUS that can sigjump to a different place and it
    actually skip resolving the fault depending on a page offset. There's
    currently no real way to skip the fault especially because after a
    UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
    kernel without having to return to userland first (not even self modifying
    code replacing the .text that touched the faulting address would prevent
    the fault to be repeated). Userland cannot skip repeating the fault even
    more so if the fault was triggered by a KVM secondary page fault or any
    get_user_pages or any copy-user inside some syscall which will return to
    kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
    to a SIGBUS being raised because the userfault can't wait if it cannot
    release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
    that).

    Still returning userland a proper structure during the read() on the uffd,
    can allow to use the current UFFD_API for the future non-cooperative
    extensions too and it looks cleaner as well. Once we get additional
    fields there's no point to return the fault address page aligned anymore
    to reuse the bits below PAGE_SHIFT.

    The only downside is that the read() syscall will read 32bytes instead of
    8bytes but that's not going to be measurable overhead.

    The total number of new events that can be extended or of new future bits
    for already shipped events, is limited to 64 by the features field of the
    uffdio_api structure. If more will be needed a bump of UFFD_API will be
    required.

    [akpm@linux-foundation.org: use __packed]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This is (seems to be) the minimal thing that is required to unblock
    standard uffd usage from the non-cooperative one. Now more bits can be
    added to the features field indicating e.g. UFFD_FEATURE_FORK and others
    needed for the latter use-case.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Andrea Arcangeli
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Once an userfaultfd has been created and certain region of the process
    virtual address space have been registered into it, the thread responsible
    for doing the memory externalization can manage the page faults in
    userland by talking to the kernel using the userfaultfd protocol.

    poll() can be used to know when there are new pending userfaults to be
    read (POLLIN).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli