16 Apr, 2015

1 commit

  • Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
    converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
    ACCESS_ONCE doesn't work reliably on non-scalar types.

    This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
    and __get_user_pages_fast() in mm/gup.c

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

2 commits

  • It's odd that we have populate_vma_page_range() and __mm_populate() in
    mm/mlock.c. It's implementation of generic memory population and mlocking
    is one of possible side effect, if VM_LOCKED is set.

    __get_user_pages() is core of the implementation. Let's move the code
    into mm/gup.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
    get_user_pages only for mlock") FOLL_MLOCK has lost its original
    meaning: we don't necessarily mlock the page if the flags is set -- we
    also take VM_LOCKED into consideration.

    Since we use the same codepath for __mm_populate(), let's rename
    FOLL_MLOCK to FOLL_POPULATE.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Feb, 2015

1 commit

  • Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
    "Tighten rules for ACCESS_ONCE

    This series tightens the rules for ACCESS_ONCE to only work on scalar
    types. It also contains the necessary fixups as indicated by build
    bots of linux-next. Now everything is in place to prevent new
    non-scalar users of ACCESS_ONCE and we can continue to convert code to
    READ_ONCE/WRITE_ONCE"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    kernel: Fix sparse warning for ACCESS_ONCE
    next: sh: Fix compile error
    kernel: tighten rules for ACCESS ONCE
    mm/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
    x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
    ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
    ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

    Linus Torvalds
     

13 Feb, 2015

1 commit

  • Convert existing users of pte_numa and friends to the new helper. Note
    that the kernel is broken after this patch is applied until the other page
    table modifiers are also altered. This patch layout is to make review
    easier.

    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Acked-by: Benjamin Herrenschmidt
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

4 commits

  • This allows the get_user_pages_fast slow path to release the mmap_sem
    before blocking.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
    to get a proper -EHWPOSION retval instead of -EFAULT to take a more
    appropriate action if get_user_pages runs into a memory failure.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
    reading to reduce the mmap_sem contention (for writing), like while
    waiting for I/O completion. The problem is that right now practically no
    get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
    that nifty feature.

    Andres fixed it for the KVM page fault. However get_user_pages_fast
    remains uncovered, and 99% of other get_user_pages aren't using it either
    (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
    and in fact it doesn't even release the mmap_sem).

    So this patchsets extends the optimization Andres did in the KVM page
    fault to the whole kernel. It makes most important places (including
    gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
    during I/O.

    The only few places that remains uncovered are drivers like v4l and other
    exceptions that tends to work on their own memory and they're not working
    on random user memory (for example like O_DIRECT that uses gup_fast and is
    fully covered by this patch).

    A follow up patch should probably also add a printk_once warning to
    get_user_pages that should go obsolete and be phased out eventually. The
    "vmas" parameter of get_user_pages makes it fundamentally incompatible
    with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
    mmap_sem is released).

    While this is just an optimization, this becomes an absolute requirement
    for the userfaultfd feature http://lwn.net/Articles/615086/ .

    The userfaultfd allows to block the page fault, and in order to do so I
    need to drop the mmap_sem first. So this patch also ensures that all
    memory where userfaultfd could be registered by KVM, the very first fault
    (no matter if it is a regular page fault, or a get_user_pages) always has
    FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
    only when the pagetable is already mapped. The second fault attempt after
    the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
    without it.

    This patch (of 5):

    We can leverage the VM_FAULT_RETRY functionality in the page fault paths
    better by using either get_user_pages_locked or get_user_pages_unlocked.

    The former allows conversion of get_user_pages invocations that will have
    to pass a "&locked" parameter to know if the mmap_sem was dropped during
    the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
    up_read(&mm->mmap_sem);

    The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

    Where tsk, mm, the intermediate "..." paramters and "pages" can be any
    value as before. Just the last parameter of get_user_pages (vmas) must be
    NULL for get_user_pages_locked|unlocked to be usable (the latter original
    form wouldn't have been safe anyway if vmas wasn't null, for the former we
    just make it explicit by dropping the parameter).

    If vmas is not NULL these two methods cannot be used.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Peter Feiner
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We have a race condition between move_pages() and freeing hugepages, where
    move_pages() calls follow_page(FOLL_GET) for hugepages internally and
    tries to get its refcount without preventing concurrent freeing. This
    race crashes the kernel, so this patch fixes it by moving FOLL_GET code
    for hugepages into follow_huge_pmd() with taking the page table lock.

    This patch intentionally removes page==NULL check after pte_page.
    This is justified because pte_page() never returns NULL for any
    architectures or configurations.

    This patch changes the behavior of follow_huge_pmd() for tail pages and
    then tail pages can be pinned/returned. So the caller must be changed to
    properly handle the returned tail pages.

    We could have a choice to add the similar locking to
    follow_huge_(addr|pud) for consistency, but it's not necessary because
    currently these functions don't support FOLL_GET flag, so let's leave it
    for future development.

    Here is the reproducer:

    $ cat movepages.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000
    #define PS 0x1000

    int main(int argc, char *argv[]) {
    int i;
    int nr_hp = strtol(argv[1], NULL, 0);
    int nr_p = nr_hp * HPS / PS;
    int ret;
    void **addrs;
    int *status;
    int *nodes;
    pid_t pid;

    pid = strtol(argv[2], NULL, 0);
    addrs = malloc(sizeof(char *) * nr_p + 1);
    status = malloc(sizeof(char *) * nr_p + 1);
    nodes = malloc(sizeof(char *) * nr_p + 1);

    while (1) {
    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 1;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");

    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 0;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");
    }
    return 0;
    }

    $ cat hugepage.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000

    int main(int argc, char *argv[]) {
    int nr_hp = strtol(argv[1], NULL, 0);
    char *p;

    while (1) {
    p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (p != (void *)ADDR_INPUT) {
    perror("mmap");
    break;
    }
    memset(p, 0, nr_hp * HPS);
    munmap(p, nr_hp * HPS);
    }
    }

    $ sysctl vm.nr_hugepages=40
    $ ./hugepage 10 &
    $ ./movepages 10 $(pgrep -f hugepage)

    Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit


30 Jan, 2015

1 commit

  • The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
    "you should SIGSEGV" error, because the SIGSEGV case was generally
    handled by the caller - usually the architecture fault handler.

    That results in lots of duplication - all the architecture fault
    handlers end up doing very similar "look up vma, check permissions, do
    retries etc" - but it generally works. However, there are cases where
    the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

    In particular, when accessing the stack guard page, libsigsegv expects a
    SIGSEGV. And it usually got one, because the stack growth is handled by
    that duplicated architecture fault handler.

    However, when the generic VM layer started propagating the error return
    from the stack expansion in commit fee7e49d4514 ("mm: propagate error
    from stack expansion even for guard page"), that now exposed the
    existing VM_FAULT_SIGBUS result to user space. And user space really
    expected SIGSEGV, not SIGBUS.

    To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
    duplicate architecture fault handlers about it. They all already have
    the code to handle SIGSEGV, so it's about just tying that new return
    value to the existing code, but it's all a bit annoying.

    This is the mindless minimal patch to do this. A more extensive patch
    would be to try to gather up the mostly shared fault handling logic into
    one generic helper routine, and long-term we really should do that
    cleanup.

    Just from this patch, you can generally see that most architectures just
    copied (directly or indirectly) the old x86 way of doing things, but in
    the meantime that original x86 model has been improved to hold the VM
    semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
    "newer" things, so it would be a good idea to bring all those
    improvements to the generic case and teach other architectures about
    them too.

    Reported-and-tested-by: Takashi Iwai
    Tested-by: Jan Engelhardt
    Acked-by: Heiko Carstens # "s390 still compiles and boots"
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Jan, 2015

1 commit

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Fixup gup_pmd_range.

    Signed-off-by: Christian Borntraeger

    Christian Borntraeger
     

21 Dec, 2014

1 commit

  • Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
    "kernel: Provide READ_ONCE and ASSIGN_ONCE

    As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
    ACCESS_ONCE might fail with specific compilers for non-scalar
    accesses.

    Here is a set of patches to tackle that problem.

    The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
    structure is larger than the machine word size memcpy is used and a
    warning is emitted. The next patches fix up several in-tree users of
    ACCESS_ONCE on non-scalar types.

    This does not yet contain a patch that forces ACCESS_ONCE to work only
    on scalar types. This is targetted for the next merge window as Linux
    next already contains new offenders regarding ACCESS_ONCE vs.
    non-scalar types"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    s390/kvm: REPLACE barrier fixup with READ_ONCE
    arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
    arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
    mips/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
    mm: replace ACCESS_ONCE with READ_ONCE or barriers
    kernel: Provide READ_ONCE and ASSIGN_ONCE

    Linus Torvalds
     

18 Dec, 2014

1 commit

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Let's change the code to access the page table elements with
    READ_ONCE that does implicit scalar accesses for the gup code.

    mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
    as array of longs. This code requires just that the pmd_present
    and pmd_trans_huge check are done on the same value, so a barrier
    is sufficent.

    A similar case is in handle_pte_fault. On ppc44x the word size is
    32 bit, but a pte is 64 bit. A barrier is ok as well.

    Signed-off-by: Christian Borntraeger
    Cc: linux-mm@kvack.org
    Acked-by: Paul E. McKenney

    Christian Borntraeger
     

14 Nov, 2014

1 commit


10 Oct, 2014

1 commit

  • This series implements general forms of get_user_pages_fast and
    __get_user_pages_fast in core code and activates them for arm and arm64.

    These are required for Transparent HugePages to function correctly, as a
    futex on a THP tail will otherwise result in an infinite loop (due to the
    core implementation of __get_user_pages_fast always returning 0).

    Unfortunately, a futex on THP tail can be quite common for certain
    workloads; thus THP is unreliable without a __get_user_pages_fast
    implementation.

    This series may also be beneficial for direct-IO heavy workloads and
    certain KVM workloads.

    This patch (of 6):

    get_user_pages_fast() attempts to pin user pages by walking the page
    tables directly and avoids taking locks. Thus the walker needs to be
    protected from page table pages being freed from under it, and needs to
    block any THP splits.

    One way to achieve this is to have the walker disable interrupts, and rely
    on IPIs from the TLB flushing code blocking before the page table pages
    are freed.

    On some platforms we have hardware broadcast of TLB invalidations, thus
    the TLB flushing code doesn't necessarily need to broadcast IPIs; and
    spuriously broadcasting IPIs can hurt system performance if done too
    often.

    This problem has been solved on PowerPC and Sparc by batching up page
    table pages belonging to more than one mm_user, then scheduling an
    rcu_sched callback to free the pages. This RCU page table free logic has
    been promoted to core code and is activated when one enables
    HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
    own get_user_pages_fast routines.

    The RCU page table free logic coupled with an IPI broadcast on THP split
    (which is a rare event), allows one to protect a page table walker by
    merely disabling the interrupts during the walk.

    This patch provides a general RCU implementation of get_user_pages_fast
    that can be used by architectures that perform hardware broadcast of TLB
    invalidations.

    It is based heavily on the PowerPC implementation by Nick Piggin.

    [akpm@linux-foundation.org: various comment fixes]
    Signed-off-by: Steve Capper
    Tested-by: Dann Frazier
    Reviewed-by: Catalin Marinas
    Acked-by: Hugh Dickins
    Cc: Russell King
    Cc: Mark Rutland
    Cc: Mel Gorman
    Cc: Will Deacon
    Cc: Christoffer Dall
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     

24 Sep, 2014

1 commit

  • When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
    has been swapped out or is behind a filemap, this will trigger async
    readahead and return immediately. The rationale is that KVM will kick
    back the guest with an "async page fault" and allow for some other
    guest process to take over.

    If async PFs are enabled the fault is retried asap from an async
    workqueue. If not, it's retried immediately in the same code path. In
    either case the retry will not relinquish the mmap semaphore and will
    block on the IO. This is a bad thing, as other mmap semaphore users
    now stall as a function of swap or filemap latency.

    This patch ensures both the regular and async PF path re-enter the
    fault allowing for the mmap semaphore to be relinquished in the case
    of IO wait.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

05 Jun, 2014

5 commits