30 Dec, 2020

3 commits

  • [ Upstream commit 4509b42c38963f495b49aa50209c34337286ecbe ]

    These functions accomplish the same thing but have different
    implementations.

    unpin_user_page() has a bug where it calls mod_node_page_state() after
    calling put_page() which creates a risk that the page could have been
    hot-uplugged from the system.

    Fix this by using put_compound_head() as the only implementation.

    __unpin_devmap_managed_user_page() and related can be deleted as well in
    favour of the simpler, but slower, version in put_compound_head() that has
    an extra atomic page_ref_sub, but always calls put_page() which internally
    contains the special devmap code.

    Move put_compound_head() to be directly after try_grab_compound_head() so
    people can find it in future.

    Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
    Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Reviewed-by: Ira Weiny
    Reviewed-by: Jan Kara
    CC: Joao Martins
    CC: Jonathan Corbet
    CC: Dan Williams
    CC: Dave Chinner
    CC: Christoph Hellwig
    CC: Jane Chu
    CC: "Kirill A. Shutemov"
    CC: Michal Hocko
    CC: Mike Kravetz
    CC: Shuah Khan
    CC: Muchun Song
    CC: Vlastimil Babka
    CC: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit c28b1fc70390df32e29991eedd52bd86e7aba080 ]

    Patch series "Add a seqcount between gup_fast and copy_page_range()", v4.

    As discussed and suggested by Linus use a seqcount to close the small race
    between gup_fast and copy_page_range().

    Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
    in this case and it doesn't trigger any lockdeps.

    I was able to test it using two threads, one forking and the other using
    ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
    made the window large enough to reliably hit to test the logic.

    This patch (of 2):

    The next patch in this series makes the lockless flow a little more
    complex, so move the entire block into a new function and remove a level
    of indention. Tidy a bit of cruft:

    - addr is always the same as start, so use start

    - Use the modern check_add_overflow() for computing end = start + len

    - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
    avoid shift overflow, make the variables unsigned long to avoid coding
    casts in both places. nr_pinned was missing its cast

    - The handling of ret and nr_pinned can be streamlined a bit

    No functional change.

    Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Jan Kara
    Reviewed-by: John Hubbard
    Reviewed-by: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     

15 Nov, 2020

1 commit

  • When FOLL_PIN is passed to __get_user_pages() the page list must be put
    back using unpin_user_pages() otherwise the page pin reference persists
    in a corrupted state.

    There are two places in the unwind of __gup_longterm_locked() that put
    the pages back without checking. Normally on error this function would
    return the partial page list making this the caller's responsibility,
    but in these two cases the caller is not allowed to see these pages at
    all.

    Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
    Reported-by: Ira Weiny
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: John Hubbard
    Cc: Aneesh Kumar K.V
    Cc: Dan Williams
    Cc:
    Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     

17 Oct, 2020

2 commits

  • Properly take the mmap_lock before calling into the GUP code from
    get_dump_page(); and play nice, allowing the GUP code to drop the
    mmap_lock if it has to sleep.

    As Linus pointed out, we don't actually need the VMA because
    __get_user_pages() will flush the dcache for us if necessary.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

    At the moment, we have that rather ugly mmget_still_valid() helper to work
    around : ELF core dumping doesn't
    take the mmap_sem while traversing the task's VMAs, and if anything (like
    userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
    at the moment we use mmget_still_valid() to bail out in any writers that
    might be operating on a remote mm's VMAs.

    With this series, I'm trying to get rid of the need for that as cleanly as
    possible. ("cleanly" meaning "avoid holding the mmap_lock across
    unbounded sleeps".)

    Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
    dumping code.

    Patches 5 and 6 implement the main change: Instead of repeatedly accessing
    the VMA list with sleeps in between, we snapshot it at the start with
    proper locking, and then later we just use our copy of the VMA list. This
    ensures that the kernel won't crash, that VMA metadata in the coredump is
    consistent even in the presence of concurrent modifications, and that any
    virtual addresses that aren't being concurrently modified have their
    contents show up in the core dump properly.

    The disadvantage of this approach is that we need a bit more memory during
    core dumping for storing metadata about all VMAs.

    At the end of the series, patch 7 removes the old workaround for this
    issue (mmget_still_valid()).

    I have tested:

    - Creating a simple core dump on X86-64 still works.
    - The created coredump on X86-64 opens in GDB and looks plausible.
    - X86-64 core dumps contain the first page for executable mappings at
    offset 0, and don't contain the first page for non-executable file
    mappings or executable mappings at offset !=0.
    - NOMMU 32-bit ARM can still generate plausible-looking core dumps
    through the FDPIC implementation. (I can't test this with GDB because
    GDB is missing some structure definition for nommu ARM, but I've
    poked around in the hexdump and it looked decent.)

    This patch (of 7):

    dump_emit() is for kernel pointers, and VMAs describe userspace memory.
    Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
    even if it probably doesn't matter much on !MMU systems - especially given
    that it looks like we can just use the same get_dump_page() as on MMU if
    we move it out of the CONFIG_MMU block.

    One small change we have to make in get_dump_page() is to use
    __get_user_pages_locked() instead of __get_user_pages(), since the latter
    doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
    just call __get_user_pages() for us.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
    Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     

14 Oct, 2020

2 commits

  • As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit,
    against a typical caller mistake: check if the npages arg is really a
    -ERRNO value, which would blow up the unpinning loop: WARN and return.

    If this new WARN_ON() fires, then the system *might* be leaking pages (by
    leaving them pinned), but probably not. More likely, gup/pup returned a
    hard -ERRNO error to the caller, who erroneously passed it here.

    Signed-off-by: John Hubbard
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Cc: Ira Weiny
    Cc: Souptick Joarder
    Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • gup prohibits users from calling get_user_pages() with FOLL_PIN. But it
    allows users to call get_user_pages() with FOLL_LONGTERM only. It seems
    insensible.

    Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
    users from calling get_user_pages() with FOLL_LONGTERM while not with
    FOLL_PIN.

    mm/gup_benchmark.c used to be the only user who did this improperly.
    But it has been fixed by moving to use pin_user_pages().

    [akpm@linux-foundation.org: fix CONFIG_MMU=n build]
    Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com

    Signed-off-by: Barry Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Cc: John Hubbard
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com
    Signed-off-by: Linus Torvalds

    Barry Song
     

29 Sep, 2020

1 commit

  • It seems likely this block was pasted from internal_get_user_pages_fast,
    which is not passed an mm struct and therefore uses current's. But
    __get_user_pages_locked is passed an explicit mm, and current->mm is not
    always valid. This was hit when being called from i915, which uses:

    pin_user_pages_remote->
    __get_user_pages_remote->
    __gup_longterm_locked->
    __get_user_pages_locked

    Before, this would lead to an OOPS:

    BUG: kernel NULL pointer dereference, address: 0000000000000064
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
    Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
    Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
    RIP: 0010:__get_user_pages_remote+0xd7/0x310
    Call Trace:
    __i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
    process_one_work+0x1ca/0x390
    worker_thread+0x48/0x3c0
    kthread+0x114/0x130
    ret_from_fork+0x1f/0x30
    CR2: 0000000000000064

    This commit fixes the problem by using the mm pointer passed to the
    function rather than the bogus one in current.

    Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
    Tested-by: Chris Wilson
    Reported-by: Harald Arnesen
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Peter Xu
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Linus Torvalds

    Jason A. Donenfeld
     

28 Sep, 2020

1 commit

  • (Commit message majorly collected from Jason Gunthorpe)

    Reduce the chance of false positive from page_maybe_dma_pinned() by
    keeping track if the mm_struct has ever been used with pin_user_pages().
    This allows cases that might drive up the page ref_count to avoid any
    penalty from handling dma_pinned pages.

    Future work is planned, to provide a more sophisticated solution, likely
    to turn it into a real counter. For now, make it atomic_t but use it as
    a boolean for simplicity.

    Suggested-by: Jason Gunthorpe
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

27 Sep, 2020

1 commit

  • Currently to make sure that every page table entry is read just once
    gup_fast walks perform READ_ONCE and pass pXd value down to the next
    gup_pXd_range function by value e.g.:

    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    ...
    pudp = pud_offset(&p4d, addr);

    This function passes a reference on that local value copy to pXd_offset,
    and might get the very same pointer in return. This happens when the
    level is folded (on most arches), and that pointer should not be
    iterated.

    On s390 due to the fact that each task might have different 5,4 or
    3-level address translation and hence different levels folded the logic
    is more complex and non-iteratable pointer to a local copy leads to
    severe problems.

    Here is an example of what happens with gup_fast on s390, for a task
    with 3-level paging, crossing a 2 GB pud boundary:

    // addr = 0x1007ffff000, end = 0x10080001000
    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    {
    unsigned long next;
    pud_t *pudp;

    // pud_offset returns &p4d itself (a pointer to a value on stack)
    pudp = pud_offset(&p4d, addr);
    do {
    // on second iteratation reading "random" stack value
    pud_t pud = READ_ONCE(*pudp);

    // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
    next = pud_addr_end(addr, end);
    ...
    } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

    return 1;
    }

    This happens since s390 moved to common gup code with commit
    d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
    commit 1a42010cdc26 ("s390/mm: convert to the generic
    get_user_pages_fast code").

    s390 tried to mimic static level folding by changing pXd_offset
    primitives to always calculate top level page table offset in pgd_offset
    and just return the value passed when pXd_offset has to act as folded.

    What is crucial for gup_fast and what has been overlooked is that
    PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
    And the latter is not possible with dynamic folding.

    To fix the issue in addition to pXd values pass original pXdp pointers
    down to gup_pXd_range functions. And introduce pXd_offset_lockless
    helpers, which take an additional pXd entry value parameter. This has
    already been discussed in

    https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

    Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Andrew Morton
    Reviewed-by: Gerald Schaefer
    Reviewed-by: Alexander Gordeev
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Mike Rapoport
    Reviewed-by: John Hubbard
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Andrey Ryabinin
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Claudio Imbrenda
    Cc: [5.2+]
    Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
    Signed-off-by: Linus Torvalds

    Vasily Gorbik
     

05 Sep, 2020

2 commits

  • Merge emailed patches from Peter Xu:
    "This is a small series that I picked up from Linus's suggestion to
    simplify cow handling (and also make it more strict) by checking
    against page refcounts rather than mapcounts.

    This makes uffd-wp work again (verified by running upmapsort)"

    Note: this is horrendously bad timing, and making this kind of
    fundamental vm change after -rc3 is not at all how things should work.
    The saving grace is that it really is a a nice simplification:

    8 files changed, 29 insertions(+), 120 deletions(-)

    The reason for the bad timing is that it turns out that commit
    17839856fd58 ("gup: document and work around 'COW can break either way'
    issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
    Patocka also reports that it caused issues for strace when running in a
    DAX environment with ext4 on a persistent memory setup.

    And we can't just revert that commit without re-introducing the original
    issue that is a potential security hole, so making COW stricter (and in
    the process much simpler) is a step to then undoing the forced COW that
    broke other uses.

    Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

    * emailed patches from Peter Xu :
    mm: Add PGREUSE counter
    mm/gup: Remove enfornced COW mechanism
    mm/ksm: Remove reuse_ksm_page()
    mm: do_wp_page() simplification

    Linus Torvalds
     
  • With the more strict (but greatly simplified) page reuse logic in
    do_wp_page(), we can safely go back to the world where cow is not
    enforced with writes.

    This essentially reverts commit 17839856fd58 ("gup: document and work
    around 'COW can break either way' issue"). There are some context
    differences due to some changes later on around it:

    2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
    376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

    Some lines moved back and forth with those, but this revert patch should
    have striped out and covered all the enforced cow bits anyways.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

04 Sep, 2020

2 commits

  • Merge gate page refcount fix from Dave Hansen:
    "During the conversion over to pin_user_pages(), gate pages were missed.

    The fix is pretty simple, and is accompanied by a new test from Andy
    which probably would have caught this earlier"

    * emailed patches from Dave Hansen :
    selftests/x86/test_vsyscall: Improve the process_vm_readv() test
    mm: fix pin vs. gup mismatch with gate pages

    Linus Torvalds
     
  • Gate pages were missed when converting from get to pin_user_pages().
    This can lead to refcount imbalances. This is reliably and quickly
    reproducible running the x86 selftests when vsyscall=emulate is enabled
    (the default). Fix by using try_grab_page() with appropriate flags
    passed.

    The long story:

    Today, pin_user_pages() and get_user_pages() are similar interfaces for
    manipulating page reference counts. However, "pins" use a "bias" value
    and manipulate the actual reference count by 1024 instead of 1 used by
    plain "gets".

    That means that pin_user_pages() must be matched with unpin_user_pages()
    and can't be mixed with a plain put_user_pages() or put_page().

    Enter gate pages, like the vsyscall page. They are pages usually in the
    kernel image, but which are mapped to userspace. Userspace is allowed
    access to them, including interfaces using get/pin_user_pages(). The
    refcount of these kernel pages is manipulated just like a normal user
    page on the get/pin side so that the put/unpin side can work the same
    for normal user pages or gate pages.

    get_gate_page() uses try_get_page() which only bumps the refcount by
    1, not 1024, even if called in the pin_user_pages() path. If someone
    pins a gate page, this happens:

    pin_user_pages()
    get_gate_page()
    try_get_page() // bump refcount +1
    ... some time later
    unpin_user_pages()
    page_ref_sub_and_test(page, 1024))

    ... and boom, we get a refcount off by 1023. This is reliably and
    quickly reproducible running the x86 selftests when booted with
    vsyscall=emulate (the default). The selftests use ptrace(), but I
    suspect anything using pin_user_pages() on gate pages could hit this.

    To fix it, simply use try_grab_page() instead of try_get_page(), and
    pass 'gup_flags' in so that FOLL_PIN can be respected.

    This bug traces back to the very beginning of the FOLL_PIN support in
    commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), which showed up in
    the 5.7 release.

    Signed-off-by: Dave Hansen
    Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
    Reported-by: Peter Zijlstra
    Reviewed-by: John Hubbard
    Acked-by: Andy Lutomirski
    Cc: x86@kernel.org
    Cc: Jann Horn
    Cc: Andrew Morton
    Cc: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

6 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Here're the last pieces of page fault accounting that were still done
    outside handle_mm_fault() where we still have regs==NULL when calling
    handle_mm_fault():

    arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
    arch/sparc/mm/fault_32.c: force_user_fault
    arch/um/kernel/trap.c: handle_page_fault
    mm/gup.c: faultin_page
    fixup_user_fault
    mm/hmm.c: hmm_vma_fault
    mm/ksm.c: break_ksm

    Some of them has the issue of duplicated accounting for page fault
    retries. Some of them didn't do the accounting at all.

    This patch cleans all these up by letting handle_mm_fault() to do per-task
    page fault accounting even if regs==NULL (though we'll still skip the perf
    event accountings). With that, we can safely remove all the outliers now.

    There's another functional change in that now we account the page faults
    to the caller of gup, rather than the task_struct that passed into the gup
    code. More information of this can be found at [1].

    After this patch, below things should never be touched again outside
    handle_mm_fault():

    - task_struct.[maj|min]_flt
    - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]

    [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Patch series "mm: Page fault accounting cleanups", v5.

    This is v5 of the pf accounting cleanup series. It originates from Gerald
    Schaefer's report on an issue a week ago regarding to incorrect page fault
    accountings for retried page fault after commit 4064b9827063 ("mm: allow
    VM_FAULT_RETRY for multiple times"):

    https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

    What this series did:

    - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault. For example, page fault
    retries should not be counted in page fault counters. Same to the
    perf events.

    - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g. errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense. And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

    - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR). More information in patch 1.

    - Always account the page fault onto the one that triggered the page
    fault. This does not matter much for #PF handlings, but mostly for
    gup. More information on this in patch 25.

    Patchset layout:

    Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
    Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
    Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
    Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

    This patch (of 25):

    This is a preparation patch to move page fault accountings into the
    general code in handle_mm_fault(). This includes both the per task
    flt_maj/flt_min counters, and the major/minor page fault perf events. To
    do this, the pt_regs pointer is passed into handle_mm_fault().

    PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
    handlers.

    So far, all the pt_regs pointer that passed into handle_mm_fault() is
    NULL, which means this patch should have no intented functional change.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
    Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • new_non_cma_page() in gup.c requires to allocate the new page that is not
    on the CMA area. new_non_cma_page() implements it by using allocation
    scope APIs.

    However, there is a work-around for hugetlb. Normal hugetlb page
    allocation API for migration is alloc_huge_page_nodemask(). It consists
    of two steps. First is dequeing from the pool. Second is, if there is no
    available page on the queue, allocating by using the page allocator.

    new_non_cma_page() can't use this API since first step (deque) isn't aware
    of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb
    internal function for the second step, alloc_migrate_huge_page(), to
    global scope and uses it directly. This is suboptimal since hugetlb pages
    on the queue cannot be utilized.

    This patch tries to fix this situation by making the deque function on
    hugetlb CMA aware. In the deque function, CMA memory is skipped if
    PF_MEMALLOC_NOCMA flag is found.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have well defined scope API to exclude CMA region. Use it rather than
    manipulating gfp_mask manually. With this change, we can now restore
    __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
    would result in that the ZONE_MOVABLE is also searched by page allocator.
    For hugetlb, gfp_mask is redefined since it has a regular allocation mask
    filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
    filter since a new user for gfp_mask filter, gup, want to be silent when
    allocation fails.

    Note that this can be considered as a fix for the commit 9a4e9f3b2d73
    ("mm: update get_user_pages_longterm to migrate pages allocated from CMA
    region"). However, "Fixes" tag isn't added here since it is just
    suboptimal but it doesn't cause any problem.

    Suggested-by: Michal Hocko
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Roman Gushchin
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

1 commit


20 Jun, 2020

2 commits

  • Since commit 9e343b467c70 ("READ_ONCE: Enforce atomicity for
    {READ,WRITE}_ONCE() memory accesses") it is not possible anymore to
    use READ_ONCE() to access complex page table entries like the one
    defined for powerpc 8xx with 16k size pages.

    Define a ptep_get() helper that architectures can override instead
    of performing a READ_ONCE() on the page table entry pointer.

    Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
    Signed-off-by: Christophe Leroy
    Acked-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/087fa12b6e920e32315136b998aa834f99242695.1592225558.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     
  • gup_hugepte() reads hugepage table entries, it can't read
    them directly, huge_ptep_get() must be used.

    Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
    Signed-off-by: Christophe Leroy
    Acked-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/ffc3714334c3bfaca6f13788ad039e8759ae413f.1592225558.git.christophe.leroy@csgroup.eu

    Christophe Leroy
     

10 Jun, 2020

6 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
    now go through the new mmap locking api. The mmap_lock is still
    implemented as a rwsem, though this could change in the future.

    [akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add new APIs to assert that mmap_sem is held.

    Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
    makes the assertions more tolerant of future changes to the lock type.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

09 Jun, 2020

3 commits

  • All of the pin_user_pages*() API calls will cause pages to be
    dma-pinned. As such, they are all suitable for either DMA, RDMA, and/or
    Direct IO.

    The documentation should say so, but it was instead saying that three of
    the API calls were only suitable for Direct IO. This was discovered
    when a reviewer wondered why an API call that specifically recommended
    against Case 2 (DMA/RDMA) was being used in a DMA situation [1].

    Fix this by simply deleting those claims. The gup.c comments already
    refer to the more extensive Documentation/core-api/pin_user_pages.rst,
    which does have the correct guidance. So let's just write it once,
    there.

    [1] https://lore.kernel.org/r/20200529074658.GM30374@kadam

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Acked-by: Souptick Joarder
    Cc: Dan Carpenter
    Cc: Jan Kara
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200529084515.46259-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "mm/gup: introduce pin_user_pages_locked(), use it in frame_vector.c", v2.

    This adds yet one more pin_user_pages*() variant, and uses that to
    convert mm/frame_vector.c.

    With this, along with maybe 20 or 30 other recent patches in various
    trees, we are close to having the relevant gup call sites
    converted--with the notable exception of the bio/block layer.

    This patch (of 2):

    Introduce pin_user_pages_locked(), which is nearly identical to
    get_user_pages_locked() except that it sets FOLL_PIN and rejects
    FOLL_GET.

    As with other pairs of get_user_pages*() and pin_user_pages() API calls,
    it's prudent to assert that FOLL_PIN is *not* set in the
    get_user_pages*() call, so add that as part of this.

    [jhubbard@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/20200531234131.770697-2-jhubbard@nvidia.com

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Daniel Vetter
    Cc: Jérôme Glisse
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Souptick Joarder
    Link: http://lkml.kernel.org/r/20200531234131.770697-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200527223243.884385-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200527223243.884385-2-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
    align with pin_user_pages_fast_only().

    As part of this we will get rid of write parameter. Instead caller will
    pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
    existing functionality of the API.

    All the callers are changed to pass FOLL_WRITE.

    Also introduce get_user_page_fast_only(), and use it in a few places
    that hard-code nr_pages to 1.

    Updated the documentation of the API.

    Signed-off-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Paul Mackerras [arch/powerpc/kvm]
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Paolo Bonzini
    Cc: Stephen Rothwell
    Cc: Mike Rapoport
    Cc: Aneesh Kumar K.V
    Cc: Michal Suchanek
    Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

04 Jun, 2020

5 commits

  • ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
    has added a special casing for 0 return value because that was a possible
    gup return value when interrupted by fatal signal. This has been fixed by
    ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for
    fatal signal") in the mean time so ba841078cd05 can be reverted.

    This patch however doesn't go all the way to revert it because the check
    for 0 is wrong and confusing here. Firstly it is inherently unsafe to
    access the page when get_user_pages_locked returns 0 (aka no page
    returned).

    Fortunatelly this will not happen because get_user_pages_locked will not
    return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
    the case here. Document this potential error code in gup code while we
    are at it.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Cc: Peter Xu
    Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Instead of scattering these assertions across the drivers, do this
    assertion inside the core of get_user_pages_fast*() functions. That also
    includes pin_user_pages_fast*() routines.

    Add a might_lock_read(mmap_sem) call to internal_get_user_pages_fast().

    Suggested-by: Matthew Wilcox
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox
    Cc: Michel Lespinasse
    Cc: Jason Gunthorpe
    Link: http://lkml.kernel.org/r/20200522010443.1290485-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • This is the FOLL_PIN equivalent of __get_user_pages_fast(), except with a
    more descriptive name, and gup_flags instead of a boolean "write" in the
    argument list.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Chris Wilson
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: "Joonas Lahtinen"
    Cc: Matthew Auld
    Cc: Matthew Wilcox
    Cc: Rodrigo Vivi
    Cc: Souptick Joarder
    Cc: Tvrtko Ursulin
    Link: http://lkml.kernel.org/r/20200519002124.2025955-4-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • There were two nearly identical sets of code for gup_fast() style of
    walking the page tables with interrupts disabled. This has lead to the
    usual maintenance problems that arise from having duplicated code.

    There is already a core internal routine in gup.c for gup_fast(), so just
    enhance it very slightly: allow skipping the fall-back to "slow" (regular)
    get_user_pages(), via the new FOLL_FAST_ONLY flag. Then, just call
    internal_get_user_pages_fast() from __get_user_pages_fast(), and adjust
    the API to match pre-existing API behavior.

    There is a change in behavior from this refactoring: the nested form of
    interrupt disabling is used in all gup_fast() variants now. That's
    because there is only one place that interrupt disabling for page walking
    is done, and so the safer form is required. This should, if anything,
    eliminate possible (rare) bugs, because the non-nested form of enabling
    interrupts was fragile at best.

    [jhubbard@nvidia.com: fixup]
    Link: http://lkml.kernel.org/r/20200521233841.1279742-1-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Chris Wilson
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: "Joonas Lahtinen"
    Cc: Matthew Auld
    Cc: Matthew Wilcox
    Cc: Rodrigo Vivi
    Cc: Souptick Joarder
    Cc: Tvrtko Ursulin
    Link: http://lkml.kernel.org/r/20200519002124.2025955-3-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()", v2.

    In order to convert the drm/i915 driver from get_user_pages() to
    pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was
    required. That led to refactoring __get_user_pages_fast(), with the
    following goals:

    1) As above: provide a pin_user_pages*() routine for drm/i915 to call,
    in place of __get_user_pages_fast(),

    2) Get rid of the gup.c duplicate code for walking page tables with
    interrupts disabled. This duplicate code is a minor maintenance
    problem anyway.

    3) Make it easy for an upcoming patch from Souptick, which aims to
    convert __get_user_pages_fast() to use a gup_flags argument, instead
    of a bool writeable arg. Also, if this series looks good, we can
    ask Souptick to change the name as well, to whatever the consensus
    is. My initial recommendation is: get_user_pages_fast_only(), to
    match the new pin_user_pages_only().

    This patch (of 4):

    This is in order to avoid a forward declaration of
    internal_get_user_pages_fast(), in the next patch.

    This is code movement only--all generated code should be identical.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Chris Wilson
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: "Joonas Lahtinen"
    Cc: Matthew Auld
    Cc: Matthew Wilcox
    Cc: Rodrigo Vivi
    Cc: Souptick Joarder
    Cc: Tvrtko Ursulin
    Link: http://lkml.kernel.org/r/20200522051931.54191-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200519002124.2025955-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200519002124.2025955-2-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     

03 Jun, 2020

1 commit

  • Merge updates from Andrew Morton:
    "A few little subsystems and a start of a lot of MM patches.

    Subsystems affected by this patch series: squashfs, ocfs2, parisc,
    vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
    swap, memcg, pagemap, memory-failure, vmalloc, kasan"

    * emailed patches from Andrew Morton : (128 commits)
    kasan: move kasan_report() into report.c
    mm/mm_init.c: report kasan-tag information stored in page->flags
    ubsan: entirely disable alignment checks under UBSAN_TRAP
    kasan: fix clang compilation warning due to stack protector
    x86/mm: remove vmalloc faulting
    mm: remove vmalloc_sync_(un)mappings()
    x86/mm/32: implement arch_sync_kernel_mappings()
    x86/mm/64: implement arch_sync_kernel_mappings()
    mm/ioremap: track which page-table levels were modified
    mm/vmalloc: track which page-table levels were modified
    mm: add functions to track page directory modifications
    s390: use __vmalloc_node in stack_alloc
    powerpc: use __vmalloc_node in alloc_vm_stack
    arm64: use __vmalloc_node in arch_alloc_vmap_stack
    mm: remove vmalloc_user_node_flags
    mm: switch the test_vmalloc module to use __vmalloc_node
    mm: remove __vmalloc_node_flags_caller
    mm: remove both instances of __vmalloc_node_flags
    mm: remove the prot argument to __vmalloc_node
    mm: remove the pgprot argument to __vmalloc
    ...

    Linus Torvalds