Eric Lee / smarc-fsl-linux-kernel

30 Dec, 2020

3 commits

cfde6c181 mm/gup: combine put_compound_head() and unpin_user_page() ... Browse Code »

[ Upstream commit 4509b42c38963f495b49aa50209c34337286ecbe ]

These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
Signed-off-by: Jason Gunthorpe
Reviewed-by: John Hubbard
Reviewed-by: Ira Weiny
Reviewed-by: Jan Kara
CC: Joao Martins
CC: Jonathan Corbet
CC: Dan Williams
CC: Dave Chinner
CC: Christoph Hellwig
CC: Jane Chu
CC: "Kirill A. Shutemov"
CC: Michal Hocko
CC: Mike Kravetz
CC: Shuah Khan
CC: Muchun Song
CC: Vlastimil Babka
CC: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800
537946556 mm/gup: prevent gup_fast from racing with COW during fork ... Browse Code »

[ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

CPU 0 CPU 1
get_user_pages_fast()
internal_get_user_pages_fast()
copy_page_range()
pte_alloc_map_lock()
copy_present_page()
atomic_read(has_pinned) == 0
page_maybe_dma_pinned() == false
atomic_set(has_pinned, 1);
gup_pgd_range()
gup_pte_range()
pte_t pte = gup_get_pte(ptep)
pte_access_permitted(pte)
try_grab_compound_head()
pte = pte_wrprotect(pte)
set_pte_at();
pte_unmap_unlock()
// GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

[akpm@linux-foundation.org: coding style fixes]
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe
Suggested-by: Linus Torvalds
Reviewed-by: John Hubbard
Reviewed-by: Jan Kara
Reviewed-by: Peter Xu
Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jann Horn
Cc: Kirill Shutemov
Cc: Kirill Tkhai
Cc: Leon Romanovsky
Cc: Michal Hocko
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800
bcb0f647c mm/gup: reorganize internal_get_user_pages_fast() ... Browse Code »

[ Upstream commit c28b1fc70390df32e29991eedd52bd86e7aba080 ]

Patch series "Add a seqcount between gup_fast and copy_page_range()", v4.

As discussed and suggested by Linus use a seqcount to close the small race
between gup_fast and copy_page_range().

Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
in this case and it doesn't trigger any lockdeps.

I was able to test it using two threads, one forking and the other using
ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
made the window large enough to reliably hit to test the logic.

This patch (of 2):

The next patch in this series makes the lockless flow a little more
complex, so move the entire block into a new function and remove a level
of indention. Tidy a bit of cruft:

- addr is always the same as start, so use start

- Use the modern check_add_overflow() for computing end = start + len

- nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
avoid shift overflow, make the variables unsigned long to avoid coding
casts in both places. nr_pinned was missing its cast

- The handling of ret and nr_pinned can be streamlined a bit

No functional change.

Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe
Reviewed-by: Jan Kara
Reviewed-by: John Hubbard
Reviewed-by: Peter Xu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800

15 Nov, 2020

1 commit

96e1fac16 mm/gup: use unpin_user_pages() in __gup_longterm_locked() ... Browse Code »

When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists
in a corrupted state.

There are two places in the unwind of __gup_longterm_locked() that put
the pages back without checking. Normally on error this function would
return the partial page list making this the caller's responsibility,
but in these two cases the caller is not allowed to see these pages at
all.

Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Ira Weiny
Signed-off-by: Jason Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: Ira Weiny
Reviewed-by: John Hubbard
Cc: Aneesh Kumar K.V
Cc: Dan Williams
Cc:
Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
Signed-off-by: Linus Torvalds

Jason Gunthorpe
2020-11-15 03:26:03 +0800

17 Oct, 2020

2 commits

7f3bfab52 mm/gup: take mmap_lock in get_dump_page() ... Browse Code »

Properly take the mmap_lock before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the
mmap_lock if it has to sleep.

As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Eric W . Biederman"
Cc: Oleg Nesterov
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-17 02:11:21 +0800
8f942eea1 binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU ... Browse Code »

Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

At the moment, we have that rather ugly mmget_still_valid() helper to work
around : ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.

With this series, I'm trying to get rid of the need for that as cleanly as
possible. ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)

Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.

Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list. This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.

The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.

At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).

I have tested:

- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB and looks plausible.
- X86-64 core dumps contain the first page for executable mappings at
offset 0, and don't contain the first page for non-executable file
mappings or executable mappings at offset !=0.
- NOMMU 32-bit ARM can still generate plausible-looking core dumps
through the FDPIC implementation. (I can't test this with GDB because
GDB is missing some structure definition for nommu ARM, but I've
poked around in the hexdump and it looked decent.)

This patch (of 7):

dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.

One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Eric W . Biederman"
Cc: Oleg Nesterov
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-17 02:11:21 +0800

14 Oct, 2020

2 commits

146608bb7 mm/gup: protect unpin_user_pages() against npages==-ERRNO ... Browse Code »

As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit,
against a typical caller mistake: check if the npages arg is really a
-ERRNO value, which would blow up the unpinning loop: WARN and return.

If this new WARN_ON() fires, then the system *might* be leaking pages (by
leaving them pinned), but probably not. More likely, gup/pup returned a
hard -ERRNO error to the caller, who erroneously passed it here.

Signed-off-by: John Hubbard
Signed-off-by: Dan Carpenter
Signed-off-by: Andrew Morton
Cc: Ira Weiny
Cc: Souptick Joarder
Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-10-14 09:38:29 +0800
447f3e45c mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM ... Browse Code »

gup prohibits users from calling get_user_pages() with FOLL_PIN. But it
allows users to call get_user_pages() with FOLL_LONGTERM only. It seems
insensible.

Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
users from calling get_user_pages() with FOLL_LONGTERM while not with
FOLL_PIN.

mm/gup_benchmark.c used to be the only user who did this improperly.
But it has been fixed by moving to use pin_user_pages().

[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com

Signed-off-by: Barry Song
Signed-off-by: Andrew Morton
Reviewed-by: Ira Weiny
Cc: John Hubbard
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: "Matthew Wilcox (Oracle)"
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jason Gunthorpe
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Shuah Khan
Cc: Vlastimil Babka
Cc: Naresh Kamboju
Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds

Barry Song
2020-10-14 09:38:29 +0800

29 Sep, 2020

1 commit

a4d63c373 mm: do not rely on mm == current->mm in __get_user_pages_locked ... Browse Code »

It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:

pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked

Before, this would lead to an OOPS:

BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064

This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.

Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson
Reported-by: Harald Arnesen
Reviewed-by: Jason Gunthorpe
Reviewed-by: Peter Xu
Signed-off-by: Jason A. Donenfeld
Signed-off-by: Linus Torvalds

Jason A. Donenfeld
2020-09-29 00:21:50 +0800

28 Sep, 2020

1 commit

008cfe441 mm: Introduce mm_struct.has_pinned ... Browse Code »

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.

Suggested-by: Jason Gunthorpe
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800

27 Sep, 2020

1 commit

d3f7b1bb2 mm/gup: fix gup_fast with dynamic page table folding ... Browse Code »

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

return 1;
}

This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").

s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.

What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in

https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik
Signed-off-by: Andrew Morton
Reviewed-by: Gerald Schaefer
Reviewed-by: Alexander Gordeev
Reviewed-by: Jason Gunthorpe
Reviewed-by: Mike Rapoport
Reviewed-by: John Hubbard
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Dave Hansen
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: Arnd Bergmann
Cc: Andrey Ryabinin
Cc: Heiko Carstens
Cc: Christian Borntraeger
Cc: Claudio Imbrenda
Cc: [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds

Vasily Gorbik
2020-09-27 01:33:57 +0800

05 Sep, 2020

2 commits

b25d1dc94 Merge branch 'simplify-do_wp_page' ... Browse Code »

Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.

This makes uffd-wp work again (verified by running upmapsort)"

Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:

8 files changed, 29 insertions(+), 120 deletions(-)

The reason for the bad timing is that it turns out that commit
17839856fd58 ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.

And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.

Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

* emailed patches from Peter Xu :
mm: Add PGREUSE counter
mm/gup: Remove enfornced COW mechanism
mm/ksm: Remove reuse_ksm_page()
mm: do_wp_page() simplification

Linus Torvalds
2020-09-05 00:31:54 +0800
a308c71bf mm/gup: Remove enfornced COW mechanism ... Browse Code »

With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.

This essentially reverts commit 17839856fd58 ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:

2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

Some lines moved back and forth with those, but this revert patch should
have striped out and covered all the enforced cow bits anyways.

Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-05 00:25:20 +0800

04 Sep, 2020

2 commits

8381979df Merge branch 'gate-page-refcount' (patches from Dave Hansen) ... Browse Code »

Merge gate page refcount fix from Dave Hansen:
"During the conversion over to pin_user_pages(), gate pages were missed.

The fix is pretty simple, and is accompanied by a new test from Andy
which probably would have caught this earlier"

* emailed patches from Dave Hansen :
selftests/x86/test_vsyscall: Improve the process_vm_readv() test
mm: fix pin vs. gup mismatch with gate pages

Linus Torvalds
2020-09-04 09:43:06 +0800
9fa2dd946 mm: fix pin vs. gup mismatch with gate pages ... Browse Code »

Gate pages were missed when converting from get to pin_user_pages().
This can lead to refcount imbalances. This is reliably and quickly
reproducible running the x86 selftests when vsyscall=emulate is enabled
(the default). Fix by using try_grab_page() with appropriate flags
passed.

The long story:

Today, pin_user_pages() and get_user_pages() are similar interfaces for
manipulating page reference counts. However, "pins" use a "bias" value
and manipulate the actual reference count by 1024 instead of 1 used by
plain "gets".

That means that pin_user_pages() must be matched with unpin_user_pages()
and can't be mixed with a plain put_user_pages() or put_page().

Enter gate pages, like the vsyscall page. They are pages usually in the
kernel image, but which are mapped to userspace. Userspace is allowed
access to them, including interfaces using get/pin_user_pages(). The
refcount of these kernel pages is manipulated just like a normal user
page on the get/pin side so that the put/unpin side can work the same
for normal user pages or gate pages.

get_gate_page() uses try_get_page() which only bumps the refcount by
1, not 1024, even if called in the pin_user_pages() path. If someone
pins a gate page, this happens:

pin_user_pages()
get_gate_page()
try_get_page() // bump refcount +1
... some time later
unpin_user_pages()
page_ref_sub_and_test(page, 1024))

... and boom, we get a refcount off by 1023. This is reliably and
quickly reproducible running the x86 selftests when booted with
vsyscall=emulate (the default). The selftests use ptrace(), but I
suspect anything using pin_user_pages() on gate pages could hit this.

To fix it, simply use try_grab_page() instead of try_get_page(), and
pass 'gup_flags' in so that FOLL_PIN can be respected.

This bug traces back to the very beginning of the FOLL_PIN support in
commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), which showed up in
the 5.7 release.

Signed-off-by: Dave Hansen
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Peter Zijlstra
Reviewed-by: John Hubbard
Acked-by: Andy Lutomirski
Cc: x86@kernel.org
Cc: Jann Horn
Cc: Andrew Morton
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Dave Hansen
2020-09-04 09:36:55 +0800

15 Aug, 2020

1 commit

6c357848b mm: replace hpage_nr_pages with thp_nr_pages ... Browse Code »

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-08-15 10:56:56 +0800

13 Aug, 2020

6 commits

64019a2e4 mm/gup: remove task_struct pointer for all gup code ... Browse Code »

After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: John Hubbard
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-08-13 01:58:04 +0800
a2beb5f1e mm: clean up the last pieces of page fault accountings ... Browse Code »

Here're the last pieces of page fault accounting that were still done
outside handle_mm_fault() where we still have regs==NULL when calling
handle_mm_fault():

arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
arch/sparc/mm/fault_32.c: force_user_fault
arch/um/kernel/trap.c: handle_page_fault
mm/gup.c: faultin_page
fixup_user_fault
mm/hmm.c: hmm_vma_fault
mm/ksm.c: break_ksm

Some of them has the issue of duplicated accounting for page fault
retries. Some of them didn't do the accounting at all.

This patch cleans all these up by letting handle_mm_fault() to do per-task
page fault accounting even if regs==NULL (though we'll still skip the perf
event accountings). With that, we can safely remove all the outliers now.

There's another functional change in that now we account the page faults
to the caller of gup, rather than the task_struct that passed into the gup
code. More information of this can be found at [1].

After this patch, below things should never be touched again outside
handle_mm_fault():

- task_struct.[maj|min]_flt
- PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]

[1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Albert Ou
Cc: Alexander Gordeev
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Chris Zankel
Cc: Dave Hansen
Cc: David S. Miller
Cc: Geert Uytterhoeven
Cc: Gerald Schaefer
Cc: Greentime Hu
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Ivan Kokshaysky
Cc: James E.J. Bottomley
Cc: John Hubbard
Cc: Jonas Bonn
Cc: Ley Foon Tan
Cc: "Luck, Tony"
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Nick Hu
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Richard Henderson
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Stefan Kristiansson
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-08-13 01:58:04 +0800
bce617ede mm: do page fault accounting in handle_mm_fault ... Browse Code »

Patch series "mm: Page fault accounting cleanups", v5.

This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):

https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

What this series did:

- Correct page fault accounting: we do accounting for a page fault
(no matter whether it's from #PF handling, or gup, or anything else)
only with the one that completed the fault. For example, page fault
retries should not be counted in page fault counters. Same to the
perf events.

- Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
event is used in an adhoc way across different archs.

Case (1): for many archs it's done at the entry of a page fault
handler, so that it will also cover e.g. errornous faults.

Case (2): for some other archs, it is only accounted when the page
fault is resolved successfully.

Case (3): there're still quite some archs that have not enabled
this perf event.

Since this series will touch merely all the archs, we unify this
perf event to always follow case (1), which is the one that makes most
sense. And since we moved the accounting into handle_mm_fault, the
other two MAJ/MIN perf events are well taken care of naturally.

- Unify definition of "major faults": the definition of "major
fault" is slightly changed when used in accounting (not
VM_FAULT_MAJOR). More information in patch 1.

- Always account the page fault onto the one that triggered the page
fault. This does not matter much for #PF handlings, but mostly for
gup. More information on this in patch 25.

Patchset layout:

Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

This patch (of 25):

This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault(). This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events. To
do this, the pt_regs pointer is passed into handle_mm_fault().

PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.

So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.

Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Albert Ou
Cc: Alexander Gordeev
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Chris Zankel
Cc: Dave Hansen
Cc: David S. Miller
Cc: Geert Uytterhoeven
Cc: Gerald Schaefer
Cc: Greentime Hu
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Ivan Kokshaysky
Cc: James E.J. Bottomley
Cc: John Hubbard
Cc: Jonas Bonn
Cc: Ley Foon Tan
Cc: "Luck, Tony"
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Nick Hu
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Richard Henderson
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Stefan Kristiansson
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-08-13 01:58:02 +0800
ed03d9245 mm/gup: use a standard migration target allocation callback ... Browse Code »

There is a well-defined migration target allocation callback. Use it.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: "Aneesh Kumar K . V"
Cc: Christoph Hellwig
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:58:02 +0800
bbe88753b mm/hugetlb: make hugetlb migration callback CMA aware ... Browse Code »

new_non_cma_page() in gup.c requires to allocate the new page that is not
on the CMA area. new_non_cma_page() implements it by using allocation
scope APIs.

However, there is a work-around for hugetlb. Normal hugetlb page
allocation API for migration is alloc_huge_page_nodemask(). It consists
of two steps. First is dequeing from the pool. Second is, if there is no
available page on the queue, allocating by using the page allocator.

new_non_cma_page() can't use this API since first step (deque) isn't aware
of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb
internal function for the second step, alloc_migrate_huge_page(), to
global scope and uses it directly. This is suboptimal since hugetlb pages
on the queue cannot be utilized.

This patch tries to fix this situation by making the deque function on
hugetlb CMA aware. In the deque function, CMA memory is skipped if
PF_MEMALLOC_NOCMA flag is found.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Mike Kravetz
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: "Aneesh Kumar K . V"
Cc: Christoph Hellwig
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:58:02 +0800
41b4dc14e mm/gup: restrict CMA region by using allocation scope API ... Browse Code »

We have well defined scope API to exclude CMA region. Use it rather than
manipulating gfp_mask manually. With this change, we can now restore
__GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
would result in that the ZONE_MOVABLE is also searched by page allocator.
For hugetlb, gfp_mask is redefined since it has a regular allocation mask
filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
filter since a new user for gfp_mask filter, gup, want to be silent when
allocation fails.

Note that this can be considered as a fix for the commit 9a4e9f3b2d73
("mm: update get_user_pages_longterm to migrate pages allocated from CMA
region"). However, "Fixes" tag isn't added here since it is just
suboptimal but it doesn't cause any problem.

Suggested-by: Michal Hocko
Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Christoph Hellwig
Cc: Roman Gushchin
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K . V"
Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:58:02 +0800

08 Aug, 2020

1 commit

0a36f7f85 mm/gup.c: fix the comment of return value for populate_vma_page_range() ... Browse Code »

The return value of populate_vma_page_range() is consistent with
__get_user_pages(), and so is the function comment of return value.

Signed-off-by: Tang Yizhou
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Link: http://lkml.kernel.org/r/20200720034303.29920-1-tangyizhou@huawei.com
Signed-off-by: Linus Torvalds

Tang Yizhou
2020-08-08 02:33:23 +0800

20 Jun, 2020

2 commits

481e980a7 mm: Allow arches to provide ptep_get() ... Browse Code »

Since commit 9e343b467c70 ("READ_ONCE: Enforce atomicity for
{READ,WRITE}_ONCE() memory accesses") it is not possible anymore to
use READ_ONCE() to access complex page table entries like the one
defined for powerpc 8xx with 16k size pages.

Define a ptep_get() helper that architectures can override instead
of performing a READ_ONCE() on the page table entry pointer.

Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy
Acked-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/087fa12b6e920e32315136b998aa834f99242695.1592225558.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-06-20 20:14:53 +0800
55ca22633 mm/gup: Use huge_ptep_get() in gup_hugepte() ... Browse Code »

gup_hugepte() reads hugepage table entries, it can't read
them directly, huge_ptep_get() must be used.

Fixes: 9e343b467c70 ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy
Acked-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/ffc3714334c3bfaca6f13788ad039e8759ae413f.1592225558.git.christophe.leroy@csgroup.eu

Christophe Leroy
2020-06-20 20:14:53 +0800

10 Jun, 2020

6 commits

c1e8d7c6a mmap locking API: convert mmap_sem comments ... Browse Code »

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
3e4e28c5a mmap locking API: convert mmap_sem API comments ... Browse Code »

Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Davidlohr Bueso
Reviewed-by: Daniel Jordan
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
da1c55f1b mmap locking API: rename mmap_sem to mmap_lock ... Browse Code »

Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
now go through the new mmap locking api. The mmap_lock is still
implemented as a rwsem, though this could change in the future.

[akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Davidlohr Bueso
Reviewed-by: Daniel Jordan
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
42fc54140 mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked() ... Browse Code »

Add new APIs to assert that mmap_sem is held.

Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

09 Jun, 2020

3 commits

6a005645e mm/gup: documentation fix for pin_user_pages*() APIs ... Browse Code »

All of the pin_user_pages*() API calls will cause pages to be
dma-pinned. As such, they are all suitable for either DMA, RDMA, and/or
Direct IO.

The documentation should say so, but it was instead saying that three of
the API calls were only suitable for Direct IO. This was discovered
when a reviewer wondered why an API call that specifically recommended
against Case 2 (DMA/RDMA) was being used in a DMA situation [1].

Fix this by simply deleting those claims. The gup.c comments already
refer to the more extensive Documentation/core-api/pin_user_pages.rst,
which does have the correct guidance. So let's just write it once,
there.

[1] https://lore.kernel.org/r/20200529074658.GM30374@kadam

Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Acked-by: Souptick Joarder
Cc: Dan Carpenter
Cc: Jan Kara
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200529084515.46259-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-09 02:05:56 +0800
420c2091b mm/gup: introduce pin_user_pages_locked() ... Browse Code »

Patch series "mm/gup: introduce pin_user_pages_locked(), use it in frame_vector.c", v2.

This adds yet one more pin_user_pages*() variant, and uses that to
convert mm/frame_vector.c.

With this, along with maybe 20 or 30 other recent patches in various
trees, we are close to having the relevant gup call sites
converted--with the notable exception of the bio/block layer.

This patch (of 2):

Introduce pin_user_pages_locked(), which is nearly identical to
get_user_pages_locked() except that it sets FOLL_PIN and rejects
FOLL_GET.

As with other pairs of get_user_pages*() and pin_user_pages() API calls,
it's prudent to assert that FOLL_PIN is *not* set in the
get_user_pages*() call, so add that as part of this.

[jhubbard@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200531234131.770697-2-jhubbard@nvidia.com

Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Cc: Daniel Vetter
Cc: Jérôme Glisse
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Dave Chinner
Cc: Souptick Joarder
Link: http://lkml.kernel.org/r/20200531234131.770697-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200527223243.884385-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200527223243.884385-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-09 02:05:56 +0800
dadbb612f mm/gup.c: convert to use get_user_{page|pages}_fast_only() ... Browse Code »

API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter. Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder
Signed-off-by: Andrew Morton
Reviewed-by: John Hubbard
Reviewed-by: Paul Mackerras [arch/powerpc/kvm]
Cc: Matthew Wilcox
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Mark Rutland
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Paolo Bonzini
Cc: Stephen Rothwell
Cc: Mike Rapoport
Cc: Aneesh Kumar K.V
Cc: Michal Suchanek
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds

Souptick Joarder
2020-06-09 02:05:56 +0800

04 Jun, 2020

5 commits

2d3a36a47 mm, mempolicy: fix up gup usage in lookup_node ... Browse Code »

ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
has added a special casing for 0 return value because that was a possible
gup return value when interrupted by fatal signal. This has been fixed by
ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for
fatal signal") in the mean time so ba841078cd05 can be reverted.

This patch however doesn't go all the way to revert it because the check
for 0 is wrong and confusing here. Firstly it is inherently unsafe to
access the page when get_user_pages_locked returns 0 (aka no page
returned).

Fortunatelly this will not happen because get_user_pages_locked will not
return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
the case here. Document this potential error code in gup code while we
are at it.

Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Cc: Peter Xu
Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds

Michal Hocko
2020-06-04 11:09:49 +0800
f81cd178e mm/gup: might_lock_read(mmap_sem) in get_user_pages_fast() ... Browse Code »

Instead of scattering these assertions across the drivers, do this
assertion inside the core of get_user_pages_fast*() functions. That also
includes pin_user_pages_fast*() routines.

Add a might_lock_read(mmap_sem) call to internal_get_user_pages_fast().

Suggested-by: Matthew Wilcox
Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: Matthew Wilcox
Cc: Michel Lespinasse
Cc: Jason Gunthorpe
Link: http://lkml.kernel.org/r/20200522010443.1290485-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-04 11:09:42 +0800
104acc327 mm/gup: introduce pin_user_pages_fast_only() ... Browse Code »

This is the FOLL_PIN equivalent of __get_user_pages_fast(), except with a
more descriptive name, and gup_flags instead of a boolean "write" in the
argument list.

Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: Chris Wilson
Cc: Daniel Vetter
Cc: David Airlie
Cc: Jani Nikula
Cc: "Joonas Lahtinen"
Cc: Matthew Auld
Cc: Matthew Wilcox
Cc: Rodrigo Vivi
Cc: Souptick Joarder
Cc: Tvrtko Ursulin
Link: http://lkml.kernel.org/r/20200519002124.2025955-4-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-04 11:09:42 +0800
376a34efa mm/gup: refactor and de-duplicate gup_fast() code ... Browse Code »

There were two nearly identical sets of code for gup_fast() style of
walking the page tables with interrupts disabled. This has lead to the
usual maintenance problems that arise from having duplicated code.

There is already a core internal routine in gup.c for gup_fast(), so just
enhance it very slightly: allow skipping the fall-back to "slow" (regular)
get_user_pages(), via the new FOLL_FAST_ONLY flag. Then, just call
internal_get_user_pages_fast() from __get_user_pages_fast(), and adjust
the API to match pre-existing API behavior.

There is a change in behavior from this refactoring: the nested form of
interrupt disabling is used in all gup_fast() variants now. That's
because there is only one place that interrupt disabling for page walking
is done, and so the safer form is required. This should, if anything,
eliminate possible (rare) bugs, because the non-nested form of enabling
interrupts was fragile at best.

[jhubbard@nvidia.com: fixup]
Link: http://lkml.kernel.org/r/20200521233841.1279742-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: Chris Wilson
Cc: Daniel Vetter
Cc: David Airlie
Cc: Jani Nikula
Cc: "Joonas Lahtinen"
Cc: Matthew Auld
Cc: Matthew Wilcox
Cc: Rodrigo Vivi
Cc: Souptick Joarder
Cc: Tvrtko Ursulin
Link: http://lkml.kernel.org/r/20200519002124.2025955-3-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-04 11:09:42 +0800
9e1f0580d mm/gup: move __get_user_pages_fast() down a few lines in gup.c ... Browse Code »

Patch series "mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()", v2.

In order to convert the drm/i915 driver from get_user_pages() to
pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was
required. That led to refactoring __get_user_pages_fast(), with the
following goals:

1) As above: provide a pin_user_pages*() routine for drm/i915 to call,
in place of __get_user_pages_fast(),

2) Get rid of the gup.c duplicate code for walking page tables with
interrupts disabled. This duplicate code is a minor maintenance
problem anyway.

3) Make it easy for an upcoming patch from Souptick, which aims to
convert __get_user_pages_fast() to use a gup_flags argument, instead
of a bool writeable arg. Also, if this series looks good, we can
ask Souptick to change the name as well, to whatever the consensus
is. My initial recommendation is: get_user_pages_fast_only(), to
match the new pin_user_pages_only().

This patch (of 4):

This is in order to avoid a forward declaration of
internal_get_user_pages_fast(), in the next patch.

This is code movement only--all generated code should be identical.

Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: Chris Wilson
Cc: Daniel Vetter
Cc: David Airlie
Cc: Jani Nikula
Cc: "Joonas Lahtinen"
Cc: Matthew Auld
Cc: Matthew Wilcox
Cc: Rodrigo Vivi
Cc: Souptick Joarder
Cc: Tvrtko Ursulin
Link: http://lkml.kernel.org/r/20200522051931.54191-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200519002124.2025955-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200519002124.2025955-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds

John Hubbard
2020-06-04 11:09:42 +0800

03 Jun, 2020

1 commit

94709049f Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.

Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton : (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
...

Linus Torvalds
2020-06-03 03:21:36 +0800