16 Apr, 2015
1 commit
-
Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
ACCESS_ONCE doesn't work reliably on non-scalar types.This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
and __get_user_pages_fast() in mm/gup.cSigned-off-by: Jason Low
Acked-by: Michal Hocko
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Reviewed-by: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2015
2 commits
-
It's odd that we have populate_vma_page_range() and __mm_populate() in
mm/mlock.c. It's implementation of generic memory population and mlocking
is one of possible side effect, if VM_LOCKED is set.__get_user_pages() is core of the implementation. Let's move the code
into mm/gup.c.Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Feb, 2015
1 commit
-
Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
"Tighten rules for ACCESS_ONCEThis series tightens the rules for ACCESS_ONCE to only work on scalar
types. It also contains the necessary fixups as indicated by build
bots of linux-next. Now everything is in place to prevent new
non-scalar users of ACCESS_ONCE and we can continue to convert code to
READ_ONCE/WRITE_ONCE"* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Fix sparse warning for ACCESS_ONCE
next: sh: Fix compile error
kernel: tighten rules for ACCESS ONCE
mm/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
13 Feb, 2015
1 commit
-
Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Acked-by: Benjamin Herrenschmidt
Tested-by: Sasha Levin
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2015
4 commits
-
This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.This patch (of 5):
We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).If vmas is not NULL these two methods cannot be used.
Signed-off-by: Andrea Arcangeli
Reviewed-by: Andres Lagar-Cavilla
Reviewed-by: Peter Feiner
Reviewed-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.Here is the reproducer:
$ cat movepages.c
#include
#include
#include#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}$ cat hugepage.c
#include
#include
#include#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi
Reported-by: Hugh Dickins
Cc: James Hogan
Cc: David Rientjes
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Andrea Arcangeli
Cc: Luiz Capitulino
Cc: Nishanth Aravamudan
Cc: Lee Schermerhorn
Cc: Steve Capper
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2015
1 commit
-
One bit in ->vm_flags is unused now!
Signed-off-by: Kirill A. Shutemov
Cc: Dan Carpenter
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jan, 2015
1 commit
-
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.Reported-and-tested-by: Takashi Iwai
Tested-by: Jan Engelhardt
Acked-by: Heiko Carstens # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
19 Jan, 2015
1 commit
-
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)Fixup gup_pmd_range.
Signed-off-by: Christian Borntraeger
21 Dec, 2014
1 commit
-
Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
"kernel: Provide READ_ONCE and ASSIGN_ONCEAs discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
ACCESS_ONCE might fail with specific compilers for non-scalar
accesses.Here is a set of patches to tackle that problem.
The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
structure is larger than the machine word size memcpy is used and a
warning is emitted. The next patches fix up several in-tree users of
ACCESS_ONCE on non-scalar types.This does not yet contain a patch that forces ACCESS_ONCE to work only
on scalar types. This is targetted for the next merge window as Linux
next already contains new offenders regarding ACCESS_ONCE vs.
non-scalar types"* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
s390/kvm: REPLACE barrier fixup with READ_ONCE
arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
mips/gup: Replace ACCESS_ONCE with READ_ONCE
x86/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
mm: replace ACCESS_ONCE with READ_ONCE or barriers
kernel: Provide READ_ONCE and ASSIGN_ONCE
18 Dec, 2014
1 commit
-
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)Let's change the code to access the page table elements with
READ_ONCE that does implicit scalar accesses for the gup code.mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
as array of longs. This code requires just that the pmd_present
and pmd_trans_huge check are done on the same value, so a barrier
is sufficent.A similar case is in handle_pte_fault. On ppc44x the word size is
32 bit, but a pte is 64 bit. A barrier is ok as well.Signed-off-by: Christian Borntraeger
Cc: linux-mm@kvack.org
Acked-by: Paul E. McKenney
14 Nov, 2014
1 commit
-
Update generic gup implementation with powerpc specific details.
On powerpc at pmd level we can have hugepte, normal pmd pointer
or a pointer to the hugepage directory.Tested-by: Steve Capper
Acked-by: Steve Capper
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
10 Oct, 2014
1 commit
-
This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.These are required for Transparent HugePages to function correctly, as a
futex on a THP tail will otherwise result in an infinite loop (due to the
core implementation of __get_user_pages_fast always returning 0).Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.This patch (of 6):
get_user_pages_fast() attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs to
block any THP splits.One way to achieve this is to have the walker disable interrupts, and rely
on IPIs from the TLB flushing code blocking before the page table pages
are freed.On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic has
been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
own get_user_pages_fast routines.The RCU page table free logic coupled with an IPI broadcast on THP split
(which is a rare event), allows one to protect a page table walker by
merely disabling the interrupts during the walk.This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of TLB
invalidations.It is based heavily on the PowerPC implementation by Nick Piggin.
[akpm@linux-foundation.org: various comment fixes]
Signed-off-by: Steve Capper
Tested-by: Dann Frazier
Reviewed-by: Catalin Marinas
Acked-by: Hugh Dickins
Cc: Russell King
Cc: Mark Rutland
Cc: Mel Gorman
Cc: Will Deacon
Cc: Christoffer Dall
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Sep, 2014
1 commit
-
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.Reviewed-by: Radim Krčmář
Signed-off-by: Andres Lagar-Cavilla
Acked-by: Andrew Morton
Signed-off-by: Paolo Bonzini
07 Aug, 2014
1 commit
-
Add a comment describing the circumstances in which
__lock_page_or_retry() will or will not release the mmap_sem when
returning 0.Add comments to lock_page_or_retry()'s callers (filemap_fault(),
do_swap_page()) noting the impact on VM_FAULT_RETRY returns.Add comments on up the call tree, particularly replacing the false "We
return with mmap_sem still held" comments.Signed-off-by: Paul Cassella
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
5 commits
-
Get rid of two nested loops over nr_pages, extract vma flags checking to
separate function and other random cleanups.Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nesting level in __get_user_pages() is just insane. Let's try to fix it
a bit.Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cleanups:
- move pte-related code to separate function. It's about half of the
function;
- get rid of some goto-logic;
- use 'return NULL' instead of 'return page' where page can only be
NULL;Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The case is special and disturb from reading main __get_user_pages()
code path. Let's move it to separate function.Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
pretty much self-contained let's move it to separate file.No other changes made.
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds