Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2012

1 commit

948f017b0 mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() ... Browse Code »

migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli
Reported-by: Nai Xia
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Pawel Sikora
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-01-11 08:30:44 +0800

01 Nov, 2011

3 commits

37a1c49a9 thp: mremap support and TLB optimization ... Browse Code »

This adds THP support to mremap (decreases the number of split_huge_page()
calls).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include
#include
#include
#include
#include

#define SIZE (5UL*1024*1024*1024)

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");

return 0;
}
===

THP on

Performance counter stats for './largepage13' (3 runs):

69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

7.314454164 seconds time elapsed ( +- 0.023% )

THP off

Performance counter stats for './largepage13' (3 runs):

1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

9.693655243 seconds time elapsed ( +- 0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.

[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:48 +0800
7b6efc2bc mremap: avoid sending one IPI per page ... Browse Code »

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.

The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things. It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey). If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls. Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs. One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:48 +0800
ebed48460 mremap: check for overflow using deltas ... Browse Code »

Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check. This I feel
will improve readability.

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:47 +0800

25 May, 2011

2 commits

3d48ae45e mm: Convert i_mmap_lock to a mutex ... Browse Code »

Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:18 +0800
97a894136 mm: Remove i_mmap_lock lockbreak ... Browse Code »
43

Hugh says:
"The only significant loser, I think, would be page reclaim (when
concurrent with truncation): could spin for a long time waiting for
the i_mmap_mutex it expects would soon be dropped? "

Counter points:
- cpu contention makes the spin stop (need_resched())
- zap pages should be freeing pages at a higher rate than reclaim
ever can

I think the simplification of the truncate code is definitely worth it.

Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:17 +0800

07 Apr, 2011

1 commit

982134ba6 mm: avoid wrapping vm_pgoff in mremap() ... Browse Code »

The normal mmap paths all avoid creating a mapping where the pgoff
inside the mapping could wrap around due to overflow. However, an
expanding mremap() can take such a non-wrapping mapping and make it
bigger and cause a wrapping condition.

Noticed by Robert Swiecki when running a system call fuzzer, where it
caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A
vma dumping patch by Hugh then pinpointed the crazy wrapped case.

Reported-and-tested-by: Robert Swiecki
Acked-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-04-07 22:35:51 +0800

24 Feb, 2011

1 commit

a3e8cc643 mm: fix possible cause of a page_mapped BUG ... Browse Code »

Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE). That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().

Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.

But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.

We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-02-24 13:55:06 +0800

14 Jan, 2011

2 commits

bae9c19bf thp: split_huge_page_mm/vma ... Browse Code »

split_huge_page_pmd compat code. Each one of those would need to be
expanded to hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:41 +0800
8ac1f8320 thp: pte alloc trans splitting ... Browse Code »

pte alloc routines must wait for split_huge_page if the pmd is not present
and not null (i.e. pmd_trans_splitting). The additional branches are
optimized away at compile time by pmd_trans_splitting if the config option
is off. However we must pass the vma down in order to know the anon_vma
lock to wait for.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:40 +0800

27 Oct, 2010

1 commit

ece0e2b64 mm: remove pte_*map_nested() ... Browse Code »

Since we no longer need to provide KM_type, the whole pte_*map_nested()
API is now redundant, remove it.

Signed-off-by: Peter Zijlstra
Acked-by: Chris Metcalf
Cc: David Howells
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Steven Rostedt
Cc: Russell King
Cc: Ralf Baechle
Cc: David Miller
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2010-10-27 07:52:08 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

07 Mar, 2010

2 commits

5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
59e99e5b9 mm: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:24 +0800

11 Dec, 2009

7 commits

9206de95b Take arch_mmap_check() into get_unmapped_area() ... Browse Code »

Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:44:58 +0800
935874141 fix pgoff in "have to relocate" case of mremap() ... Browse Code »

Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:23 +0800
097eed103 fix the arch checks in MREMAP_FIXED case ... Browse Code »

Acked-by: Russell King
Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:23 +0800
f106af4e9 fix checks for expand-in-place mremap ... Browse Code »

Acked-by: Russell King
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:23 +0800
1a0ef85f8 do_mremap() untangling, part 3 ... Browse Code »

Take the check for being able to expand vma in place into a separate
helper.

Acked-by: Russell King
Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:22 +0800
ecc1a8993 do_mremap() untangling, part 2 ... Browse Code »

Take the MREMAP_FIXED into a separate helper, simplify the living
hell out of conditions in both cases.

Acked-by: Russell King
Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:22 +0800
54f5de709 untangling do_mremap(), part 1 ... Browse Code »

Take locating vma and checks on it to a separate helper (it will be
shared between MREMAP_FIXED/non-MREMAP_FIXED cases when we split
them in the next patch)

Acked-by: Russell King
Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:30:22 +0800

24 Sep, 2009

1 commit

25d9e2d15 truncate: new helpers ... Browse Code »

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2009-09-24 20:41:47 +0800

22 Sep, 2009

2 commits

7103ad323 ksm: mremap use err from ksm_madvise ... Browse Code »

mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
-ENOMEM (letting madvise convert that to -EAGAIN), and can also say
-ERESTARTSYS when signalled: so pass the error from ksm_madvise.

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:33 +0800
1ff829957 ksm: prevent mremap move poisoning ... Browse Code »

KSM's scan allows for user pages to be COWed or unmapped at any time,
without requiring any notification. But its stable tree does assume that
when it finds a KSM page where it placed a KSM page, then it is the same
KSM page that it placed there.

mremap move could break that assumption: if an area containing a KSM page
was unmapped, then an area containing a different KSM page was moved with
mremap into the place of the original, before KSM's scan came around to
notice. That could then poison a node of the stable tree, so that memcmps
would "lie" and upset the ordering of the tree.

Probably noone will ever need mremap move on a VM_MERGEABLE area; except
that prohibiting it would make trouble for schemes in which we try making
everything VM_MERGEABLE e.g. for testing: an mremap which normally works
would then fail mysteriously.

There's no need to go to any trouble, such as re-sorting KSM's list of
rmap_items to match the new layout: simply unmerge the area to COW all its
KSM pages before moving, but leave VM_MERGEABLE on so that they're
remerged later.

Signed-off-by: Hugh Dickins
Signed-off-by: Chris Wright
Signed-off-by: Izik Eidus
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Wu Fengguang
Cc: Balbir Singh
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Avi Kivity
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:31 +0800

14 Jan, 2009

2 commits

6a6160a7b [CVE-2009-0029] System call wrappers part 13 ... Browse Code »

Signed-off-by: Heiko Carstens

Heiko Carstens
2009-01-14 21:15:23 +0800
2ed7c03ec [CVE-2009-0029] Convert all system calls to return a long ... Browse Code »

Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.

Signed-off-by: Heiko Carstens

Heiko Carstens
2009-01-14 21:15:14 +0800

06 Jan, 2009

1 commit

046c68842 mm: update my address ... Browse Code »

Signed-off-by: Alan Cox
Signed-off-by: Linus Torvalds

Alan Cox
2009-01-06 09:44:42 +0800

20 Oct, 2008

1 commit

ba470de43 mmap: handle mlocked pages during map, remap, unmap ... Browse Code »

Originally by Nick Piggin

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate(). Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed. Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-10-20 23:52:31 +0800

29 Jul, 2008

1 commit

cddb8a5c1 mmu-notifiers: core ... Browse Code »

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;

if (!kvm)
return ERR_PTR(-ENOMEM);

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Cc: Jack Steiner
Cc: Robin Holt
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Kanoj Sarcar
Cc: Roland Dreier
Cc: Steve Wise
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Rusty Russell
Cc: Anthony Liguori
Cc: Chris Wright
Cc: Marcelo Tosatti
Cc: Eric Dumazet
Cc: "Paul E. McKenney"
Cc: Izik Eidus
Cc: Anthony Liguori
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2008-07-29 07:30:21 +0800

19 Oct, 2007

1 commit

c80544dc0 sparse pointer use of zero as null ... Browse Code »

Get rid of sparse related warnings from places that use integer as NULL
pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger
Cc: Andi Kleen
Cc: Jeff Garzik
Cc: Matt Mackall
Cc: Ian Kent
Cc: Arnd Bergmann
Cc: Davide Libenzi
Cc: Stephen Smalley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Hemminger
2007-10-19 05:37:31 +0800

20 Jul, 2007

1 commit

b6a2fea39 mm: variable length argument support ... Browse Code »

Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space. Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild
Signed-off-by: Peter Zijlstra
Cc:
Cc: Hugh Dickins
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ollie Wild
2007-07-20 01:04:45 +0800

12 Jul, 2007

1 commit

ed0321895 security: Protection for exploiting null dereference using mmap ... Browse Code »

Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space. The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.

This patch uses a new SELinux security class "memprotect." Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)

Acked-by: Stephen Smalley
Acked-by: Chris Wright
Signed-off-by: Eric Paris
Signed-off-by: James Morris

Eric Paris
2007-07-12 10:52:29 +0800

31 Jan, 2007

1 commit

701dfbc1c [PATCH] mm: mremap correct rmap accounting ... Browse Code »

Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

[ "There is no block device driver one could use for XIP on mips
platforms" - Carsten Otte ]

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Andrew Morton
Cc: Ralf Baechle
Cc: Carsten Otte
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-01-31 00:33:32 +0800

01 Oct, 2006

1 commit

6606c3e0d [PATCH] paravirt: lazy mmu mode hooks.patch ... Browse Code »

Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables. The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall. We use this in VMI for
shadow page table synchronization, and it is a win. It also can be used by
PPC and for direct page tables on Xen.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified. This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of. Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.

Signed-off-by: Zachary Amsden
Signed-off-by: Jeremy Fitzhardinge
Cc: Rusty Russell
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zachary Amsden
2006-10-01 15:39:33 +0800

04 Jul, 2006

1 commit

f20dc5f7c [PATCH] lockdep: annotate mm ... Browse Code »

Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:07 +0800

12 Jan, 2006

1 commit

c59ede7b7 [PATCH] move capable() to capability.h ... Browse Code »

- Move capable() from sched.h to capability.h;

- Use where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy.Dunlap
2006-01-12 10:42:13 +0800

17 Dec, 2005

1 commit

4d7672b46 Make sure we copy pages inserted with "vm_insert_page()" on fork ... Browse Code »

The logic that decides that a fork() might be able to avoid copying a VM
area when it can be re-created by page faults didn't know about the new
vm_insert_page() case.

Also make some things a bit more anal wrt VM_PFNMAP.

Pointed out by Hugh Dickins

Signed-off-by: Linus Torvalds

Linus Torvalds
2005-12-17 02:21:23 +0800

30 Oct, 2005

3 commits

4c21e2f24 [PATCH] mm: split page table lock ... Browse Code »

Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:42 +0800
c74df32c7 [PATCH] mm: ptd_alloc take ptlock ... Browse Code »

Second step in pushing down the page_table_lock. Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code. But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate. It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:40 +0800
1bb3630e8 [PATCH] mm: ptd_alloc inline and out ... Browse Code »

It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
add a little to kernel size, change them to macros testing inline, calling
__pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
as fastcalls, leave that to CONFIG_REGPARM or not.

It also seems more natural for the out-of-line functions to leave the offset
calculation and map to the inline, which has to do it anyway for the common
case. At least mremap move wants __pte_alloc without _map.

Macros rather than inline functions, certainly to avoid the header file issues
which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
architectures I haven't built would have other such problems.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:40 +0800