Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2012

2 commits

6bd4837de mm: simplify find_vma_prev() ... Browse Code »
43

commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
using it. Also, this change helps to improve page fault performance
because it has stronger locality of reference.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Peter Zijlstra
Cc: Shaohua Li
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:44 +0800
948f017b0 mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() ... Browse Code »

migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli
Reported-by: Nai Xia
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Pawel Sikora
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-01-11 08:30:44 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

01 Nov, 2011

1 commit

584cff54e mm/mmap.c: eliminate the ret variable from mm_take_all_locks() ... Browse Code »

The ret variable is really not needed in mm_take_all_locks().

Signed-off-by: Kautuk Consul
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2011-11-01 08:30:49 +0800

31 Oct, 2011

1 commit

b95f1b31b mm: Map most files to use export.h instead of module.h ... Browse Code »

The files changed within are only using the EXPORT_SYMBOL
macro variants. They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-10-31 21:20:12 +0800

26 Jul, 2011

1 commit

c15bef309 mmap: fix and tidy up overcommit page arithmetic ... Browse Code »

- shmem pages are not immediately available, but they are not
potentially available either, even if we swap them out, they will just
relocate from memory into swap, total amount of immediate and
potentially available memory is not going to be affected, so we
shouldn't count them as potentially free in the first place.

- nr_free_pages() is not an expensive operation anymore, there is no
need to split the decision making in two halves and repeat code.

Signed-off-by: Dmitry Fink
Reviewed-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Fink
2011-07-26 11:57:09 +0800

16 Jun, 2011

1 commit

9be34c9d5 mm: get rid of the most spurious find_vma_prev() users ... Browse Code »

We have some users of this function that date back to before the vma
list was doubly linked, and just are silly. These days, you can find
the previous vma by just following the vma->vm_prev pointer.

In some cases you don't need any find_vma() lookup at all, and in other
cases you're better off with the regular "find_vma()" that uses the vma
cache front-end lookup.

Some "find_vma_prev()" users are still valid, though. For example, in
the case of a stack that grows up, it can be the case that we don't find
any 'vma' at all (because we're looking up an address that is past the
last vma), and that the stack that we want to grow is the 'prev' vma.

But that kind of special case aside, we generally should prefer to use
'find_vma()'.

Noticed due to a totally unrelated POWER memory corruption bug that just
happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
using that function here?".

Signed-off-by: Linus Torvalds

Linus Torvalds
2011-06-16 15:35:09 +0800

27 May, 2011

1 commit

ca16d140a mm: don't access vm_flags as 'int' ... Browse Code »

The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-27 00:20:31 +0800

25 May, 2011

9 commits

2b575eb64 mm: convert anon_vma->lock to a mutex ... Browse Code »

Straightforward conversion of anon_vma->lock to a mutex.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:19 +0800
3d48ae45e mm: Convert i_mmap_lock to a mutex ... Browse Code »

Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:18 +0800
97a894136 mm: Remove i_mmap_lock lockbreak ... Browse Code »
43

Hugh says:
"The only significant loser, I think, would be page reclaim (when
concurrent with truncation): could spin for a long time waiting for
the i_mmap_mutex it expects would soon be dropped? "

Counter points:
- cpu contention makes the spin stop (need_resched())
- zap pages should be freeing pages at a higher rate than reclaim
ever can

I think the simplification of the truncate code is definitely worth it.

Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:17 +0800
d16dfc550 mm: mmu_gather rework ... Browse Code »

Rework the existing mmu_gather infrastructure.

The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.

The first 9 patches rework the mmu_gather infrastructure. For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.

The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.

Then there is one patch that extends the generic mmu_gather batching.

After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.

Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.

Furthermore, it removes the new and universially detested unmap_mutex.

This patch:

Remove the first obstacle towards a fully preemptible mmu_gather.

The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.

Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex. Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.

Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.

[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:12 +0800
d05f3169c mm: make expand_downwards() symmetrical with expand_upwards() ... Browse Code »

Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.

check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.

Let's clean this up by exporting both functions and make those names
consistent. Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion). expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.

Signed-off-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: James Bottomley
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-05-25 23:39:12 +0800
6038def0d mm: nommu: sort mm->mmap list properly ... Browse Code »

When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).

Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.

Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.

Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.

This patch:

@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.

This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.

[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:05 +0800
965f55dea mmap: avoid merging cloned VMAs ... Browse Code »

Avoid merging a VMA with another VMA which is cloned from the parent process.

The cloned VMA shares the anon_vma lock with the parent process's VMA. If
we do the merge, more vmas (even the new range is only for current
process) use the perent process's anon_vma lock. This introduces
scalability issues. find_mergeable_anon_vma() already considers this
case.

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:04 +0800
5f70b962c mmap: avoid unnecessary anon_vma lock ... Browse Code »

If we only change vma->vm_end, we can avoid taking anon_vma lock even if
'insert' isn't NULL, which is the case of split_vma.

As I understand it, we need the lock before because rmap must get the
'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
to anon_vma list in __insert_vm_struct before).

But now this isn't true any more. The 'insert' VMA is already linked to
anon_vma list in __split_vma(with anon_vma_clone()) instead of
__insert_vm_struct. There is no race rmap can't get required VMAs. So
the anon_vma lock is unnecessary, and this can reduce one locking in brk
case and improve scalability.

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Acked-by: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:04 +0800
34679d7ea mmap: add alignment for some variables ... Browse Code »

Make some variables have correct alignment/section to avoid cache issue.
In a workload which heavily does mmap/munmap, the variables will be used
frequently.

Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:03 +0800

10 May, 2011

1 commit

42c36f63a vm: fix vm_pgoff wrap in upward expansion ... Browse Code »

Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed
the case of an expanding mapping causing vm_pgoff wrapping when you had
downward stack expansion. But there was another case where IA64 and
PA-RISC expand mappings: upward expansion.

This fixes that case too.

Signed-off-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-10 08:52:17 +0800

15 Apr, 2011

1 commit

4471a675d brk: COMPAT_BRK: fix detection of randomized brk ... Browse Code »

5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.

It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:

: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e) = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e) = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.

I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.

Signed-off-by: Jiri Kosina
Tested-by: Geert Uytterhoeven
Reported-by: Geert Uytterhoeven
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2011-04-15 07:06:55 +0800

13 Apr, 2011

1 commit

a626ca6a6 vm: fix vm_pgoff wrap in stack expansion ... Browse Code »

Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed
the case of a expanding mapping causing vm_pgoff wrapping when you used
mremap. But there was another case where we expand mappings hiding in
plain sight: the automatic stack expansion.

This fixes that case too.

This one also found by Robert Święcki, using his nasty system call
fuzzer tool. Good job.

Reported-and-tested-by: Robert Święcki
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-04-13 23:07:28 +0800

14 Jan, 2011

3 commits

5520e8948 brk: fix min_brk lower bound computation for COMPAT_BRK ... Browse Code »

Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.

If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized. But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.

In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.

In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e. immediately after last global variable).

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2011-01-14 09:32:48 +0800
94fcc585f thp: avoid breaking huge pmd invariants in case of vma_adjust failures ... Browse Code »

An huge pmd can only be mapped if the corresponding 2M virtual range is
fully contained in the vma. At times the VM calls split_vma twice, if the
first split_vma succeeds and the second fail, the first split_vma remains
in effect and it's not rolled back. For split_vma or vma_adjust to fail
an allocation failure is needed so it's a very unlikely event (the out of
memory killer would normally fire before any allocation failure is visible
to kernel and userland and if an out of memory condition happens it's
unlikely to happen exactly here). Nevertheless it's safer to ensure that
no huge pmd can be left around if the vma is adjusted in a way that can't
fit hugepages anymore at the new vm_start/vm_end address.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800
b15d00b6a thp: khugepaged vma merge ... Browse Code »

register in khugepaged if the vma grows.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800

16 Dec, 2010

1 commit

462e635e5 install_special_mapping skips security_file_mmap check. ... Browse Code »

The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

$ uname -m
x86_64
$ cat /proc/sys/vm/mmap_min_addr
65536
$ cat install_special_mapping.s
section .bss
resb BSS_SIZE
section .text
global _start
_start:
mov eax, __NR_pause
int 0x80
$ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
$ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
$ ./install_special_mapping &
[1] 14303
$ cat /proc/14303/maps
0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy
Acked-by: Kees Cook
Acked-by: Robert Swiecki
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris
Signed-off-by: Linus Torvalds

Tavis Ormandy
2010-12-16 04:30:36 +0800

30 Oct, 2010

1 commit

120a795da audit mmap ... Browse Code »

Normal syscall audit doesn't catch 5th argument of syscall. It also
doesn't catch the contents of userland structures pointed to be
syscall argument, so for both old and new mmap(2) ABI it doesn't
record the descriptor we are mapping. For old one it also misses
flags.

Signed-off-by: Al Viro

Al Viro
2010-10-30 20:45:43 +0800

23 Sep, 2010

1 commit

2aeadc30d mmap: call unlink_anon_vmas() in __split_vma() in case of error ... Browse Code »

If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Hugh Dickins
Cc: Marcelo Tosatti
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-09-23 08:22:40 +0800

25 Aug, 2010

1 commit

8ca3eb080 guard page for stacks that grow upwards ... Browse Code »

pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().

Signed-off-by: Tony Luck
Signed-off-by: Linus Torvalds

Luck, Tony
2010-08-25 03:13:20 +0800

21 Aug, 2010

1 commit

297c5eee3 mm: make the vma list be doubly linked ... Browse Code »

It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:49:21 +0800

10 Aug, 2010

4 commits

5e549e989 mmap: remove unnecessary lock from __vma_link ... Browse Code »

There's no anon-vma related mangling happening inside __vma_link anymore
so no need of anon_vma locking there.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-08-10 11:44:58 +0800
012f18004 mm: always lock the root (oldest) anon_vma ... Browse Code »

Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.

However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.

Also add the proper locking to vma_adjust.

Signed-off-by: Rik van Riel
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
cba48b98f mm: change direct call of spin_lock(anon_vma->lock) to inline function ... Browse Code »

Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.

This makes it easier to do the substitution to the root anon_vma lock in a
following patch.

We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
bb4a340e0 mm: rename anon_vma_lock to vma_lock_anon_vma ... Browse Code »

Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style
used in page_lock_anon_vma and will come in really handy further down in
this patch series.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:54 +0800

09 Jun, 2010

1 commit

3af9e8592 perf: Add non-exec mmap() tracking ... Browse Code »

Add the capacility to track data mmap()s. This can be used together
with PERF_SAMPLE_ADDR for data profiling.

Signed-off-by: Anton Blanchard
[Updated code for stable perf ABI]
Signed-off-by: Eric B Munson
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Paul Mackerras
Cc: Mike Galbraith
Cc: Steven Rostedt
LKML-Reference:
Signed-off-by: Ingo Molnar

Eric B Munson
2010-06-09 17:12:34 +0800

27 Apr, 2010

1 commit

589275338 mmap: check ->vm_ops before dereferencing ... Browse Code »

Check whether the VMA has a vm_ops before calling close, just
like we check vm_ops before calling open a few dozen lines
higher up in the function.

Signed-off-by: Rik van Riel
Reported-by: Dan Carpenter
Signed-off-by: Linus Torvalds

Rik van Riel
2010-04-27 23:26:51 +0800

13 Apr, 2010

2 commits

287d97ac0 vma_adjust: fix the copying of anon_vma chains ... Browse Code »

When we move the boundaries between two vma's due to things like
mprotect, we need to make sure that the anon_vma of the pages that got
moved from one vma to another gets properly copied around. And that was
not always the case, in this rather hard-to-follow code sequence.

Clarify the code, and fix it so that it copies the anon_vma from the
right source.

Reviewed-by: Rik van Riel
Acked-by: Johannes Weiner
Tested-by: Borislav Petkov [ "Yeah, not so much this one either" ]
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-04-13 08:54:11 +0800
d0e9fe175 Simplify and comment on anon_vma re-use for anon_vma_prepare() ... Browse Code »

This changes the anon_vma reuse case to require that we only reuse
simple anon_vma's - ie the case when the vma only has a single anon_vma
associated with it.

This means that a reuse of an anon_vma from an adjacent vma will always
guarantee that both vma's are associated not only with the same
anon_vma, they will also have the same anon_vma chain (of just a single
entry in this case).

And since anon_vma re-use was the only case where the same anon_vma
might be associated with different chains of anon_vma's, we now have the
case that every vma that shares the same anon_vma will always also have
the same chain. That makes it much easier to think about merging vma's
that share the same anon_vma's: you can always just drop the other
anon_vma chain in anon_vma_merge() since you know that they are always
identical.

This also splits up the function to validate the anon_vma re-use, and
adds a lot of commentary about the possible races.

Reviewed-by: Rik van Riel
Acked-by: Johannes Weiner
Tested-by: Borislav Petkov [ "That didn't fix it" ]
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-04-13 08:53:59 +0800

13 Mar, 2010

1 commit

a4679373c Add generic sys_old_mmap() ... Browse Code »

Add a generic implementation of the old mmap() syscall, which expects its
argument in a memory block and switch all architectures over to use it.

Signed-off-by: Christoph Hellwig
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Hirokazu Takata
Cc: Thomas Gleixner
Cc: Ingo Molnar
Reviewed-by: H. Peter Anvin
Cc: Al Viro
Cc: Arnd Bergmann
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: "Luck, Tony"
Cc: James Morris
Cc: Andreas Schwab
Acked-by: Jesper Nilsson
Acked-by: Russell King
Acked-by: Greg Ungerer
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:32 +0800

07 Mar, 2010

3 commits

fc148a5f7 mm: remove VM_LOCK_RMAP code ... Browse Code »

When a VMA is in an inconsistent state during setup or teardown, the worst
that can happen is that the rmap code will not be able to find the page.

The mapping is in the process of being torn down (PTEs just got
invalidated by munmap), or set up (no PTEs have been instantiated yet).

It is also impossible for the rmap code to follow a pointer to an already
freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
teardown code needs to take before the VMA is removed from the anon_vma
chain.

Hence, we should not need the VM_LOCK_RMAP locking at all.

Signed-off-by: Rik van Riel
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
59e99e5b9 mm: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:24 +0800