Eric Lee / smarc-fsl-linux-kernel

16 Dec, 2010

1 commit

462e635e5 install_special_mapping skips security_file_mmap check. ... Browse Code »

The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

$ uname -m
x86_64
$ cat /proc/sys/vm/mmap_min_addr
65536
$ cat install_special_mapping.s
section .bss
resb BSS_SIZE
section .text
global _start
_start:
mov eax, __NR_pause
int 0x80
$ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
$ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
$ ./install_special_mapping &
[1] 14303
$ cat /proc/14303/maps
0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy
Acked-by: Kees Cook
Acked-by: Robert Swiecki
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris
Signed-off-by: Linus Torvalds

Tavis Ormandy
2010-12-16 04:30:36 +0800

30 Oct, 2010

1 commit

120a795da audit mmap ... Browse Code »

Normal syscall audit doesn't catch 5th argument of syscall. It also
doesn't catch the contents of userland structures pointed to be
syscall argument, so for both old and new mmap(2) ABI it doesn't
record the descriptor we are mapping. For old one it also misses
flags.

Signed-off-by: Al Viro

Al Viro
2010-10-30 20:45:43 +0800

23 Sep, 2010

1 commit

2aeadc30d mmap: call unlink_anon_vmas() in __split_vma() in case of error ... Browse Code »

If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Hugh Dickins
Cc: Marcelo Tosatti
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-09-23 08:22:40 +0800

25 Aug, 2010

1 commit

8ca3eb080 guard page for stacks that grow upwards ... Browse Code »

pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().

Signed-off-by: Tony Luck
Signed-off-by: Linus Torvalds

Luck, Tony
2010-08-25 03:13:20 +0800

21 Aug, 2010

1 commit

297c5eee3 mm: make the vma list be doubly linked ... Browse Code »

It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:49:21 +0800

10 Aug, 2010

4 commits

5e549e989 mmap: remove unnecessary lock from __vma_link ... Browse Code »

There's no anon-vma related mangling happening inside __vma_link anymore
so no need of anon_vma locking there.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-08-10 11:44:58 +0800
012f18004 mm: always lock the root (oldest) anon_vma ... Browse Code »

Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.

However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.

Also add the proper locking to vma_adjust.

Signed-off-by: Rik van Riel
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
cba48b98f mm: change direct call of spin_lock(anon_vma->lock) to inline function ... Browse Code »

Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.

This makes it easier to do the substitution to the root anon_vma lock in a
following patch.

We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
bb4a340e0 mm: rename anon_vma_lock to vma_lock_anon_vma ... Browse Code »

Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style
used in page_lock_anon_vma and will come in really handy further down in
this patch series.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:54 +0800

09 Jun, 2010

1 commit

3af9e8592 perf: Add non-exec mmap() tracking ... Browse Code »

Add the capacility to track data mmap()s. This can be used together
with PERF_SAMPLE_ADDR for data profiling.

Signed-off-by: Anton Blanchard
[Updated code for stable perf ABI]
Signed-off-by: Eric B Munson
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Paul Mackerras
Cc: Mike Galbraith
Cc: Steven Rostedt
LKML-Reference:
Signed-off-by: Ingo Molnar

Eric B Munson
2010-06-09 17:12:34 +0800

27 Apr, 2010

1 commit

589275338 mmap: check ->vm_ops before dereferencing ... Browse Code »

Check whether the VMA has a vm_ops before calling close, just
like we check vm_ops before calling open a few dozen lines
higher up in the function.

Signed-off-by: Rik van Riel
Reported-by: Dan Carpenter
Signed-off-by: Linus Torvalds

Rik van Riel
2010-04-27 23:26:51 +0800

13 Apr, 2010

2 commits

287d97ac0 vma_adjust: fix the copying of anon_vma chains ... Browse Code »

When we move the boundaries between two vma's due to things like
mprotect, we need to make sure that the anon_vma of the pages that got
moved from one vma to another gets properly copied around. And that was
not always the case, in this rather hard-to-follow code sequence.

Clarify the code, and fix it so that it copies the anon_vma from the
right source.

Reviewed-by: Rik van Riel
Acked-by: Johannes Weiner
Tested-by: Borislav Petkov [ "Yeah, not so much this one either" ]
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-04-13 08:54:11 +0800
d0e9fe175 Simplify and comment on anon_vma re-use for anon_vma_prepare() ... Browse Code »

This changes the anon_vma reuse case to require that we only reuse
simple anon_vma's - ie the case when the vma only has a single anon_vma
associated with it.

This means that a reuse of an anon_vma from an adjacent vma will always
guarantee that both vma's are associated not only with the same
anon_vma, they will also have the same anon_vma chain (of just a single
entry in this case).

And since anon_vma re-use was the only case where the same anon_vma
might be associated with different chains of anon_vma's, we now have the
case that every vma that shares the same anon_vma will always also have
the same chain. That makes it much easier to think about merging vma's
that share the same anon_vma's: you can always just drop the other
anon_vma chain in anon_vma_merge() since you know that they are always
identical.

This also splits up the function to validate the anon_vma re-use, and
adds a lot of commentary about the possible races.

Reviewed-by: Rik van Riel
Acked-by: Johannes Weiner
Tested-by: Borislav Petkov [ "That didn't fix it" ]
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-04-13 08:53:59 +0800

13 Mar, 2010

1 commit

a4679373c Add generic sys_old_mmap() ... Browse Code »

Add a generic implementation of the old mmap() syscall, which expects its
argument in a memory block and switch all architectures over to use it.

Signed-off-by: Christoph Hellwig
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Hirokazu Takata
Cc: Thomas Gleixner
Cc: Ingo Molnar
Reviewed-by: H. Peter Anvin
Cc: Al Viro
Cc: Arnd Bergmann
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: "Luck, Tony"
Cc: James Morris
Cc: Andreas Schwab
Acked-by: Jesper Nilsson
Acked-by: Russell King
Acked-by: Greg Ungerer
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:32 +0800

07 Mar, 2010

5 commits

fc148a5f7 mm: remove VM_LOCK_RMAP code ... Browse Code »

When a VMA is in an inconsistent state during setup or teardown, the worst
that can happen is that the rmap code will not be able to find the page.

The mapping is in the process of being torn down (PTEs just got
invalidated by munmap), or set up (no PTEs have been instantiated yet).

It is also impossible for the rmap code to follow a pointer to an already
freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
teardown code needs to take before the VMA is removed from the anon_vma
chain.

Hence, we should not need the VM_LOCK_RMAP locking at all.

Signed-off-by: Rik van Riel
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
59e99e5b9 mm: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:24 +0800
06f9d8c2b mm: mlock_vma_pages_range() only return success or failure ... Browse Code »

Currently, mlock_vma_pages_range() only return len or 0. then current
error handling of mmap_region() is meaningless complex.

This patch makes simplify and makes consist with brk() code.

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
c58267c32 mm: mlock_vma_pages_range() never return negative value ... Browse Code »

Currently, mlock_vma_pages_range() never return negative value. Then, we
can remove some worthless error check.

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800

31 Dec, 2009

1 commit

66f0dc481 mm: move sys_mmap_pgoff from util.c ... Browse Code »

Move sys_mmap_pgoff() from mm/util.c to mm/mmap.c and mm/nommu.c,
where we'd expect to find such code: especially now that it contains
the MAP_HUGETLB handling. Revert mm/util.c to how it was in 2.6.32.

This patch just ignores MAP_HUGETLB in the nommu case, as in 2.6.32,
whereas 2.6.33-rc2 reported -ENOSYS. Perhaps validate_mmap_request()
should reject it with -EINVAL? Add that later if necessary.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-31 04:23:27 +0800

16 Dec, 2009

2 commits

c9d0bf241 mm: uncached vma support with writenotify ... Browse Code »

Modify the generic mmap() code to keep the cache attribute in
vma->vm_page_prot regardless if writenotify is enabled or not. Without
this patch the cache configuration selected by f_op->mmap() is overwritten
if writenotify is enabled, making it impossible to keep the vma uncached.

Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
deferred io together with uncached memory.

Signed-off-by: Magnus Damm
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Paul Mundt
Cc: Jaya Kumar
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Magnus Damm
2009-12-16 00:53:21 +0800
659ace584 mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap() ... Browse Code »

On ia64, the following test program exit abnormally, because glibc thread
library called abort().

========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================

The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.

Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.

This patch does it.

test_max_mapcount.c
==================================================================
#include
#include
#include
#include
#include
#include

#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)

void *wait_thread(void *args)
{
void *addr;

addr = malloc(MAL_SIZE);
sleep(10);

return NULL;
}

void *wait_thread2(void *args)
{
sleep(60);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;

ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}

ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}

for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;

ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}

sleep(3600);
return 0;
}
==================================================================

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:11 +0800

11 Dec, 2009

3 commits

2c6a10161 switch do_brk() to get_unmapped_area() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-12-11 19:44:58 +0800
9206de95b Take arch_mmap_check() into get_unmapped_area() ... Browse Code »

Acked-by: Hugh Dickins
Signed-off-by: Al Viro

Al Viro
2009-12-11 19:44:58 +0800
8c7b49b3e fix a struct file leak in do_mmap_pgoff() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-12-11 19:44:57 +0800

25 Oct, 2009

1 commit

6c21a7fb4 LSM: imbed ima calls in the security hooks ... Browse Code »

Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.

Signed-off-by: Mimi Zohar
Signed-off-by: James Morris

Mimi Zohar
2009-10-25 12:22:48 +0800

28 Sep, 2009

1 commit

f0f37e2f7 const: mark struct vm_struct_operations ... Browse Code »

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-28 02:39:25 +0800

22 Sep, 2009

8 commits

4e52780d4 hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions ... Browse Code »

Add a flag for mmap that will be used to request a huge page region that
will look like anonymous memory to userspace. This is accomplished by
using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
MAP_ANONYMOUS and so must be specified with it. The region will behave
the same as a MAP_ANONYMOUS region using small pages.

[akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
Signed-off-by: Eric B Munson
Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Adam Litke
Cc: David Gibson
Cc: Lee Schermerhorn
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2009-09-22 22:17:42 +0800
f8dbf0a7a mmap: save some cycles for the shared anonymous mapping ... Browse Code »

shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

Move these codes to a more proper place to save cycles for shared
anonymous mapping.

Signed-off-by: Huang Shijie
Reviewed-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-09-22 22:17:41 +0800
252c5f94d mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() ... Browse Code »

We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on an
8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
thousands of tasks], the throughput would vary between two "plateaus"--one
at ~65K jobs per minute and one at ~130K jpm. The simple patch below
causes the results to smooth out at the ~130k plateau.

But wait, there's more:

We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
This could be the result of the larger number of cpus on the larger
platform--a scalability issue--or it could be the result of the larger
number of interconnect "hops" between some nodes in this platform and how
the tasks for a given load end up distributed over the nodes' cpus and
memories--a stochastic NUMA effect.

The variability in the results are less pronounced [on the same platform]
with Shanghai processors and with mainline kernels. With 31-rc6 on
Shanghai processors and 288 file systems on 288 fibre attached storage
volumes, the curves [jpm vs load] are both quite flat with the patched
kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
jpm] than the unpatched kernel.

Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called from the
sbrk() system call.

The patch:

A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
anon_vma lock when we're only adjusting the end of a vma, as is the case
for brk(). The comment questions whether it's worth while to optimize for
this case. Apparently, on the newer, larger x86_64 platforms, with
interesting NUMA topologies, it is worth while--especially considering
that the patch [if correct!] is quite simple.

We can detect this condition--no overlap with next vma--by noting a NULL
"importer". The anon_vma pointer will also be NULL in this case, so
simply avoid loading vma->anon_vma to avoid the lock.

However, we DO need to take the anon_vma lock when we're inserting a vma
['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
need to check for 'insert', as well. And Hugh points out that we should
also take it when adjusting vm_start (so that rmap.c can rely upon
vma_address() while it holds the anon_vma lock).

akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
might be -stable material even though thiss isn't a regression: "this
issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
severe on large machine (4*8*2 cpu)"

[hugh.dickins@tiscali.co.uk: test vma start too]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Eric Whitney
Tested-by: "Zhang, Yanmin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-09-22 22:17:41 +0800
cdf7b3418 mmap: remove unnecessary code ... Browse Code »

If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
the bit VM_LOCKED which is set by calc_vm_flag_bits().

So there is no need to reset it again, just remove it.

Signed-off-by: Huang Shijie
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-09-22 22:17:41 +0800
a913e182a ksm: clean up obsolete references ... Browse Code »

A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
longer applies, and it can be made private to ksm.c; there's no more
reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:33 +0800
8314c4f24 ksm: remove VM_MERGEABLE_FLAGS ... Browse Code »

KSM originally stood for Kernel Shared Memory: but the kernel has long
supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
something else. So we switched to saying "merge" instead of "share".

But Chris Wright points out that this is confusing where mmap.c merges
adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
is_mergeable_vma() to let vmas be merged despite flags being different.

Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists
only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
that directly, with a comment on it in is_mergeable_vma().

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:33 +0800
1c2fb7a4c ksm: fix deadlock with munlock in exit_mmap ... Browse Code »

Rawhide users have reported hang at startup when cryptsetup is run: the
same problem can be simply reproduced by running a program int main() {
mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

The problem is that exit_mmap() applies munlock_vma_pages_all() to
clean up VM_LOCKED areas, and its current implementation (stupidly)
tries to fault in absent pages, for example where PROT_NONE prevented
them being faulted in when mlocking. Whereas the "ksm: fix oom
deadlock" patch, knowing there's a race by which KSM might try to fault
in pages after exit_mmap() had finally zapped the range, backs out of
such faults doing nothing when its ksm_test_exit() notices mm_users 0.

So revert that part of "ksm: fix oom deadlock" which moved the
ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
and remove those ksm_test_exit() checks from the page fault paths, so
allowing the munlocking to proceed without interference.

ksm_exit, if there are rmap_items still chained on this mm slot, takes
mmap_sem write side: so preventing KSM from working on an mm while
exit_mmap runs. And KSM will bail out as soon as it notices that
mm_users is already zero, thanks to its internal ksm_test_exit checks.
So that when a task is killed by OOM killer or the user, KSM will not
indefinitely prevent it from running exit_mmap to release its memory.

This does break a part of what "ksm: fix oom deadlock" was trying to
achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
when ksmd itself has to cancel a KSM page, it is possible that the
first OOM-kill victim would be the KSM process being faulted: then its
memory won't be freed until a second victim has been selected (freeing
memory for the unmerging fault to complete).

But the OOM killer is already liable to kill a second victim once the
intended victim's p->mm goes to NULL: so there's not much point in
rejecting this KSM patch before fixing that OOM behaviour. It is very
much more important to allow KSM users to boot up, than to haggle over
an unlikely and poorly supported OOM case.

We also intend to fix munlocking to not fault pages: at which point
this patch _could_ be reverted; though that would be controversial, so
we hope to find a better solution.

Signed-off-by: Andrea Arcangeli
Acked-by: Justin M. Forbes
Acked-for-now-by: Hugh Dickins
Cc: Izik Eidus
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2009-09-22 22:17:32 +0800
9ba692948 ksm: fix oom deadlock ... Browse Code »

There's a now-obvious deadlock in KSM's out-of-memory handling:
imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
trying to allocate a page to break KSM in an mm which becomes the
OOM victim (quite likely in the unmerge case): it's killed and goes
to exit, and hangs there waiting to acquire ksm_thread_mutex.

Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
though that made everything else: perhaps use mmap_sem somehow?
And part of the answer lies in the comments on unmerge_ksm_pages:
__ksm_exit should also leave all the rmap_item removal to ksmd.

But there's a fundamental problem, that KSM relies upon mmap_sem to
guarantee the consistency of the mm it's dealing with, yet exit_mmap
tears down an mm without taking mmap_sem. And bumping mm_users won't
help at all, that just ensures that the pages the OOM killer assumes
are on their way to being freed will not be freed.

The best answer seems to be, to move the ksm_exit callout from just
before exit_mmap, to the middle of exit_mmap: after the mm's pages
have been freed (if the mmu_gather is flushed), but before its page
tables and vma structures have been freed; and down_write,up_write
mmap_sem there to serialize with KSM's own reliance on mmap_sem.

But KSM then needs to be careful, whenever it downs mmap_sem, to
check that the mm is not already exiting: there's a danger of using
find_vma on a layout that's being torn apart, or writing into page
tables which have been freed for reuse; and even do_anonymous_page
and __do_fault need to check they're not being called by break_ksm
to reinstate a pte after zap_pte_range has zapped that page table.

Though it might be clearer to add an exiting flag, set while holding
mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
a zapped pte. All we need is to check whether mm_users is 0 - but
must remember that ksmd may detect that before __ksm_exit is reached.
So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

__ksm_exit now has to leave clearing up the rmap_items to ksmd,
that needs ksm_thread_mutex; but shift the exiting mm just after the
ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
mm_count to hold the mm_struct, ksmd's exit processing (exactly like
its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

But also give __ksm_exit a fast path: when there's no complication
(no rmap_items attached to mm and it's not at the ksm_scan cursor),
it can safely do all the exiting work itself. This is not just an
optimization: when ksmd is not running, the raised mm_count would
otherwise leak mm_structs.

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:32 +0800

21 Sep, 2009

1 commit

cdd6c482c perf: Do the big rename: Performance Counters -> Performance Events ... Browse Code »

Bye-bye Performance Counters, welcome Performance Events!

In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.

Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.

All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)

The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.

Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.

User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)

This patch has been generated via the following script:

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES

for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done

FILES=$(find . -name perf_event.*)

sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES

... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.

Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.

( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )

Suggested-by: Stephane Eranian
Acked-by: Peter Zijlstra
Acked-by: Paul Mackerras
Reviewed-by: Arjan van de Ven
Cc: Mike Galbraith
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Kyle McMartin
Cc: Martin Schwidefsky
Cc: "David S. Miller"
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-21 20:28:04 +0800

19 Sep, 2009

1 commit

27f5de796 mm: Fix problem of parameter in note ... Browse Code »

'current' is a pointer, so the right form is 'down_write(¤t->mm->mmap_sem)'.

Signed-off-by: Jianjun Kong
Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Jianjun Kong
2009-09-19 00:48:52 +0800

17 Aug, 2009

1 commit

788084aba Security/SELinux: seperate lsm specific mmap_min_addr ... Browse Code »

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable. This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.

Signed-off-by: Eric Paris
Signed-off-by: James Morris

Eric Paris
2009-08-17 13:09:11 +0800

12 Jun, 2009

1 commit

8a1ca8ced Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/linux-2.6-tip

* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...

Linus Torvalds
2009-06-12 05:01:07 +0800

05 Jun, 2009

1 commit

089dd79db perf_counter: Generate mmap events for install_special_mapping() ... Browse Code »

In order to track the vdso also generate mmap events for
install_special_mapping().

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-06-05 20:46:41 +0800