Eric Lee / smarc-fsl-linux-kernel

07 Aug, 2008

1 commit

d6606683a Revert duplicate "mm/hugetlb.c must #include <asm/io.h>" ... Browse Code »

This reverts commit 7cb93181629c613ee2b8f4ffe3446f8003074842, since we
did that patch twice, and the problem was already fixed earlier by
78a34ae29bf1c9df62a5bd0f0798b6c62a54d520.

Reported-by: Andi Kleen
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-08-07 03:04:54 +0800

06 Aug, 2008

2 commits

dfe195fb7 mm: fix uninitialized variables for find_vma_prepare callers ... Browse Code »

gcc 4.3.0 correctly emits the following warnings.
When a vma covering addr is found, find_vma_prepare indeed returns without
setting pprev, rb_link, and rb_parent.

mm/mmap.c: In function `insert_vm_struct':
mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `copy_vma':
mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `do_brk':
mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `mmap_region':
mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

Hugh adds: in fact, none of find_vma_prepare's callers use those values
when a vma is found to be already covering addr, it's either an error or
an occasion to munmap and repeat. Okay, let's quieten the compiler (but I
would prefer it if pprev, rb_link and rb_parent were meaningful in that
case, rather than whatever's in them from descending the tree).

Signed-off-by: Benny Halevy
Signed-off-by: Hugh Dickins
Cc: "Ryan Hope"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benny Halevy
2008-08-06 05:33:50 +0800
5c9ffc9c3 mm_init.c: avoid ifdef-inside-macro-expansion ... Browse Code »

gcc-3.2:

mm/mm_init.c:77:1: directives may not be used inside a macro argument
mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk"
mm/mm_init.c: In function `mminit_verify_pageflags_layout':
mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function)
mm/mm_init.c:80: (Each undeclared identifier is reported only once
mm/mm_init.c:80: for each function it appears in.)
mm/mm_init.c:80: syntax error before numeric constant

Also fix a typo in a comment.

Reported-by: Adrian Bunk
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-08-06 05:33:46 +0800

05 Aug, 2008

3 commits

529ae9aaa mm: rename page trylock ... Browse Code »

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin
Acked-by: KOSAKI Motohiro
Acked-by: Peter Zijlstra
Acked-by: Andrew Morton
Acked-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:31:34 +0800
2e1e9212e Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (29 commits)
sh: enable maple_keyb in dreamcast_defconfig.
SH2(A) cache update
nommu: Provide vmalloc_exec().
add addrespace definition for sh2a.
sh: Kill off ARCH_SUPPORTS_AOUT and remnants of a.out support.
sh: define GENERIC_HARDIRQS_NO__DO_IRQ.
sh: define GENERIC_LOCKBREAK.
sh: Save NUMA node data in vmcore for crash dumps.
sh: module_alloc() should be using vmalloc_exec().
sh: Fix up __bug_table handling in module loader.
sh: Add documentation and integrate into docbook build.
sh: Fix up broken kerneldoc comments.
maple: Kill useless private_data pointer.
maple: Clean up maple_driver_register/unregister routines.
input: Clean up maple keyboard driver
maple: allow removal and reinsertion of keyboard driver module
sh: /proc/asids depends on MMU.
arch/sh/boards/mach-se/7343/irq.c: removed duplicated #include
arch/sh/boards/board-ap325rxa.c: removed duplicated #include
sh/boards/Makefile typo fix
...

Linus Torvalds
2008-08-05 08:26:15 +0800
a477097d9 mlock() fix return values ... Browse Code »

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include
#include
#include
#include
#include
#include
#include
#include
#include

int main(void)
{
int fd,ret, i = 0;
char *addr, *addr1 = NULL;
unsigned int page_size;
struct rlimit rlim;

if (0 != geteuid())
{
printf("Execute this pgm as root\n");
exit(1);
}

/* create a file */
if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
{
printf("cant create test file\n");
exit(1);
}

page_size = sysconf(_SC_PAGE_SIZE);

/* set the MEMLOCK limit */
rlim.rlim_cur = 2000;
rlim.rlim_max = 2000;

if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
{
printf("Cant change limit values\n");
exit(1);
}

addr = 0;
while (1)
{
/* map a page into memory each time*/
if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
{
printf("cant do mmap on file\n");
exit(1);
}

if (0 == i)
addr1 = addr;
i++;
errno = 0;
/* lock the mapped memory pagewise*/
if ((ret = mlock((char *)addr, 1500)) == -1)
{
printf("errno value is %d\n", errno);
printf("cant lock maped region\n");
exit(1);
}
addr = addr + page_size;
}
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT. When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
Some or all of the address range specified by the addr and
len arguments does not correspond to valid mapped pages
in the address space of the process.

[EAGAIN]
Some or all of the memory identified by the operation could not
be locked when the call was made.

This rule isn't so nice and slighly strange. but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv
Tested-by: Halesh Sadashiv
Signed-off-by: KOSAKI Motohiro
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-08-05 07:58:45 +0800

04 Aug, 2008

1 commit

1af446edf nommu: Provide vmalloc_exec(). ... Browse Code »

Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage,
it's apparent that nommu has no vmalloc_exec() definition of its own.
Stub in the one from mm/vmalloc.c.

Signed-off-by: Paul Mundt

Paul Mundt
2008-08-04 15:01:47 +0800

03 Aug, 2008

1 commit

84209e02d mm: dont clear PG_uptodate on truncate/invalidate ... Browse Code »

Brian Wang reported that a FUSE filesystem exported through NFS could
return I/O errors on read. This was traced to splice_direct_to_actor()
returning a short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably
NFS) also call invalidate_inode_pages2() to purge stale data from the
cache.

If this happens while such pages are sitting in a pipe buffer, then
splice(2) from the pipe can return zero, and read(2) from the pipe can
return ENODATA.

The zero return is especially bad, since it implies end-of-file or
disconnected pipe/socket, and is documented as such for splice. But
returning an error for read() is also nasty, when in fact there was no
error (data becoming stale is not an error).

The same problems can be triggered by "hole punching" with
madvise(MADV_REMOVE).

Fix this by not clearing the PG_uptodate flag on truncation and
invalidation.

Signed-off-by: Miklos Szeredi
Acked-by: Nick Piggin
Cc: Andrew Morton
Cc: Jens Axboe
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-08-03 00:12:34 +0800

02 Aug, 2008

3 commits

3669bc143 Remove EXPORTS of follow_page & zap_page_range ... Browse Code »

Delete 2 EXPORTs that were accidentally sent upstream.

Signed-off-by: Jack Steiner
Signed-off-by: Linus Torvalds

Jack Steiner
2008-08-02 04:19:16 +0800
0ef89d25d mm/hugetlb: don't crash when HPAGE_SHIFT is 0 ... Browse Code »

Some platform decide whether they support huge pages at boot time. On
these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is
set to 0 when there is no such support.

The patches to introduce multiple huge pages support broke that causing
the kernel to crash at boot time on machines such as POWER3 which lack
support for multiple page sizes.

Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-08-02 03:46:41 +0800
00e9028a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
mm/hugetlb.c must #include
video: Fix up hp6xx driver build regressions.
sh: defconfig updates.
sh: Kill off stray mach-rsk7203 reference.
serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
sh: Move out individual boards without mach groups.
sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
sh: Allow SH-3 and SH-5 to use common headers.
sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
sh/maple: clean maple bus code
sh: More header path fixups for mach dir refactoring.
sh: Move out the solution engine headers to arch/sh/include/mach-se/
sh: I2C fix for AP325RXA and Migo-R
sh: Shuffle the board directories in to mach groups.
sh: dma-sh: Fix up dreamcast dma.h mach path.
sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
sh: Add ARCH_DEFCONFIG entries for sh and sh64.
sh: Fix compile error of Solution Engine
sh: Proper __put_user_asm() size mismatch fix.
sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
...

Linus Torvalds
2008-08-02 01:53:43 +0800

01 Aug, 2008

1 commit

a4b526b3b [S390] Optimize storage key operations for anon pages ... Browse Code »

For anonymous pages without a swap cache backing the check in
page_remove_rmap for the physical dirty bit in page_remove_rmap is
unnecessary. The instructions that are used to check and reset the dirty
bit are expensive. Removing the check noticably speeds up process exit.
In addition the clearing of the dirty bit in __SetPageUptodate is
pointless as well. With these two changes there is no storage key
operation for an anonymous page anymore if it does not hit the swap
space.

The micro benchmark which repeatedly executes an empty shell script
gets about 5% faster.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2008-08-01 22:39:30 +0800

31 Jul, 2008

9 commits

94ad374a0 Fix off-by-one error in iov_iter_advance() ... Browse Code »

The iov_iter_advance() function would look at the iov->iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end. This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.

The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.

Thanks to Alexey Dobriyan for finding this case, and testing the fix.

Reported-and-tested-by: Alexey Dobriyan
Cc: Nick Piggin
Cc: Andrew Morton
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-07-31 05:50:18 +0800
0d39741a2 GRU Driver: export is_uv_system(), zap_page_range() & follow_page() ... Browse Code »

Exports needed by the GRU driver.

Signed-off-by: Jack Steiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2008-07-31 00:41:48 +0800
c627f9cc0 mm: add zap_vma_ptes(): a library function to unmap driver ptes ... Browse Code »

zap_vma_ptes() is intended to be used by drivers to unmap ptes assigned to the
driver private vmas. This interface is similar to zap_page_range() but is
less general & less likely to be abused.

Needed by the GRU driver.

Signed-off-by: Jack Steiner
Cc: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2008-07-31 00:41:47 +0800
87547ee95 do_try_to_free_page: update comments related to vmscan functions ... Browse Code »

Signed-off-by: Fernando Luis Vazquez Cao
Cc: Hugh Dickins
Cc: Nick Piggin
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fernando Luis Vazquez Cao
2008-07-31 00:41:46 +0800
7d03431cf swapfile/vmscan: update comments related to vmscan functions ... Browse Code »

Signed-off-by: Fernando Luis Vazquez Cao
Cc: Hugh Dickins
Cc: Nick Piggin
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fernando Luis Vazquez Cao
2008-07-31 00:41:46 +0800
ab33dc09a swap: update function comment of release_pages ... Browse Code »

Signed-off-by: Fernando Luis Vazquez Cao
Cc: Hugh Dickins
Cc: Nick Piggin
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fernando Luis Vazquez Cao
2008-07-31 00:41:46 +0800
7e6cbea39 madvise: update function comment of madvise_dontneed ... Browse Code »

Signed-off-by: Fernando Luis Vazquez Cao
Cc: Hugh Dickins
Cc: Nick Piggin
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fernando Luis Vazquez Cao
2008-07-31 00:41:45 +0800
4ef1b0fd6 memcg: remove redundant check in move_task() ... Browse Code »

It's guaranteed by cgroup that old_cgrp != cgrp.

Signed-off-by: Li Zefan
Cc: Paul Menage
Cc: Cedric Le Goater
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-07-31 00:41:44 +0800
1d1958f05 mm: remove find_max_pfn_with_active_regions ... Browse Code »

It has no user now

Also print out info about adding/removing active regions.

Signed-off-by: Yinghai Lu
Acked-by: Mel Gorman
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2008-07-31 00:41:44 +0800

30 Jul, 2008

1 commit

7cb931816 mm/hugetlb.c must #include <asm/io.h> ... Browse Code »

This patch fixes the following build error on sh caused by
commit aa888a74977a8f2120ae9332376e179c39a6b07d
(hugetlb: support larger than MAX_ORDER):

...
CC mm/hugetlb.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
/home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
make[2]: *** [mm/hugetlb.o] Error 1

Reported-by: Adrian Bunk
Signed-off-by: Adrian Bunk
Signed-off-by: Paul Mundt

Adrian Bunk
2008-07-30 01:18:26 +0800

29 Jul, 2008

5 commits

8ab22b9ab vfs: pagecache usage optimization for pagesize!=blocksize ... Browse Code »

When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.

I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.

So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.

I wrote a benchmark program and got result number with this program.

This benchmark do:

1: mount and open a test file.

2: create a 512MB file.

3: close a file and umount.

4: mount and again open a test file.

5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).

6: measure time of preading randomly 100000 times on a test file.

The result was:
2.6.26
330 sec

2.6.26-patched
226 sec

Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB

On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.

The benchmark program is as follows:

#include
#include
#include
#include
#include
#include
#include
#include
#include

#define LEN 1024
#define LOOP 1024*512 /* 512MB */

main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;

if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}

filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}

Signed-off-by: Hisashi Hifumi
Cc: Nick Piggin
Cc: Christoph Hellwig
Cc: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hisashi Hifumi
2008-07-29 07:30:21 +0800
78a34ae29 mm/hugetlb.c must #include <asm/io.h> ... Browse Code »

This patch fixes the following build error on sh caused by commit
aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than
MAX_ORDER"):

mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'

Signed-off-by: Adrian Bunk
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-29 07:30:21 +0800
cddb8a5c1 mmu-notifiers: core ... Browse Code »

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;

if (!kvm)
return ERR_PTR(-ENOMEM);

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Cc: Jack Steiner
Cc: Robin Holt
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Kanoj Sarcar
Cc: Roland Dreier
Cc: Steve Wise
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Rusty Russell
Cc: Anthony Liguori
Cc: Chris Wright
Cc: Marcelo Tosatti
Cc: Eric Dumazet
Cc: "Paul E. McKenney"
Cc: Izik Eidus
Cc: Anthony Liguori
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2008-07-29 07:30:21 +0800
7906d00cd mmu-notifiers: add mm_take_all_locks() operation ... Browse Code »

mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
mmu notifiers to register into the mm at any time with the guarantee that
no mmu operation is in progress on the mm.

This operation locks against the VM for all pte/vma/mm related operations
that could ever happen on a certain mm. This includes vmtruncate,
try_to_unmap, and all page faults.

The caller must take the mmap_sem in write mode before calling
mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
until mm_drop_all_locks() returns.

mmap_sem in write mode is required in order to block all operations that
could modify pagetables and free pages without need of altering the vma
layout (for example populate_range() with nonlinear vmas). It's also
needed in write mode to avoid new anon_vmas to be associated with existing
vmas.

A single task can't take more than one mm_take_all_locks() in a row or it
would deadlock.

mm_take_all_locks() and mm_drop_all_locks are expensive operations that
may have to take thousand of locks.

mm_take_all_locks() can fail if it's interrupted by signals.

When mmu_notifier_register returns, we must be sure that the driver is
notified if some task is in the middle of a vmtruncate for the 'mm' where
the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
is run around the vmtruncation but mmu_notifier_register can run after
mmu_notifier_invalidate_range_start and before
mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
we've to remove page pinning to avoid replicating the tlb_gather logic
inside KVM (and GRU doesn't work well with page pinning regardless of
needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
the page, kvm would have no way to notice that it mapped into sptes a page
that is going into the freelist without a chance of any further
mmu_notifier notification.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli
Acked-by: Linus Torvalds
Cc: Christoph Lameter
Cc: Jack Steiner
Cc: Robin Holt
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Kanoj Sarcar
Cc: Roland Dreier
Cc: Steve Wise
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Rusty Russell
Cc: Anthony Liguori
Cc: Chris Wright
Cc: Marcelo Tosatti
Cc: Eric Dumazet
Cc: "Paul E. McKenney"
Cc: Izik Eidus
Cc: Anthony Liguori
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2008-07-29 07:30:21 +0800
14fcc23fd tmpfs: fix kernel BUG in shmem_delete_inode ... Browse Code »

SuSE's insserve initscript ordering program hits kernel BUG at mm/shmem.c:814
on 2.6.26. It's using posix_fadvise on directories, and the shmem_readpage
method added in 2.6.23 is letting POSIX_FADV_WILLNEED allocate useless pages
to a tmpfs directory, incrementing i_blocks count but never decrementing it.

Fix this by assigning shmem_aops (pointing to readpage and writepage and
set_page_dirty) only when it's needed, on a regular file or a long symlink.

Many thanks to Kel for outstanding bugreport and steps to reproduce it.

Reported-by: Kel Modderman
Tested-by: Kel Modderman
Signed-off-by: Hugh Dickins
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-07-29 07:30:20 +0800

28 Jul, 2008

1 commit

9b1a4d383 stop_machine: Wean existing callers off stop_machine_run() ... Browse Code »

Signed-off-by: Rusty Russell

Rusty Russell
2008-07-28 10:16:31 +0800

27 Jul, 2008

12 commits

4836e3007 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
[PATCH] fix RLIM_NOFILE handling
[PATCH] get rid of corner case in dup3() entirely
[PATCH] remove remaining namei_{32,64}.h crap
[PATCH] get rid of indirect users of namei.h
[PATCH] get rid of __user_path_lookup_open
[PATCH] f_count may wrap around
[PATCH] dup3 fix
[PATCH] don't pass nameidata to __ncp_lookup_validate()
[PATCH] don't pass nameidata to gfs2_lookupi()
[PATCH] new (local) helper: user_path_parent()
[PATCH] sanitize __user_walk_fd() et.al.
[PATCH] preparation to __user_walk_fd cleanup
[PATCH] kill nameidata passing to permission(), rename to inode_permission()
[PATCH] take noexec checks to very few callers that care
Re: [PATCH 3/6] vfs: open_exec cleanup
[patch 4/4] vfs: immutable inode checking cleanup
[patch 3/4] fat: dont call notify_change
[patch 2/4] vfs: utimes cleanup
[patch 1/4] vfs: utimes: move owner check into inode_change_ok()
[PATCH] vfs: use kstrdup() and check failing allocation
...

Linus Torvalds
2008-07-27 11:23:44 +0800
228428428 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix ip_rt_frag_needed rt_is_expired
netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
netfilter: fix double-free and use-after free
netfilter: arptables in netns for real
netfilter: ip{,6}tables_security: fix future section mismatch
selinux: use nf_register_hooks()
netfilter: ebtables: use nf_register_hooks()
Revert "pkt_sched: sch_sfq: dump a real number of flows"
qeth: use dev->ml_priv instead of dev->priv
syncookies: Make sure ECN is disabled
net: drop unused BUG_TRAP()
net: convert BUG_TRAP to generic WARN_ON
drivers/net: convert BUG_TRAP to generic WARN_ON

Linus Torvalds
2008-07-27 11:17:56 +0800
3b8f14b41 mm/util.c must #include <linux/sched.h> ... Browse Code »

mm/util.c: In function 'arch_pick_mmap_layout':
mm/util.c:144: error: dereferencing pointer to incomplete type
mm/util.c:145: error: 'arch_get_unmapped_area' undeclared (first use in this function)
mm/util.c:145: error: (Each undeclared identifier is reported only once
mm/util.c:145: error: for each function it appears in.)
mm/util.c:146: error: 'arch_unmap_area' undeclared (first use in this function)

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-27 11:16:47 +0800
2f1936b87 [patch 3/5] vfs: change remove_suid() to file_remove_suid() ... Browse Code »

All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.

Clean up callers by passing in a file instead of a dentry.

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2008-07-27 08:53:16 +0800
e6305c43e [PATCH] sanitize ->permission() prototype ... Browse Code »

* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.

The obvious next target in that direction is permission(9)

folded fix for nfs_permission() breakage from Miklos Szeredi

Signed-off-by: Al Viro

Al Viro
2008-07-27 08:53:14 +0800
93bc4e89c netfilter: fix double-free and use-after free ... Browse Code »

As suggested by Patrick McHardy, introduce a __krealloc() that doesn't
free the original buffer to fix a double-free and use-after-free bug
introduced by me in netfilter that uses RCU.

Reported-by: Patrick McHardy
Signed-off-by: Pekka Enberg
Tested-by: Dieter Ries
Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Pekka Enberg
2008-07-27 08:49:33 +0800
7c363b8c6 mm/swapfile.c: make code static ... Browse Code »

This patch makes the following needlessly global code static:
- swap_lock
- nr_swapfiles
- struct swap_list

Signed-off-by: Adrian Bunk
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-27 03:00:12 +0800
15f59adae make mm/memory.c:print_bad_pte() static ... Browse Code »

This patch makes the needlessly global print_bad_pte() static.

Signed-off-by: Adrian Bunk
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-27 03:00:12 +0800
9d8fddfb1 mm/allocpercpu.c: make 4 functions static ... Browse Code »

This patch makes the following needlessly global functions static:
- percpu_depopulate()
- __percpu_depopulate_mask()
- percpu_populate()
- __percpu_populate_mask()

Signed-off-by: Adrian Bunk
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-27 03:00:12 +0800
9e5c6da71 make mm/sparse.c: make a function static ... Browse Code »

This patch makes the needlessly global sparse_early_mem_map_alloc()
static.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-27 03:00:12 +0800
2c97b7fc0 mm: print swapcache page count in show_swap_cache_info() ... Browse Code »

Every arch implements its own show_mem() function. Most of them share
quite some code, some of them are completely identical.

This series implements a generic version of this function and migrates
almost all architectures to it.

This patch:

Most show_mem() implementations calculate the amount of pages within
the swapcache every time. Move the output to a more appropriate place
and use the anyway available total_swapcache_pages variable.

Signed-off-by: Johannes Weiner
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Haavard Skinnemoen
Cc: Bryan Wu
Cc: Chris Zankel
Cc: Ingo Molnar
Cc: Jeff Dike
Cc: David S. Miller
Cc: Paul Mundt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: David Howells
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Yoshinori Sato
Cc: Ralf Baechle
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Cc: Roman Zippel
Cc: Hirokazu Takata
Cc: Mikael Starvik
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-07-27 03:00:10 +0800
fa8e26ccd tracehook: tracehook_expect_breakpoints ... Browse Code »

This adds tracehook_expect_breakpoints() as a formal hook for the nommu
code to use for its, "Is text-poking likely?" check at mmap time. This
names the actual semantics the code means to test, and documents it.

Signed-off-by: Roland McGrath
Cc: Oleg Nesterov
Reviewed-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roland McGrath
2008-07-27 03:00:09 +0800