Eric Lee / linux-smarc-t335x-v3.2

06 Jan, 2009

2 commits

520c85346 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
inotify: fix type errors in interfaces
fix breakage in reiserfs_new_inode()
fix the treatment of jfs special inodes
vfs: remove duplicate code in get_fs_type()
add a vfs_fsync helper
sys_execve and sys_uselib do not call into fsnotify
zero i_uid/i_gid on inode allocation
inode->i_op is never NULL
ntfs: don't NULL i_op
isofs check for NULL ->i_op in root directory is dead code
affs: do not zero ->i_op
kill suid bit only for regular files
vfs: lseek(fd, 0, SEEK_CUR) race condition

Linus Torvalds
2009-01-06 10:32:06 +0800
7f5ff766a kill suid bit only for regular files ... Browse Code »

We don't have to do it because it is useless for non regular files.
In fact block device may trigger this path without dentry->d_inode->i_mutex.

(akpm: concerns were expressed (by me) about S_ISDIR inodes)

Signed-off-by: Dmitri Monakhov
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Dmitri Monakhov
2009-01-06 00:53:07 +0800

05 Jan, 2009

1 commit

54566b2c1 fs: symlink write_begin allocation context fix ... Browse Code »

With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Cc: [2.6.28.x]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-05 05:33:20 +0800

31 Oct, 2008

1 commit

4e02ed4b4 fs: remove prepare_write/commit_write ... Browse Code »

Nothing uses prepare_write or commit_write. Remove them from the tree
completely.

[akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
Signed-off-by: Nick Piggin
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-31 02:38:45 +0800

20 Oct, 2008

3 commits

b7abea963 memcg: make page->mapping NULL before uncharge ... Browse Code »

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not". This patch also adds BUG_ON() to
mem_cgroup_uncharge_cache_page();

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-10-20 23:52:38 +0800
8413ac9d8 mm: page lock use lock bitops ... Browse Code »

trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.

Also, mark trylock as likely to succeed (and remove the annotation from
callers).

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800
4f98a2fee vmscan: split LRU lists into anon & file sets ... Browse Code »

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel
Signed-off-by: Lee Schermerhorn
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-10-20 23:50:25 +0800

17 Oct, 2008

2 commits

e533b2270 Merge branch 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
softirq, warning fix: correct a format to avoid a warning
softirqs, debug: preemption check
x86, pci-hotplug, calgary / rio: fix EBDA ioremap()
IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix
IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes
softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description
dmi scan: warn about too early calls to dmi_check_system()
generic: redefine resource_size_t as phys_addr_t
generic: make PFN_PHYS explicitly return phys_addr_t
generic: add phys_addr_t for holding physical addresses
softirq: allocate less vectors
IO resources: fix/remove printk
printk: robustify printk, update comment
printk: robustify printk, fix #2
printk: robustify printk, fix
printk: robustify printk

Fixed up conflicts in:
arch/powerpc/include/asm/types.h
arch/powerpc/platforms/Kconfig.cputype
manually.

Linus Torvalds
2008-10-17 06:17:40 +0800
0c6aa2639 mm: do_generic_file_read() never gets a NULL 'filp' argument ... Browse Code »

The 'filp' argument to do_generic_file_read() is never NULL.

Signed-off-by: Krishna Kumar
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Krishna Kumar
2008-10-17 02:21:29 +0800

14 Oct, 2008

1 commit

854623235 do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails ... Browse Code »

If lock_page_killable() fails because the task was killed by SIGKILL or
any other fatal signal, do_generic_file_read() returns -EIO.

This seems to be OK, because in fact the userspace won't see this error,
the task will dequeue SIGKILL and exit.

However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
return to the user-space with the bogus -EIO.

Change the code to return the error code from lock_page_killable(), -EINTR.
This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
change the code looks a bit more logical, and the "good" init should handle
the spurious EINTR or short read.

Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
but this can't prevent the short reads.

Signed-off-by: Oleg Nesterov
Signed-off-by: Ingo Molnar

Oleg Nesterov
2008-10-14 23:15:33 +0800

03 Sep, 2008

1 commit

6ccfa806a VFS: fix dio write returning EIO when try_to_release_page fails ... Browse Code »

Dio write returns EIO when try_to_release_page fails because bh is
still referenced.

The patch

commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
Author: Mingming Cao
Date: Fri Jul 25 01:46:22 2008 -0700

jbd: fix race between free buffer and commit transaction

was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.

I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.

The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race, freeing
buffer(dio) and ext3/4_ordered_writepage.

: background_writeout()
->write_cache_pages()
->ext3_ordered_writepage()
walk_page_buffers() -> take a bh ref
block_write_full_page() -> unlock_page
: try_to_release_page fails)
walk_page_buffers() ->release a bh ref

ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.

To fix this race, I used the approach of falling back to buffered
writes if try_to_release_page() fails on a page.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Hisashi Hifumi
Cc: Chris Mason
Cc: Jan Kara
Cc: Mingming Cao
Cc: Zach Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hisashi Hifumi
2008-09-03 10:21:37 +0800

05 Aug, 2008

1 commit

529ae9aaa mm: rename page trylock ... Browse Code »

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin
Acked-by: KOSAKI Motohiro
Acked-by: Peter Zijlstra
Acked-by: Andrew Morton
Acked-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:31:34 +0800

31 Jul, 2008

1 commit

94ad374a0 Fix off-by-one error in iov_iter_advance() ... Browse Code »

The iov_iter_advance() function would look at the iov->iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end. This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.

The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.

Thanks to Alexey Dobriyan for finding this case, and testing the fix.

Reported-and-tested-by: Alexey Dobriyan
Cc: Nick Piggin
Cc: Andrew Morton
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-07-31 05:50:18 +0800

29 Jul, 2008

1 commit

8ab22b9ab vfs: pagecache usage optimization for pagesize!=blocksize ... Browse Code »

When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.

I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.

So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.

I wrote a benchmark program and got result number with this program.

This benchmark do:

1: mount and open a test file.

2: create a 512MB file.

3: close a file and umount.

4: mount and again open a test file.

5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).

6: measure time of preading randomly 100000 times on a test file.

The result was:
2.6.26
330 sec

2.6.26-patched
226 sec

Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB

On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.

The benchmark program is as follows:

#include
#include
#include
#include
#include
#include
#include
#include
#include

#define LEN 1024
#define LOOP 1024*512 /* 512MB */

main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;

if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}

filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}

Signed-off-by: Hisashi Hifumi
Cc: Nick Piggin
Cc: Christoph Hellwig
Cc: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hisashi Hifumi
2008-07-29 07:30:21 +0800

27 Jul, 2008

4 commits

2f1936b87 [patch 3/5] vfs: change remove_suid() to file_remove_suid() ... Browse Code »

All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.

Clean up callers by passing in a file instead of a dentry.

Signed-off-by: Miklos Szeredi

Miklos Szeredi
2008-07-27 08:53:16 +0800
19fd62312 mm: spinlock tree_lock ... Browse Code »

mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.

Signed-off-by: Nick Piggin
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Hugh Dickins
Cc: "Paul E. McKenney"
Reviewed-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-07-27 03:00:06 +0800
a60637c85 mm: lockless pagecache ... Browse Code »

Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on the
read-side).

The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages. In
practice, this atomicity guarantee is not used anyway, and it is to
replace individual lookups, so these semantics are natural.

Signed-off-by: Nick Piggin
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Hugh Dickins
Cc: "Paul E. McKenney"
Reviewed-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-07-27 03:00:06 +0800
e286781d5 mm: speculative page references ... Browse Code »

If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.

This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie. if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).

Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page. We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.

This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).

Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.

[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin
Cc: Jeff Garzik
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Hugh Dickins
Cc: "Paul E. McKenney"
Reviewed-by: Peter Zijlstra
Signed-off-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-07-27 03:00:06 +0800

26 Jul, 2008

2 commits

69029cd55 memcg: remove refcnt from page_cgroup ... Browse Code »

memcg: performance improvements

Patch Description
1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
2/5 ... swapcache handling patch
3/5 ... add helper function for shmem's memory reclaim patch
4/5 ... optimize by likely/unlikely ppatch
5/5 ... remove redundunt check patch (shmem handling is fixed.)

Unix bench result.

== 2.6.26-rc2-mm1 + memory resource controller
Execl Throughput 2915.4 lps (29.6 secs, 3 samples)
C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples)
Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples)

INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 2915.4 678.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0
File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8
Shell Scripts (8 concurrent) 6.0 1097.7 1829.5
=========
FINAL SCORE 991.3

== 2.6.26-rc2-mm1 + this set ==
Execl Throughput 3012.9 lps (29.9 secs, 3 samples)
C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples)
Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples)

INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 3012.9 700.7
File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7
File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0
Shell Scripts (8 concurrent) 6.0 1120.3 1867.2
=========
FINAL SCORE 1004.6

This patch:

Remove refcnt from page_cgroup().

After this,

* A page is charged only when !page_mapped() && no page_cgroup is assigned.
* Anon page is newly mapped.
* File page is added to mapping->tree.

* A page is uncharged only when
* Anon page is fully unmapped.
* File page is removed from LRU.

There is no change in behavior from user's view.

This patch also removes unnecessary calls in rmap.c which was used only for
refcnt mangement.

[akpm@linux-foundation.org: fix warning]
[hugh@veritas.com: fix shmem_unuse_inode charging]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: "Eric W. Biederman"
Cc: Pavel Emelyanov
Cc: Li Zefan
Cc: Hugh Dickins
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: David Rientjes
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-07-26 01:53:37 +0800
3f31fddfa jbd: fix race between free buffer and commit transaction ... Browse Code »

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk. If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller. We have seen the directo IO failed
due to this race. Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao
Reviewed-by: Badari Pulavarty
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mingming Cao
2008-07-26 01:53:32 +0800

25 Jul, 2008

2 commits

11fa977ec generic_file_aio_read() cleanups ... Browse Code »

As akpm points out, there's really no need for generic_file_aio_read to
make a special case of count 0: just loop through nr_segs doing nothing.
And as Harvey Harrison points out, there's no need to reset retval to 0
where it's already 0.

Setting count (or ocount) to 0 before calling generic_segment_checks is
unnecessary too; but reluctantly I'll leave that removal to someone with a
wider range of gcc versions to hand - 4.1.2 and 4.2.1 don't warn about it,
but perhaps others do - I forget which are the warniest versions.

Signed-off-by: Hugh Dickins
Tested-by: Lawrence Greenfield
Cc: Christoph Rohland
Cc: Badari Pulavarty
Cc: Zach Brown
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-07-25 01:47:16 +0800
a969e903a kill generic_file_direct_IO() ... Browse Code »

generic_file_direct_IO is a common helper around the invocation of
->direct_IO. But there's almost nothing shared between the read and write
side, so we're better off without this helper.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-07-25 01:47:14 +0800

12 Jul, 2008

1 commit

f4c0a0fdf vfs: export filemap_fdatawrite_range() ... Browse Code »

Make filemap_fdatawrite_range() function public, so that it can later
be used in ordered mode rewrite by JBD/JBD2.

Signed-off-by: Jan Kara

Jan Kara
2008-07-12 07:27:31 +0800

15 May, 2008

1 commit

3ef0f720e mm: fix infinite loop in filemap_fault ... Browse Code »

filemap_fault will go into an infinite loop if ->readpage() fails
asynchronously.

AFAICS the bug was introduced by this commit, which removed the wait after the
final readpage:

commit d00806b183152af6d24f46f0c33f14162ca1262a
Author: Nick Piggin
Date: Thu Jul 19 01:46:57 2007 -0700

mm: fix fault vs invalidate race for linear mappings

Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
the page is up-to-date before jumping back to the beginning of the function.

I've noticed this while testing nfs exporting on fuse. The patch
fixes it.

Signed-off-by: Miklos Szeredi
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-05-15 10:11:13 +0800

07 May, 2008

1 commit

7f3d4ee10 vfs: splice remove_suid() cleanup ... Browse Code »

generic_file_splice_write() duplicates remove_suid() just because it
doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
anyway, so this is rather pointless.

Move locking to generic_file_splice_write() and call remove_suid() and
__splice_from_pipe() instead.

Signed-off-by: Miklos Szeredi
Signed-off-by: Jens Axboe

Miklos Szeredi
2008-05-07 15:29:00 +0800

28 Apr, 2008

1 commit

ac6aadb24 mm: rotate_reclaimable_page() cleanup ... Browse Code »

Clean up messy conditional calling of test_clear_page_writeback() from both
rotate_reclaimable_page() and end_page_writeback().

The only user of rotate_reclaimable_page() is end_page_writeback() so this is
OK.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-28 23:58:20 +0800

20 Mar, 2008

1 commit

7682486b3 mm: fix various kernel-doc comments ... Browse Code »

Fix various kernel-doc notation in mm/:

filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800

11 Mar, 2008

1 commit

f7009264c iov_iter_advance() fix ... Browse Code »

iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.

The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.

Signed-off-by: Nick Piggin
Tested-by: "Kevin Coffman"
Cc: "Alexey Dobriyan"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-03-11 09:01:20 +0800

10 Mar, 2008

1 commit

3426fadfa Do not include linux/backing-dev.h twice ... Browse Code »

Don't include linux/backing-dev.h twice in mm/filemap.c, it's pointless.

Signed-off-by: Jesper Juhl
Signed-off-by: Linus Torvalds

Jesper Juhl
2008-03-10 13:21:52 +0800

14 Feb, 2008

1 commit

b5606c2d4 remove final fastcall users ... Browse Code »

fastcall always expands to empty, remove it.

Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-02-14 08:21:18 +0800

09 Feb, 2008

2 commits

36e789144 kill do_generic_mapping_read ... Browse Code »

do_generic_mapping_read was used by gfs2 for internals reads, but this use
of the interface was rather suboptimal (as was the whole interface) and has
been replaced by an internal helper now. This patch kills
do_generic_mapping_read and surrounding damage in preparation of additional
cleanups for the buffered read path.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-02-09 01:22:39 +0800
2004dc8ee Use pgoff_t instead of unsigned long ... Browse Code »

Convert variables containing page indexes to pgoff_t.

Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2008-02-09 01:22:32 +0800

08 Feb, 2008

5 commits

4c6bc8dd5 mem-controller gfp-mask fix ... Browse Code »

Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge().

Signed-off-by: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Badari Pulavarty
2008-02-08 00:42:19 +0800
35c754d79 memory controller BUG_ON() ... Browse Code »

Move mem_controller_cache_charge() above radix_tree_preload().
radix_tree_preload() disables preemption, even though the gfp_mask passed
contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
hit a BUG_ON() in kmem_cache_alloc().

This patch moves mem_controller_cache_charge() to above radix_tree_preload()
for cache charging.

Signed-off-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:19 +0800
e1a1cd590 Memory controller: make charging gfp mask aware ... Browse Code »

Nick Piggin pointed out that swap cache and page cache addition routines
could be called from non GFP_KERNEL contexts. This patch makes the
charging routine aware of the gfp context. Charging might fail if the
cgroup is over it's limit, in which case a suitable error is returned.

This patch was tested on a Powerpc box. I am still looking at being able
to test the path, through which allocations happen in non GFP_KERNEL
contexts.

[kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:19 +0800
8697d3319 Memory controller: add switch to control what type of pages to limit ... Browse Code »

Choose if we want cached pages to be accounted or not. By default both are
accounted for. A new set of tunables are added.

echo -n 1 > mem_control_type

switches the accounting to account for only mapped pages

echo -n 3 > mem_control_type

switches the behaviour back

[bunk@kernel.org: mm/memcontrol.c: clenups]
[akpm@linux-foundation.org: fix sparc32 build]
Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:19 +0800
8a9f3ccd2 Memory controller: memory accounting ... Browse Code »

Add the accounting hooks. The accounting is carried out for RSS and Page
Cache (unmapped) pages. There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time. Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache(). Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier. A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup. Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan
Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc:
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:18 +0800

06 Feb, 2008

2 commits

920c7a5d0 mm: remove fastcall from mm/ ... Browse Code »

fastcall is always defined to be empty, remove it

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-02-06 01:44:18 +0800
e2848a0ef radix-tree: avoid atomic allocations for preloaded insertions ... Browse Code »

Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But the
preallocation is squandered when it comes time to allocate a node, we
default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we don't
need to be using.

Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock. Pagecache insertions are always
performed with a radix tree preload, and after this change, such a
situation will never fall back to kmem_cache_alloc within
radix_tree_node_alloc.

David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:

[527319.459981] dd: page allocation failure. order:0, mode:0x20
[527319.460403] Call Trace:
[527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
[527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
[527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
[527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
[527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
[527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
[527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
[527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
[527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
[527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
[527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
[527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
[527319.461361] [00000000004bb920] sys_read+0x34/0x60
[527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including disk
IO) before all the pages are inserted into the radix-tree. So the reserves
can easily be depleted at that point. The patch is confirmed to fix the
problem.

Signed-off-by: Nick Piggin
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-02-06 01:44:17 +0800

03 Feb, 2008

1 commit

124d3b704 fix writev regression: pan hanging unkillable and un-straceable ... Browse Code »

Frederik Himpe reported an unkillable and un-straceable pan process.

Zero length iovecs can go into an infinite loop in writev, because the
iovec iterator does not always advance over them.

The sequence required to trigger this is not trivial. I think it
requires that a zero-length iovec be followed by a non-zero-length iovec
which causes a pagefault in the atomic usercopy. This causes the writev
code to drop back into single-segment copy mode, which then tries to
copy the 0 bytes of the zero-length iovec; a zero length copy looks like
a failure though, so it loops.

Put a test into iov_iter_advance to catch zero-length iovecs. We could
just put the test in the fallback path, but I feel it is more robust to
skip over zero-length iovecs throughout the code (iovec iterator may be
used in filesystems too, so it should be robust).

Signed-off-by: Nick Piggin
Signed-off-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Nick Piggin
2008-02-03 04:55:39 +0800