Eric Lee / smarc-fsl-linux-kernel

17 Oct, 2007

40 commits

55144768e fs: remove some AOP_TRUNCATED_PAGE ... Browse Code »

prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
GFS2 were converted to the new aops, so we can make some simplifications
for that.

[michal.k.k.piotrowski@gmail.com: fix warning]
Signed-off-by: Nick Piggin
Cc: Michael Halcrow
Cc: Mark Fasheh
Cc: Steven Whitehouse
Signed-off-by: Michal Piotrowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:58 +0800
03158cd7e fs: restore nobh ... Browse Code »

Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is
now implemented in a way that does not create blocks in sparse regions,
which is a silly thing for it to have been doing (isn't it?)

ext2 survives fsx and fsstress. jfs is converted as well... ext3
should be easy to do (but not done yet).

[akpm@linux-foundation.org: coding-style fixes]
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:58 +0800
b6af1bcd8 ocfs2: convert to new aops ... Browse Code »

Plug ocfs2 into the ->write_begin and ->write_end aops.

A bunch of custom code is now gone - the iovec iteration stuff during write
and the ocfs2 splice write actor.

Signed-off-by: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:58 +0800
f2b6a16eb fs: affs convert to new aops ... Browse Code »

Cc: Roman Zippel
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:58 +0800
b4585729f fs: adfs convert to new aops ... Browse Code »

Acked-by: Russell King
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
d5c5f84ba jfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Acked-by: Dave Kleikamp
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
4a66af9ea minixfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Andries Brouwer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
26a6441aa sysv: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
be021ee41 udf: convert to new aops ... Browse Code »
44

Convert udf to new aops. Also seem to have fixed pagecache corruption in
udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
fixed the silly setup where prepare_write was doing a kmap to be used in
commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
this easier.

Signed-off-by: Nick Piggin
Cc:
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
82b9d1d0d ufs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Evgeniy Dushistov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
205c109a7 jffs2: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
ae361ff46 hostfs: convert to new aops ... Browse Code »

This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.

Signed-off-by: Nick Piggin
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
5e6f58a1d fuse: convert to new aops ... Browse Code »

[mszeredi]
- don't send zero length write requests
- it is not legal for the filesystem to return with zero written bytes

Signed-off-by: Nick Piggin
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
fb53b3094 smbfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
4899f9c85 nfs: convert to new aops ... Browse Code »

[akpm@linux-foundation.org: fix against git-nfs]
[peterz@infradead.org: fix against git-nfs]
Signed-off-by: Nick Piggin
Acked-by: Trond Myklebust
Cc: "J. Bruce Fields"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:57 +0800
a20fa20c5 With reiserfs no longer using the weird generic_cont_expand, remove it completely. ... Browse Code »

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
f7557e8f7 reiserfs: use generic_cont_expand_simple ... Browse Code »

This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
in order to get rid of the special generic_cont_expand routine

Signed-off-by: Vladimir Saveliev
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Saveliev
2007-10-17 00:42:56 +0800
ba9d8cec6 reiserfs: convert to new aops ... Browse Code »

Convert reiserfs to new aops

Signed-off-by: Vladimir Saveliev
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Saveliev
2007-10-17 00:42:56 +0800
797b4cffd reiserfs: use generic write ... Browse Code »

Make reiserfs to write via generic routines.
Original reiserfs write optimized for big writes is deadlock rone

Signed-off-by: Vladimir Saveliev
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Saveliev
2007-10-17 00:42:56 +0800
f87061842 qnx4: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Acked-by: Anders Larsen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
eedcbba5e bfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Tigran Aivazian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
d6091b720 hpfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
7c0efc627 hfsplus: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Roman Zippel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
7903d9eed hfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: Roman Zippel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:56 +0800
d7777a25a fat: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
89e107877 fs: new cont helpers ... Browse Code »

Rework the generic block "cont" routines to handle the new aops. Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Signed-off-by: Nick Piggin
Cc: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
7765ec26a gfs2: convert to new aops ... Browse Code »

Cc: Nick Piggin
Cc: Steven Whitehouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Whitehouse
2007-10-17 00:42:55 +0800
d79689c70 xfs: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Cc: David Chinner
Cc: Timothy Shimmin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
bfc1af650 ext4: convert to new aops ... Browse Code »

Convert ext4 to use write_begin()/write_end() methods.

Signed-off-by: Badari Pulavarty
Signed-off-by: Nick Piggin
Cc: Dmitriy Monakhov
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
f4fc66a89 ext3: convert to new aops ... Browse Code »

Various fixes and improvements

Signed-off-by: Badari Pulavarty
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
f34fb6ecc ext2: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
6272b5a58 block_dev: convert to new aops ... Browse Code »

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
800d15a53 implement simple fs aops ... Browse Code »

Implement new aops for some of the simpler filesystems.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
afddba49d fs: introduce write_begin, write_end, and perform_write aops ... Browse Code »

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin
Signed-off-by: Mark Fasheh
Signed-off-by: Dmitriy Monakhov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
637aff46f fs: fix data-loss on error ... Browse Code »

New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts of
those buffers because it thinks they contain stale data: wrong, they are
actually uptodate so this is a data loss situation.

Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
eb2be1893 mm: buffered write cleanup ... Browse Code »

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
a4b0672db fs: fix nobh error handling ... Browse Code »

nobh mode error handling is not just pretty slack, it's wrong.

One cannot zero out the whole page to ensure new blocks are zeroed, because
it just brings the whole page "uptodate" with zeroes even if that may not
be the correct uptodate data. Also, other parts of the page may already
contain dirty data which would get lost by zeroing it out. Thirdly, the
writeback of zeroes to the new blocks will also erase existing blocks. All
these conditions are pagecache and/or filesystem corruption.

The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only this
state, because at the point we start dirtying parts of the page (new
blocks, with zeroes), the handling of IO errors becomes impossible without
buffers because the page may only be partially uptodate, in which case the
page flags allone cannot capture the state of the parts of the page.

So allocate all buffers for the page upfront, but leave them unattached so
that they don't pick up any other references and can be freed when we're
done. If the error path is hit, then zero the new buffers as the regular
buffer path does, then attach the buffers to the page so that it can
actually be written out correctly and be subject to the normal IO error
handling paths.

As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
68671f35f mm: add end_buffer_read helper function ... Browse Code »

Move duplicated code from end_buffer_read_XXX methods to separate helper
function.

Signed-off-by: Dmitry Monakhov
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Monakhov
2007-10-17 00:42:53 +0800
557ed1fa2 remove ZERO_PAGE ... Browse Code »

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus "snif" Torvalds

Nick Piggin
2007-10-17 00:42:53 +0800
f4e6b498d readahead: combine file_ra_state.prev_index/prev_offset into prev_pos ... Browse Code »

Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800