Eric Lee / smarc-fsl-linux-kernel

17 Oct, 2007

40 commits

89e107877 fs: new cont helpers ... Browse Code »

Rework the generic block "cont" routines to handle the new aops. Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).

write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.

Signed-off-by: Nick Piggin
Cc: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
800d15a53 implement simple fs aops ... Browse Code »

Implement new aops for some of the simpler filesystems.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
674b892ed mm: restore KERNEL_DS optimisations ... Browse Code »

Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
path.

This may be a pretty questionable gain in most cases, especially after the
legacy 2copy write path is removed, but it doesn't cost much.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
afddba49d fs: introduce write_begin, write_end, and perform_write aops ... Browse Code »

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin
Signed-off-by: Mark Fasheh
Signed-off-by: Dmitriy Monakhov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
2f718ffc1 mm: buffered write iterator ... Browse Code »

Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
08291429c mm: fix pagecache write deadlocks ... Browse Code »

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1. generic_buffered_write
2. lock_page
3. prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6. mmap_sem(r)
7. handle_mm_fault
8. lock_page (filemap_nopage)
9. commit_write
10. unlock_page

a. sys_munmap / sys_mlock / others
b. mmap_sem(w)
c. make_pages_present
d. get_user_pages
e. handle_mm_fault
f. lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1. If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2. If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
4a9e5ef1f mm: write iovec cleanup ... Browse Code »

Hide some of the open-coded nr_segs tests into the iovec helpers. This is all
to simplify generic_file_buffered_write, because that gets more complex in the
next patch.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
eb2be1893 mm: buffered write cleanup ... Browse Code »

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
64649a589 mm: trim more holes ... Browse Code »

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
5fe172370 mm: debug write deadlocks ... Browse Code »

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
ae37461c7 mm: clean up buffered write code ... Browse Code »

Rename some variables and fix some types.

Signed-off-by: Andrew Morton
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 00:42:54 +0800
6814d7a91 Revert "[PATCH] generic_file_buffered_write(): deadlock on vectored write" ... Browse Code »

This reverts commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which
fixed the following bug:

When prefaulting in the pages in generic_file_buffered_write(), we only
faulted in the pages for the firts segment of the iovec. If the second of
successive segment described a mmapping of the page into which we're
write()ing, and that page is not up-to-date, the fault handler tries to lock
the already-locked page (to bring it up to date) and deadlocks.

An exploit for this bug is in writev-deadlock-demo.c, in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

(These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment. So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 00:42:54 +0800
4b49643fb Revert "[PATCH] generic_file_buffered_write(): handle zero-length iovec segments" ... Browse Code »

This reverts commit 81b0c8713385ce1b1b9058e916edcf9561ad76d6, which was
a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 ("[PATCH]
generic_file_buffered_write(): deadlock on vectored write"), which we
also revert.

Signed-off-by: Andrew Morton
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 00:42:54 +0800
41cb8ac02 mm: revert KERNEL_DS buffered write optimisation ... Browse Code »

Revert the patch from Neil Brown to optimise NFSD writev handling.

Cc: Neil Brown
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:54 +0800
902aaed0d mm: use pagevec to rotate reclaimable page ... Browse Code »

While running some memory intensive load, system response deteriorated just
after swap-out started.

The cause of this problem is that when a PG_reclaim page is moved to the tail
of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
acquired every page writeback . This deteriorates system performance and
makes interrupt hold off time longer when swap-out started.

Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt hold off time.

I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under memory
intensive load.

The test result is:

-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms

-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 ms

Max round-trip-time was improved.

The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.

I did ping test again to observe performance deterioration caused by taking
a ref.

-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms

The result for my original patch is as follows.

-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms

The influence to response was small.

[akpm@linux-foundation.org: fix uninitalised var warning]
[hugh@veritas.com: fix locking]
[randy.dunlap@oracle.com: fix function declaration]
[hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
[hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
[hugh@veritas.com: move_tail_pages into lru_add_drain]
Signed-off-by: Hisashi Hifumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hisashi Hifumi
2007-10-17 00:42:54 +0800
754af6f5a Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag ... Browse Code »

Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this to
obtain allowed memories for constructing valid policies.

TODO: update out-of-tree libnuma wrapper[s], or maybe add a new
wrapper--e.g., numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by: Lee Schermerhorn
Acked-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2007-10-17 00:42:54 +0800
32a4330d4 mm: prevent kswapd from freeing excessive amounts of lowmem ... Browse Code »

The current VM can get itself into trouble fairly easily on systems with a
small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.

On one side, page_alloc() will allocate down to zone->pages_low, while on
the other side, kswapd() and balance_pgdat() will try to free memory from
every zone, until every zone has more free pages than zone->pages_high.

Highmem can be filled up to zone->pages_low with page tables, ramfs,
vmalloc allocations and other unswappable things quite easily and without
many bad side effects, since we still have a huge ZONE_NORMAL to do future
allocations from.

However, as long as the number of free pages in the highmem zone is below
zone->pages_high, kswapd will continue swapping things out from
ZONE_NORMAL, too!

Sami Farin managed to get his system into a stage where kswapd had freed
about 700MB of low memory and was still "going strong".

The attached patch will make kswapd stop paging out data from zones when
there is more than enough memory free. We do go above zone->pages_high in
order to keep pressure between zones equal in normal circumstances, but the
patch should prevent the kind of excesses that made Sami's computer totally
unusable.

Signed-off-by: Rik van Riel
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2007-10-17 00:42:54 +0800
8691f3a72 mm: no need to cast vmalloc() return value in zone_wait_table_init() ... Browse Code »

vmalloc() returns a void pointer, so there's no need to cast its
return value in mm/page_alloc.c::zone_wait_table_init().

Signed-off-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2007-10-17 00:42:54 +0800
ef8b4520b Slab allocators: fail if ksize is called with a NULL parameter ... Browse Code »

A NULL pointer means that the object was not allocated. One cannot
determine the size of an object that has not been allocated. Currently we
return 0 but we really should BUG() on attempts to determine the size of
something nonexistent.

krealloc() interprets NULL to mean a zero sized object. Handle that
separately in krealloc().

Signed-off-by: Christoph Lameter
Acked-by: Pekka Enberg
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:53 +0800
0da7e01f5 calculation of pgoff in do_linear_fault() uses mixed units ... Browse Code »

The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
PAGE_CACHE_SIZE. At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
defined as PAGE_SHIFT, but should that ever change this calculation would
break.

Signed-off-by: Dean Nelson
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dean Nelson
2007-10-17 00:42:53 +0800
2408c5503 {slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check ... Browse Code »

Considering kfree(NULL) would normally occur only in error paths and
kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
condition check in SLUB's and SLOB's kfree() to optimize for the common
case. SLAB has this already.

Signed-off-by: Satyam Sharma
Cc: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Satyam Sharma
2007-10-17 00:42:53 +0800
b55ed8162 mm: clarify __add_to_swap_cache locking ... Browse Code »

__add_to_swap_cache unconditionally sets the page locked, which can be a bit
alarming to the unsuspecting reader: in the code paths where the page is
visible to other CPUs, the page should be (and is) already locked.

Instead, just add a check to ensure the page is locked here, and teach the one
path relying on the old behaviour to call SetPageLocked itself.

[hugh@veritas.com: locking fix]
Signed-off-by: Nick Piggin
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:53 +0800
45726cb43 mm: improve find_lock_page ... Browse Code »

find_lock_page does not need to recheck ->index because if the page is in the
right mapping then the index must be the same. Also, tree_lock does not need
to be retaken after the page is locked in order to test that ->mapping has not
changed, because holding the page lock pins its mapping.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:53 +0800
001281881 mm: use lockless radix-tree probe ... Browse Code »

Probing pages and radix_tree_tagged are lockless operations with the lockless
radix-tree. Convert these users to RCU locking rather than using tree_lock.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:53 +0800
557ed1fa2 remove ZERO_PAGE ... Browse Code »

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus "snif" Torvalds

Nick Piggin
2007-10-17 00:42:53 +0800
aadb4bc4a SLUB: direct pass through of page size or higher kmalloc requests ... Browse Code »

This gets rid of all kmalloc caches larger than page size. A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator. This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
need to pass through the slab allocator to get to the
page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
PAGE_SIZEd allocations.
SLUB must call page allocator for each alloc anyways since
the higher order pages which that allowed avoiding the page alloc calls
are not available in a reliable way anymore. So we are basically removing
useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
SLAB did. Bad things like using page sized kmalloc allocations to
stand in for page allocate allocs can be transparently handled and are not
distinguishable from page allocator uses.
- Checking for too large objects can be removed since
it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:53 +0800
57f6b96c0 filemap: convert some unsigned long to pgoff_t ... Browse Code »

Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:53 +0800
b2c3843b1 filemap: trivial code cleanups ... Browse Code »

- remove unused local next_index in do_generic_mapping_read()
- remove a redudant page_cache_read() declaration

Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:53 +0800
535443f51 readahead: remove several readahead macros ... Browse Code »

Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
7ff81078d readahead: remove the local copy of ra in do_generic_mapping_read() ... Browse Code »

The local copy of ra in do_generic_mapping_read() can now go away.

It predates readanead(req_size). In a time when the readahead code was called
on *every* single page. Hence a local has to be made to reduce the chance of
the readahead state being overwritten by a concurrent reader. More details
in: Linux: Random File I/O Regressions In 2.6

Cc: Nick Piggin
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
6b10c6c9f readahead: basic support of interleaved reads ... Browse Code »

This is a simplified version of the pagecache context based readahead. It
handles the case of multiple threads reading on the same fd and invalidating
each others' readahead state. It does the trick by scanning the pagecache and
recovering the current read stream's readahead status.

The algorithm works in a opportunistic way, in that it does not try to detect
interleaved reads _actively_, which requires a probe into the page cache
(which means a little more overhead for random reads). It only tries to
handle a previously started sequential readahead whose state was overwritten
by another concurrent stream, and it can do this job pretty well.

Negative and positive examples(or what you can expect from it):

1) it cannot detect and serve perfect request-by-request interleaved reads
right:
time stream 1 stream 2
0 1
1 1001
2 2
3 1002
4 3
5 1003
6 4
7 1004
8 5
9 1005

Here no single readahead will be carried out.

2) However, if it's two concurrent reads by two threads, the chance of the
initial sequential readahead be started is huge. Once the first sequential
readahead is started for a stream, this patch will ensure that the readahead
window continues to rampup and won't be disturbed by other streams.

time stream 1 stream 2
0 1
1 2
2 1001
3 3
4 1002
5 1003
6 4
7 5
8 1004
9 6
10 1005
11 7
12 1006
13 1007

Here stream 1 will start a readahead at page 2, and stream 2 will start its
first readahead at page 1003. From then on the two streams will be served
right.

Cc: Rusty Russell
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
f4e6b498d readahead: combine file_ra_state.prev_index/prev_offset into prev_pos ... Browse Code »

Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
0bb7ba6b9 readahead: mmap read-around simplification ... Browse Code »

Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.

Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
937085aa3 readahead: compacting file_ra_state ... Browse Code »

Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

This helps reduce memory consumption on 64bit CPU when a lot of files are
opened.

CC: Andi Kleen
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800
43fac94dd Clean up duplicate includes in mm/ ... Browse Code »

This patch cleans up duplicate includes in
mm/

Signed-off-by: Jesper Juhl
Acked-by: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2007-10-17 00:42:52 +0800
1cd7daa51 slub.c:early_kmem_cache_node_alloc() shouldn't be __init ... Browse Code »

WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
...

Signed-off-by: Adrian Bunk
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-10-17 00:42:51 +0800
29c71111d vmemmap: generify initialisation via helpers ... Browse Code »

Convert the common vmemmap population into initialisation helpers for use by
architecture vmemmap populators. All architecture implementing the
SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
initialiser, which may make use of the helpers.

This allows us to clean up and remove the initialisation Kconfig entries.
With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
indicate use of that variant.

Signed-off-by: Andy Whitcroft
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2007-10-17 00:42:51 +0800
8f6aac419 Generic Virtual Memmap support for SPARSEMEM ... Browse Code »

SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches. It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps. So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address. This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed. This allows virt_to_page page_address and cohorts become
simple shift/add operations. No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory. As kernel memory is typically already mapped 1:1
this introduces no additional overhead. The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages. This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry. vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems. It should be optimal on UP, SMP and
NUMA on most platforms. Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter
Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Cc: "Luck, Tony"
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: KAMEZAWA Hiroyuki
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 00:42:51 +0800
540557b94 sparsemem: record when a section has a valid mem_map ... Browse Code »

We have flags to indicate whether a section actually has a valid mem_map
associated with it. This is never set and we rely solely on the present bit
to indicate a section is valid. By definition a section is not valid if it
has no mem_map and there is a window during init where the present bit is set
but there is no mem_map, during which pfn_valid() will return true
incorrectly.

Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new
present_section{,_nr} and pfn_present() interfaces for those users who care to
know that a section is going to be valid.

[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Cc: Christoph Lameter
Cc: "Luck, Tony"
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2007-10-17 00:42:51 +0800
cd881a6b2 sparsemem: clean up spelling error in comments ... Browse Code »

SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches. It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps. So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address. This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.

This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed. This allows virt_to_page page_address and cohorts become
simple shift/add operations. No page flag fields, no table lookups, nothing
involving memory is required.

The two key operations pfn_to_page and page_to_page become:

#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) ((page) - vmemmap)

By having a virtual mapping for the memmap we allow simple access without
wasting physical memory. As kernel memory is typically already mapped 1:1
this introduces no additional overhead. The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages. This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.

However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry. vmemmap is a
constant to which we can simply add the offset.

This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems. It should be optimal on UP, SMP and
NUMA on most platforms. Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.

The current aim is to bring a common virtually mapped mem_map to all
architectures. This should facilitate the removal of the bespoke
implementations from the architectures. This also brings performance
improvements for most architecture making sparsmem vmemmap the more desirable
memory model. The ultimate aim of this work is to expand sparsemem support to
encompass all the features of the other memory models. This could allow us to
drop support for and remove the other models in the longer term.

Below are some comparitive kernbench numbers for various architectures,
comparing default memory model against SPARSEMEM VMEMMAP. All but ia64 show
marginal improvement; we expect the ia64 figures to be sorted out when the
larger mapping support returns.

x86-64 non-NUMA
Base VMEMAP % change (-ve good)
User 85.07 84.84 -0.26
System 34.32 33.84 -1.39
Total 119.38 118.68 -0.59

ia64
Base VMEMAP % change (-ve good)
User 1016.41 1016.93 0.05
System 50.83 51.02 0.36
Total 1067.25 1067.95 0.07

x86-64 NUMA
Base VMEMAP % change (-ve good)
User 30.77 431.73 0.22
System 45.39 43.98 -3.11
Total 476.17 475.71 -0.10

ppc64
Base VMEMAP % change (-ve good)
User 488.77 488.35 -0.09
System 56.92 56.37 -0.97
Total 545.69 544.72 -0.18

Below are some AIM bencharks on IA64 and x86-64 (thank Bob). The seems
pretty much flat as you would expect.

ia64 results 2 cpu non-numa 4Gb SCSI disk

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 1 07:17:24 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 98.9 100 58.9 1.3 1.6482
101 5547.1 95 106.0 79.4 0.9154
201 6377.7 95 183.4 158.3 0.5288
301 6932.2 95 252.7 237.3 0.3838
401 7075.8 93 329.8 316.7 0.2941
501 7235.6 94 403.0 396.2 0.2407
600 7387.5 94 472.7 475.0 0.2052

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 1 09:59:04 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 99.1 100 58.8 1.2 1.6509
101 5480.9 95 107.2 79.2 0.9044
201 6490.3 95 180.2 157.8 0.5382
301 6886.6 94 254.4 236.8 0.3813
401 7078.2 94 329.7 316.0 0.2942
501 7250.3 95 402.2 395.4 0.2412
600 7399.1 94 471.9 473.9 0.2055

open power 710 2 cpu, 4 Gb, SCSI and configured physically

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 25.7 100 226.3 4.3 0.4286
101 1096.0 97 536.4 199.8 0.1809
201 1236.4 96 946.1 389.1 0.1025
301 1280.5 96 1368.0 582.3 0.0709
401 1270.2 95 1837.4 771.0 0.0528
501 1251.4 96 2330.1 955.9 0.0416
601 1252.6 96 2792.4 1139.2 0.0347
701 1245.2 96 3276.5 1334.6 0.0296
918 1229.5 96 4345.4 1728.7 0.0223

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 25.6 100 226.9 4.3 0.4275
101 1049.3 97 560.2 198.1 0.1731
201 1199.1 97 975.6 390.7 0.0994
301 1261.7 96 1388.5 591.5 0.0699
401 1256.1 96 1858.1 771.9 0.0522
501 1220.1 96 2389.7 955.3 0.0406
601 1224.6 96 2856.3 1133.4 0.0340
701 1252.0 96 3258.7 1314.1 0.0298
915 1232.8 96 4319.7 1704.0 0.0225

amd64 2 2-core, 4Gb and SATA

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 2 03:59:48 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 13.0 100 446.4 2.1 0.2173
101 533.4 97 1102.0 110.2 0.0880
201 578.3 97 2022.8 220.8 0.0480
301 583.8 97 3000.6 332.3 0.0323
401 580.5 97 4020.1 442.2 0.0241
501 574.8 98 5072.8 558.8 0.0191
600 566.5 98 6163.8 671.0 0.0157

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 3 04:19:31 2007

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 13.0 100 447.8 2.0 0.2166
101 536.5 97 1095.6 109.7 0.0885
201 567.7 97 2060.5 219.3 0.0471
301 582.1 96 3009.4 330.2 0.0322
401 578.2 96 4036.4 442.4 0.0240
501 585.1 98 4983.2 555.1 0.0195
600 565.5 98 6175.2 660.6 0.0157

This patch:

Fix some spelling errors.

Signed-off-by: Christoph Lameter
Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Cc: "Luck, Tony"
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2007-10-17 00:42:51 +0800