Eric Lee / smarc-fsl-linux-kernel

17 Jun, 2009

40 commits

418589663 page allocator: use allocation flags as an index to the zone watermark ... Browse Code »

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine
which watermark to use.

This patch uses the flags as an array index into a watermark array that is
indexed with WMARK_* defines accessed via helpers. All call sites that
use zone->pages_* are updated to use the helpers for accessing the values
and the array offsets for setting.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:35 +0800
a3af9c389 page allocator: do not check for compound pages during the page allocator sanity checks ... Browse Code »

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for compound
pages and checks the count of the head page if true. However, in these
paths, we do not care if the page is compound or not as the count of each
tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path.
It converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also
avoids checking for compound pages.

[mel@csn.ul.ie: Wrote changelog]
Signed-off-by: Mel Gorman
Signed-off-by: Nick Piggin
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-06-17 10:47:34 +0800
d395b7342 page allocator: do not setup zonelist cache when there is only one node ... Browse Code »

There is a zonelist cache which is used to track zones that are not in the
allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost for
no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:34 +0800
da456f14d page allocator: do not disable interrupts in free_page_mlock() ... Browse Code »

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and
this happens on every page free even though interrupts are disabled very
shortly afterwards a second time. This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One
potential weirdness with this split is that the counters do not get
updated if the bad_page() check is triggered but a system showing bad
pages is getting screwed already.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Reviewed-by: KOSAKI Motohiro
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Acked-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:34 +0800
ed0ae21dc page allocator: do not call get_pageblock_migratetype() more than necessary ... Browse Code »

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and decisions
that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:34 +0800
0ac3a4099 page allocator: inline __rmqueue_fallback() ... Browse Code »

__rmqueue_fallback() is in the slow path but has only one call site.
Because there is only one call-site, this function can then be inlined
without causing text bloat. On an x86-based config, it made no difference
as the savings were padded out by NOP instructions. Milage varies but
text will either decrease in size or remain static.

Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:34 +0800
0a15c3e9f page allocator: inline buffered_rmqueue() ... Browse Code »

buffered_rmqueue() is in the fast path so inline it. Because it only has
one call site, this function can then be inlined without causing text
bloat. On an x86-based config, it made no difference as the savings were
padded out by NOP instructions. Milage varies but text will either
decrease in size or remain static.

Signed-off-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:34 +0800
728ec980f page allocator: inline __rmqueue_smallest() ... Browse Code »

Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:33 +0800
a56f57ff9 page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGH ... Browse Code »

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[akpm@linux-foundation.org: Suggested the hack]
Signed-off-by: Mel Gorman
Reviewed-by: Pekka Enberg
Reviewed-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:33 +0800
341ce06f6 page allocator: calculate the alloc_flags for allocation only once ... Browse Code »

Factor out the mapping between GFP and alloc_flags only once. Once
factored out, it only needs to be calculated once but some care must be
taken.

[neilb@suse.de says]
As the test:

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+ if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but
TF_MEMDIE callers are now allowed to directly reclaim where they would
have been prevented in the past.

Signed-off-by: Peter Zijlstra
Acked-by: Pekka Enberg
Signed-off-by: Mel Gorman
Cc: Neil Brown
Cc: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2009-06-17 10:47:33 +0800
3dd282669 page allocator: calculate the migratetype for allocation only once ... Browse Code »

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per allocation,
at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:33 +0800
5117f45d1 page allocator: calculate the preferred zone for allocation only once ... Browse Code »

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.

[akpm@linux-foundation.org: remove (void) casts]
Signed-off-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Christoph Lameter
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:33 +0800
49255c619 page allocator: move check for disabled anti-fragmentation out of fastpath ... Browse Code »

On low-memory systems, anti-fragmentation gets disabled as there is
nothing it can do and it would just incur overhead shuffling pages between
lists constantly. Currently the check is made in the free page fast path
for every page. This patch moves it to a slow path. On machines with low
memory, there will be small amount of additional overhead as pages get
shuffled between lists but it should quickly settle.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:33 +0800
11e33f6a5 page allocator: break up the allocator entry point into fast and slow paths ... Browse Code »

The core of the page allocator is one giant function which allocates
memory on the stack and makes calculations that may not be needed for
every allocation. This patch breaks up the allocator path into fast and
slow paths for clarity. Note the slow paths are still inlined but the
entry is marked unlikely. If they were not inlined, it actally increases
text size to generate the as there is only one call site.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800
7f82af974 page allocator: check only once if the zonelist is suitable for the allocation ... Browse Code »

It is possible with __GFP_THISNODE that no zones are suitable. This patch
makes sure the check is only made once.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800
6484eb3e2 page allocator: do not check NUMA node ID when the caller knows the node is valid ... Browse Code »

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in
fast paths know for a fact their node is valid. To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON(). Callers that know their node is valid are then
converted.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Acked-by: Paul Mundt [for the SLOB NUMA bits]
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800
b3c466ce5 page allocator: do not sanity check order in the fast path ... Browse Code »

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

[akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800
d239171e4 page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() ... Browse Code »

The start of a large patch series to clean up and optimise the page
allocator.

The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;

kernbench: 0 to 0.12% (elapsed time)
0.49% to 3.20% (sys time)
aim9: -4% to 30% (for page_test and brk_test)
tbench: -1% to 4%
hackbench: -2.5% to 3.45% (mostly within the noise though)
netperf-udp -1.34% to 4.06% (varies between machines a bit)
netperf-tcp -0.44% to 5.22% (varies between machines a bit)

I haven't sysbench figures at hand, but previously they were within the
-0.5% to 2% range.

On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).

I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the
.config is based on the Debian Lenny kernel config so I expect it to be
reasonably typical.

This patch:

__alloc_pages_internal is the core page allocator function but essentially
it is an alias of __alloc_pages_nodemask. Naming a publicly available and
exported function "internal" is also a big ugly. This patch renames
__alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800
6c0db4664 mm: alloc_large_system_hash check order ... Browse Code »

On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash()
had better make its own check on the order.

Signed-off-by: Hugh Dickins
Cc: David Miller
Cc: Mel Gorman
Cc: Eric Dumazet
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-17 10:47:31 +0800
58568d2a8 cpuset,mm: update tasks' mems_allowed in time ... Browse Code »

Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.

In order to update tasks' mems_allowed in time, we must modify the code of
memory policy. Because the memory policy is applied in the process's
context originally. After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.

But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression. But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set. In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.

[lee.schermerhorn@hp.com:
The rework of mpol_new() to extract the adjusting of the node mask to
apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
allocation. Fix this by adding the check for MPOL_PREFERRED and empty
node mask to mpol_new_mpolicy().

Remove the now unneeded 'nodes = NULL' from mpol_new().

Note that mpol_new_mempolicy() is always called with a non-NULL
'nodes' parameter now that it has been removed from mpol_new().
Therefore, we don't need to test nodes for NULL before testing it for
'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
verify this assumption.]
[lee.schermerhorn@hp.com:

I don't think the function name 'mpol_new_mempolicy' is descriptive
enough to differentiate it from mpol_new().

This function applies cpuset set context, usually constraining nodes
to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
is set, it also translates the nodes. So I settled on
'mpol_set_nodemask()', because the comment block for mpol_new() mentions
that we need to call this function to "set nodes".

Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800
950592f7b cpusets: update tasks' page/slab spread flags in time ... Browse Code »

Fix the bug that the kernel didn't spread page cache/slab object evenly
over all the allowed nodes when spread flags were set by updating tasks'
page/slab spread flags in time.

Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800
f3b39d47e cpusets: restructure the function cpuset_update_task_memory_state() ... Browse Code »

The kernel still allocates the page caches on old node after modifying its
cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
page cache evenly over all the nodes that faulting task is allowed to usr
after memory_spread_page was set. it is caused by the old mem_allowed and
flags of the task, the current kernel doesn't updates them unless some
function invokes cpuset_update_task_memory_state(), it is too late
sometimes.We must update the mem_allowed and the flags of the tasks in
time.

Slab has the same problem.

The following patches fix this bug by updating tasks' mem_allowed and
spread flag after its cpuset's mems or spread flag is changed.

This patch:

Extract a function from cpuset_update_task_memory_state(). It will be
used later for update tasks' page/slab spread flags after its cpuset's
flag is set

Signed-off-by: Miao Xie
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Christoph Lameter
Cc: Paul Menage
Cc: Nick Piggin
Cc: Yasunori Goto
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2009-06-17 10:47:31 +0800
dcf975d58 mm/page-writeback.c: dirty limit type should be unsigned long ... Browse Code »

get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
with variable pbdi_dirty as one of the arguments. This variable is an
unsigned long * but both functions expect it to be a long *. This causes
the following sparse warnings:

warning: incorrect type in argument 3 (different signedness)
expected long *pbdi_dirty
got unsigned long *pbdi_dirty
warning: incorrect type in argument 2 (different signedness)
expected long *pdirty
got unsigned long *pbdi_dirty

Fix the warnings by changing the long * to unsigned long * in both
functions.

Signed-off-by: H Hartley Sweeten
Cc: Johannes Weiner
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2009-06-17 10:47:31 +0800
78dc583d3 vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC ... Browse Code »

Commit 33c120ed2843090e2bd316de1588b8bf8b96cbde ("more aggressively use
lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
both active and inactive pages for asynchronous lumpy reclaim on
costly-high-order pages and for cheap-high-order when memory pressure is
high. However, if the system is under heavy pressure and there are dirty
pages, asynchronous IO may not be sufficient to reclaim a suitable page in
time.

This patch causes the caller to enter synchronous lumpy reclaim for
costly-high-order pages and for cheap-high-order pages when under memory
pressure.

Minchan.kim@gmail.com said:

Andy added synchronous lumpy reclaim with
c661b078fd62abe06fd11fab4ac5e4eeafe26b6d. At that time, lumpy reclaim is
not agressive. His intension is just for high-order users.(above
PAGE_ALLOC_COSTLY_ORDER).

After some time, Rik added aggressive lumpy reclaim with
33c120ed2843090e2bd316de1588b8bf8b96cbde. His intention was to do lumpy
reclaim when high-order users and trouble getting a small set of
contiguous pages.

So we also have to add synchronous pageout for small set of contiguous
pages.

Cc: Lee Schermerhorn
Cc: Andy Whitcroft
Acked-by: Peter Zijlstra
Cc: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Reviewed-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-06-17 10:47:31 +0800
d2bf6be8a mm: clean up get_user_pages_fast() documentation ... Browse Code »

Move more documentation for get_user_pages_fast into the new kerneldoc comment.
Add some comments for get_user_pages as well.

Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
there initially because it was once a static inline function.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Nick Piggin
Cc: Andy Grover
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-06-17 10:47:30 +0800
7ffc59b4d readahead: enforce full sync mmap readahead size ... Browse Code »

Now that we do readahead for sequential mmap reads, here is a simple
evaluation of the impacts, and one further optimization.

It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.

approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
A 383 31.6% 383 11
B 225 32.4% 390 11
C 224 32.6% 307 13

case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus mmap readahead
C = B plus this patch

The numbers show that
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected

The theory is
- readahead is _good_ for clustered random reads, and can perform
_better_ than readaround because they could be _async_.
- async readahead size is guaranteed to be larger than readaround
size, and they are _async_, hence will mostly behave better
However for B
- sync readahead size could be smaller than readaround size, hence may
make things worse by produce more smaller IOs
which will be fixed by this patch.

Final conclusion:
- mmap readahead reduced major faults by 1/3 and no obvious overheads;
- mmap io can be further reduced by 1/4 with this patch.

Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:30 +0800
61b7cbdba readahead: remove redundant test in shrink_readahead_size_eio() ... Browse Code »

Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:30 +0800
10be0b372 readahead: introduce context readahead algorithm ... Browse Code »

Introduce page cache context based readahead algorithm.
This is to better support concurrent read streams in general.

RATIONALE
---------
The current readahead algorithm detects interleaved reads in a _passive_ way.
Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
By checking for (offset == prev_offset + 1), it will discover the sequentialness
between 3,4 and between 1004,1005, and start doing sequential readahead for the
individual streams since page 4 and page 1005.

The context readahead algorithm guarantees to discover the sequentialness no
matter how the streams are interleaved. For the above example, it will start
sequential readahead since page 2 and 1002.

The trick is to poke for page @offset-1 in the page cache when it has no other
clues on the sequentialness of request @offset: if the current requenst belongs
to a sequential stream, that stream must have accessed page @offset-1 recently,
and the page will still be cached now. So if page @offset-1 is there, we can
take request @offset as a sequential access.

BENEFICIARIES
-------------
- strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
the current readahead will take them as silly random reads;
the context readahead will take them as two sequential streams.

- cooperative IO processes i.e. NFS and SCST
They create a thread pool, farming off (sequential) IO requests to different
threads which will be performing interleaved IO.

It was not easy(or possible) to reliably tell from file->f_ra all those
cooperative processes working on the same sequential stream, since they will
have different file->f_ra instances. And NFSD's file->f_ra is particularly
unusable, since their file objects are dynamically created for each request.
The nfsd does have code trying to restore the f_ra bits, but not satisfactory.

The new scheme is to detect the sequential pattern via looking up the page
cache, which provides one single and consistent view of the pages recently
accessed. That makes sequential detection for cooperative processes possible.

USER REPORT
-----------
Vladislav recommends the addition of context readahead as a result of his SCST
benchmarks. It leads to 6%~40% performance gains in various cases and achieves
equal performance in others. http://lkml.org/lkml/2009/3/19/239

OVERHEADS
---------
In theory, it introduces one extra page cache lookup per random read. However
the below benchmark shows context readahead to be slightly faster, wondering..

Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
each block size. The average throughputs are:

original ra context ra gain
4K random reads: 65.561MB/s 65.648MB/s +0.1%
16K random reads: 124.767MB/s 124.951MB/s +0.1%
64K random reads: 162.123MB/s 162.278MB/s +0.1%

Cc: Jens Axboe
Cc: Jeff Moyer
Tested-by: Vladislav Bolkhovitin
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:30 +0800
045a2529a readahead: move the random read case to bottom ... Browse Code »

Split all readahead cases, and move the random one to bottom.

No behavior changes.

This is to prepare for the introduction of context readahead, and make it
easy for inserting accounting/tracing points for each case.

Signed-off-by: Wu Fengguang
Cc: Vladislav Bolkhovitin
Cc: Jens Axboe
Cc: Jeff Moyer
Cc: Nick Piggin
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:30 +0800
dc566127d radix-tree: add radix_tree_prev_hole() ... Browse Code »

The counterpart of radix_tree_next_hole(). To be used by context readahead.

Signed-off-by: Wu Fengguang
Cc: Vladislav Bolkhovitin
Cc: Jens Axboe
Cc: Jeff Moyer
Cc: Nick Piggin
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:30 +0800
d30a11004 readahead: record mmap read-around states in file_ra_state ... Browse Code »

Mmap read-around now shares the same code style and data structure with
readahead code.

This also removes do_page_cache_readahead(). Its last user, mmap
read-around, has been changed to call ra_submit().

The no-readahead-if-congested logic is dumped by the way. Users will be
pretty sensitive about the slow loading of executables. So it's
unfavorable to disabled mmap read-around on a congested queue.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Nick Piggin
Signed-off-by: Fengguang Wu
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:29 +0800
2fad6f5de readahead: enforce full readahead size on async mmap readahead ... Browse Code »

We need this in one particular case and two more general ones.

Now we do async readahead for sequential mmap reads, and do it with the
help of PG_readahead. For normal reads, PG_readahead is the sufficient
condition to do a sequential readahead. But unfortunately, for mmap
reads, there is a tiny nuisance:

[11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
[11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
[11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
[11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~

An unfavorably small readahead. The original dumb read-around size could
be more efficient.

That happened because ld-linux.so does a read(832) in L1 before mmap(),
which triggers a 4-page readahead, with the second page tagged
PG_readahead.

L0: open("/lib/libc.so.6", O_RDONLY) = 3
L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
L7: close(3) = 0

In general, the PG_readahead flag will also be hit in cases

- sequential reads

- clustered random reads

A full readahead size is desirable in both cases.

Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:29 +0800
70ac23cfa readahead: sequential mmap readahead ... Browse Code »

Auto-detect sequential mmap reads and do readahead for them.

The sequential mmap readahead will be triggered when
- sync readahead: it's a major fault and (prev_offset == offset-1);
- async readahead: minor fault on PG_readahead page with valid readahead state.

The benefits of doing readahead instead of read-around:
- less I/O wait thanks to async readahead
- double real I/O size and no more cache hits

The single stream case is improved a little.
For 100,000 sequential mmap reads:

user system cpu total
(1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
(1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
(2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607

The patched (2) has smallest total time, since it has no cache hit overheads
and less I/O block time(thanks to async readahead). Here the I/O size
makes no much difference, since there's only one single stream.

Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
since the half of the read-around pages will be readahead cache hits.

This is going to make _real_ differences for _concurrent_ IO streams.

Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:29 +0800
ef00e08e2 readahead: clean up and simplify the code for filemap page fault readahead ... Browse Code »

This shouldn't really change behavior all that much, but the single rather
complex function with read-ahead inside a loop etc is broken up into more
manageable pieces.

The behaviour is also less subtle, with the read-ahead being done up-front
rather than inside some subtle loop and thus avoiding the now unnecessary
extra state variables (ie "did_readaround" is gone).

Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
which case the original code will directly jump to no_cached_page reading.

Cc: Pavel Levshin
Cc:
Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Signed-off-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2009-06-17 10:47:29 +0800
51daa88eb readahead: remove sync/async readahead call dependency ... Browse Code »

The readahead call scheme is error-prone in that it expects the call sites
to check for async readahead after doing a sync one. I.e.

if (!page)
page_cache_sync_readahead();
page = find_get_page();
if (page && PageReadahead(page))
page_cache_async_readahead();

This is because PG_readahead could be set by a sync readahead for the
_current_ newly faulted in page, and the readahead code simply expects one
more callback on the same page to start the async readahead. If the
caller fails to do so, it will miss the PG_readahead bits and never able
to start an async readahead.

Eliminate this insane constraint by piggy-backing the async part into the
current readahead window.

Now if an async readahead should be started immediately after a sync one,
the readahead logic itself will do it. So the following code becomes
valid: (the 'else' in particular)

if (!page)
page_cache_sync_readahead();
else if (PageReadahead(page))
page_cache_async_readahead();

Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:29 +0800
160334a0c readahead: increase interleaved readahead size ... Browse Code »

Make sure interleaved readahead size is larger than request size. This
also makes the readahead window grow up more quickly.

Reported-by: Xu Chenfeng
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:29 +0800
caca7cb74 readahead: remove one unnecessary radix tree lookup ... Browse Code »

(hit_readahead_marker != 0) means the page at @offset is present, so we
can search for non-present page starting from @offset+1.

Reported-by: Xu Chenfeng
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:28 +0800
fc31d16ad readahead: apply max_sane_readahead() limit in ondemand_readahead() ... Browse Code »

Just in case someone aggressively sets a huge readahead size.

Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:28 +0800
f7e839dd3 readahead: move max_sane_readahead() calls into force_page_cache_readahead() ... Browse Code »

Impact: code simplification.

Cc: Nick Piggin
Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:28 +0800
1ebf26a9b readahead: make mmap_miss an unsigned int ... Browse Code »

This makes the performance impact of possible mmap_miss wrap around to be
temporary and tolerable: i.e. MMAP_LOTSAMISS=100 extra readarounds.

Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
cache misses to bring it back to normal state. During the time mmap
readaround will be _enabled_ for whatever wild random workload. That's
almost permanent performance impact.

Signed-off-by: Wu Fengguang
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:28 +0800