17 Jun, 2009

40 commits

  • ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
    pages_min, pages_low or pages_high is used as the zone watermark when
    allocating the pages. Two branches in the allocator hotpath determine
    which watermark to use.

    This patch uses the flags as an array index into a watermark array that is
    indexed with WMARK_* defines accessed via helpers. All call sites that
    use zone->pages_* are updated to use the helpers for accessing the values
    and the array offsets for setting.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A number of sanity checks are made on each page allocation and free
    including that the page count is zero. page_count() checks for compound
    pages and checks the count of the head page if true. However, in these
    paths, we do not care if the page is compound or not as the count of each
    tail page should also be zero.

    This patch makes two changes to the use of page_count() in the free path.
    It converts one check of page_count() to a VM_BUG_ON() as the count should
    have been unconditionally checked earlier in the free path. It also
    avoids checking for compound pages.

    [mel@csn.ul.ie: Wrote changelog]
    Signed-off-by: Mel Gorman
    Signed-off-by: Nick Piggin
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There is a zonelist cache which is used to track zones that are not in the
    allowed cpuset or found to be recently full. This is to reduce cache
    footprint on large machines. On smaller machines, it just incurs cost for
    no gain. This patch only uses the zonelist cache when there are NUMA
    nodes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • free_page_mlock() tests and clears PG_mlocked using locked versions of the
    bit operations. If set, it disables interrupts to update counters and
    this happens on every page free even though interrupts are disabled very
    shortly afterwards a second time. This is wasteful.

    This patch splits what free_page_mlock() does. The bit check is still
    made. However, the update of counters is delayed until the interrupts are
    disabled and the non-lock version for clearing the bit is used. One
    potential weirdness with this split is that the counters do not get
    updated if the bad_page() check is triggered but a system showing bad
    pages is getting screwed already.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Acked-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • get_pageblock_migratetype() is potentially called twice for every page
    free. Once, when being freed to the pcp lists and once when being freed
    back to buddy. When freeing from the pcp lists, it is known what the
    pageblock type was at the time of free so use it rather than rechecking.
    In low memory situations under memory pressure, this might skew
    anti-fragmentation slightly but the interference is minimal and decisions
    that are fragmenting memory are being made anyway.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __rmqueue_fallback() is in the slow path but has only one call site.
    Because there is only one call-site, this function can then be inlined
    without causing text bloat. On an x86-based config, it made no difference
    as the savings were padded out by NOP instructions. Milage varies but
    text will either decrease in size or remain static.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • buffered_rmqueue() is in the fast path so inline it. Because it only has
    one call site, this function can then be inlined without causing text
    bloat. On an x86-based config, it made no difference as the savings were
    padded out by NOP instructions. Milage varies but text will either
    decrease in size or remain static.

    Signed-off-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Inline __rmqueue_smallest by altering flow very slightly so that there is
    only one call site. Because there is only one call-site, this function
    can then be inlined without causing text bloat. On an x86-based config,
    this patch reduces text by 16 bytes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
    flags are equal to each other, we can eliminate a branch.

    [akpm@linux-foundation.org: Suggested the hack]
    Signed-off-by: Mel Gorman
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Factor out the mapping between GFP and alloc_flags only once. Once
    factored out, it only needs to be calculated once but some care must be
    taken.

    [neilb@suse.de says]
    As the test:

    - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
    - && !in_interrupt()) {
    - if (!(gfp_mask & __GFP_NOMEMALLOC)) {

    has been replaced with a slightly weaker one:

    + if (alloc_flags & ALLOC_NO_WATERMARKS) {

    Without care, this would allow recursion into the allocator via direct
    reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set but
    TF_MEMDIE callers are now allowed to directly reclaim where they would
    have been prevented in the past.

    Signed-off-by: Peter Zijlstra
    Acked-by: Pekka Enberg
    Signed-off-by: Mel Gorman
    Cc: Neil Brown
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • GFP mask is converted into a migratetype when deciding which pagelist to
    take a page from. However, it is happening multiple times per allocation,
    at least once per zone traversed. Calculate it once.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • get_page_from_freelist() can be called multiple times for an allocation.
    Part of this calculates the preferred_zone which is the first usable zone
    in the zonelist but the zone depends on the GFP flags specified at the
    beginning of the allocation call. This patch calculates preferred_zone
    once. It's safe to do this because if preferred_zone is NULL at the start
    of the call, no amount of direct reclaim or other actions will change the
    fact the allocation will fail.

    [akpm@linux-foundation.org: remove (void) casts]
    Signed-off-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On low-memory systems, anti-fragmentation gets disabled as there is
    nothing it can do and it would just incur overhead shuffling pages between
    lists constantly. Currently the check is made in the free page fast path
    for every page. This patch moves it to a slow path. On machines with low
    memory, there will be small amount of additional overhead as pages get
    shuffled between lists but it should quickly settle.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The core of the page allocator is one giant function which allocates
    memory on the stack and makes calculations that may not be needed for
    every allocation. This patch breaks up the allocator path into fast and
    slow paths for clarity. Note the slow paths are still inlined but the
    entry is marked unlikely. If they were not inlined, it actally increases
    text size to generate the as there is only one call site.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is possible with __GFP_THISNODE that no zones are suitable. This patch
    makes sure the check is only made once.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • No user of the allocator API should be passing in an order >= MAX_ORDER
    but we check for it on each and every allocation. Delete this check and
    make it a VM_BUG_ON check further down the call path.

    [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The start of a large patch series to clean up and optimise the page
    allocator.

    The performance improvements are in a wide range depending on the exact
    machine but the results I've seen so fair are approximately;

    kernbench: 0 to 0.12% (elapsed time)
    0.49% to 3.20% (sys time)
    aim9: -4% to 30% (for page_test and brk_test)
    tbench: -1% to 4%
    hackbench: -2.5% to 3.45% (mostly within the noise though)
    netperf-udp -1.34% to 4.06% (varies between machines a bit)
    netperf-tcp -0.44% to 5.22% (varies between machines a bit)

    I haven't sysbench figures at hand, but previously they were within the
    -0.5% to 2% range.

    On netperf, the client and server were bound to opposite number CPUs to
    maximise the problems with cache line bouncing of the struct pages so I
    expect different people to report different results for netperf depending
    on their exact machine and how they ran the test (different machines, same
    cpus client/server, shared cache but two threads client/server, different
    socket client/server etc).

    I also measured the vmlinux sizes for a single x86-based config with
    CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the
    .config is based on the Debian Lenny kernel config so I expect it to be
    reasonably typical.

    This patch:

    __alloc_pages_internal is the core page allocator function but essentially
    it is an alias of __alloc_pages_nodemask. Naming a publicly available and
    exported function "internal" is also a big ugly. This patch renames
    __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
    nodemask function.

    Warning - This patch renames an exported symbol. No kernel driver is
    affected by external drivers calling __alloc_pages_internal() should
    change the call to __alloc_pages_nodemask() without any alteration of
    parameters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
    to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
    order >= MAX_ORDER - it's hoping for order 11. alloc_large_system_hash()
    had better make its own check on the order.

    Signed-off-by: Hugh Dickins
    Cc: David Miller
    Cc: Mel Gorman
    Cc: Eric Dumazet
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Fix the bug that the kernel didn't spread page cache/slab object evenly
    over all the allowed nodes when spread flags were set by updating tasks'
    page/slab spread flags in time.

    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • The kernel still allocates the page caches on old node after modifying its
    cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
    page cache evenly over all the nodes that faulting task is allowed to usr
    after memory_spread_page was set. it is caused by the old mem_allowed and
    flags of the task, the current kernel doesn't updates them unless some
    function invokes cpuset_update_task_memory_state(), it is too late
    sometimes.We must update the mem_allowed and the flags of the tasks in
    time.

    Slab has the same problem.

    The following patches fix this bug by updating tasks' mem_allowed and
    spread flag after its cpuset's mems or spread flag is changed.

    This patch:

    Extract a function from cpuset_update_task_memory_state(). It will be
    used later for update tasks' page/slab spread flags after its cpuset's
    flag is set

    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
    with variable pbdi_dirty as one of the arguments. This variable is an
    unsigned long * but both functions expect it to be a long *. This causes
    the following sparse warnings:

    warning: incorrect type in argument 3 (different signedness)
    expected long *pbdi_dirty
    got unsigned long *pbdi_dirty
    warning: incorrect type in argument 2 (different signedness)
    expected long *pdirty
    got unsigned long *pbdi_dirty

    Fix the warnings by changing the long * to unsigned long * in both
    functions.

    Signed-off-by: H Hartley Sweeten
    Cc: Johannes Weiner
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Commit 33c120ed2843090e2bd316de1588b8bf8b96cbde ("more aggressively use
    lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
    both active and inactive pages for asynchronous lumpy reclaim on
    costly-high-order pages and for cheap-high-order when memory pressure is
    high. However, if the system is under heavy pressure and there are dirty
    pages, asynchronous IO may not be sufficient to reclaim a suitable page in
    time.

    This patch causes the caller to enter synchronous lumpy reclaim for
    costly-high-order pages and for cheap-high-order pages when under memory
    pressure.

    Minchan.kim@gmail.com said:

    Andy added synchronous lumpy reclaim with
    c661b078fd62abe06fd11fab4ac5e4eeafe26b6d. At that time, lumpy reclaim is
    not agressive. His intension is just for high-order users.(above
    PAGE_ALLOC_COSTLY_ORDER).

    After some time, Rik added aggressive lumpy reclaim with
    33c120ed2843090e2bd316de1588b8bf8b96cbde. His intention was to do lumpy
    reclaim when high-order users and trouble getting a small set of
    contiguous pages.

    So we also have to add synchronous pageout for small set of contiguous
    pages.

    Cc: Lee Schermerhorn
    Cc: Andy Whitcroft
    Acked-by: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Reviewed-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Move more documentation for get_user_pages_fast into the new kerneldoc comment.
    Add some comments for get_user_pages as well.

    Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
    there initially because it was once a static inline function.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Nick Piggin
    Cc: Andy Grover
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Now that we do readahead for sequential mmap reads, here is a simple
    evaluation of the impacts, and one further optimization.

    It's an NFS-root debian desktop system, readahead size = 60 pages.
    The numbers are grabbed after a fresh boot into console.

    approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
    A 383 31.6% 383 11
    B 225 32.4% 390 11
    C 224 32.6% 307 13

    case A: mmap sync/async readahead disabled
    case B: mmap sync/async readahead enabled, with enforced full async readahead size
    case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
    or:
    A = vanilla 2.6.30-rc1
    B = A plus mmap readahead
    C = B plus this patch

    The numbers show that
    - there are good possibilities for random mmap reads to trigger readahead
    - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
    - case C can further reduce IO count by 1/4
    - readahead miss ratios are not quite affected

    The theory is
    - readahead is _good_ for clustered random reads, and can perform
    _better_ than readaround because they could be _async_.
    - async readahead size is guaranteed to be larger than readaround
    size, and they are _async_, hence will mostly behave better
    However for B
    - sync readahead size could be smaller than readaround size, hence may
    make things worse by produce more smaller IOs
    which will be fixed by this patch.

    Final conclusion:
    - mmap readahead reduced major faults by 1/3 and no obvious overheads;
    - mmap io can be further reduced by 1/4 with this patch.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Introduce page cache context based readahead algorithm.
    This is to better support concurrent read streams in general.

    RATIONALE
    ---------
    The current readahead algorithm detects interleaved reads in a _passive_ way.
    Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
    By checking for (offset == prev_offset + 1), it will discover the sequentialness
    between 3,4 and between 1004,1005, and start doing sequential readahead for the
    individual streams since page 4 and page 1005.

    The context readahead algorithm guarantees to discover the sequentialness no
    matter how the streams are interleaved. For the above example, it will start
    sequential readahead since page 2 and 1002.

    The trick is to poke for page @offset-1 in the page cache when it has no other
    clues on the sequentialness of request @offset: if the current requenst belongs
    to a sequential stream, that stream must have accessed page @offset-1 recently,
    and the page will still be cached now. So if page @offset-1 is there, we can
    take request @offset as a sequential access.

    BENEFICIARIES
    -------------
    - strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
    the current readahead will take them as silly random reads;
    the context readahead will take them as two sequential streams.

    - cooperative IO processes i.e. NFS and SCST
    They create a thread pool, farming off (sequential) IO requests to different
    threads which will be performing interleaved IO.

    It was not easy(or possible) to reliably tell from file->f_ra all those
    cooperative processes working on the same sequential stream, since they will
    have different file->f_ra instances. And NFSD's file->f_ra is particularly
    unusable, since their file objects are dynamically created for each request.
    The nfsd does have code trying to restore the f_ra bits, but not satisfactory.

    The new scheme is to detect the sequential pattern via looking up the page
    cache, which provides one single and consistent view of the pages recently
    accessed. That makes sequential detection for cooperative processes possible.

    USER REPORT
    -----------
    Vladislav recommends the addition of context readahead as a result of his SCST
    benchmarks. It leads to 6%~40% performance gains in various cases and achieves
    equal performance in others. http://lkml.org/lkml/2009/3/19/239

    OVERHEADS
    ---------
    In theory, it introduces one extra page cache lookup per random read. However
    the below benchmark shows context readahead to be slightly faster, wondering..

    Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
    each block size. The average throughputs are:

    original ra context ra gain
    4K random reads: 65.561MB/s 65.648MB/s +0.1%
    16K random reads: 124.767MB/s 124.951MB/s +0.1%
    64K random reads: 162.123MB/s 162.278MB/s +0.1%

    Cc: Jens Axboe
    Cc: Jeff Moyer
    Tested-by: Vladislav Bolkhovitin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Split all readahead cases, and move the random one to bottom.

    No behavior changes.

    This is to prepare for the introduction of context readahead, and make it
    easy for inserting accounting/tracing points for each case.

    Signed-off-by: Wu Fengguang
    Cc: Vladislav Bolkhovitin
    Cc: Jens Axboe
    Cc: Jeff Moyer
    Cc: Nick Piggin
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The counterpart of radix_tree_next_hole(). To be used by context readahead.

    Signed-off-by: Wu Fengguang
    Cc: Vladislav Bolkhovitin
    Cc: Jens Axboe
    Cc: Jeff Moyer
    Cc: Nick Piggin
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Mmap read-around now shares the same code style and data structure with
    readahead code.

    This also removes do_page_cache_readahead(). Its last user, mmap
    read-around, has been changed to call ra_submit().

    The no-readahead-if-congested logic is dumped by the way. Users will be
    pretty sensitive about the slow loading of executables. So it's
    unfavorable to disabled mmap read-around on a congested queue.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Nick Piggin
    Signed-off-by: Fengguang Wu
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • We need this in one particular case and two more general ones.

    Now we do async readahead for sequential mmap reads, and do it with the
    help of PG_readahead. For normal reads, PG_readahead is the sufficient
    condition to do a sequential readahead. But unfortunately, for mmap
    reads, there is a tiny nuisance:

    [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
    [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
    [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
    [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~

    An unfavorably small readahead. The original dumb read-around size could
    be more efficient.

    That happened because ld-linux.so does a read(832) in L1 before mmap(),
    which triggers a 4-page readahead, with the second page tagged
    PG_readahead.

    L0: open("/lib/libc.so.6", O_RDONLY) = 3
    L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
    L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
    L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
    L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
    L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
    L7: close(3) = 0

    In general, the PG_readahead flag will also be hit in cases

    - sequential reads

    - clustered random reads

    A full readahead size is desirable in both cases.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Auto-detect sequential mmap reads and do readahead for them.

    The sequential mmap readahead will be triggered when
    - sync readahead: it's a major fault and (prev_offset == offset-1);
    - async readahead: minor fault on PG_readahead page with valid readahead state.

    The benefits of doing readahead instead of read-around:
    - less I/O wait thanks to async readahead
    - double real I/O size and no more cache hits

    The single stream case is improved a little.
    For 100,000 sequential mmap reads:

    user system cpu total
    (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
    (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
    (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607

    The patched (2) has smallest total time, since it has no cache hit overheads
    and less I/O block time(thanks to async readahead). Here the I/O size
    makes no much difference, since there's only one single stream.

    Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
    since the half of the read-around pages will be readahead cache hits.

    This is going to make _real_ differences for _concurrent_ IO streams.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This shouldn't really change behavior all that much, but the single rather
    complex function with read-ahead inside a loop etc is broken up into more
    manageable pieces.

    The behaviour is also less subtle, with the read-ahead being done up-front
    rather than inside some subtle loop and thus avoiding the now unnecessary
    extra state variables (ie "did_readaround" is gone).

    Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
    the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
    which case the original code will directly jump to no_cached_page reading.

    Cc: Pavel Levshin
    Cc:
    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The readahead call scheme is error-prone in that it expects the call sites
    to check for async readahead after doing a sync one. I.e.

    if (!page)
    page_cache_sync_readahead();
    page = find_get_page();
    if (page && PageReadahead(page))
    page_cache_async_readahead();

    This is because PG_readahead could be set by a sync readahead for the
    _current_ newly faulted in page, and the readahead code simply expects one
    more callback on the same page to start the async readahead. If the
    caller fails to do so, it will miss the PG_readahead bits and never able
    to start an async readahead.

    Eliminate this insane constraint by piggy-backing the async part into the
    current readahead window.

    Now if an async readahead should be started immediately after a sync one,
    the readahead logic itself will do it. So the following code becomes
    valid: (the 'else' in particular)

    if (!page)
    page_cache_sync_readahead();
    else if (PageReadahead(page))
    page_cache_async_readahead();

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Make sure interleaved readahead size is larger than request size. This
    also makes the readahead window grow up more quickly.

    Reported-by: Xu Chenfeng
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • (hit_readahead_marker != 0) means the page at @offset is present, so we
    can search for non-present page starting from @offset+1.

    Reported-by: Xu Chenfeng
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Just in case someone aggressively sets a huge readahead size.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Impact: code simplification.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This makes the performance impact of possible mmap_miss wrap around to be
    temporary and tolerable: i.e. MMAP_LOTSAMISS=100 extra readarounds.

    Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
    cache misses to bring it back to normal state. During the time mmap
    readaround will be _enabled_ for whatever wild random workload. That's
    almost permanent performance impact.

    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang