20 Oct, 2008

1 commit

  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

1 commit


27 Jul, 2008

1 commit

  • radix_tree_next_hole() is implemented as a series of radix_tree_lookup()s.
    So it can be called locklessly, under rcu_read_lock().

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Apr, 2008

1 commit

  • Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
    This allows us to see and set the various BDI specific variables.

    In particular this properly exposes the read-ahead window for all relevant
    users and /sys/block//queue/read_ahead_kb should be deprecated.

    With patient help from Kay Sievers and Greg KH

    [mszeredi@suse.cz]

    - split off NFS and FUSE changes into separate patches
    - document new sysfs attributes under Documentation/ABI
    - do bdi_class_init as a core_initcall, otherwise the "default" BDI
    won't be initialized
    - remove bdi_init_fmt macro, it's not used very much

    [akpm@linux-foundation.org: fix ia64 warning]
    Signed-off-by: Peter Zijlstra
    Cc: Kay Sievers
    Acked-by: Greg KH
    Cc: Trond Myklebust
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

20 Mar, 2008

1 commit

  • Fix kernel-doc notation in mm/readahead.c.

    Change ":" to ";" so that it doesn't get treated as a doc section heading.
    Move the comment block ending "*/" to a line by itself so that the text on
    that last line is not lost (dropped).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

17 Oct, 2007

7 commits

  • provide BDI constructor/destructor hooks

    [akpm@linux-foundation.org: compile fix]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Quite a bit of code is used in maintaining these "cached pages" that are
    probably pretty unlikely to get used. It would require a narrow race where
    the page is inserted concurrently while this process is allocating a page
    in order to create the spare page. Then a multi-page write into an uncached
    part of the file, to make use of it.

    Next, the buffered write path (and others) uses its own LRU pagevec when it
    should be just using the per-CPU LRU pagevec (which will cut down on both data
    and code size cacheline footprint). Also, these private LRU pagevecs are
    emptied after just a very short time, in contrast with the per-CPU pagevecs
    that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
    to add the pages to pagecache for a bulk write (in 4K chunks).

    [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
    to clashes in -mm. What put them there, and why? ]

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Probing pages and radix_tree_tagged are lockless operations with the lockless
    radix-tree. Convert these users to RCU locking rather than using tree_lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a simplified version of the pagecache context based readahead. It
    handles the case of multiple threads reading on the same fd and invalidating
    each others' readahead state. It does the trick by scanning the pagecache and
    recovering the current read stream's readahead status.

    The algorithm works in a opportunistic way, in that it does not try to detect
    interleaved reads _actively_, which requires a probe into the page cache
    (which means a little more overhead for random reads). It only tries to
    handle a previously started sequential readahead whose state was overwritten
    by another concurrent stream, and it can do this job pretty well.

    Negative and positive examples(or what you can expect from it):

    1) it cannot detect and serve perfect request-by-request interleaved reads
    right:
    time stream 1 stream 2
    0 1
    1 1001
    2 2
    3 1002
    4 3
    5 1003
    6 4
    7 1004
    8 5
    9 1005

    Here no single readahead will be carried out.

    2) However, if it's two concurrent reads by two threads, the chance of the
    initial sequential readahead be started is huge. Once the first sequential
    readahead is started for a stream, this patch will ensure that the readahead
    window continues to rampup and won't be disturbed by other streams.

    time stream 1 stream 2
    0 1
    1 2
    2 1001
    3 3
    4 1002
    5 1003
    6 4
    7 5
    8 1004
    9 6
    10 1005
    11 7
    12 1006
    13 1007

    Here stream 1 will start a readahead at page 2, and stream 2 will start its
    first readahead at page 1003. From then on the two streams will be served
    right.

    Cc: Rusty Russell
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Combine the file_ra_state members
    unsigned long prev_index
    unsigned int prev_offset
    into
    loff_t prev_pos

    It is more consistent and better supports huge files.

    Thanks to Peter for the nice proposal!

    [akpm@linux-foundation.org: fix shift overflow]
    Cc: Peter Zijlstra
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Use 'unsigned int' instead of 'unsigned long' for readahead sizes.

    This helps reduce memory consumption on 64bit CPU when a lot of files are
    opened.

    CC: Andi Kleen
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

10 Oct, 2007

1 commit

  • Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
    the (few) files that fail to build because they were relying on blkdev.h
    pulling in extra includes for them.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Jul, 2007

8 commits

  • Rename some file_ra_state variables and remove some accessors.

    It results in much simpler code.
    Kudos to Rusty!

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Split ondemand readahead interface into two functions. I think this makes it
    a little clearer for non-readahead experts (like Rusty).

    Internally they both call ondemand_readahead(), but the page argument is
    changed to an obvious boolean flag.

    Signed-off-by: Rusty Russell
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Share the same page flag bit for PG_readahead and PG_reclaim.

    One is used only on file reads, another is only for emergency writes. One
    is used mostly for fresh/young pages, another is for old pages.

    Combinations of possible interactions are:

    a) clear PG_reclaim => implicit clear of PG_readahead
    it will delay an asynchronous readahead into a synchronous one
    it actually does _good_ for readahead:
    the pages will be reclaimed soon, it's readahead thrashing!
    in this case, synchronous readahead makes more sense.

    b) clear PG_readahead => implicit clear of PG_reclaim
    one(and only one) page will not be reclaimed in time
    it can be avoided by checking PageWriteback(page) in readahead first

    c) set PG_reclaim => implicit set of PG_readahead
    will confuse readahead and make it restart the size rampup process
    it's a trivial problem, and can mostly be avoided by checking
    PageWriteback(page) first in readahead

    d) set PG_readahead => implicit set of PG_reclaim
    PG_readahead will never be set on already cached pages.
    PG_reclaim will always be cleared on dirtying a page.
    so not a problem.

    In summary,
    a) we get better behavior
    b,d) possible interactions can be avoided
    c) racy condition exists that might affect readahead, but the chance
    is _really_ low, and the hurt on readahead is trivial.

    Compound pages also use PG_reclaim, but for now they do not interact with
    reclaim/readahead code.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Remove the old readahead algorithm.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a minimal readahead algorithm that aims to replace the current one.
    It is more flexible and reliable, while maintaining almost the same behavior
    and performance. Also it is full integrated with adaptive readahead.

    It is designed to be called on demand:
    - on a missing page, to do synchronous readahead
    - on a lookahead page, to do asynchronous readahead

    In this way it eliminated the awkward workarounds for cache hit/miss,
    readahead thrashing, retried read, and unaligned read. It also adopts the
    data structure introduced by adaptive readahead, parameterizes readahead
    pipelining with `lookahead_index', and reduces the current/ahead windows to
    one single window.

    HEURISTICS

    The logic deals with four cases:

    - sequential-next
    found a consistent readahead window, so push it forward

    - random
    standalone small read, so read as is

    - sequential-first
    create a new readahead window for a sequential/oversize request

    - lookahead-clueless
    hit a lookahead page not associated with the readahead window,
    so create a new readahead window and ramp it up

    In each case, three parameters are determined:

    - readahead index: where the next readahead begins
    - readahead size: how much to readahead
    - lookahead size: when to do the next readahead (for pipelining)

    BEHAVIORS

    The old behaviors are maximally preserved for trivial sequential/random reads.
    Notable changes are:

    - It no longer imposes strict sequential checks.
    It might help some interleaved cases, and clustered random reads.
    It does introduce risks of a random lookahead hit triggering an
    unexpected readahead. But in general it is more likely to do good
    than to do evil.

    - Interleaved reads are supported in a minimal way.
    Their chances of being detected and proper handled are still low.

    - Readahead thrashings are better handled.
    The current readahead leads to tiny average I/O sizes, because it
    never turn back for the thrashed pages. They have to be fault in
    by do_generic_mapping_read() one by one. Whereas the on-demand
    readahead will redo readahead for them.

    OVERHEADS

    The new code reduced the overheads of

    - excessively calling the readahead routine on small sized reads
    (the current readahead code insists on seeing all requests)

    - doing a lot of pointless page-cache lookups for small cached files
    (the current readahead only turns itself off after 256 cache hits,
    unfortunately most files are < 1MB, so never see that chance)

    That accounts for speedup of
    - 0.3% on 1-page sequential reads on sparse file
    - 1.2% on 1-page cache hot sequential reads
    - 3.2% on 256-page cache hot sequential reads
    - 1.3% on cache hot `tar /lib`

    However, it does introduce one extra page-cache lookup per cache miss, which
    impacts random reads slightly. That's 1% overheads for 1-page random reads on
    sparse file.

    PERFORMANCE

    The basic benchmark setup is
    - 2.6.20 kernel with on-demand readahead
    - 1MB max readahead size
    - 2.9GHz Intel Core 2 CPU
    - 2GB memory
    - 160G/8M Hitachi SATA II 7200 RPM disk

    The benchmarks show that
    - it maintains the same performance for trivial sequential/random reads
    - sysbench/OLTP performance on MySQL gains up to 8%
    - performance on readahead thrashing gains up to 3 times

    iozone throughput (KB/s): roughly the same
    ==========================================
    iozone -c -t1 -s 4096m -r 64k

    2.6.20 on-demand gain
    first run
    " Initial write " 61437.27 64521.53 +5.0%
    " Rewrite " 47893.02 48335.20 +0.9%
    " Read " 62111.84 62141.49 +0.0%
    " Re-read " 62242.66 62193.17 -0.1%
    " Reverse Read " 50031.46 49989.79 -0.1%
    " Stride read " 8657.61 8652.81 -0.1%
    " Random read " 13914.28 13898.23 -0.1%
    " Mixed workload " 19069.27 19033.32 -0.2%
    " Random write " 14849.80 14104.38 -5.0%
    " Pwrite " 62955.30 65701.57 +4.4%
    " Pread " 62209.99 62256.26 +0.1%

    second run
    " Initial write " 60810.31 66258.69 +9.0%
    " Rewrite " 49373.89 57833.66 +17.1%
    " Read " 62059.39 62251.28 +0.3%
    " Re-read " 62264.32 62256.82 -0.0%
    " Reverse Read " 49970.96 50565.72 +1.2%
    " Stride read " 8654.81 8638.45 -0.2%
    " Random read " 13901.44 13949.91 +0.3%
    " Mixed workload " 19041.32 19092.04 +0.3%
    " Random write " 14019.99 14161.72 +1.0%
    " Pwrite " 64121.67 68224.17 +6.4%
    " Pread " 62225.08 62274.28 +0.1%

    In summary, writes are unstable, reads are pretty close on average:

    access pattern 2.6.20 on-demand gain
    Read 62085.61 62196.38 +0.2%
    Re-read 62253.49 62224.99 -0.0%
    Reverse Read 50001.21 50277.75 +0.6%
    Stride read 8656.21 8645.63 -0.1%
    Random read 13907.86 13924.07 +0.1%
    Mixed workload 19055.29 19062.68 +0.0%
    Pread 62217.53 62265.27 +0.1%

    aio-stress: roughly the same
    ============================
    aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
    aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

    2.6.20 on-demand delta
    sequential 92.57s 92.54s -0.0%
    random 311.87s 312.15s +0.1%

    sysbench fileio: roughly the same
    =================================
    sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
    --file-total-size=4G --file-block-size=64K \
    --num-threads=001 --max-requests=10000 --max-time=900 run

    threads 2.6.20 on-demand delta
    first run
    1 59.1974s 59.2262s +0.0%
    2 58.0575s 58.2269s +0.3%
    4 48.0545s 47.1164s -2.0%
    8 41.0684s 41.2229s +0.4%
    16 35.8817s 36.4448s +1.6%
    32 32.6614s 32.8240s +0.5%
    64 23.7601s 24.1481s +1.6%
    128 24.3719s 23.8225s -2.3%
    256 23.2366s 22.0488s -5.1%

    second run
    1 59.6720s 59.5671s -0.2%
    8 41.5158s 41.9541s +1.1%
    64 25.0200s 23.9634s -4.2%
    256 22.5491s 20.9486s -7.1%

    Note that the numbers are not very stable because of the writes.
    The overall performance is close when we sum all seconds up:

    sum all up 495.046s 491.514s -0.7%

    sysbench oltp (trans/sec): up to 8% gain
    ========================================
    sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
    --mysql-socket=/var/run/mysqld/mysqld.sock \
    --mysql-user=root --mysql-password=readahead \
    --num-threads=064 --max-requests=10000 --max-time=900 run

    10000-transactions run
    threads 2.6.20 on-demand gain
    1 62.81 64.56 +2.8%
    2 67.97 70.93 +4.4%
    4 81.81 85.87 +5.0%
    8 94.60 97.89 +3.5%
    16 99.07 104.68 +5.7%
    32 95.93 104.28 +8.7%
    64 96.48 103.68 +7.5%
    5000-transactions run
    1 48.21 48.65 +0.9%
    8 68.60 70.19 +2.3%
    64 70.57 74.72 +5.9%
    2000-transactions run
    1 37.57 38.04 +1.3%
    2 38.43 38.99 +1.5%
    4 45.39 46.45 +2.3%
    8 51.64 52.36 +1.4%
    16 54.39 55.18 +1.5%
    32 52.13 54.49 +4.5%
    64 54.13 54.61 +0.9%

    That's interesting results. Some investigations show that
    - MySQL is accessing the db file non-uniformly: some parts are
    more hot than others
    - It is mostly doing 4-page random reads, and sometimes doing two
    reads in a row, the latter one triggers a 16-page readahead.
    - The on-demand readahead leaves many lookahead pages (flagged
    PG_readahead) there. Many of them will be hit, and trigger
    more readahead pages. Which might save more seeks.
    - Naturally, the readahead windows tend to lie in hot areas,
    and the lookahead pages in hot areas is more likely to be hit.
    - The more overall read density, the more possible gain.

    That also explains the adaptive readahead tricks for clustered random reads.

    readahead thrashing: 3 times better
    ===================================
    We boot kernel with "mem=128m single", and start a 100KB/s stream on every
    second, until reaching 200 streams.

    max throughput min avg I/O size
    2.6.20: 5MB/s 16KB
    on-demand: 15MB/s 140KB

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Extend struct file_ra_state to support the on-demand readahead logic. Also
    define some helpers for it.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Define two convenient macros for read-ahead:
    - MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
    - MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

    Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
    page sizes like 64k.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Add look-ahead support to __do_page_cache_readahead().

    It works by
    - mark the Nth backwards page with PG_readahead,
    (which instructs the page's first reader to invoke readahead)
    - and only do the marking for newly allocated pages.
    (to prevent blindly doing readahead on already cached pages)

    Look-ahead is a technique to achieve I/O pipelining:

    While the application is working through a chunk of cached pages, the kernel
    reads-ahead the next chunk of pages _before_ time of need. It effectively
    hides low level I/O latencies to high level applications.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

08 May, 2007

2 commits

  • Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
    prev_offset. Also update of prev_index in do_generic_mapping_read() is now
    moved close to the update of prev_offset.

    [wfg@mail.ustc.edu.cn: fix it]
    Signed-off-by: Jan Kara
    Cc: Nick Piggin
    Cc: WU Fengguang
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Introduce ra.offset and store in it an offset where the previous read
    ended. This way we can detect whether reads are really sequential (and
    thus we should not mark the page as accessed repeatedly) or whether they
    are random and just happen to be in the same page (and the page should
    really be marked accessed again).

    Signed-off-by: Jan Kara
    Acked-by: Nick Piggin
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

12 Feb, 2007

1 commit


11 Dec, 2006

1 commit

  • nfs's ->readpages uses read_cache_pages(). Wire it up there.

    [wfg@mail.ustc.edu.cn: account only successful nfs/fuse reads]
    Cc: Jay Lan
    Cc: Shailabh Nagar
    Cc: Balbir Singh
    Cc: Chris Sturtivant
    Cc: Tony Ernst
    Cc: Guillaume Thouvenin
    Cc: David Wright
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2006

1 commit


08 Dec, 2006

1 commit


04 Nov, 2006

1 commit

  • Current read_pages() assume ->readpages() frees the passed pages.

    This patch free the pages in ->read_pages(), if those were remaining in the
    pages_list. So, readpages() just can ignore the remaining pages in
    pages_list.

    Signed-off-by: OGAWA Hirofumi
    Cc: Steven French
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

03 Jul, 2006

1 commit


27 Jun, 2006

1 commit

  • acquired (aquired)
    contiguous (contigious)
    successful (succesful, succesfull)
    surprise (suprise)
    whether (weather)
    some other misspellings

    Signed-off-by: Andreas Mohr
    Signed-off-by: Adrian Bunk

    Andreas Mohr
     

26 Jun, 2006

2 commits

  • Put short function description for read_cache_pages() on one line as needed
    by kernel-doc.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • AOP_TRUNCATED_PAGE victims in read_pages() belong in the LRU

    Nick Piggin rightly pointed out that the introduction of AOP_TRUNCATED_PAGE
    to read_pages() was wrong to leave A_T_P victim pages in the page cache but
    not put them in the LRU. Failing to do so hid them from the VM.

    A_T_P just means that the aop method unlocked the page rather than
    performing IO. It would be very rare that the page was truncated between
    the unlock and testing A_T_P. So we leave the pages in the LRU for likely
    reuse soon rather than backing them back out of the page cache. We do this
    by matching the behaviour before the A_T_P introduction which added pages
    to the LRU regardless of what ->readpage() did.

    This doesn't include the unrelated cleanup in Nick's initial fix which
    changed read_pages() to return void to match its only caller's behaviour of
    ignoring errors.

    Signed-off-by: Nick Piggin
    Signed-off-by: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

01 Apr, 2006

1 commit


23 Mar, 2006

1 commit

  • Linus points out that ext3_readdir's readahead only cuts in when
    ext3_readdir() is operating at the very start of the directory. So for large
    directories we end up performing no readahead at all and we suck.

    So take it all out and use the core VM's page_cache_readahead(). This means
    that ext3 directory reads will use all of readahead's dynamic sizing goop.

    Note that we're using the directory's filp->f_ra to hold the readahead state,
    but readahead is actually being performed against the underlying blockdev's
    address_space. Fortunately the readahead code is all set up to handle this.

    Tested with printk. It works. I was struggling to find a real workload which
    actually cared.

    (The patch also exports page_cache_readahead() to GPL modules)

    Cc: "Stephen C. Tweedie"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

22 Mar, 2006

2 commits

  • The current current get_init_ra_size is not optimal across different IO
    sizes and max_readahead values. Here is a quick summary of sizes computed
    under current design and under the attached patch. All of these assume 1st
    IO at offset 0, or 1st detected sequential IO.

    32k max, 4k request

    old new
    -----------------
    8k 8k
    16k 16k
    32k 32k

    128k max, 4k request
    old new
    -----------------
    32k 16k
    64k 32k
    128k 64k
    128k 128k

    128k max, 32k request
    old new
    -----------------
    32k 64k
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Pratt
     
  • If get_next_ra_size() does not grow fast enough, ->prev_page can overrun
    the ahead window. This means the caller will read the pages from
    ->ahead_start + ->ahead_size to ->prev_page synchronously.

    Signed-off-by: Oleg Nesterov
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

30 Jan, 2006

1 commit


04 Jan, 2006

1 commit

  • readpage(), prepare_write(), and commit_write() callers are updated to
    understand the special return code AOP_TRUNCATED_PAGE in the style of
    writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that
    the callee has unlocked the page and that the operation should be tried again
    with a new page. OCFS2 uses this to detect and work around a lock inversion in
    its aop methods. There should be no change in behaviour for methods that don't
    return AOP_TRUNCATED_PAGE.

    WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
    made enums so that kerneldoc can be used to document their semantics.

    Signed-off-by: Zach Brown

    Zach Brown
     

07 Nov, 2005

1 commit

  • Add a few comments surrounding the generic readahead API.

    Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE
    offsets into pagecache.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

08 Sep, 2005

1 commit