01 Aug, 2007

3 commits

  • Fix kernel-doc warning:
    Warning(linux-2.6.23-rc1-mm1//mm/filemap.c:864): No description found for parameter 'ra'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • In badness(), the automatic variable 'points' is unsigned long. Print it
    as such.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • out_of_memory() may be called when an allocation is failing and the direct
    reclaim is not making any progress. This does not take into account the
    requested order of the allocation. If the request if for an order larger
    than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation
    because the kernel makes no guarantees about those allocations succeeding.

    This false OOM situation can occur if a user is trying to grow the hugepage
    pool in a script like;

    #!/bin/bash
    REQUIRED=$1
    echo 1 > /proc/sys/vm/hugepages_treat_as_movable
    echo $REQUIRED > /proc/sys/vm/nr_hugepages
    ACTUAL=`cat /proc/sys/vm/nr_hugepages`
    while [ $REQUIRED -ne $ACTUAL ]; do
    echo Huge page pool at $ACTUAL growing to $REQUIRED
    echo $REQUIRED > /proc/sys/vm/nr_hugepages
    ACTUAL=`cat /proc/sys/vm/nr_hugepages`
    sleep 1
    done

    This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM
    easily on 2.6.23-rc1. This patch will fail an allocation for an order above
    PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Jul, 2007

2 commits


30 Jul, 2007

3 commits

  • Remove fs.h from mm.h. For this,
    1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
    2) Add back fs.h or less bloated headers (err.h) to files that need it.

    As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
    rebuilt down to 3444 (-12.3%).

    Cross-compile tested without regressions on my two usual configs and (sigh):

    alpha arm-mx1ads mips-bigsur powerpc-ebony
    alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
    alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
    alpha-up arm-netx mips-db1000 powerpc-iseries
    arm arm-ns9xxx mips-db1100 powerpc-linkstation
    arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
    arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
    arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
    arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
    arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
    arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
    arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
    arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
    arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
    arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
    arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
    arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
    arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
    arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
    arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
    arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
    arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
    arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
    arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
    arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
    arm-footbridge ia64 mips-pb1500 powerpc-pmac32
    arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
    arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
    arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
    arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
    arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
    arm-integrator ia64-sn2 mips-rbhma4500 s390
    arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
    arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
    arm-iop33x ia64-zx1 mips-sead s390-up
    arm-ixp2000 m68k mips-tb0219 sparc
    arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
    arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
    arm-jornada720 m68k-atari mips-workpad sparc-up
    arm-kafa m68k-bvme6000 mips-wrppmc sparc64
    arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
    arm-ks8695 m68k-mac parisc sparc64-defconfig
    arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
    arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
    arm-lpd7a400 m68k-q40 parisc-up x86_64
    arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
    arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
    arm-lusl7200 mips powerpc-celleb x86_64-up
    arm-mainstone mips-atlas powerpc-chrp32

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Introduce CONFIG_SUSPEND representing the ability to enter system sleep
    states, such as the ACPI S3 state, and allow the user to choose SUSPEND
    and HIBERNATION independently of each other.

    Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
    been chosen and the kernel is intended for SMP systems.

    Also, introduce CONFIG_PM_SLEEP which is automatically selected if
    CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
    code needed for both suspend and hibernation.

    The top-level power management headers and the ACPI code related to
    suspend and hibernation are modified to use the new definitions (the
    changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
    the number of ifdefs).

    There are many other files in which CONFIG_PM can be replaced with
    CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
    the future.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid
    confusion (among other things, with CONFIG_SUSPEND introduced in the
    next patch).

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

27 Jul, 2007

3 commits

  • With the introduction of kernelcore=, a configurable zone is created on
    request. In some cases, this value will be small enough that some nodes
    contain only ZONE_MOVABLE. On some NUMA configurations when this occurs,
    arch-independent zone-sizing will get the size of the memory holes within
    the node incorrect. The value of present_pages goes negative and the boot
    fails.

    This patch fixes the bug in the calculation of the size of the hole. The
    test case is to boot test a NUMA machine with a low value of kernelcore=
    before and after the patch is applied. While this bug exists in early
    kernel it cannot be triggered in practice.

    This patch has been boot-tested on a variety machines with and without
    kernelcore= set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • release_pages() in mm/swap.c changes page_count() to be 0 without removing
    PageLRU flag...

    This means isolate_lru_page() can see a page, PageLRU() &&
    page_count(page)==0.. This is BUG. (get_page() will be called against
    count=0 page.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In usual, migrate_pages(page,,) is called with holding mm->sem by system call.
    (mm here is a mm_struct which maps the migration target page.)
    This semaphore helps avoiding some race conditions.

    But, if we want to migrate a page by some kernel codes, we have to avoid
    some races. This patch adds check code for following race condition.

    1. A page which page->mapping==NULL can be target of migration. Then, we have
    to check page->mapping before calling try_to_unmap().

    2. anon_vma can be freed while page is unmapped, but page->mapping remains as
    it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping.
    So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from
    being freed during migration.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Jul, 2007

3 commits

  • * 'request-queue-t' of git://git.kernel.dk/linux-2.6-block:
    [BLOCK] Add request_queue_t and mark it deprecated
    [BLOCK] Get rid of request_queue_t typedef

    Linus Torvalds
     
  • Use the correct local variable when calling into the page allocator. Local
    `flags' can have __GFP_ZERO set, which causes us to pass __GFP_ZERO into the
    page allocator, possibly from illegal contexts. The page allocator will later
    do prep_zero_page()->kmap_atomic(..., KM_USER0) from irq contexts and will
    then go BUG.

    Cc: Mike Galbraith
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • dequeue_huge_page() has a serious memory leak upon hugetlb page
    allocation. The for loop continues on allocating hugetlb pages out of
    all allowable zone, where this function is supposedly only dequeue one
    and only one pages.

    Fixed it by breaking out of the for loop once a hugetlb page is found.

    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

24 Jul, 2007

1 commit

  • Some of the code has been gradually transitioned to using the proper
    struct request_queue, but there's lots left. So do a full sweet of
    the kernel and get rid of this typedef and replace its uses with
    the proper type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 Jul, 2007

1 commit

  • Fix following warning:
    WARNING: vmlinux.o(.text+0x188ea): Section mismatch: reference to .init.text:__alloc_bootmem_core (between 'alloc_bootmem_high_node' and 'get_gate_vma')

    alloc_bootmem_high_node() is only used from __init scope so declare it __init.
    And in addition declare the weak variant __init too.

    Signed-off-by: Sam Ravnborg
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     

22 Jul, 2007

3 commits

  • The version of SLOB in -mm always scans its free list from the beginning,
    which results in small allocations and free segments clustering at the
    beginning of the list over time. This causes the average search to scan
    over a large stretch at the beginning on each allocation.

    By starting each page search where the last one left off, we evenly
    distribute the allocations and greatly shorten the average search.

    Without this patch, kernel compiles on a 1.5G machine take a large amount
    of system time for list scanning. With this patch, compiles are within a
    few seconds of performance of a SLAB kernel with no notable change in
    system time.

    Signed-off-by: Matt Mackall
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Now that arch/powerpc/platforms/cell/spufs/fault.c is always built in
    the kernel there is no need to export handle_mm_fault anymore.

    Signed-off-by: Christoph Hellwig
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Trying to survive an allmodconfig on a nommu platform results in many
    screen lengths of module unhappiness. Many of the mmap related things that
    binfmt_flat hooks in to are never exported despite being global, and there
    are also missing definitions for vmalloc_32_user() and vm_insert_page().

    I've implemented vmalloc_32_user() trying to stick as close to the
    mm/vmalloc.c implementation as possible, though we don't have any need for
    VM_USERMAP, so groveling for the VMA can be skipped. vm_insert_page() has
    been stubbed for now in order to keep the build happy.

    Signed-off-by: Paul Mundt
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

20 Jul, 2007

21 commits

  • zone_movable_pfn is presently marked as __initdata and referenced from
    adjust_zone_range_for_zone_movable(), which in turn is referenced by
    zone_spanned_pages_in_node(). Both of these are __meminit annotated. When
    memory hotplug is enabled, this will oops on a hot-add, due to
    zone_movable_pfn having been freed.

    __meminitdata annotation gives the desired behaviour.

    This will only impact platforms that enable both memory hotplug
    and ARCH_POPULATES_NODE_MAP.

    Signed-off-by: Paul Mundt
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     
  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     
  • The slab and slob allocators already did this right, but slub would call
    "get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
    end results.

    Noted by Ingo Molnar.

    Cc: Ingo Molnar
    Cc: Christoph Lameter
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • I suspect Christoph tested his code only in the NUMA configuration, for
    the combination of SLAB+non-NUMA the zero-sized kmalloc's would not work.

    Of course, this would only trigger in configurations where those zero-
    sized allocations happen (not very common), so that may explain why it
    wasn't more widely noticed.

    Seen by by Andi Kleen under qemu, and there seems to be a report by
    Michael Tsirkin on it too.

    Cc: Andi Kleen
    Cc: Roland Dreier
    Cc: Michael S. Tsirkin
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • lguest does some fairly lowlevel things to support a host, which
    normal modules don't need:

    math_state_restore:
    When the guest triggers a Device Not Available fault, we need
    to be able to restore the FPU

    __put_task_struct:
    We need to hold a reference to another task for inter-guest
    I/O, and put_task_struct() is an inline function which calls
    __put_task_struct.

    access_process_vm:
    We need to access another task for inter-guest I/O.

    map_vm_area & __get_vm_area:
    We need to map the switcher shim (ie. monitor) at 0xFFC01000.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • page-writeback accounting is presently performed in the page-flags macros.
    This is inconsistent and a bit ugly and makes it awkward to implement
    per-backing_dev under-writeback page accounting.

    So move this accounting down to the callsite(s).

    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Use appropriate accessor function to set compound page destructor
    function.

    Cc: William Irwin
    Signed-off-by: Akinobu Mita
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The fix to that race in alloc_fresh_huge_page() which could give an illegal
    node ID did not need nid_lock at all: the fix was to replace static int nid
    by static int prev_nid and do the work on local int nid. nid_lock did make
    sure that racers strictly roundrobin the nodes, but that's not something we
    need to enforce strictly. Kill nid_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I've noticed lots of failures of vmalloc_32 on machines where it
    shouldn't have failed unless it was doing an atomic operation.

    Looking closely, I noticed that:

    #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
    #define GFP_VMALLOC32 GFP_DMA32
    #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
    #define GFP_VMALLOC32 GFP_DMA
    #else
    #define GFP_VMALLOC32 GFP_KERNEL
    #endif

    Which seems to be incorrect, it should always -or- in the DMA flags
    on top of GFP_KERNEL, thus this patch.

    This fixes frequent errors launchin X with the nouveau DRM for example.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Andi Kleen
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Work around a possible bug in the FRV compiler.

    What appears to be happening is that gcc resolves the
    __builtin_constant_p() in kmalloc() to true, but then fails to reduce the
    therefore constant conditions in the if-statements it guards to constant
    results.

    When compiling with -O2 or -Os, one single spurious error crops up in
    cpuup_callback() in mm/slab.c. This can be avoided by making the memsize
    variable const.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • mm/hugetlb.c: In function `dequeue_huge_page':
    mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function

    Cc: Christoph Lameter
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     
  • Rename some file_ra_state variables and remove some accessors.

    It results in much simpler code.
    Kudos to Rusty!

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Split ondemand readahead interface into two functions. I think this makes it
    a little clearer for non-readahead experts (like Rusty).

    Internally they both call ondemand_readahead(), but the page argument is
    changed to an obvious boolean flag.

    Signed-off-by: Rusty Russell
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Share the same page flag bit for PG_readahead and PG_reclaim.

    One is used only on file reads, another is only for emergency writes. One
    is used mostly for fresh/young pages, another is for old pages.

    Combinations of possible interactions are:

    a) clear PG_reclaim => implicit clear of PG_readahead
    it will delay an asynchronous readahead into a synchronous one
    it actually does _good_ for readahead:
    the pages will be reclaimed soon, it's readahead thrashing!
    in this case, synchronous readahead makes more sense.

    b) clear PG_readahead => implicit clear of PG_reclaim
    one(and only one) page will not be reclaimed in time
    it can be avoided by checking PageWriteback(page) in readahead first

    c) set PG_reclaim => implicit set of PG_readahead
    will confuse readahead and make it restart the size rampup process
    it's a trivial problem, and can mostly be avoided by checking
    PageWriteback(page) first in readahead

    d) set PG_readahead => implicit set of PG_reclaim
    PG_readahead will never be set on already cached pages.
    PG_reclaim will always be cleared on dirtying a page.
    so not a problem.

    In summary,
    a) we get better behavior
    b,d) possible interactions can be avoided
    c) racy condition exists that might affect readahead, but the chance
    is _really_ low, and the hurt on readahead is trivial.

    Compound pages also use PG_reclaim, but for now they do not interact with
    reclaim/readahead code.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Remove the old readahead algorithm.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert filemap reads to use on-demand readahead.

    The new call scheme is to
    - call readahead on non-cached page
    - call readahead on look-ahead page
    - update prev_index when finished with the read request

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a minimal readahead algorithm that aims to replace the current one.
    It is more flexible and reliable, while maintaining almost the same behavior
    and performance. Also it is full integrated with adaptive readahead.

    It is designed to be called on demand:
    - on a missing page, to do synchronous readahead
    - on a lookahead page, to do asynchronous readahead

    In this way it eliminated the awkward workarounds for cache hit/miss,
    readahead thrashing, retried read, and unaligned read. It also adopts the
    data structure introduced by adaptive readahead, parameterizes readahead
    pipelining with `lookahead_index', and reduces the current/ahead windows to
    one single window.

    HEURISTICS

    The logic deals with four cases:

    - sequential-next
    found a consistent readahead window, so push it forward

    - random
    standalone small read, so read as is

    - sequential-first
    create a new readahead window for a sequential/oversize request

    - lookahead-clueless
    hit a lookahead page not associated with the readahead window,
    so create a new readahead window and ramp it up

    In each case, three parameters are determined:

    - readahead index: where the next readahead begins
    - readahead size: how much to readahead
    - lookahead size: when to do the next readahead (for pipelining)

    BEHAVIORS

    The old behaviors are maximally preserved for trivial sequential/random reads.
    Notable changes are:

    - It no longer imposes strict sequential checks.
    It might help some interleaved cases, and clustered random reads.
    It does introduce risks of a random lookahead hit triggering an
    unexpected readahead. But in general it is more likely to do good
    than to do evil.

    - Interleaved reads are supported in a minimal way.
    Their chances of being detected and proper handled are still low.

    - Readahead thrashings are better handled.
    The current readahead leads to tiny average I/O sizes, because it
    never turn back for the thrashed pages. They have to be fault in
    by do_generic_mapping_read() one by one. Whereas the on-demand
    readahead will redo readahead for them.

    OVERHEADS

    The new code reduced the overheads of

    - excessively calling the readahead routine on small sized reads
    (the current readahead code insists on seeing all requests)

    - doing a lot of pointless page-cache lookups for small cached files
    (the current readahead only turns itself off after 256 cache hits,
    unfortunately most files are < 1MB, so never see that chance)

    That accounts for speedup of
    - 0.3% on 1-page sequential reads on sparse file
    - 1.2% on 1-page cache hot sequential reads
    - 3.2% on 256-page cache hot sequential reads
    - 1.3% on cache hot `tar /lib`

    However, it does introduce one extra page-cache lookup per cache miss, which
    impacts random reads slightly. That's 1% overheads for 1-page random reads on
    sparse file.

    PERFORMANCE

    The basic benchmark setup is
    - 2.6.20 kernel with on-demand readahead
    - 1MB max readahead size
    - 2.9GHz Intel Core 2 CPU
    - 2GB memory
    - 160G/8M Hitachi SATA II 7200 RPM disk

    The benchmarks show that
    - it maintains the same performance for trivial sequential/random reads
    - sysbench/OLTP performance on MySQL gains up to 8%
    - performance on readahead thrashing gains up to 3 times

    iozone throughput (KB/s): roughly the same
    ==========================================
    iozone -c -t1 -s 4096m -r 64k

    2.6.20 on-demand gain
    first run
    " Initial write " 61437.27 64521.53 +5.0%
    " Rewrite " 47893.02 48335.20 +0.9%
    " Read " 62111.84 62141.49 +0.0%
    " Re-read " 62242.66 62193.17 -0.1%
    " Reverse Read " 50031.46 49989.79 -0.1%
    " Stride read " 8657.61 8652.81 -0.1%
    " Random read " 13914.28 13898.23 -0.1%
    " Mixed workload " 19069.27 19033.32 -0.2%
    " Random write " 14849.80 14104.38 -5.0%
    " Pwrite " 62955.30 65701.57 +4.4%
    " Pread " 62209.99 62256.26 +0.1%

    second run
    " Initial write " 60810.31 66258.69 +9.0%
    " Rewrite " 49373.89 57833.66 +17.1%
    " Read " 62059.39 62251.28 +0.3%
    " Re-read " 62264.32 62256.82 -0.0%
    " Reverse Read " 49970.96 50565.72 +1.2%
    " Stride read " 8654.81 8638.45 -0.2%
    " Random read " 13901.44 13949.91 +0.3%
    " Mixed workload " 19041.32 19092.04 +0.3%
    " Random write " 14019.99 14161.72 +1.0%
    " Pwrite " 64121.67 68224.17 +6.4%
    " Pread " 62225.08 62274.28 +0.1%

    In summary, writes are unstable, reads are pretty close on average:

    access pattern 2.6.20 on-demand gain
    Read 62085.61 62196.38 +0.2%
    Re-read 62253.49 62224.99 -0.0%
    Reverse Read 50001.21 50277.75 +0.6%
    Stride read 8656.21 8645.63 -0.1%
    Random read 13907.86 13924.07 +0.1%
    Mixed workload 19055.29 19062.68 +0.0%
    Pread 62217.53 62265.27 +0.1%

    aio-stress: roughly the same
    ============================
    aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
    aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

    2.6.20 on-demand delta
    sequential 92.57s 92.54s -0.0%
    random 311.87s 312.15s +0.1%

    sysbench fileio: roughly the same
    =================================
    sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
    --file-total-size=4G --file-block-size=64K \
    --num-threads=001 --max-requests=10000 --max-time=900 run

    threads 2.6.20 on-demand delta
    first run
    1 59.1974s 59.2262s +0.0%
    2 58.0575s 58.2269s +0.3%
    4 48.0545s 47.1164s -2.0%
    8 41.0684s 41.2229s +0.4%
    16 35.8817s 36.4448s +1.6%
    32 32.6614s 32.8240s +0.5%
    64 23.7601s 24.1481s +1.6%
    128 24.3719s 23.8225s -2.3%
    256 23.2366s 22.0488s -5.1%

    second run
    1 59.6720s 59.5671s -0.2%
    8 41.5158s 41.9541s +1.1%
    64 25.0200s 23.9634s -4.2%
    256 22.5491s 20.9486s -7.1%

    Note that the numbers are not very stable because of the writes.
    The overall performance is close when we sum all seconds up:

    sum all up 495.046s 491.514s -0.7%

    sysbench oltp (trans/sec): up to 8% gain
    ========================================
    sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
    --mysql-socket=/var/run/mysqld/mysqld.sock \
    --mysql-user=root --mysql-password=readahead \
    --num-threads=064 --max-requests=10000 --max-time=900 run

    10000-transactions run
    threads 2.6.20 on-demand gain
    1 62.81 64.56 +2.8%
    2 67.97 70.93 +4.4%
    4 81.81 85.87 +5.0%
    8 94.60 97.89 +3.5%
    16 99.07 104.68 +5.7%
    32 95.93 104.28 +8.7%
    64 96.48 103.68 +7.5%
    5000-transactions run
    1 48.21 48.65 +0.9%
    8 68.60 70.19 +2.3%
    64 70.57 74.72 +5.9%
    2000-transactions run
    1 37.57 38.04 +1.3%
    2 38.43 38.99 +1.5%
    4 45.39 46.45 +2.3%
    8 51.64 52.36 +1.4%
    16 54.39 55.18 +1.5%
    32 52.13 54.49 +4.5%
    64 54.13 54.61 +0.9%

    That's interesting results. Some investigations show that
    - MySQL is accessing the db file non-uniformly: some parts are
    more hot than others
    - It is mostly doing 4-page random reads, and sometimes doing two
    reads in a row, the latter one triggers a 16-page readahead.
    - The on-demand readahead leaves many lookahead pages (flagged
    PG_readahead) there. Many of them will be hit, and trigger
    more readahead pages. Which might save more seeks.
    - Naturally, the readahead windows tend to lie in hot areas,
    and the lookahead pages in hot areas is more likely to be hit.
    - The more overall read density, the more possible gain.

    That also explains the adaptive readahead tricks for clustered random reads.

    readahead thrashing: 3 times better
    ===================================
    We boot kernel with "mem=128m single", and start a 100KB/s stream on every
    second, until reaching 200 streams.

    max throughput min avg I/O size
    2.6.20: 5MB/s 16KB
    on-demand: 15MB/s 140KB

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Extend struct file_ra_state to support the on-demand readahead logic. Also
    define some helpers for it.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Define two convenient macros for read-ahead:
    - MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
    - MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

    Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
    page sizes like 64k.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Add look-ahead support to __do_page_cache_readahead().

    It works by
    - mark the Nth backwards page with PG_readahead,
    (which instructs the page's first reader to invoke readahead)
    - and only do the marking for newly allocated pages.
    (to prevent blindly doing readahead on already cached pages)

    Look-ahead is a technique to achieve I/O pipelining:

    While the application is working through a chunk of cached pages, the kernel
    reads-ahead the next chunk of pages _before_ time of need. It effectively
    hides low level I/O latencies to high level applications.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu