13 Mar, 2008

1 commit

  • Introduced in commit-id 9e2779fa281cfda13ac060753d674bbcaa23367e and
    ifdef'ed out for nommu in 8ca3ed87db062201e1fa15b64a9214e193fc3a8a, both
    approaches end up breaking the nommu build in different ways. An
    impressive feat for a 2-liner.

    Current is_vmalloc_addr() users fall in to two camps:

    - Determining whether to use vfree()/kfree()
    - Whether to do vmlist traversal (only /proc/kcore).

    Since we don't support /proc/kcore on nommu, that leaves the
    vfree()/kfree() determination use cases. nommu vfree() happens to be a
    wrapper to kfree() anyways, so is_vmalloc_addr() can always return 0
    and end up with the right behaviour.

    Signed-off-by: Paul Mundt
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

24 Feb, 2008

1 commit

  • Make is_vmalloc_addr() contingent on CONFIG_MMU=y, as it won't compile
    in !MMU mode.

    [ Bug introduced in commit 9e2779fa281cfda13ac060753d674bbcaa23367e:
    "is_vmalloc_addr(): Check if an address is within the vmalloc
    boundaries" ].

    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

21 Feb, 2008

1 commit

  • Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by
    checking if the pages to be copied are marked as present in the
    kernel mapping and temporarily marking them as present if that's not
    the case. No functional modifications are introduced if
    CONFIG_DEBUG_PAGEALLOC is unset.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Len Brown

    Rafael J. Wysocki
     

14 Feb, 2008

1 commit


09 Feb, 2008

1 commit

  • Background: I've implemented 1K/2K page tables for s390. These sub-page
    page tables are required to properly support the s390 virtualization
    instruction with KVM. The SIE instruction requires that the page tables
    have 256 page table entries (pte) followed by 256 page status table entries
    (pgste). The pgstes are only required if the process is using the SIE
    instruction. The pgstes are updated by the hardware and by the hypervisor
    for a number of reasons, one of them is dirty and reference bit tracking.
    To avoid wasting memory the standard pte table allocation should return
    1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

    Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
    the s390 version for pte_alloc_one cannot return a pointer to a struct
    page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
    cannot return a pointer to a pte either, since that would require more than
    32 bit for the return value of pte_alloc_one (and the pte * would not be
    accessible since its not kmapped).

    Solution: The only solution I found to this dilemma is a new typedef: a
    pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
    later patch. For everybody else it will be a (struct page *). The
    additional problem with the initialization of the ptl lock and the
    NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
    a destructor pgtable_page_dtor. The page table allocation and free
    functions need to call these two whenever a page table page is allocated or
    freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
    To get the pgtable_t back from a pmd entry that has been installed with
    pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
    call in free_pte_range and apply_to_pte_range.

    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

06 Feb, 2008

5 commits

  • Introduce a general page table walker

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Both slab defrag and the large blocksize patches need to ability to take
    refcounts on compound pages. May be useful in other places as well.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Checking if an address is a vmalloc address is done in a couple of places.
    Define a common version in mm.h and replace the other checks.

    Again the include structures suck. The definition of VMALLOC_START and
    VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
    there.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make vmalloc functions work the same way as kfree() and friends that
    take a const void * argument.

    [akpm@linux-foundation.org: fix consts, coding-style]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We already have page table manipulation for vmalloc in vmalloc.c. Move the
    vmalloc_to_page() function there as well.

    Move the definitions for vmalloc related functions in mm.h to a newly created
    section. A better place would be vmalloc.h but mm.h is basic and may depend
    on these functions. An alternative would be to include vmalloc.h in mm.h
    (like done for vmstat.h).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Jan, 2008

2 commits

  • get more testing of the c_p_a() code done by not turning off
    PSE on DEBUG_PAGEALLOC.

    this simplifies the early pagetable setup code, and tests
    the largepage-splitup code quite heavily.

    In the end, all the largepages will be split up pretty quickly,
    so there's no difference to how DEBUG_PAGEALLOC worked before.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Ingo Molnar
     
  • They now look like:

    hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000]

    This makes it easier to pinpoint bugs to specific libraries.

    And printing the offset into a mapping also always allows to find the
    correct fault point in a library even with randomized mappings. Previously
    there was no way to actually find the correct code address inside
    the randomized mapping.

    Relies on earlier patch to shorten the printk formats.

    They are often now longer than 80 characters, but I think that's worth it.

    [includes fix from Eric Dumazet to check d_path error value]

    Signed-off-by: Andi Kleen
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Andi Kleen
     

25 Jan, 2008

1 commit


05 Dec, 2007

1 commit

  • If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
    non-null hint address less than mmap_min_addr the mapping will fail the
    security checks. Since this is just a hint address this patch will round
    such a hint address above mmap_min_addr.

    gcj was found to try to be very frugal with vm usage and give hint addresses
    in the 8k-32k range. Without this patch all such programs failed and with
    the patch they happily get a higher address.

    This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
    without it and there would be no security check possible no matter what. So
    we should not bother compiling in this rounding if it is just a waste of
    time.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

17 Oct, 2007

9 commits

  • mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
    remove them and add them back to files which need them.

    Cross-compile tested on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This patch makes three needlessly global functions static.

    Signed-off-by: Adrian Bunk
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch is to avoid panic when memory hot-add is executed with
    sparsemem-vmemmap. Current vmemmap-sparsemem code doesn't support memory
    hot-add. Vmemmap must be populated when hot-add. This is for
    2.6.23-rc2-mm2.

    Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
    offnode page_structs" is displayed. To allocate memmap on its node,
    memmap (and pgdat) must be initialized itself like chicken and
    egg relationship.

    # vmemmap_unpopulate will be necessary for followings.
    - For cancel hot-add due to error.
    - For unplug.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • After moving the lockless_freelist to kmem_cache_cpu we no longer need
    page->lockless_freelist. Restructure the use of the struct page fields in
    such a way that we never touch the mapping field.

    This is turn allows us to remove the special casing of SLUB when determining
    the mapping of a page (needed for corner cases of virtual caches machines that
    need to flush caches of processors mapping a page).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move the definitions of struct mm_struct and struct vma_area_struct to
    include/mm_types.h. This allows to define more function in asm/pgtable.h
    and friends with inline assemblies instead of macros. Compile tested on
    i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.

    [aurelien@aurel32.net: build fix]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Aurelien Jarno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
    (and thus mapcounted and count towards shared rss). These writes to
    the struct page could cause excessive cacheline bouncing on big
    systems. There are a number of ways this could be addressed if it is
    an issue.

    And indeed this cacheline bouncing has shown up on large SGI systems.
    There was a situation where an Altix system was essentially livelocked
    tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
    This situation can be avoided in userspace, but it does highlight the
    potential scalability problem with refcounting ZERO_PAGE, and corner
    cases where it can really hurt (we don't want the system to livelock!).

    There are several broad ways to fix this problem:
    1. add back some special casing to avoid refcounting ZERO_PAGE
    2. per-node or per-cpu ZERO_PAGES
    3. remove the ZERO_PAGE completely

    I will argue for 3. The others should also fix the problem, but they
    result in more complex code than does 3, with little or no real benefit
    that I can see.

    Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
    false optimisation: if an application is performance critical, it would
    not be doing many read faults of new memory, or at least it could be
    expected to write to that memory soon afterwards. If cache or memory use
    is critical, it should not be working with a significant number of
    ZERO_PAGEs anyway (a more compact representation of zeroes should be
    used).

    As a sanity check -- mesuring on my desktop system, there are never many
    mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
    increase much without it.

    When running a make -j4 kernel compile on my dual core system, there are
    about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
    ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
    is torn down without being COWed). So removing ZERO_PAGE will save 1,000
    page faults per second when running kbuild, while keeping it only saves
    less than 1 page clearing operation per second. 1 page clear is cheaper
    than a thousand faults, presumably, so there isn't an obvious loss.

    Neither the logical argument nor these basic tests give a guarantee of no
    regressions. However, this is a reasonable opportunity to try to remove
    the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
    we can reintroduce it and just avoid refcounting it.

    The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
    much use to them except on benchmarks. All other users of ZERO_PAGE are
    converted just to use ZERO_PAGE(0) for simplicity. We can look at
    replacing them all and maybe ripping out ZERO_PAGE completely when we are
    more satisfied with this solution.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus "snif" Torvalds

    Nick Piggin
     
  • Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert the common vmemmap population into initialisation helpers for use by
    architecture vmemmap populators. All architecture implementing the
    SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
    initialiser, which may make use of the helpers.

    This allows us to clean up and remove the initialisation Kconfig entries.
    With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
    indicate use of that variant.

    Signed-off-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
    the arches. It would be great if it could be the default so that we can get
    rid of various forms of DISCONTIG and other variations on memory maps. So far
    what has hindered this are the additional lookups that SPARSEMEM introduces
    for virt_to_page and page_address. This goes so far that the code to do this
    has to be kept in a separate function and cannot be used inline.

    This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
    is mapped into a virtually contigious area, only the active sections are
    physically backed. This allows virt_to_page page_address and cohorts become
    simple shift/add operations. No page flag fields, no table lookups, nothing
    involving memory is required.

    The two key operations pfn_to_page and page_to_page become:

    #define __pfn_to_page(pfn) (vmemmap + (pfn))
    #define __page_to_pfn(page) ((page) - vmemmap)

    By having a virtual mapping for the memmap we allow simple access without
    wasting physical memory. As kernel memory is typically already mapped 1:1
    this introduces no additional overhead. The virtual mapping must be big
    enough to allow a struct page to be allocated and mapped for all valid
    physical pages. This vill make a virtual memmap difficult to use on 32 bit
    platforms that support 36 address bits.

    However, if there is enough virtual space available and the arch already maps
    its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this
    technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
    FLATMEM needs to read the contents of the mem_map variable to get the start of
    the memmap and then add the offset to the required entry. vmemmap is a
    constant to which we can simply add the offset.

    This patch has the potential to allow us to make SPARSMEM the default (and
    even the only) option for most systems. It should be optimal on UP, SMP and
    NUMA on most platforms. Then we may even be able to remove the other memory
    models: FLATMEM, DISCONTIG etc.

    [apw@shadowen.org: config cleanups, resplit code etc]
    [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
    [apw@shadowen.org: vmemmap: remove excess debugging]
    [apw@shadowen.org: simplify initialisation code and reduce duplication]
    [apw@shadowen.org: pull out the vmemmap code into its own file]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: "David S. Miller"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Aug, 2007

1 commit

  • The new exec code inserts an accounted vma into an mm struct which is not
    current->mm. The existing memory check code has a hard coded assumption
    that this does not happen as does the security code.

    As the correct mm is known we pass the mm to the security method and the
    helper function. A new security test is added for the case where we need
    to pass the mm and the existing one is modified to pass current->mm to
    avoid the need to change large amounts of code.

    (Thanks to Tobias for fixing rejects and testing)

    Signed-off-by: Alan Cox
    Cc: WU Fengguang
    Cc: James Morris
    Cc: Tobias Diedrich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

30 Jul, 2007

1 commit

  • Remove fs.h from mm.h. For this,
    1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
    2) Add back fs.h or less bloated headers (err.h) to files that need it.

    As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
    rebuilt down to 3444 (-12.3%).

    Cross-compile tested without regressions on my two usual configs and (sigh):

    alpha arm-mx1ads mips-bigsur powerpc-ebony
    alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
    alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
    alpha-up arm-netx mips-db1000 powerpc-iseries
    arm arm-ns9xxx mips-db1100 powerpc-linkstation
    arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
    arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
    arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
    arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
    arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
    arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
    arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
    arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
    arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
    arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
    arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
    arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
    arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
    arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
    arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
    arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
    arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
    arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
    arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
    arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
    arm-footbridge ia64 mips-pb1500 powerpc-pmac32
    arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
    arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
    arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
    arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
    arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
    arm-integrator ia64-sn2 mips-rbhma4500 s390
    arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
    arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
    arm-iop33x ia64-zx1 mips-sead s390-up
    arm-ixp2000 m68k mips-tb0219 sparc
    arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
    arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
    arm-jornada720 m68k-atari mips-workpad sparc-up
    arm-kafa m68k-bvme6000 mips-wrppmc sparc64
    arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
    arm-ks8695 m68k-mac parisc sparc64-defconfig
    arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
    arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
    arm-lpd7a400 m68k-q40 parisc-up x86_64
    arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
    arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
    arm-lusl7200 mips powerpc-celleb x86_64-up
    arm-mainstone mips-atlas powerpc-chrp32

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

27 Jul, 2007

1 commit

  • Nathan Lynch reported:
    2.6.23-rc1 breaks the build for 64-bit powerpc for me (using
    maple_defconfig):

    LD vmlinux.o
    powerpc64-unknown-linux-gnu-ld: dynreloc miscount for
    kernel/built-in.o, section .opd
    powerpc64-unknown-linux-gnu-ld: can not edit opd Bad value
    make: *** [vmlinux.o] Error 1

    However, I see a possibly related binutils patch:
    http://article.gmane.org/gmane.comp.gnu.binutils/33650

    It was tracked down to be caused by the weak prototype
    declaration in mm.h:
    __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma);

    But there is no need to make the declaration weak - only the definition
    needs to be marked weak. So drop the weak declaration. And in the process
    drop the duplicate definition in page.h for powerpc.

    Note: the arch_vma_name fix for x86_64 needs to be applied first to avoid
    breaking x86_64

    Signed-off-by: Sam Ravnborg
    Cc: Nathan Lynch
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     

20 Jul, 2007

8 commits

  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     
  • Split ondemand readahead interface into two functions. I think this makes it
    a little clearer for non-readahead experts (like Rusty).

    Internally they both call ondemand_readahead(), but the page argument is
    changed to an obvious boolean flag.

    Signed-off-by: Rusty Russell
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Remove the old readahead algorithm.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a minimal readahead algorithm that aims to replace the current one.
    It is more flexible and reliable, while maintaining almost the same behavior
    and performance. Also it is full integrated with adaptive readahead.

    It is designed to be called on demand:
    - on a missing page, to do synchronous readahead
    - on a lookahead page, to do asynchronous readahead

    In this way it eliminated the awkward workarounds for cache hit/miss,
    readahead thrashing, retried read, and unaligned read. It also adopts the
    data structure introduced by adaptive readahead, parameterizes readahead
    pipelining with `lookahead_index', and reduces the current/ahead windows to
    one single window.

    HEURISTICS

    The logic deals with four cases:

    - sequential-next
    found a consistent readahead window, so push it forward

    - random
    standalone small read, so read as is

    - sequential-first
    create a new readahead window for a sequential/oversize request

    - lookahead-clueless
    hit a lookahead page not associated with the readahead window,
    so create a new readahead window and ramp it up

    In each case, three parameters are determined:

    - readahead index: where the next readahead begins
    - readahead size: how much to readahead
    - lookahead size: when to do the next readahead (for pipelining)

    BEHAVIORS

    The old behaviors are maximally preserved for trivial sequential/random reads.
    Notable changes are:

    - It no longer imposes strict sequential checks.
    It might help some interleaved cases, and clustered random reads.
    It does introduce risks of a random lookahead hit triggering an
    unexpected readahead. But in general it is more likely to do good
    than to do evil.

    - Interleaved reads are supported in a minimal way.
    Their chances of being detected and proper handled are still low.

    - Readahead thrashings are better handled.
    The current readahead leads to tiny average I/O sizes, because it
    never turn back for the thrashed pages. They have to be fault in
    by do_generic_mapping_read() one by one. Whereas the on-demand
    readahead will redo readahead for them.

    OVERHEADS

    The new code reduced the overheads of

    - excessively calling the readahead routine on small sized reads
    (the current readahead code insists on seeing all requests)

    - doing a lot of pointless page-cache lookups for small cached files
    (the current readahead only turns itself off after 256 cache hits,
    unfortunately most files are < 1MB, so never see that chance)

    That accounts for speedup of
    - 0.3% on 1-page sequential reads on sparse file
    - 1.2% on 1-page cache hot sequential reads
    - 3.2% on 256-page cache hot sequential reads
    - 1.3% on cache hot `tar /lib`

    However, it does introduce one extra page-cache lookup per cache miss, which
    impacts random reads slightly. That's 1% overheads for 1-page random reads on
    sparse file.

    PERFORMANCE

    The basic benchmark setup is
    - 2.6.20 kernel with on-demand readahead
    - 1MB max readahead size
    - 2.9GHz Intel Core 2 CPU
    - 2GB memory
    - 160G/8M Hitachi SATA II 7200 RPM disk

    The benchmarks show that
    - it maintains the same performance for trivial sequential/random reads
    - sysbench/OLTP performance on MySQL gains up to 8%
    - performance on readahead thrashing gains up to 3 times

    iozone throughput (KB/s): roughly the same
    ==========================================
    iozone -c -t1 -s 4096m -r 64k

    2.6.20 on-demand gain
    first run
    " Initial write " 61437.27 64521.53 +5.0%
    " Rewrite " 47893.02 48335.20 +0.9%
    " Read " 62111.84 62141.49 +0.0%
    " Re-read " 62242.66 62193.17 -0.1%
    " Reverse Read " 50031.46 49989.79 -0.1%
    " Stride read " 8657.61 8652.81 -0.1%
    " Random read " 13914.28 13898.23 -0.1%
    " Mixed workload " 19069.27 19033.32 -0.2%
    " Random write " 14849.80 14104.38 -5.0%
    " Pwrite " 62955.30 65701.57 +4.4%
    " Pread " 62209.99 62256.26 +0.1%

    second run
    " Initial write " 60810.31 66258.69 +9.0%
    " Rewrite " 49373.89 57833.66 +17.1%
    " Read " 62059.39 62251.28 +0.3%
    " Re-read " 62264.32 62256.82 -0.0%
    " Reverse Read " 49970.96 50565.72 +1.2%
    " Stride read " 8654.81 8638.45 -0.2%
    " Random read " 13901.44 13949.91 +0.3%
    " Mixed workload " 19041.32 19092.04 +0.3%
    " Random write " 14019.99 14161.72 +1.0%
    " Pwrite " 64121.67 68224.17 +6.4%
    " Pread " 62225.08 62274.28 +0.1%

    In summary, writes are unstable, reads are pretty close on average:

    access pattern 2.6.20 on-demand gain
    Read 62085.61 62196.38 +0.2%
    Re-read 62253.49 62224.99 -0.0%
    Reverse Read 50001.21 50277.75 +0.6%
    Stride read 8656.21 8645.63 -0.1%
    Random read 13907.86 13924.07 +0.1%
    Mixed workload 19055.29 19062.68 +0.0%
    Pread 62217.53 62265.27 +0.1%

    aio-stress: roughly the same
    ============================
    aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
    aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

    2.6.20 on-demand delta
    sequential 92.57s 92.54s -0.0%
    random 311.87s 312.15s +0.1%

    sysbench fileio: roughly the same
    =================================
    sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
    --file-total-size=4G --file-block-size=64K \
    --num-threads=001 --max-requests=10000 --max-time=900 run

    threads 2.6.20 on-demand delta
    first run
    1 59.1974s 59.2262s +0.0%
    2 58.0575s 58.2269s +0.3%
    4 48.0545s 47.1164s -2.0%
    8 41.0684s 41.2229s +0.4%
    16 35.8817s 36.4448s +1.6%
    32 32.6614s 32.8240s +0.5%
    64 23.7601s 24.1481s +1.6%
    128 24.3719s 23.8225s -2.3%
    256 23.2366s 22.0488s -5.1%

    second run
    1 59.6720s 59.5671s -0.2%
    8 41.5158s 41.9541s +1.1%
    64 25.0200s 23.9634s -4.2%
    256 22.5491s 20.9486s -7.1%

    Note that the numbers are not very stable because of the writes.
    The overall performance is close when we sum all seconds up:

    sum all up 495.046s 491.514s -0.7%

    sysbench oltp (trans/sec): up to 8% gain
    ========================================
    sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
    --mysql-socket=/var/run/mysqld/mysqld.sock \
    --mysql-user=root --mysql-password=readahead \
    --num-threads=064 --max-requests=10000 --max-time=900 run

    10000-transactions run
    threads 2.6.20 on-demand gain
    1 62.81 64.56 +2.8%
    2 67.97 70.93 +4.4%
    4 81.81 85.87 +5.0%
    8 94.60 97.89 +3.5%
    16 99.07 104.68 +5.7%
    32 95.93 104.28 +8.7%
    64 96.48 103.68 +7.5%
    5000-transactions run
    1 48.21 48.65 +0.9%
    8 68.60 70.19 +2.3%
    64 70.57 74.72 +5.9%
    2000-transactions run
    1 37.57 38.04 +1.3%
    2 38.43 38.99 +1.5%
    4 45.39 46.45 +2.3%
    8 51.64 52.36 +1.4%
    16 54.39 55.18 +1.5%
    32 52.13 54.49 +4.5%
    64 54.13 54.61 +0.9%

    That's interesting results. Some investigations show that
    - MySQL is accessing the db file non-uniformly: some parts are
    more hot than others
    - It is mostly doing 4-page random reads, and sometimes doing two
    reads in a row, the latter one triggers a 16-page readahead.
    - The on-demand readahead leaves many lookahead pages (flagged
    PG_readahead) there. Many of them will be hit, and trigger
    more readahead pages. Which might save more seeks.
    - Naturally, the readahead windows tend to lie in hot areas,
    and the lookahead pages in hot areas is more likely to be hit.
    - The more overall read density, the more possible gain.

    That also explains the adaptive readahead tricks for clustered random reads.

    readahead thrashing: 3 times better
    ===================================
    We boot kernel with "mem=128m single", and start a 100KB/s stream on every
    second, until reaching 200 streams.

    max throughput min avg I/O size
    2.6.20: 5MB/s 16KB
    on-demand: 15MB/s 140KB

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This patch completes Linus's wish that the fault return codes be made into
    bit flags, which I agree makes everything nicer. This requires requires
    all handle_mm_fault callers to be modified (possibly the modifications
    should go further and do things like fault accounting in handle_mm_fault --
    however that would be for another patch).

    [akpm@linux-foundation.org: fix alpha build]
    [akpm@linux-foundation.org: fix s390 build]
    [akpm@linux-foundation.org: fix sparc build]
    [akpm@linux-foundation.org: fix sparc64 build]
    [akpm@linux-foundation.org: fix ia64 build]
    Signed-off-by: Nick Piggin
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Bryan Wu
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Matthew Wilcox
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Acked-by: Kyle McMartin
    Acked-by: Haavard Skinnemoen
    Acked-by: Ralf Baechle
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    [ Still apparently needs some ARM and PPC loving - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix the race between invalidate_inode_pages and do_no_page.

    Andrea Arcangeli identified a subtle race between invalidation of pages from
    pagecache with userspace mappings, and do_no_page.

    The issue is that invalidation has to shoot down all mappings to the page,
    before it can be discarded from the pagecache. Between shooting down ptes to
    a particular page, and actually dropping the struct page from the pagecache,
    do_no_page from any process might fault on that page and establish a new
    mapping to the page just before it gets discarded from the pagecache.

    The most common case where such invalidation is used is in file truncation.
    This case was catered for by doing a sort of open-coded seqlock between the
    file's i_size, and its truncate_count.

    Truncation will decrease i_size, then increment truncate_count before
    unmapping userspace pages; do_no_page will read truncate_count, then find the
    page if it is within i_size, and then check truncate_count under the page
    table lock and back out and retry if it had subsequently been changed (ptl
    will serialise against unmapping, and ensure a potentially updated
    truncate_count is actually visible).

    Complexity and documentation issues aside, the locking protocol fails in the
    case where we would like to invalidate pagecache inside i_size. do_no_page
    can come in anytime and filemap_nopage is not aware of the invalidation in
    progress (as it is when it is outside i_size). The end result is that
    dangling (->mapping == NULL) pages that appear to be from a particular file
    may be mapped into userspace with nonsense data. Valid mappings to the same
    place will see a different page.

    Andrea implemented two working fixes, one using a real seqlock, another using
    a page->flags bit. He also proposed using the page lock in do_no_page, but
    that was initially considered too heavyweight. However, it is not a global or
    per-file lock, and the page cacheline is modified in do_no_page to increment
    _count and _mapcount anyway, so a further modification should not be a large
    performance hit. Scalability is not an issue.

    This patch implements this latter approach. ->nopage implementations return
    with the page locked if it is possible for their underlying file to be
    invalidated (in that case, they must set a special vm_flags bit to indicate
    so). do_no_page only unlocks the page after setting up the mapping
    completely. invalidation is excluded because it holds the page lock during
    invalidation of each page (and ensures that the page is not mapped while
    holding the lock).

    This also allows significant simplifications in do_no_page, because we have
    the page locked in the right place in the pagecache from the start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

4 commits

  • Detect slab objects being passed to the page oriented functions of the VM.

    It is not sufficient to simply return NULL because the functions calling
    page_mapping may depend on other items of the page_struct also to be setup
    properly. Moreover slab object may not be properly aligned. The page
    oriented functions of the VM expect to operate on page aligned, page sized
    objects. Operations on object straddling page boundaries may only affect the
    objects partially which may lead to surprising results.

    It is better to detect eventually remaining uses and eliminate them.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I can never remember what the function to register to receive VM pressure
    is called. I have to trace down from __alloc_pages() to find it.

    It's called "set_shrinker()", and it needs Your Help.

    1) Don't hide struct shrinker. It contains no magic.
    2) Don't allocate "struct shrinker". It's not helpful.
    3) Call them "register_shrinker" and "unregister_shrinker".
    4) Call the function "shrink" not "shrinker".
    5) Reduce the 17 lines of waffly comments to 13, but document it properly.

    Signed-off-by: Rusty Russell
    Cc: David Chinner
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • This patch adds the kernelcore= parameter for x86.

    Once all patches are applied, a new command-line parameter exist and a new
    sysctl. This patch adds the necessary documentation.

    From: Yasunori Goto

    When "kernelcore" boot option is specified, kernel can't boot up on ia64
    because of an infinite loop. In addition, the parsing code can be handled
    in an architecture-independent manner.

    This patch uses common code to handle the kernelcore= parameter. It is
    only available to architectures that support arch-independent zone-sizing
    (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
    ignore the boot parameter.

    [bunk@stusta.de: make cmdline_parse_kernelcore() static]
    Signed-off-by: Mel Gorman
    Signed-off-by: Yasunori Goto
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
    that is only usable by allocations that specify both __GFP_HIGHMEM and
    __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
    single memory partition while allowing movable allocations to be satisfied
    from either partition. The patches may be applied with the list-based
    anti-fragmentation patches that groups pages together based on mobility.

    The size of the zone is determined by a kernelcore= parameter specified at
    boot-time. This specifies how much memory is usable by non-movable
    allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
    within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

    When selecting a zone to take pages from for ZONE_MOVABLE, there are two
    things to consider. First, only memory from the highest populated zone is
    used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
    but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
    the amount of memory usable by the kernel will be spread evenly throughout
    NUMA nodes where possible. If the nodes are not of equal size, the amount of
    memory usable by the kernel on some nodes may be greater than others.

    By default, the zone is not as useful for hugetlb allocations because they are
    pinned and non-migratable (currently at least). A sysctl is provided that
    allows huge pages to be allocated from that zone. This means that the huge
    page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
    the system assuming that pages are not mlocked. Despite huge pages being
    non-movable, we do not introduce additional external fragmentation of note as
    huge pages are always the largest contiguous block we care about.

    Credit goes to Andy Whitcroft for catching a large variety of problems during
    review of the patches.

    This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
    by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
    memory continues to be placed in their existing destination as there is no
    mechanism to redirect them to a specific zone.

    [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Yasunori Goto
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

2 commits

  • I forgot to remove capability.h from mm.h while removing sched.h! This
    patch remedies that, because the only inline function which was using
    CAP_something was made out of line.

    Cross-compile tested without regressions on:

    all powerpc defconfigs
    all mips defconfigs
    all m68k defconfigs
    all arm defconfigs
    all ia64 defconfigs

    alpha alpha-allnoconfig alpha-defconfig alpha-up
    arm
    i386 i386-allnoconfig i386-defconfig i386-up
    ia64 ia64-allnoconfig ia64-defconfig ia64-up
    m68k
    mips
    parisc parisc-allnoconfig parisc-defconfig parisc-up
    powerpc powerpc-up
    s390 s390-allnoconfig s390-defconfig s390-up
    sparc sparc-allnoconfig sparc-defconfig sparc-up
    sparc64 sparc64-allnoconfig sparc64-defconfig sparc64-up
    um-x86_64
    x86_64 x86_64-allnoconfig x86_64-defconfig x86_64-up

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • This is a straightforward split of do_mmap_pgoff() into two functions:

    - do_mmap_pgoff() checks the parameters, and calculates the vma
    flags. Then it calls

    - mmap_region(), which does the actual mapping

    Signed-off-by: Miklos Szeredi
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi