20 Jul, 2007

7 commits

  • page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
    re-dirty through such a mapping will not generate a fault, PG_dirty will
    not reflect the dirty state and the dirty count will be skewed. This
    implies that msync() is also currently broken for nonlinear mappings.

    The easiest solution is to emulate remap_file_pages on non-linear mappings
    with simple mmap() for non ram-backed filesystems. Applications continue
    to work (albeit slower), as long as the number of remappings remain below
    the maximum vma count.

    However all currently known real uses of non-linear mappings are for ram
    backed filesystems, which this patch doesn't affect.

    Signed-off-by: Miklos Szeredi
    Acked-by: Peter Zijlstra
    Cc: William Lee Irwin III
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fix msync data loss and (less importantly) dirty page accounting
    inaccuracies due to the race remaining in clear_page_dirty_for_io().

    The deleted comment explains what the race was, and the added comments
    explain how it is fixed.

    Signed-off-by: Nick Piggin
    Acked-by: Linus Torvalds
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch completes Linus's wish that the fault return codes be made into
    bit flags, which I agree makes everything nicer. This requires requires
    all handle_mm_fault callers to be modified (possibly the modifications
    should go further and do things like fault accounting in handle_mm_fault --
    however that would be for another patch).

    [akpm@linux-foundation.org: fix alpha build]
    [akpm@linux-foundation.org: fix s390 build]
    [akpm@linux-foundation.org: fix sparc build]
    [akpm@linux-foundation.org: fix sparc64 build]
    [akpm@linux-foundation.org: fix ia64 build]
    Signed-off-by: Nick Piggin
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Bryan Wu
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Matthew Wilcox
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Acked-by: Kyle McMartin
    Acked-by: Haavard Skinnemoen
    Acked-by: Ralf Baechle
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    [ Still apparently needs some ARM and PPC loving - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __do_fault() was calling ->page_mkwrite() with the page lock held, which
    violates the locking rules for that callback. Release and retake the page
    lock around the callback to avoid deadlocking file systems which manually
    take it.

    Signed-off-by: Mark Fasheh
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix the race between invalidate_inode_pages and do_no_page.

    Andrea Arcangeli identified a subtle race between invalidation of pages from
    pagecache with userspace mappings, and do_no_page.

    The issue is that invalidation has to shoot down all mappings to the page,
    before it can be discarded from the pagecache. Between shooting down ptes to
    a particular page, and actually dropping the struct page from the pagecache,
    do_no_page from any process might fault on that page and establish a new
    mapping to the page just before it gets discarded from the pagecache.

    The most common case where such invalidation is used is in file truncation.
    This case was catered for by doing a sort of open-coded seqlock between the
    file's i_size, and its truncate_count.

    Truncation will decrease i_size, then increment truncate_count before
    unmapping userspace pages; do_no_page will read truncate_count, then find the
    page if it is within i_size, and then check truncate_count under the page
    table lock and back out and retry if it had subsequently been changed (ptl
    will serialise against unmapping, and ensure a potentially updated
    truncate_count is actually visible).

    Complexity and documentation issues aside, the locking protocol fails in the
    case where we would like to invalidate pagecache inside i_size. do_no_page
    can come in anytime and filemap_nopage is not aware of the invalidation in
    progress (as it is when it is outside i_size). The end result is that
    dangling (->mapping == NULL) pages that appear to be from a particular file
    may be mapped into userspace with nonsense data. Valid mappings to the same
    place will see a different page.

    Andrea implemented two working fixes, one using a real seqlock, another using
    a page->flags bit. He also proposed using the page lock in do_no_page, but
    that was initially considered too heavyweight. However, it is not a global or
    per-file lock, and the page cacheline is modified in do_no_page to increment
    _count and _mapcount anyway, so a further modification should not be a large
    performance hit. Scalability is not an issue.

    This patch implements this latter approach. ->nopage implementations return
    with the page locked if it is possible for their underlying file to be
    invalidated (in that case, they must set a special vm_flags bit to indicate
    so). do_no_page only unlocks the page after setting up the mapping
    completely. invalidation is excluded because it holds the page lock during
    invalidation of each page (and ensures that the page is not mapped while
    holding the lock).

    This also allows significant simplifications in do_no_page, because we have
    the page locked in the right place in the pagecache from the start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

33 commits

  • Allocate/release a chunk of vmalloc address space:
    alloc_vm_area reserves a chunk of address space, and makes sure all
    the pagetables are constructed for that address range - but no pages.

    free_vm_area releases the address space range.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Ian Pratt
    Signed-off-by: Christian Limpach
    Signed-off-by: Chris Wright
    Cc: "Jan Beulich"
    Cc: "Andi Kleen"

    Jeremy Fitzhardinge
     
  • Add a kstrndup function, modelled on strndup. Like strndup this
    returns a string copied into its own allocated memory, but it copies
    no more than the specified number of bytes from the source.

    Remove private strndup() from irda code.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Chris Wright
    Cc: Andrew Morton
    Cc: Randy Dunlap
    Cc: YOSHIFUJI Hideaki
    Cc: Akinobu Mita
    Cc: Arnaldo Carvalho de Melo
    Cc: Al Viro
    Cc: Panagiotis Issaris
    Cc: Rene Scharfe

    Jeremy Fitzhardinge
     
  • Our original NFSv4 delegation policy was to give out a read delegation on any
    open when it was possible to.

    Since the lifetime of a delegation isn't limited to that of an open, a client
    may quite reasonably hang on to a delegation as long as it has the inode
    cached. This becomes an obvious problem the first time a client's inode cache
    approaches the size of the server's total memory.

    Our first quick solution was to add a hard-coded limit. This patch makes a
    mild incremental improvement by varying that limit according to the server's
    total memory size, allowing at most 4 delegations per megabyte of RAM.

    My quick back-of-the-envelope calculation finds that in the worst case (where
    every delegation is for a different inode), a delegation could take about
    1.5K, which would make the worst case usage about 6% of memory. The new limit
    works out to be about the same as the old on a 1-gig server.

    [akpm@linux-foundation.org: Don't needlessly bloat vmlinux]
    [akpm@linux-foundation.org: Make it right for highmem machines]
    Signed-off-by: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Meelap Shah
     
  • currently the export_operation structure and helpers related to it are in
    fs.h. fs.h is already far too large and there are very few places needing the
    export bits, so split them off into a separate header.

    [akpm@linux-foundation.org: fix cifs build]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Neil Brown
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • KSYM_NAME_LEN is peculiar in that it does not include the space for the
    trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating
    buffer. This is nonsense and error-prone. Moreover, when the caller
    forgets that it's very likely to subtly bite back by corrupting the stack
    because the last position of the buffer is always cleared to zero.

    This patch increments KSYM_NAME_LEN by one and updates code accordingly.

    * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro
    is fixed.

    * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together,
    MODULE_NAME_LEN was treated as if it didn't include space for the
    trailing '\0'. Fix it.

    Signed-off-by: Tejun Heo
    Acked-by: Paulo Marques
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • The bounce buffer logic is included on systems that do not need it. If a
    system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to
    the use of bounce buffers then there is no need to reserve memory pools etc
    etc. This is true f.e. for SGI Altix.

    Also nicifies the Makefile and gets rid of the tricky "and" there.

    Signed-off-by: Christoph Lameter
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently, the freezer treats all tasks as freezable, except for the kernel
    threads that explicitly set the PF_NOFREEZE flag for themselves. This
    approach is problematic, since it requires every kernel thread to either
    set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
    care for the freezing of tasks at all.

    It seems better to only require the kernel threads that want to or need to
    be frozen to use some freezer-related code and to remove any
    freezer-related code from the other (nonfreezable) kernel threads, which is
    done in this patch.

    The patch causes all kernel threads to be nonfreezable by default (ie. to
    have PF_NOFREEZE set by default) and introduces the set_freezable()
    function that should be called by the freezable kernel threads in order to
    unset PF_NOFREEZE. It also makes all of the currently freezable kernel
    threads call set_freezable(), so it shouldn't cause any (intentional)
    change of behaviour to appear. Additionally, it updates documentation to
    describe the freezing of tasks more accurately.

    [akpm@linux-foundation.org: build fixes]
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Cc: Pavel Machek
    Cc: Oleg Nesterov
    Cc: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • It is a bug to set a page dirty if it is not uptodate unless it has
    buffers. If the page has buffers, then the page may be dirty (some buffers
    dirty) but not uptodate (some buffers not uptodate). The exception to this
    rule is if the set_page_dirty caller is racing with truncate or invalidate.

    A buffer can not be set dirty if it is not uptodate.

    If either of these situations occurs, it indicates there could be some data
    loss problem. Some of these warnings could be a harmless one where the
    page or buffer is set uptodate immediately after it is dirtied, however we
    should fix those up, and enforce this ordering.

    Bring the order of operations for truncate into line with those of
    invalidate. This will prevent a page from being able to go !uptodate while
    we're holding the tree_lock, which is probably a good thing anyway.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that
    embedded systems start to use NUMA we may need this.

    Put an #ifdef around places where NUMA only code uses fields only valid
    for CONFIG_SLUB_DEBUG.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Sysfs can do a gazillion things when called. Make sure that we do not call
    any sysfs functions while holding the slub_lock.

    Just protect the essentials:

    1. The list of all slab caches
    2. The kmalloc_dma array
    3. The ref counters of the slabs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The objects per slab increase with the current patches in mm since we allow up
    to order 3 allocs by default. More patches in mm actually allow to use 2M or
    higher sized slabs. For slab validation we need per object bitmaps in order
    to check a slab. We end up with up to 64k objects per slab resulting in a
    potential requirement of 8K stack space. That does not look good.

    Allocate the bit arrays via kmalloc.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
    variant in the past. But with __GFP_ZERO it is possible now to do zeroing
    while allocating.

    Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
    we can.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It becomes now easy to support the zeroing allocs with generic inline
    functions in slab.h. Provide inline definitions to allow the continued use of
    kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing
    functions from the slab allocators and util.c.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We can get to the length of the object through the kmem_cache_structure. The
    additional parameter does no good and causes the compiler to generate bad
    code.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do proper spacing and we only need to do this in steps of 8.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Adrian Bunk
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There is no need to caculate the dma slab size ourselves. We can simply
    lookup the size of the corresponding non dma slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_index is a long series of comparisons. The attempt to replace
    kmalloc_index with something more efficient like ilog2 failed due to compiler
    issues with constant folding on gcc 3.3 / powerpc.

    kmalloc_index()'es long list of comparisons works fine for constant folding
    since all the comparisons are optimized away. However, SLUB also uses
    kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
    This leads to a large set of comparisons in get_slab().

    The patch here allows to get rid of that list of comparisons in get_slab():

    1. If the requested size is larger than 192 then we can simply use
    fls to determine the slab index since all larger slabs are
    of the power of two type.

    2. If the requested size is smaller then we cannot use fls since there
    are non power of two caches to be considered. However, the sizes are
    in a managable range. So we divide the size by 8. Then we have only
    24 possibilities left and then we simply look up the kmalloc index
    in a table.

    Code size of slub.o decreases by more than 200 bytes through this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We modify the kmalloc_cache_dma[] array without proper locking. Do the proper
    locking and undo the dma cache creation if another processor has already
    created it.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The rarely used dma functionality in get_slab() makes the function too
    complex. The compiler begins to spill variables from the working set onto the
    stack. The created function is only used in extremely rare cases so make sure
    that the compiler does not decide on its own to merge it back into get_slab().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add #ifdefs around data structures only needed if debugging is compiled into
    SLUB.

    Add inlines to small functions to reduce code size.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A kernel convention for many allocators is that if __GFP_ZERO is passed to an
    allocator then the allocated memory should be zeroed.

    This is currently not supported by the slab allocators. The inconsistency
    makes it difficult to implement in derived allocators such as in the uncached
    allocator and the pool allocators.

    In addition the support zeroed allocations in the slab allocators does not
    have a consistent API. There are no zeroing allocator functions for NUMA node
    placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are
    only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
    __GFP_ZERO will make zeroing universally available and does not require any
    addititional functions.

    So add the necessary logic to all slab allocators to support __GFP_ZERO.

    The code is added to the hot path. The gfp flags are on the stack and so the
    cacheline is readily available for checking if we want a zeroed object.

    Zeroing while allocating is now a frequent operation and we seem to be
    gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
    The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
    allocators. Move ZERO_SIZE_PTR related stuff into slab.h.

    Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
    WARN_ON_ONCE(size == 0) that is still remaining in SLAB.

    Make slub return NULL like the other allocators if a too large memory segment
    is requested via __kmalloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The size of a kmalloc object is readily available via ksize(). ksize is
    provided by all allocators and thus we can implement krealloc in a generic
    way.

    Implement krealloc in mm/util.c and drop slab specific implementations of
    krealloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The function we are calling to initialize object debug state during early NUMA
    bootstrap sets up an inactive object giving it the wrong redzone signature.
    The bootstrap nodes are active objects and should have active redzone
    signatures.

    Currently slab validation complains and reverts the object to active state.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently SLUB has no provision to deal with too high page orders that may
    be specified on the kernel boot line. If an order higher than 6 (on a 4k
    platform) is generated then we will BUG() because slabs get more than 65535
    objects.

    Add some logic that decreases order for slabs that have too many objects.
    This allow booting with slab sizes up to MAX_ORDER.

    For example

    slub_min_order=10

    will boot with a default slab size of 4M and reduce slab sizes for small
    object sizes to lower orders if the number of objects becomes too big.
    Large slab sizes like that allow a concentration of objects of the same
    slab cache under as few as possible TLB entries and thus potentially
    reduces TLB pressure.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We currently have to do an GFP_ATOMIC allocation because the list_lock is
    already taken when we first allocate memory for tracking allocation
    information. It would be better if we could avoid atomic allocations.

    Allocate a size of the tracking table that is usually sufficient (one page)
    before we take the list lock. We will then only do the atomic allocation
    if we need to resize the table to become larger than a page (mostly only
    needed under large NUMA because of the tracking of cpus and nodes otherwise
    the table stays small).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use list_for_each_entry() instead of list_for_each().

    Get rid of for_all_slabs(). It had only one user. So fold it into the
    callback. This also gets rid of cpu_slab_flush.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Changes the error reporting format to loosely follow lockdep.

    If data corruption is detected then we generate the following lines:

    ============================================
    BUG :
    --------------------------------------------

    INFO: [possibly multiple times]

    FIX :

    This also adds some more intelligence to the data corruption detection. Its
    now capable of figuring out the start and end.

    Add a comment on how to configure SLUB so that a production system may
    continue to operate even though occasional slab corruption occur through
    a misbehaving kernel component. See "Emergency operations" in
    Documentation/vm/slub.txt.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I can never remember what the function to register to receive VM pressure
    is called. I have to trace down from __alloc_pages() to find it.

    It's called "set_shrinker()", and it needs Your Help.

    1) Don't hide struct shrinker. It contains no magic.
    2) Don't allocate "struct shrinker". It's not helpful.
    3) Call them "register_shrinker" and "unregister_shrinker".
    4) Call the function "shrink" not "shrinker".
    5) Reduce the 17 lines of waffly comments to 13, but document it properly.

    Signed-off-by: Rusty Russell
    Cc: David Chinner
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • When we are out of memory of a suitable size we enter reclaim. The current
    reclaim algorithm targets pages in LRU order, which is great for fairness at
    order-0 but highly unsuitable if you desire pages at higher orders. To get
    pages of higher order we must shoot down a very high proportion of memory;
    >95% in a lot of cases.

    This patch set adds a lumpy reclaim algorithm to the allocator. It targets
    groups of pages at the specified order anchored at the end of the active and
    inactive lists. This encourages groups of pages at the requested orders to
    move from active to inactive, and active to free lists. This behaviour is
    only triggered out of direct reclaim when higher order pages have been
    requested.

    This patch set is particularly effective when utilised with an
    anti-fragmentation scheme which groups pages of similar reclaimability
    together.

    This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
    the foundation. Credit to Mel Gorman for sanitity checking.

    Mel said:

    The patches have an application with hugepage pool resizing.

    When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
    be resized with greater reliability. Testing on a desktop machine with 2GB
    of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
    was very slow as the success rate was quite low. Without lumpy-reclaim,
    each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
    With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

    [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
    [bunk@stusta.de: static declarations for internal functions]
    [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
    Signed-off-by: Andy Whitcroft
    Acked-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • This patch adds a new parameter for sizing ZONE_MOVABLE called
    movablecore=. While kernelcore= is used to specify the minimum amount of
    memory that must be available for all allocation types, movablecore= is
    used to specify the minimum amount of memory that is used for migratable
    allocations. The amount of memory used for migratable allocations
    determines how large the huge page pool could be dynamically resized to at
    runtime for example.

    How movablecore is actually handled is that the total number of pages in
    the system is calculated and a value is set for kernelcore that is

    kernelcore == totalpages - movablecore

    Both kernelcore= and movablecore= can be safely specified at the same time.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman