26 Sep, 2006

40 commits

  • Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
    PageNosaveFree for PageNosave pages. This allows us to get rid of an ugly
    hack in kernel/power/snapshot.c#copy_data_pages().

    Additionally, the page-copying loop in copy_data_pages() is moved to an
    inline function.

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Implement async reads for swsusp resuming.

    Crufty old PIII testbox:
    15.7 MB/s -> 20.3 MB/s

    Sony Vaio:
    14.6 MB/s -> 33.3 MB/s

    I didn't implement the post-resume bio_set_pages_dirty(). I don't really
    understand why resume needs to run set_page_dirty() against these pages.

    It might be a worry that this code modifies PG_Uptodate, PG_Error and
    PG_Locked against the image pages. Can this possibly affect the resumed-into
    kernel? Hopefully not, if we're atomically restoring its mem_map?

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Jens Axboe
    Cc: Laurent Riffard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

    Crufty old PIII testbox:
    12.9 MB/s -> 20.9 MB/s

    Sony Vaio:
    14.7 MB/s -> 26.5 MB/s

    The implementation is crude. A better one would use larger BIOs, but wouldn't
    gain any performance.

    The memcpys will be mostly pipelined with the IO and basically come for free.

    The ENOMEM path has not been tested. It should be.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I found two location in hugetlb.c where we chase pointer instead of using
    page_to_nid(). Page_to_nid is more effective and can get the node directly
    from page flags.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Update the comments for __oom_kill_task() to reflect the code changes.

    Signed-off-by: Ram Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ram Gupta
     
  • Minor performance fix.

    If we reclaimed enough slab pages from a zone then we can avoid going off
    node with the current allocation. Take care of updating nr_reclaimed when
    reclaiming from the slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • *_pages is a better description of the role of the variable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The allocpercpu functions __alloc_percpu and __free_percpu() are heavily
    using the slab allocator. However, they are conceptually slab. This also
    simplifies SLOB (at this point slob may be broken in mm. This should fix
    it).

    Signed-off-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If a zone is unpopulated then we do not need to check for pages that are to
    be drained and also not for vm counters that may need to be updated.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Free one_page currently adds the page to a fake list and calls
    free_page_bulk. Fee_page_bulk takes it off again and then calles
    __free_one_page.

    Make free_one_page go directly to __free_one_page. Saves list on / off and
    a temporary list in free_one_page for higher ordered pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • On High end systems (1024 or so cpus) this can potentially cause stack
    overflow. Fix the stack usage.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • In many places we will need to use the same combination of flags. Specify
    a single GFP_THISNODE definition for ease of use in gfp.h.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are frequent references to *z in get_page_from_freelist.

    Add an explicit zone variable that can be used in all these places.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If the user specified a node where we should move the page to then we
    really do not want any other node.

    Signed-off-by: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • Place the alien array cache locks of on slab malloc slab caches on a
    seperate lockdep class. This avoids false positives from lockdep

    [akpm@osdl.org: build fix]
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Thomas Gleixner
    Acked-by: Arjan van de Ven
    Cc: Ingo Molnar
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • It is fairly easy to get a system to oops by simply sizing a cache via
    /proc in such a way that one of the chaches (shared is easiest) becomes
    bigger than the maximum allowed slab allocation size. This occurs because
    enable_cpucache() fails if it cannot reallocate some caches.

    However, enable_cpucache() is used for multiple purposes: resizing caches,
    cache creation and bootstrap.

    If the slab is already up then we already have working caches. The resize
    can fail without a problem. We just need to return the proper error code.
    F.e. after this patch:

    # echo "size-64 10000 50 1000" >/proc/slabinfo
    -bash: echo: write error: Cannot allocate memory

    notice no OOPS.

    If we are doing a kmem_cache_create() then we also should not panic but
    return -ENOMEM.

    If on the other hand we do not have a fully bootstrapped slab allocator yet
    then we should indeed panic since we are unable to bring up the slab to its
    full functionality.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The ability to free memory allocated to a slab cache is also useful if an
    error occurs during setup of a slab. So extract the function.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • [akpm@osdl.org: export fix]
    Signed-off-by: Christoph Hellwig
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Let's try to keep mm/ comments more useful and up to date. This is a start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Also, checks if we get a valid slabp_cache for off slab slab-descriptors.
    We should always get this. If we don't, then in that case we, will have to
    disable off-slab descriptors for this cache and do the calculations again.
    This is a rare case, so add a BUG_ON, for now, just in case.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Introduce ARCH_LOW_ADDRESS_LIMIT which can be set per architecture to
    override the 4GB default limit used by the bootmem allocater within
    __alloc_bootmem_low() and __alloc_bootmem_low_node(). E.g. s390 needs a
    2GB limit instead of 4GB.

    Acked-by: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Print the name of the task invoking the OOM killer. Could make debugging
    easier.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Skip kernel threads, rather than having them return 0 from badness.
    Theoretically, badness might truncate all results to 0, thus a kernel thread
    might be picked first, causing an infinite loop.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • PF_SWAPOFF processes currently cause select_bad_process to return straight
    away. Instead, give them high priority, so we will kill them first, however
    we also first ensure no parallel OOM kills are happening at the same time.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Having the oomkilladj == OOM_DISABLE check before the releasing check means
    that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer.

    Moving the test down will give the desired behaviour. Also: it will allow
    them to "OOM-kill" themselves if they are exiting. As per the previous patch,
    this is required to prevent OOM killer deadlocks (and they don't actually get
    killed, because they're already exiting -- they're simply allowed access to
    memory reserves).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If current *is* exiting, it should actually be allowed to access reserved
    memory rather than OOM kill something else. Can't do this via a straight
    check in page_alloc.c because that would allow multiple tasks to use up
    reserves. Instead cause current to OOM-kill itself which will mark it as
    TIF_MEMDIE.

    The current procedure of simply aborting the OOM-kill if a task is exiting can
    lead to OOM deadlocks.

    In the case of killing a PF_EXITING task, don't make a lot of noise about it.
    This becomes more important in future patches, where we can "kill" OOM_DISABLE
    tasks.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • cpuset_excl_nodes_overlap does not always indicate that killing a task will
    not free any memory we for us. For example, we may be asking for an
    allocation from _anywhere_ in the machine, or the task in question may be
    pinning memory that is outside its cpuset. Fix this by just causing
    cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.

    Signed-off-by: Nick Piggin
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Potentially it takes several scans of the lru lists before we can even start
    reclaiming pages.

    mapped pages, with young ptes can take 2 passes on the active list + one on
    the inactive list. But reclaim_mapped may not always kick in instantly, so it
    could take even more than that.

    Raise the threshold for marking a zone as all_unreclaimable from a factor of 4
    time the pages in the zone to 6. Introduce a mechanism to force
    reclaim_mapped if we've reached a factor 3 and still haven't made progress.

    Previously, a customer doing stress testing was able to easily OOM the box
    after using only a small fraction of its swap (~100MB). After the patches, it
    would only OOM after having used up all swap (~800MB).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __alloc_pages currently starts shooting if page reclaim has failed to free up
    swap_cluster_max pages in one run through the priorities. This is not always
    a good indicator on its own, so make use of the all_unreclaimable logic as
    well: don't consider going OOM until all zones we're interested in are
    unreclaimable.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Currently we can silently drop data if the write to swap failed. It
    usually doesn't result in data-corruption because on page-in the process
    will receive SIGBUS (assuming write-failure implies read-failure).

    This assumption might or might not be valid.

    This patch will avoid the page being discarded after a failed write. But
    will print a warning the sysadmin _should_ take to heart, if a lot of swap
    space becomes un-writeable, OOM is not far off.

    Tested by making the write fail 'randomly' once every 50 writes or so.

    [akpm@osdl.org: printk warning fix]
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • As explained by Heiko, on s390 (32-bit) ARCH_KMALLOC_MINALIGN is set to
    eight because their common I/O layer allocates data structures that need to
    have an eight byte alignment. This does not work when CONFIG_SLAB_DEBUG is
    enabled because kmem_cache_create will override alignment to BYTES_PER_WORD
    which is four.

    So change kmem_cache_create to ensure cache alignment is always at minimum
    what the architecture or caller mandates even if slab debugging is enabled.

    Cc: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Manfred Spraul
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • lock_page needs the caller to have a reference on the page->mapping inode
    due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
    its comments.

    Solve it by introducing a new lock_page_nosync which does not do a sync_page.

    akpm: unpleasant solution to an unpleasant problem. If it goes wrong it could
    cause great slowdowns while the lock_page() caller waits for kblockd to
    perform the unplug. And if a filesystem has special sync_page() requirements
    (none presently do), permanent hangs are possible.

    otoh, set_page_dirty_lock() is usually (always?) called against userspace
    pages. They are always up-to-date, so there shouldn't be any pending read I/O
    against these pages.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Some users of remove_mapping had been unsafe.

    Modify the remove_mapping precondition to ensure the caller has locked the
    page and obtained the correct mapping. Modify callers to ensure the
    mapping is the correct one.

    [hugh@veritas.com: swapper_space fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • These functions are already documented quite well with long comments. Now
    add kerneldoc style header to make this turn up in everyones favorite doc
    format.

    Signed-off-by: Rolf Eike Beer
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rolf Eike Beer
     
  • This patch splits alloc_percpu() up into two phases. Likewise for
    free_percpu(). This allows clients to limit initial allocations to online
    cpu's, and to populate or depopulate per-cpu data at run time as needed:

    struct my_struct *obj;

    /* initial allocation for online cpu's */
    obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);

    ...

    /* populate per-cpu data for cpu coming online */
    ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);

    ...

    /* access per-cpu object */
    ptr = percpu_ptr(obj, smp_processor_id());

    ...

    /* depopulate per-cpu data for cpu going offline */
    percpu_depopulate(obj, cpu);

    ...

    /* final removal */
    percpu_free(obj);

    Signed-off-by: Martin Peschke
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Peschke
     
  • Add a notifer chain to the out of memory killer. If one of the registered
    callbacks could release some memory, do not kill the process but return and
    retry the allocation that forced the oom killer to run.

    The purpose of the notifier is to add a safety net in the presence of
    memory ballooners. If the resource manager inflated the balloon to a size
    where memory allocations can not be satisfied anymore, it is better to
    deflate the balloon a bit instead of killing processes.

    The implementation for the s390 ballooner is included.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky