17 Oct, 2007

1 commit

  • Move the OOM killer's extern function prototypes to include/linux/oom.h and
    include it where necessary.

    [clg@fr.ibm.com: build fix]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Oct, 2007

1 commit

  • As bi_end_io is only called once when the reqeust is complete,
    the 'size' argument is now redundant. Remove it.

    Now there is no need for bio_endio to subtract the size completed
    from bi_size. So don't do that either.

    While we are at it, change bi_end_io to return void.

    Signed-off-by: Neil Brown
    Signed-off-by: Jens Axboe

    NeilBrown
     

18 Jul, 2007

1 commit

  • When we are out of memory of a suitable size we enter reclaim. The current
    reclaim algorithm targets pages in LRU order, which is great for fairness at
    order-0 but highly unsuitable if you desire pages at higher orders. To get
    pages of higher order we must shoot down a very high proportion of memory;
    >95% in a lot of cases.

    This patch set adds a lumpy reclaim algorithm to the allocator. It targets
    groups of pages at the specified order anchored at the end of the active and
    inactive lists. This encourages groups of pages at the requested orders to
    move from active to inactive, and active to free lists. This behaviour is
    only triggered out of direct reclaim when higher order pages have been
    requested.

    This patch set is particularly effective when utilised with an
    anti-fragmentation scheme which groups pages of similar reclaimability
    together.

    This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
    the foundation. Credit to Mel Gorman for sanitity checking.

    Mel said:

    The patches have an application with hugepage pool resizing.

    When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
    be resized with greater reliability. Testing on a desktop machine with 2GB
    of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
    was very slow as the success rate was quite low. Without lumpy-reclaim,
    each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
    With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

    [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
    [bunk@stusta.de: static declarations for internal functions]
    [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
    Signed-off-by: Andy Whitcroft
    Acked-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

12 Feb, 2007

2 commits

  • Function is unnecessary now. We can use the summing features of the ZVCs to
    get the values we need.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_free_pages is now a simple access to a global variable. Make it a macro
    instead of a function.

    The nr_free_pages now requires vmstat.h to be included. There is one
    occurrence in power management where we need to add the include. Directly
    refrer to global_page_state() there to clarify why the #include was added.

    [akpm@osdl.org: arm build fix]
    [akpm@osdl.org: sparc64 build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Jan, 2007

1 commit

  • In the kernels later than 2.6.19 there is a regression that makes swsusp
    fail if the resume device is not explicitly specified.

    It can be fixed by adding an additional parameter to
    mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
    *) corresponding to the first available swap back to the caller.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

08 Dec, 2006

3 commits

  • Make swsusp use block device offsets instead of swap offsets to identify swap
    locations and make it use the same code paths for writing as well as for
    reading data.

    This allows us to use the same code for handling swap files and swap
    partitions and to simplify the code, eg. by dropping rw_swap_page_sync().

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The Linux kernel handles swap files almost in the same way as it handles swap
    partitions and there are only two differences between these two types of swap
    areas:

    (1) swap files need not be contiguous,

    (2) the header of a swap file is not in the first block of the partition
    that holds it. From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

    In principle the location of a swap file's header may be determined with the
    help of appropriate filesystem driver. Unfortunately, however, it requires
    the filesystem holding the swap file to be mounted, and if this filesystem is
    journaled, it cannot be mounted during a resume from disk. For this reason we
    need some other means by which swap areas can be identified.

    For example, to identify a swap area we can use the partition that holds the
    area and the offset from the beginning of this partition at which the swap
    header is located.

    The following patch allows swsusp to identify swap areas this way. It changes
    swap_type_of() so that it takes an additional argument representing an offset
    of the swap header within the partition represented by its first argument.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • The new swap token patches replace the current token traversal algo. The old
    algo had a crude timeout parameter that was used to handover the token from
    one task to another. This algo, transfers the token to the tasks that are in
    need of the token. The urgency for the token is based on the number of times
    a task is required to swap-in pages. Accordingly, the priority of a task is
    incremented if it has been badly affected due to swap-outs. To ensure that
    the token doesnt bounce around rapidly, the token holders are given a priority
    boost. The priority of tasks is also decremented, if their rate of swap-in's
    keeps reducing. This way, the condition to check whether to pre-empt the swap
    token, is a matter of comparing two task's priority fields.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Ashwin Chaugule
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashwin Chaugule
     

26 Sep, 2006

5 commits

  • Implement async reads for swsusp resuming.

    Crufty old PIII testbox:
    15.7 MB/s -> 20.3 MB/s

    Sony Vaio:
    14.6 MB/s -> 33.3 MB/s

    I didn't implement the post-resume bio_set_pages_dirty(). I don't really
    understand why resume needs to run set_page_dirty() against these pages.

    It might be a worry that this code modifies PG_Uptodate, PG_Error and
    PG_Locked against the image pages. Can this possibly affect the resumed-into
    kernel? Hopefully not, if we're atomically restoring its mem_map?

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Jens Axboe
    Cc: Laurent Riffard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

    Crufty old PIII testbox:
    12.9 MB/s -> 20.9 MB/s

    Sony Vaio:
    14.7 MB/s -> 26.5 MB/s

    The implementation is crude. A better one would use larger BIOs, but wouldn't
    gain any performance.

    The memcpys will be mostly pipelined with the IO and basically come for free.

    The ENOMEM path has not been tested. It should be.

    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add a notifer chain to the out of memory killer. If one of the registered
    callbacks could release some memory, do not kill the process but return and
    retry the allocation that forced the oom killer to run.

    The purpose of the notifier is to add a safety net in the presence of
    memory ballooners. If the resource manager inflated the balloon to a size
    where memory allocations can not be satisfied anymore, it is better to
    deflate the balloon a bit instead of killing processes.

    The implementation for the s390 ballooner is included.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Move totalhigh_pages and nr_free_highpages() into highmem.c/.h

    Move the totalhigh_pages definition into highmem.c/.h. Move the
    nr_free_highpages function into highmem.c

    [yoichi_yuasa@tripeaks.co.jp: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Jul, 2006

1 commit

  • It turns out that it is advantageous to leave a small portion of unmapped file
    backed pages if all of a zone's pages (or almost all pages) are allocated and
    so the page allocator has to go off-node.

    This allows recently used file I/O buffers to stay on the node and
    reduces the times that zone reclaim is invoked if file I/O occurs
    when we run out of memory in a zone.

    The problem is that zone reclaim runs too frequently when the page cache is
    used for file I/O (read write and therefore unmapped pages!) alone and we have
    almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped
    pages. File I/O will use these pages for the next read/write requests and the
    unmapped pages increase. After the zone has filled up again zone reclaim will
    remove it again after only 32 pages. This cycle is too inefficient and there
    are potentially too many zone reclaim cycles.

    With the 1% boundary we may still remove all unmapped pages for file I/O in
    zone reclaim pass. However. it will take a large number of read and writes
    to get back to 1% again where we trigger zone reclaim again.

    The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
    second timeout.

    [akpm@osdl.org: rename the /proc file and the variable]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2006

1 commit

  • The zone_reclaim_interval was necessary because we were not able to determine
    how many unmapped pages exist in a zone. Therefore we had to scan in
    intervals to figure out if any pages were unmapped.

    With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
    pages and the number of mapped pages in a zone. So we can simply skip the
    reclaim if there is an insufficient number of unmapped pages. We use
    SWAP_CLUSTER_MAX as the boundary.

    Drop all support for /proc/sys/vm/zone_reclaim_interval.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Jun, 2006

1 commit


23 Jun, 2006

5 commits

  • Initialise total_memory earlier in boot. Because if for some reason we run
    page reclaim early in boot, we don't want total_memory to be zero when we use
    it as a divisor.

    And rename total_memory to vm_total_pages to avoid naming clashes with
    architectures.

    Cc: Yasunori Goto
    Cc: KAMEZAWA Hiroyuki
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • If CONFIG_SWAP is not defined we get:

    mm/vmscan.c: In function ‘remove_mapping’:
    mm/vmscan.c:387: warning: unused variable ‘swap’

    Convert defines in swap.h into blank inline functions to fix this warning
    and be consistent.

    Signed-off-by: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • This implements the use of migration entries to preserve ptes of file backed
    pages during migration. Processes can therefore be migrated back and forth
    without loosing their connection to pagecache pages.

    Note that we implement the migration entries only for linear mappings.
    Nonlinear mappings still require the unmapping of the ptes for migration.

    And another writepage() ugliness shows up. writepage() can drop the page
    lock. Therefore we have to remove migration ptes before calling writepages()
    in order to avoid having migration entries point to unlocked pages.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Reserve space in the swap disk header for a LABEL and UUID to be specified.
    This has been possible with util-linux-2.12b (via e2fsprogs 1.36
    libblkid), and is used by at least FC3 and later. The kernel doesn't
    really care about this, but the space shouldn't accidentally be used by
    something else either.

    Also make the on-disk structures be fixed-size types, instead of "int",
    though I don't know of any architecture in use where an "int" isn't the
    same size as a "__u32" (all current kernel arches have it as "unsigned
    int").

    Signed-off-by: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Dilger
     

24 May, 2006

1 commit


16 May, 2006

1 commit

  • can_share_swap_page() is used to check if the page has the last reference.
    This avoids allocating a new page for COW if it's the last page.

    However, if CONFIG_SWAP is not set, can_share_swap_page() is defined as 0,
    thus always causes a copy for the last COW page. The below simple patch
    fixes it.

    Signed-off-by: Hua Zhong
    Cc: David Howells
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hua Zhong
     

26 Apr, 2006

1 commit


11 Apr, 2006

1 commit

  • These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory().

    - why the kernel needed patching

    When the kernel can't allocate anonymous pages in practice, currnet
    OVERCOMMIT_GUESS could return success. This implementation might be
    the cause of oom kill in memory pressure situation.

    If the Linux runs with page reservation features like
    /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
    the oom kill occurs easily.

    - the overall design approach in the patch

    When the OVERCOMMET_GUESS algorithm calculates number of free pages,
    the reserved free pages are regarded as non-free pages.

    This change helps to avoid the pitfall that the number of free pages
    become less than the number which the kernel tries to keep free.

    - testing results

    I tested the patches using my test kernel module.

    If the patches aren't applied to the kernel, __vm_enough_memory()
    returns success in the situation but autual page allocation is
    failed.

    On the other hand, if the patches are applied to the kernel, memory
    allocation failure is avoided since __vm_enough_memory() returns
    failure in the situation.

    I checked that on i386 SMP 16GB memory machine. I haven't tested on
    nommu environment currently.

    This patch adds totalreserve_pages for __vm_enough_memory().

    Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
    pages_high in each zone. Finally, the function stores the sum of each
    zone to totalreserve_pages.

    The totalreserve_pages is calculated when the VM is initilized.
    And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
    or /proc/sys/vm/min_free_kbytes are changed.

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     

23 Mar, 2006

1 commit

  • Introduce the low level interface that can be used for handling the
    snapshot of the system memory by the in-kernel swap-writing/reading code of
    swsusp and the userland interface code (to be introduced shortly).

    Also change the way in which swsusp records the allocated swap pages and,
    consequently, simplifies the in-kernel swap-writing/reading code (this is
    necessary for the userland interface too). To this end, it introduces two
    helper functions in mm/swapfile.c, so that the swsusp code does not refer
    directly to the swap internals.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

22 Mar, 2006

2 commits

  • Centralize the page migration functions in anticipation of additional
    tinkering. Creates a new file mm/migrate.c

    1. Extract buffer_migrate_page() from fs/buffer.c

    2. Extract central migration code from vmscan.c

    3. Extract some components from mempolicy.c

    4. Export pageout() and remove_from_swap() from vmscan.c

    5. Make it possible to configure NUMA systems without page migration
    and non-NUMA systems with page migration.

    I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Turn basically everything in vmscan.c into `unsigned long'. This is to avoid
    the possibility that some piece of code in there might decide to operate upon
    more than 4G (or even 2G) of pages in one hit.

    This might be silly, but we'll need it one day.

    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Feb, 2006

1 commit

  • Some allocations are restricted to a limited set of nodes (due to memory
    policies or cpuset constraints). If the page allocator is not able to find
    enough memory then that does not mean that overall system memory is low.

    In particular going postal and more or less randomly shooting at processes
    is not likely going to help the situation but may just lead to suicide (the
    whole system coming down).

    It is better to signal to the process that no memory exists given the
    constraints that the process (or the configuration of the process) has
    placed on the allocation behavior. The process may be killed but then the
    sysadmin or developer can investigate the situation. The solution is
    similar to what we do when running out of hugepages.

    This patch adds a check before we kill processes. At that point
    performance considerations do not matter much so we just scan the zonelist
    and reconstruct a list of nodes. If the list of nodes does not contain all
    online nodes then this is a constrained allocation and we should kill the
    current process.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Feb, 2006

4 commits

  • Migrate a page with buffers without requiring writeback

    This introduces a new address space operation migratepage() that may be used
    by a filesystem to implement its own version of page migration.

    A version is provided that migrates buffers attached to pages. Some
    filesystems (ext2, ext3, xfs) are modified to utilize this feature.

    The swapper address space operation are modified so that a regular
    migrate_page() will occur for anonymous pages without writeback (migrate_pages
    forces every anonymous page to have a swap entry).

    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add remove_from_swap

    remove_from_swap() allows the restoration of the pte entries that existed
    before page migration occurred for anonymous pages by walking the reverse
    maps. This reduces swap use and establishes regular pte's without the need
    for page faults.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add direct migration support with fall back to swap.

    Direct migration support on top of the swap based page migration facility.

    This allows the direct migration of anonymous pages and the migration of file
    backed pages by dropping the associated buffers (requires writeout).

    Fall back to swap out if necessary.

    The patch is based on lots of patches from the hotplug project but the code
    was restructured, documented and simplified as much as possible.

    Note that an additional patch that defines the migrate_page() method for
    filesystems is necessary in order to avoid writeback for anonymous and file
    backed pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the zone_reclaim code has a fixed window of 30 seconds of off node
    allocations should a local zone have no unused pagecache pages left. Reclaim
    will be attempted again after this timeout period to avoid repeated useless
    scans for memory. This is also useful to established sufficiently large off
    node allocation chunks to relieve the local node.

    It may be beneficial to adjust that time period for some special situations.
    For example if memory use was exceeding node capacity one may want to give up
    for longer periods of time. If memory spikes intermittendly then one may want
    to shorten the time period to reduce the number of off node allocations.

    This patch allows just that....

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

19 Jan, 2006

2 commits

  • Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
    patch fixes them up, removes unused code and makes zone reclaim usable.

    Zone reclaim allows the reclaiming of pages from a zone if the number of
    free pages falls below the watermarks even if other zones still have enough
    pages available. Zone reclaim is of particular importance for NUMA
    machines. It can be more beneficial to reclaim a page than taking the
    performance penalties that come with allocating a page on a remote zone.

    Zone reclaim is enabled if the maximum distance to another node is higher
    than RECLAIM_DISTANCE, which may be defined by an arch. By default
    RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
    component (enclosure or motherboard) on IA64. The meaning of the NUMA
    distance information seems to vary by arch.

    If zone reclaim is not successful then no further reclaim attempts will
    occur for a certain time period (ZONE_RECLAIM_INTERVAL).

    This patch was discussed before. See

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Jan, 2006

1 commit

  • Some people apparently run CONFIG_NUMA without CONFIG_SWAP. The migration
    code currently depends on swap. This patch provides a set of inline
    fallback functions so that the kernel properly compiles. However, calls to
    migration functions will fail.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2006

3 commits

  • Extend the parameters of migrate_pages() to allow the caller control over the
    fate of successfully migrated or impossible to migrate pages.

    Swap migration and direct migration will have the same interface after this
    patch so that patches can be independently applied to the policy layer and the
    core migration code.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add gfp_mask to add_to_swap

    add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
    swapping. During migration we may have use add_to_swap extensively which may
    lead to out of memory errors.

    This patch makes add_to_swap take a parameter that specifies the gfp mask.
    The page migration code can then make add_to_swap use GFP_KERNEL.

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Dave Hansen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by
    CONFIG_MIGRATION saving some codesize for single processor kernels.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter