10 Jan, 2006

3 commits


09 Jan, 2006

37 commits

  • The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
    fully POSIX-compliant.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valentine Barshak
     
  • This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

    See mm/filemap.c:

    And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

    Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
    returns error. However, even if filemap_fdatawrite() returned an
    error, it may have submitted the partially data pages to the device.
    (e.g. in the case of -ENOSPC)

    Andrew Morton writes,

    If filemap_fdatawrite() returns an error, this might be due to some
    I/O problem: dead disk, unplugged cable, etc. Given the generally
    crappy quality of the kernel's handling of such exceptions, there's a
    good chance that the filemap_fdatawait() will get stuck in D state
    forever.

    So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

    Trond, could you please review the nfs part? Especially I'm not sure,
    nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This exports/changes the sync_page_range/_nolock(). The fatfs needs
    sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
    to "loff_t count".

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Fix more of longstanding bug in cpuset/mempolicy interaction.

    NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
    to just the Memory Nodes allowed by that cpuset. The kernel maintains
    internal state for each mempolicy, tracking what nodes are used for the
    MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

    When a tasks cpuset memory placement changes, whether because the cpuset
    changed, or because the task was attached to a different cpuset, then the
    tasks mempolicies have to be rebound to the new cpuset placement, so as to
    preserve the cpuset-relative numbering of the nodes in that policy.

    An earlier fix handled such mempolicy rebinding for mempolicies attached to a
    task.

    This fix rebinds mempolicies attached to vma's (address ranges in a tasks
    address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
    updating vma's, the rebinding of vma mempolicies has to be done when the
    cpuset memory placement is changed, at which time mmap_sem can be safely
    acquired. The tasks mempolicy is rebound later, when the task next attempts
    to allocate memory and notices that its task->cpuset_mems_generation is
    out-of-date with its cpusets mems_generation.

    Because walking the tasklist to find all tasks attached to a changing cpuset
    requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
    affected tasks while doing the tasklist scan. In general, one cannot acquire
    a semaphore (which can sleep) while already holding a spinlock (such as
    tasklist_lock). So a list of mm references has to be built up during the
    tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
    acquired, and the vma's in that mm rebound.

    Once the tasklist lock is dropped, affected tasks may fork new tasks, before
    their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
    point to the cpuset being rebound (there can only be one; cpuset modifications
    are done under a global 'manage_sem' semaphore), and the mpol_copy code that
    is used to copy a tasks mempolicies during fork catches such forking tasks,
    and ensures their children are also rebound.

    When a task is moved to a different cpuset, it is easier, as there is only one
    task involved. It's mm->vma's are scanned, using the same
    mpol_rebind_policy() as used above.

    It may happen that both the mpol_copy hook and the update done via the
    tasklist scan update the same mm twice. This is ok, as the mempolicies of
    each vma in an mm keep track of what mems_allowed they are relative to, and
    safely no-op a second request to rebind to the same nodes.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Cleanup, reorganize and make more robust the mempolicy.c code to rebind
    mempolicies relative to the containing cpuset after a tasks memory placement
    changes.

    The real motivator for this cleanup patch is to lay more groundwork for the
    upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
    after the containing cpuset memory placement changes.

    NUMA mempolicies are constrained by the cpuset their task is a member of.
    When either (1) a task is moved to a different cpuset, or (2) the 'mems'
    mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
    node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
    be recalculated, relative to their new cpuset placement.

    The old code used an unreliable method of determining what was the old
    mems_allowed constraining the mempolicy. It just looked at the tasks
    mems_allowed value. This sort of worked with the present code, that just
    rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
    referring to the old nodes. But in an upcoming patch, the vma mempolicies
    will be rebound as well. Then the order in which the various task and vma
    mempolicies are updated will no longer be deterministic, and one can no longer
    count on the task->mems_allowed holding the old value for as long as needed.
    It's not even clear if the current code was guaranteed to work reliably for
    task mempolicies.

    So I added a mems_allowed field to each mempolicy, stating exactly what
    mems_allowed the policy is relative to, and updated synchronously and reliably
    anytime that the mempolicy is rebound.

    Also removed a useless wrapper routine, numa_policy_rebind(), and had its
    caller, cpuset_update_task_memory_state(), call directly to the rewritten
    policy_rebind() routine, and made that rebind routine extern instead of
    static, and added a "mpol_" prefix to its name, making it
    mpol_rebind_policy().

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
    needed, to obtain the mems_allowed vector of a cpuset, and replaced the
    workaround in sys_migrate_pages() to call this new method.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The important code paths through alloc_pages_current() and alloc_page_vma(),
    by which most kernel page allocations go, both called
    cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
    -Both- of these latter two routines did a tasklock, got the tasks cpuset
    pointer, and checked for out of date cpuset->mems_generation.

    That was a silly duplication of code and waste of CPU cycles on an important
    code path.

    Consolidated those two routines into a single routine, called
    cpuset_update_task_memory_state(), since it updates more than just
    mems_allowed.

    Changed all callers of either routine to call the new consolidated routine.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.

    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.

    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.

    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure. It's up to the batch
    manager or other user code to decide what to do about it and take action.

    ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero. So only
    systems that enable this feature will compute the metric.

    Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.

    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
    conversion had left one routine using bitmaps, since it involved a
    corresponding change to kernel/cpuset.c

    Fix that interface by replacing with a simple macro that calls nodes_subset(),
    or if !CONFIG_CPUSET, returns (1).

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • configurable replacement for slab allocator

    This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
    disabled, the kernel falls back to using the 'SLOB' allocator.

    SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
    similar to the original Linux kmalloc allocator that SLAB replaced. It's
    signicantly smaller code and is more memory efficient. But like all
    similar allocators, it scales poorly and suffers from fragmentation more
    than SLAB, so it's only appropriate for small systems.

    It's been tested extensively in the Linux-tiny tree. I've also
    stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
    recommended).

    Here's a comparison for otherwise identical builds, showing SLOB saving
    nearly half a megabyte of RAM:

    $ size vmlinux*
    text data bss dec hex filename
    3336372 529360 190812 4056544 3de5e0 vmlinux-slab
    3323208 527948 190684 4041840 3dac70 vmlinux-slob

    $ size mm/{slab,slob}.o
    text data bss dec hex filename
    13221 752 48 14021 36c5 mm/slab.o
    1896 52 8 1956 7a4 mm/slob.o

    /proc/meminfo:
    SLAB SLOB delta
    MemTotal: 27964 kB 27980 kB +16 kB
    MemFree: 24596 kB 25092 kB +496 kB
    Buffers: 36 kB 36 kB 0 kB
    Cached: 1188 kB 1188 kB 0 kB
    SwapCached: 0 kB 0 kB 0 kB
    Active: 608 kB 600 kB -8 kB
    Inactive: 808 kB 812 kB +4 kB
    HighTotal: 0 kB 0 kB 0 kB
    HighFree: 0 kB 0 kB 0 kB
    LowTotal: 27964 kB 27980 kB +16 kB
    LowFree: 24596 kB 25092 kB +496 kB
    SwapTotal: 0 kB 0 kB 0 kB
    SwapFree: 0 kB 0 kB 0 kB
    Dirty: 4 kB 12 kB +8 kB
    Writeback: 0 kB 0 kB 0 kB
    Mapped: 560 kB 556 kB -4 kB
    Slab: 1756 kB 0 kB -1756 kB
    CommitLimit: 13980 kB 13988 kB +8 kB
    Committed_AS: 4208 kB 4208 kB 0 kB
    PageTables: 28 kB 28 kB 0 kB
    VmallocTotal: 1007312 kB 1007312 kB 0 kB
    VmallocUsed: 48 kB 48 kB 0 kB
    VmallocChunk: 1007264 kB 1007264 kB 0 kB

    (this work has been sponsored in part by CELF)

    From: Ingo Molnar

    Fix 32-bitness bugs in mm/slob.c.

    Signed-off-by: Matt Mackall
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Add mm/util.c for functions common between SLAB and SLOB.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • ____cacheline_maxaligned_in_smp is currently used to align critical structures
    and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
    L1_CACHE_SHIFT_MAX useless.

    However, we have been using ____cacheline_maxaligned_in_smp to align
    structures on the internode cacheline size. As per Andi's suggestion,
    following patch kills ____cacheline_maxaligned_in_smp and introduces
    INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
    Arches needing L3/Internode cacheline alignment can define
    INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
    ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

    With this patch, L1_CACHE_SHIFT_MAX can be killed

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • When oom_killer kills current there's no need to call
    schedule_timeout_interruptible() since task must die ASAP.

    Signed-Off-By: Pavel Emelianov
    Signed-Off-By: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Group page migration functions in mempolicy.c

    Add a forward declaration for migrate_page_add (like gather_stats()) and use
    our new found mobility to group all page migration related function around
    do_migrate_pages().

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Since the numa_maps functionality is now in mempolicy.c we no longer need to
    export get_vma_policy().

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • migrate_page_add cannot be called with a spinlock held (calls
    isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in
    check_pte_range before calling migrate_page_add().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2

    - Use the check_range() in mempolicy.c to gather statistics.

    - Improve the numa_maps code in general and fix some comments.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This was was first posted at
    http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2

    (Part of this functionality is also contained in the direct migration
    pathset. The functionality here is more generic and independent of that
    patchset.)

    - Add internal flags MPOL_MF_INVERT to control check_range() behavior.

    - Replace the pagelist passed through by check_range by a general
    private pointer that may be used for other purposes.
    (The following patches will use that to merge numa_maps into
    mempolicy.c and to better group the page migration code in
    the policy layer)

    - Improve some comments.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We seem to be hitting this assertion failure too often for it to be
    hardware bugs.

    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Clean up a local variable with the same name as a variable in a larger
    block. Also move a variable into the block where it's actually used.

    Spotted by http://linuxicc.sourceforge.net/

    Signed-off-by: Tobias Klauser
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2
    http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2

    Make hugepages obey cpusets.

    Signed-off-by: Christoph Lameter
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use -Exxx instead of numeric return codes and cleanup the code in
    migrate_pages() using -Exx error codes.

    Consolidate successful migration handling

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Extend the parameters of migrate_pages() to allow the caller control over the
    fate of successfully migrated or impossible to migrate pages.

    Swap migration and direct migration will have the same interface after this
    patch so that patches can be independently applied to the policy layer and the
    core migration code.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Drop unused pages immediately

    If a page is encountered that is only referenced by the migration code then
    there is no reason to swap or migrate the page. Release the page by calling
    move_to_lru().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add gfp_mask to add_to_swap

    add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
    swapping. During migration we may have use add_to_swap extensively which may
    lead to out of memory errors.

    This patch makes add_to_swap take a parameter that specifies the gfp mask.
    The page migration code can then make add_to_swap use GFP_KERNEL.

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Dave Hansen
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by
    CONFIG_MIGRATION saving some codesize for single processor kernels.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • sys_migrate_pages implementation using swap based page migration

    This is the original API proposed by Ray Bryant in his posts during the first
    half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

    The intent of sys_migrate is to migrate memory of a process. A process may
    have migrated to another node. Memory was allocated optimally for the prior
    context. sys_migrate_pages allows to shift the memory to the new node.

    sys_migrate_pages is also useful if the processes available memory nodes have
    changed through cpuset operations to manually move the processes memory. Paul
    Jackson is working on an automated mechanism that will allow an automatic
    migration if the cpuset of a process is changed. However, a user may decide
    to manually control the migration.

    This implementation is put into the policy layer since it uses concepts and
    functions that are also needed for mbind and friends. The patch also provides
    a do_migrate_pages function that may be useful for cpusets to automatically
    move memory. sys_migrate_pages does not modify policies in contrast to Ray's
    implementation.

    The current code here is based on the swap based page migration capability and
    thus is not able to preserve the physical layout relative to it containing
    nodeset (which may be a cpuset). When direct page migration becomes available
    then the implementation needs to be changed to do a isomorphic move of pages
    between different nodesets. The current implementation simply evicts all
    pages in source nodeset that are not in the target nodeset.

    Patch supports ia64, i386 and x86_64.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add page migration support via swap to the NUMA policy layer

    This patch adds page migration support to the NUMA policy layer. An
    additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is
    specified then pages that do not conform to the memory policy will be evicted
    from memory. When they get pages back in new pages will be allocated
    following the numa policy.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Include page migration if the system is NUMA or having a memory model that
    allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM).

    And:
    - Only include lru_add_drain_per_cpu if building for an SMP system.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This adds the basic page migration function with a minimal implementation that
    only allows the eviction of pages to swap space.

    Page eviction and migration may be useful to migrate pages, to suspend
    programs or for remapping single pages (useful for faulty pages or pages with
    soft ECC failures)

    The process is as follows:

    The function wanting to migrate pages must first build a list of pages to be
    migrated or evicted and take them off the lru lists via isolate_lru_page().
    isolate_lru_page determines that a page is freeable based on the LRU bit set.

    Then the actual migration or swapout can happen by calling migrate_pages().

    migrate_pages does its best to migrate or swapout the pages and does multiple
    passes over the list. Some pages may only be swappable if they are not dirty.
    migrate_pages may start writing out dirty pages in the initial passes over
    the pages. However, migrate_pages may not be able to migrate or evict all
    pages for a variety of reasons.

    The remaining pages may be returned to the LRU lists using putback_lru_pages().

    Changelog V4->V5:
    - Use the lru caches to return pages to the LRU

    Changelog V3->V4:
    - Restructure code so that applying patches to support full migration does
    require minimal changes. Rename swapout_pages() to migrate_pages().

    Changelog V2->V3:
    - Extract common code from shrink_list() and swapout_pages()

    Signed-off-by: Mike Kravetz
    Signed-off-by: Christoph Lameter
    Cc: "Michael Kerrisk"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add PF_SWAPWRITE to control a processes permission to write to swap.

    - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
    and pdflush

    - Set PF_SWAPWRITE flag for kswapd and pdflush

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is the start of the `swap migration' patch series.

    Swap migration allows the moving of the physical location of pages between
    nodes in a numa system while the process is running. This means that the
    virtual addresses that the process sees do not change. However, the system
    rearranges the physical location of those pages.

    The main intent of page migration patches here is to reduce the latency of
    memory access by moving pages near to the processor where the process
    accessing that memory is running.

    The patchset allows a process to manually relocate the node on which its
    pages are located through the MF_MOVE and MF_MOVE_ALL options while
    setting a new memory policy.

    The pages of process can also be relocated from another process using the
    sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages
    function call takes two sets of nodes and moves pages of a process that are
    located on the from nodes to the destination nodes.

    Manual migration is very useful if for example the scheduler has relocated a
    process to a processor on a distant node. A batch scheduler or an
    administrator can detect the situation and move the pages of the process
    nearer to the new processor.

    sys_migrate_pages() could be used on non-numa machines as well, to force all
    of a particualr process's pages out to swap, if someone thinks that's useful.

    Larger installations usually partition the system using cpusets into sections
    of nodes. Paul has equipped cpusets with the ability to move pages when a
    task is moved to another cpuset. This allows automatic control over locality
    of a process. If a task is moved to a new cpuset then also all its pages are
    moved with it so that the performance of the process does not sink
    dramatically (as is the case today).

    Swap migration works by simply evicting the page. The pages must be faulted
    back in. The pages are then typically reallocated by the system near the node
    where the process is executing.

    For swap migration the destination of the move is controlled by the allocation
    policy. Cpusets set the allocation policy before calling sys_migrate_pages()
    in order to move the pages as intended.

    No allocation policy changes are performed for sys_migrate_pages(). This
    means that the pages may not faulted in to the specified nodes if no
    allocation policy was set by other means. The pages will just end up near the
    node where the fault occurred.

    There's another patch series in the pipeline which implements "direct
    migration".

    The direct migration patchset extends the migration functionality to avoid
    going through swap. The destination node of the relation is controllable
    during the actual moving of pages. The crutch of using the allocation policy
    to relocate is not necessary and the pages are moved directly to the target.
    Its also faster since swap is not used.

    And sys_migrate_pages() can then move pages directly to the specified node.
    Implement functions to isolate pages from the LRU and put them back later.

    This patch:

    An earlier implementation was provided by Hirokazu Takahashi
    and IWAMOTO Toshihiro for the
    memory hotplug project.

    From: Magnus

    This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap
    migration.

    Signed-off-by: Magnus Damm
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Try to streamline free_pages_bulk by ensuring callers don't pass in a
    'count' that exceeds the list size.

    Some cleanups:
    Rename __free_pages_bulk to __free_one_page.
    Put the page list manipulation from __free_pages_ok into free_one_page.
    Make __free_pages_ok static.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Use zone_pcp everywhere even though NUMA code "knows" the internal details
    of the zone. Stop other people trying to copy, and it looks nicer.

    Also, only print the pagesets of online cpus in zoneinfo.

    Signed-off-by: Nick Piggin
    Cc: "Seth, Rohit"
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • As recently there has been lot of traffic on the right values for batch and
    high water marks for per_cpu_pagelists. This patch makes these two
    variables configurable through /proc interface.

    A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
    controls the fraction of pages at most in each zone that are allocated for
    each per cpu page list. The min value for this is 8. It means that we
    don't allow more than 1/8th of pages in each zone to be allocated in any
    single per_cpu_pagelist.

    The batch value of each per cpu pagelist is also updated as a result. It
    is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

    Signed-off-by: Rohit Seth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rohit Seth
     
  • Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
    discard as much pagecache and/or reclaimable slab objects as it can. THis
    operation requires root permissions.

    It won't drop dirty data, so the user should run `sync' first.

    Caveats:

    a) Holds inode_lock for exorbitant amounts of time.

    b) Needs to be taught about NUMA nodes: propagate these all the way through
    so the discarding can be controlled on a per-node basis.

    This is a debugging feature: useful for getting consistent results between
    filesystem benchmarks. We could possibly put it under a config option, but
    it's less than 300 bytes.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
    CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter