26 Jun, 2006

1 commit

  • Hooks for calling vma specific migration functions

    With this patch a vma may define a vma->vm_ops->migrate function. That
    function may perform page migration on its own (some vmas may not contain page
    structs and therefore cannot be handled by regular page migration. Pages in a
    vma may require special preparatory treatment before migration is possible
    etc) . Only mmap_sem is held when the migration function is called. The
    migrate() function gets passed two sets of nodemasks describing the source and
    the target of the migration. The flags parameter either contains

    MPOL_MF_MOVE which means that only pages used exclusively by
    the specified mm should be moved

    or

    MPOL_MF_MOVE_ALL which means that pages shared with other processes
    should also be moved.

    The migration function returns 0 on success or an error condition. An error
    condition will prevent regular page migration from occurring.

    On its own this patch cannot be included since there are no users for this
    functionality. But it seems that the uncached allocator will need this
    functionality at some point.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

4 commits

  • This patch inserts security_task_movememory hook calls into memory management
    code to enable security modules to mediate this operation between tasks.

    Since the last posting, the hook has been renamed following feedback from
    Christoph Lameter.

    Signed-off-by: David Quigley
    Acked-by: Stephen Smalley
    Signed-off-by: James Morris
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Acked-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Quigley
     
  • move_pages() is used to move individual pages of a process. The function can
    be used to determine the location of pages and to move them onto the desired
    node. move_pages() returns status information for each page.

    long move_pages(pid, number_of_pages_to_move,
    addresses_of_pages[],
    nodes[] or NULL,
    status[],
    flags);

    The addresses of pages is an array of void * pointing to the
    pages to be moved.

    The nodes array contains the node numbers that the pages should be moved
    to. If a NULL is passed instead of an array then no pages are moved but
    the status array is updated. The status request may be used to determine
    the page state before issuing another move_pages() to move pages.

    The status array will contain the state of all individual page migration
    attempts when the function terminates. The status array is only valid if
    move_pages() completed successfullly.

    Possible page states in status[]:

    0..MAX_NUMNODES The page is now on the indicated node.

    -ENOENT Page is not present

    -EACCES Page is mapped by multiple processes and can only
    be moved if MPOL_MF_MOVE_ALL is specified.

    -EPERM The page has been mlocked by a process/driver and
    cannot be moved.

    -EBUSY Page is busy and cannot be moved. Try again later.

    -EFAULT Invalid address (no VMA or zero page).

    -ENOMEM Unable to allocate memory on target node.

    -EIO Unable to write back page. The page must be written
    back in order to move it since the page is dirty and the
    filesystem does not provide a migration function that
    would allow the moving of dirty pages.

    -EINVAL A dirty page cannot be moved. The filesystem does not provide
    a migration function and has no ability to write back pages.

    The flags parameter indicates what types of pages to move:

    MPOL_MF_MOVE Move pages that are only mapped by the process.

    MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
    Requires sufficient capabilities.

    Possible return codes from move_pages()

    -ENOENT No pages found that would require moving. All pages
    are either already on the target node, not present, had an
    invalid address or could not be moved because they were
    mapped by multiple processes.

    -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
    to migrate pages in a kernel thread.

    -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
    or an attempt to move a process belonging to another user.

    -EACCES One of the target nodes is not allowed by the current cpuset.

    -ENODEV One of the target nodes is not online.

    -ESRCH Process does not exist.

    -E2BIG Too many pages to move.

    -ENOMEM Not enough memory to allocate control array.

    -EFAULT Parameters could not be accessed.

    A test program for move_pages() may be found with the patches
    on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

    From: Christoph Lameter

    Detailed results for sys_move_pages()

    Pass a pointer to an integer to get_new_page() that may be used to
    indicate where the completion status of a migration operation should be
    placed. This allows sys_move_pags() to report back exactly what happened to
    each page.

    Wish there would be a better way to do this. Looks a bit hacky.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Instead of passing a list of new pages, pass a function to allocate a new
    page. This allows the correct placement of MPOL_INTERLEAVE pages during page
    migration. It also further simplifies the callers of migrate pages.
    migrate_pages() becomes similar to migrate_pages_to() so drop
    migrate_pages_to(). The batching of new page allocations becomes unnecessary.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Apr, 2006

1 commit


29 Mar, 2006

1 commit


24 Mar, 2006

1 commit

  • The hooks in the slab cache allocator code path for support of NUMA
    mempolicies and cpuset memory spreading are in an important code path. Many
    systems will use neither feature.

    This patch optimizes those hooks down to a single check of some bits in the
    current tasks task_struct flags. For non NUMA systems, this hook and related
    code is already ifdef'd out.

    The optimization is done by using another task flag, set if the task is using
    a non-default NUMA mempolicy. Taking this flag bit along with the
    PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
    memory spreading' patch set, one can check for the combination of any of these
    special case memory placement mechanisms with a single test of the current
    tasks task_struct flags.

    This patch also tightens up the code, to save a few bytes of kernel text
    space, and moves some of it out of line. Due to the nested inlines called
    from multiple places, we were ending up with three copies of this code, which
    once we get off the main code path (for local node allocation) seems a bit
    wasteful of instruction memory.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

22 Mar, 2006

2 commits

  • Centralize the page migration functions in anticipation of additional
    tinkering. Creates a new file mm/migrate.c

    1. Extract buffer_migrate_page() from fs/buffer.c

    2. Extract central migration code from vmscan.c

    3. Extract some components from mempolicy.c

    4. Export pageout() and remove_from_swap() from vmscan.c

    5. Make it possible to configure NUMA systems without page migration
    and non-NUMA systems with page migration.

    I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have struct kmem_cache now so use it instead of the old typedef.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

17 Mar, 2006

1 commit


15 Mar, 2006

1 commit

  • It seems that setting scheduling policy and priorities is also the kind of
    thing that might be performed in apps that also use the NUMA API, so it
    would seem consistent to use CAP_SYS_NICE for NUMA also.

    So use CAP_SYS_NICE for controlling migration permissions.

    Signed-off-by: Christoph Lameter
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Mar, 2006

1 commit


07 Mar, 2006

1 commit

  • Change the format of numa_maps to be more compact and contain additional
    information that is useful for managing and troubleshooting memory on a
    NUMA system. Numa_maps can now also support huge pages.

    Fixes:

    1. More compact format. Only display fields if they contain additional
    information.

    2. Always display information for all vmas. The old numa_maps did not display
    vma with no mapped entries. This was a bit confusing because page
    migration removes ptes for file backed vmas. After page migration
    a part of the vmas vanished.

    3. Rename maxref to maxmap. This is the maximum mapcount of all the pages
    in a vma and may be used as an indicator as to how many processes
    may be using a certain vma.

    4. Include the ability to scan over huge page vmas.

    New items shown:

    dirty
    Number of pages in a vma that have either the dirty bit set in the
    page_struct or in the pte.

    file=
    The file backing the pages if any

    stack
    Stack area

    heap
    Heap area

    huge
    Huge page area. The number of pages shows is the number of huge
    pages not the regular sized pages.

    swapcache
    Number of pages with swap references. Must be >0 in order to
    be shown.

    active
    Number of active pages. Only displayed if different from the number
    of pages mapped.

    writeback
    Number of pages under writeback. Only displayed if >0.

    Sample ouput of a process using huge pages:

    00000000 default
    2000000000000000 default file=/lib/ld-2.3.90.so mapped=13 mapmax=30 N0=13
    2000000000044000 default file=/lib/ld-2.3.90.so anon=2 dirty=2 swapcache=2 N2=2
    2000000000064000 default file=/lib/librt-2.3.90.so mapped=2 active=1 N1=1 N3=1
    2000000000074000 default file=/lib/librt-2.3.90.so
    2000000000080000 default file=/lib/librt-2.3.90.so anon=1 swapcache=1 N2=1
    2000000000084000 default
    2000000000088000 default file=/lib/libc-2.3.90.so mapped=52 mapmax=32 active=48 N0=52
    20000000002bc000 default file=/lib/libc-2.3.90.so
    20000000002c8000 default file=/lib/libc-2.3.90.so anon=3 dirty=2 swapcache=3 active=2 N1=1 N2=2
    20000000002d4000 default anon=1 swapcache=1 N1=1
    20000000002d8000 default file=/lib/libpthread-2.3.90.so mapped=8 mapmax=3 active=7 N2=2 N3=6
    20000000002fc000 default file=/lib/libpthread-2.3.90.so
    2000000000308000 default file=/lib/libpthread-2.3.90.so anon=1 dirty=1 swapcache=1 N1=1
    200000000030c000 default anon=1 dirty=1 swapcache=1 N1=1
    2000000000320000 default anon=1 dirty=1 N1=1
    200000000071c000 default
    2000000000720000 default anon=2 dirty=2 swapcache=1 N1=1 N2=1
    2000000000f1c000 default
    2000000000f20000 default anon=2 dirty=2 swapcache=1 active=1 N2=1 N3=1
    200000000171c000 default
    2000000001720000 default anon=1 dirty=1 swapcache=1 N1=1
    2000000001b20000 default
    2000000001b38000 default file=/lib/libgcc_s.so.1 mapped=2 N1=2
    2000000001b48000 default file=/lib/libgcc_s.so.1
    2000000001b54000 default file=/lib/libgcc_s.so.1 anon=1 dirty=1 active=0 N1=1
    2000000001b58000 default file=/lib/libunwind.so.7.0.0 mapped=2 active=1 N1=2
    2000000001b74000 default file=/lib/libunwind.so.7.0.0
    2000000001b80000 default file=/lib/libunwind.so.7.0.0
    2000000001b84000 default
    4000000000000000 default file=/media/huge/test9 mapped=1 N1=1
    6000000000000000 default file=/media/huge/test9 anon=1 dirty=1 active=0 N1=1
    6000000000004000 default heap
    607fffff7fffc000 default anon=1 dirty=1 swapcache=1 N2=1
    607fffffff06c000 default stack anon=1 dirty=1 active=0 N1=1
    8000000060000000 default file=/mnt/huge/test0 huge dirty=3 N1=3
    8000000090000000 default file=/mnt/huge/test1 huge dirty=3 N0=1 N2=2
    80000000c0000000 default file=/mnt/huge/test2 huge dirty=3 N1=1 N3=2

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

03 Mar, 2006

1 commit

  • numa_maps should not scan over huge vmas in order not to cause problems for
    non IA64 platforms that may have pte entries pointing to huge pages in a
    variety of ways in their page tables. Add a simple check to ignore vmas
    containing huge pages.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Mar, 2006

1 commit


25 Feb, 2006

1 commit

  • migrate_pages_to() allocates a list of new pages on the intended target
    node or with the intended policy and then uses the list of new pages as
    targets for the migration of a list of pages out of place.

    When the pages are allocated it is not clear which of the out of place
    pages will be moved to the new pages. So we cannot specify an address as
    needed by alloc_page_vma(). This causes problem for MPOL_INTERLEAVE which
    will currently allocate the pages on the first node of the set. If mbind
    is used with vma that has the policy of MPOL_INTERLEAVE then the
    interleaving of pages may be destroyed.

    This patch fixes that by generating a fake address for each alloc_page_vma
    which will result is a distribution of pages as prescribed by
    MPOL_INTERLEAVE.

    Lee also noted that the sequence of nodes for the new pages seems to be
    inverted. So we also invert the way the lists of pages for migration are
    build.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Looks-ok-to: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

21 Feb, 2006

2 commits


18 Feb, 2006

2 commits

  • Make sure maxnodes is safe size before calculating nlongs in
    get_nodes().

    Signed-off-by: Chris Wright
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • The memory allocator doesn't like empty zones (which have an
    uninitialized freelist), so a x86-64 system with a node fully
    in GFP_DMA32 only would crash on mbind.

    Fix that up by putting all possible zones as fallback into the zonelist
    and skipping the empty ones.

    In fact the code always enough allocated space for all zones,
    but only used it for the highest. This change just uses all the
    memory that was allocated before.

    This should work fine for now, but whoever implements node hot removal
    needs to fix this somewhere else too (or make sure zone datastructures
    by itself never go away, only their memory)

    Signed-off-by: Andi Kleen
    Acked-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

05 Feb, 2006

1 commit

  • > mm/mempolicy.c: In function `huge_zonelist':
    > mm/mempolicy.c:1045: error: `HPAGE_SHIFT' undeclared (first use in this function)
    > mm/mempolicy.c:1045: error: (Each undeclared identifier is reported only once
    > mm/mempolicy.c:1045: error: for each function it appears in.)
    > make[1]: *** [mm/mempolicy.o] Error 1

    Need to wrap huge_zonelist function with CONFIG_HUGETLBFS.

    Signed-off-by: Ken Chen
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

02 Feb, 2006

1 commit

  • Modify policy layer to support direct page migration

    - Add migrate_pages_to() allowing the migration of a list of pages to a a
    specified node or to vma with a specific allocation policy in sets of
    MIGRATE_CHUNK_SIZE pages

    - Modify do_migrate_pages() to do a staged move of pages from the source
    nodes to the target nodes.

    Signed-off-by: Paul Jackson
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

19 Jan, 2006

4 commits

  • Move the interrupt check from slab_node into ___cache_alloc and adds an
    "unlikely()" to avoid pipeline stalls on some architectures.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
    imbalance in memory allocation during bootup.

    The slab allocator in 2.6.13 is not numa aware and simply calls
    alloc_pages(). This means that memory policies may control the behavior of
    alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE
    resulting in the spreading out of allocations during bootup over all
    available nodes. The slab allocator in 2.6.13 has only a single list of
    slab pages. As a result the per cpu slab cache and the spinlock controlled
    page lists may contain slab entries from off node memory. The slab
    allocator in 2.6.13 makes no effort to discern the locality of an entry on
    its lists.

    The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
    explicitly by calling alloc_pages_node(). The NUMA slab allocator manages
    slab entries by having lists of available slab pages for each node. The
    per cpu slab cache can only contain slab entries associated with the node
    local to the processor. This guarantees that the default allocation mode
    of the slab allocator always assigns local memory if available.

    Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
    anymore. In 2.6.14 all node unspecific slab allocations are performed on
    the boot processor. This means that most of key data structures are
    allocated on one node. Most processors will have to refer to these
    structures making the boot node a potential bottleneck. This may reduce
    performance and cause unnecessary memory pressure on the boot node.

    This patch implements NUMA policies in the slab layer. There is the need
    of explicit application of NUMA memory policies by the slab allcator itself
    since the NUMA slab allocator does no longer let the page_allocator control
    locality.

    The check for policies is made directly at the beginning of __cache_alloc
    using current->mempolicy. The memory policy is already frequently checked
    by the page allocator (alloc_page_vma() and alloc_page_current()). So it
    is highly likely that the cacheline is present. For MPOL_INTERLEAVE
    kmalloc() will spread out each request to one node after another so that an
    equal distribution of allocations can be obtained during bootup.

    It is not possible to push the policy check to lower layers of the NUMA
    slab allocator since the per cpu caches are now only containing slab
    entries from the current node. If the policy says that the local node is
    not to be preferred or forbidden then there is no point in checking the
    slab cache or local list of slab pages. The allocation better be directed
    immediately to the lists containing slab entries for the allowed set of
    nodes.

    This way of applying policy also fixes another strange behavior in 2.6.13.
    alloc_pages() is controlled by the memory allocation policy of the current
    process. It could therefore be that one process is running with
    MPOL_INTERLEAVE and would f.e. obtain a new page following that policy
    since no slab entries are in the lists anymore. A page can typically be
    used for multiple slab entries but lets say that the current process is
    only using one. The other entries are then added to the slab lists. These
    are now non local entries in the slab lists despite of the possible
    availability of local pages that would provide faster access and increase
    the performance of the application.

    Another process without MPOL_INTERLEAVE may now run and expect a local slab
    entry from kmalloc(). However, there are still these free slab entries
    from the off node page obtained from the other process via MPOL_INTERLEAVE
    in the cache. The process will then get an off node slab entry although
    other slab entries may be available that are local to that process. This
    means that the policy if one process may contaminate the locality of the
    slab caches for other processes.

    This patch in effect insures that a per process policy is followed for the
    allocation of slab entries and that there cannot be a memory policy
    influence from one process to another. A process with default policy will
    always get a local slab entry if one is available. And the process using
    memory policies will get its memory arranged as requested. Off-node slab
    allocation will require the use of spinlocks and will make the use of per
    cpu caches not possible. A process using memory policies to redirect
    allocations offnode will have to cope with additional lock overhead in
    addition to the latency added by the need to access a remote slab entry.

    Changes V1->V2
    - Remove #ifdef CONFIG_NUMA by moving forward declaration into
    prior #ifdef CONFIG_NUMA section.

    - Give the function determining the node number to use a saner
    name.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simplify migrate_page_add after feedback from Hugh. This also allows us to
    drop one parameter from migrate_page_add.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Jan, 2006

1 commit

  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

13 Jan, 2006

1 commit


09 Jan, 2006

11 commits

  • Fix more of longstanding bug in cpuset/mempolicy interaction.

    NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
    to just the Memory Nodes allowed by that cpuset. The kernel maintains
    internal state for each mempolicy, tracking what nodes are used for the
    MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

    When a tasks cpuset memory placement changes, whether because the cpuset
    changed, or because the task was attached to a different cpuset, then the
    tasks mempolicies have to be rebound to the new cpuset placement, so as to
    preserve the cpuset-relative numbering of the nodes in that policy.

    An earlier fix handled such mempolicy rebinding for mempolicies attached to a
    task.

    This fix rebinds mempolicies attached to vma's (address ranges in a tasks
    address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
    updating vma's, the rebinding of vma mempolicies has to be done when the
    cpuset memory placement is changed, at which time mmap_sem can be safely
    acquired. The tasks mempolicy is rebound later, when the task next attempts
    to allocate memory and notices that its task->cpuset_mems_generation is
    out-of-date with its cpusets mems_generation.

    Because walking the tasklist to find all tasks attached to a changing cpuset
    requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
    affected tasks while doing the tasklist scan. In general, one cannot acquire
    a semaphore (which can sleep) while already holding a spinlock (such as
    tasklist_lock). So a list of mm references has to be built up during the
    tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
    acquired, and the vma's in that mm rebound.

    Once the tasklist lock is dropped, affected tasks may fork new tasks, before
    their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
    point to the cpuset being rebound (there can only be one; cpuset modifications
    are done under a global 'manage_sem' semaphore), and the mpol_copy code that
    is used to copy a tasks mempolicies during fork catches such forking tasks,
    and ensures their children are also rebound.

    When a task is moved to a different cpuset, it is easier, as there is only one
    task involved. It's mm->vma's are scanned, using the same
    mpol_rebind_policy() as used above.

    It may happen that both the mpol_copy hook and the update done via the
    tasklist scan update the same mm twice. This is ok, as the mempolicies of
    each vma in an mm keep track of what mems_allowed they are relative to, and
    safely no-op a second request to rebind to the same nodes.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Cleanup, reorganize and make more robust the mempolicy.c code to rebind
    mempolicies relative to the containing cpuset after a tasks memory placement
    changes.

    The real motivator for this cleanup patch is to lay more groundwork for the
    upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
    after the containing cpuset memory placement changes.

    NUMA mempolicies are constrained by the cpuset their task is a member of.
    When either (1) a task is moved to a different cpuset, or (2) the 'mems'
    mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
    node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
    be recalculated, relative to their new cpuset placement.

    The old code used an unreliable method of determining what was the old
    mems_allowed constraining the mempolicy. It just looked at the tasks
    mems_allowed value. This sort of worked with the present code, that just
    rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
    referring to the old nodes. But in an upcoming patch, the vma mempolicies
    will be rebound as well. Then the order in which the various task and vma
    mempolicies are updated will no longer be deterministic, and one can no longer
    count on the task->mems_allowed holding the old value for as long as needed.
    It's not even clear if the current code was guaranteed to work reliably for
    task mempolicies.

    So I added a mems_allowed field to each mempolicy, stating exactly what
    mems_allowed the policy is relative to, and updated synchronously and reliably
    anytime that the mempolicy is rebound.

    Also removed a useless wrapper routine, numa_policy_rebind(), and had its
    caller, cpuset_update_task_memory_state(), call directly to the rewritten
    policy_rebind() routine, and made that rebind routine extern instead of
    static, and added a "mpol_" prefix to its name, making it
    mpol_rebind_policy().

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
    needed, to obtain the mems_allowed vector of a cpuset, and replaced the
    workaround in sys_migrate_pages() to call this new method.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The important code paths through alloc_pages_current() and alloc_page_vma(),
    by which most kernel page allocations go, both called
    cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
    -Both- of these latter two routines did a tasklock, got the tasks cpuset
    pointer, and checked for out of date cpuset->mems_generation.

    That was a silly duplication of code and waste of CPU cycles on an important
    code path.

    Consolidated those two routines into a single routine, called
    cpuset_update_task_memory_state(), since it updates more than just
    mems_allowed.

    Changed all callers of either routine to call the new consolidated routine.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
    conversion had left one routine using bitmaps, since it involved a
    corresponding change to kernel/cpuset.c

    Fix that interface by replacing with a simple macro that calls nodes_subset(),
    or if !CONFIG_CPUSET, returns (1).

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Group page migration functions in mempolicy.c

    Add a forward declaration for migrate_page_add (like gather_stats()) and use
    our new found mobility to group all page migration related function around
    do_migrate_pages().

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Since the numa_maps functionality is now in mempolicy.c we no longer need to
    export get_vma_policy().

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • migrate_page_add cannot be called with a spinlock held (calls
    isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in
    check_pte_range before calling migrate_page_add().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2

    - Use the check_range() in mempolicy.c to gather statistics.

    - Improve the numa_maps code in general and fix some comments.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This was was first posted at
    http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2

    (Part of this functionality is also contained in the direct migration
    pathset. The functionality here is more generic and independent of that
    patchset.)

    - Add internal flags MPOL_MF_INVERT to control check_range() behavior.

    - Replace the pagelist passed through by check_range by a general
    private pointer that may be used for other purposes.
    (The following patches will use that to merge numa_maps into
    mempolicy.c and to better group the page migration code in
    the policy layer)

    - Improve some comments.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Extend the parameters of migrate_pages() to allow the caller control over the
    fate of successfully migrated or impossible to migrate pages.

    Swap migration and direct migration will have the same interface after this
    patch so that patches can be independently applied to the policy layer and the
    core migration code.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter