Eric Lee / smarc-fsl-linux-kernel

26 Jun, 2006

1 commit

7b2259b3e [PATCH] page migration: Support a vma migration function ... Browse Code »

Hooks for calling vma specific migration functions

With this patch a vma may define a vma->vm_ops->migrate function. That
function may perform page migration on its own (some vmas may not contain page
structs and therefore cannot be handled by regular page migration. Pages in a
vma may require special preparatory treatment before migration is possible
etc) . Only mmap_sem is held when the migration function is called. The
migrate() function gets passed two sets of nodemasks describing the source and
the target of the migration. The flags parameter either contains

MPOL_MF_MOVE which means that only pages used exclusively by
the specified mm should be moved

or

MPOL_MF_MOVE_ALL which means that pages shared with other processes
should also be moved.

The migration function returns 0 on success or an error condition. An error
condition will prevent regular page migration from occurring.

On its own this patch cannot be included since there are no users for this
functionality. But it seems that the uncached allocator will need this
functionality at some point.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-26 01:00:55 +0800

23 Jun, 2006

4 commits

86c3a7645 [PATCH] SELinux: add security_task_movememory calls to mm code ... Browse Code »

This patch inserts security_task_movememory hook calls into memory management
code to enable security modules to mediate this operation between tasks.

Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.

Signed-off-by: David Quigley
Acked-by: Stephen Smalley
Signed-off-by: James Morris
Cc: Andi Kleen
Acked-by: Christoph Lameter
Acked-by: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Quigley
2006-06-23 22:42:54 +0800
742755a1d [PATCH] page migration: sys_move_pages(): support moving of individual pages ... Browse Code »

move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
addresses_of_pages[],
nodes[] or NULL,
status[],
flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES The page is now on the indicated node.

-ENOENT Page is not present

-EACCES Page is mapped by multiple processes and can only
be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM The page has been mlocked by a process/driver and
cannot be moved.

-EBUSY Page is busy and cannot be moved. Try again later.

-EFAULT Invalid address (no VMA or zero page).

-ENOMEM Unable to allocate memory on target node.

-EIO Unable to write back page. The page must be written
back in order to move it since the page is dirty and the
filesystem does not provide a migration function that
would allow the moving of dirty pages.

-EINVAL A dirty page cannot be moved. The filesystem does not provide
a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT No pages found that would require moving. All pages
are either already on the target node, not present, had an
invalid address or could not be moved because they were
mapped by multiple processes.

-EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
to migrate pages in a kernel thread.

-EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
or an attempt to move a process belonging to another user.

-EACCES One of the target nodes is not allowed by the current cpuset.

-ENODEV One of the target nodes is not online.

-ESRCH Process does not exist.

-E2BIG Too many pages to move.

-ENOMEM Not enough memory to allocate control array.

-EFAULT Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter

Detailed results for sys_move_pages()

Pass a pointer to an integer to get_new_page() that may be used to
indicate where the completion status of a migration operation should be
placed. This allows sys_move_pags() to report back exactly what happened to
each page.

Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Jes Sorensen
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Andi Kleen
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-23 22:42:53 +0800
95a402c38 [PATCH] page migration: use allocator function for migrate_pages() ... Browse Code »

Instead of passing a list of new pages, pass a function to allocate a new
page. This allows the correct placement of MPOL_INTERLEAVE pages during page
migration. It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to(). The batching of new page allocations becomes unnecessary.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Jes Sorensen
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-23 22:42:53 +0800
aaa994b30 [PATCH] page migration: handle freeing of pages in migrate_pages() ... Browse Code »

Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Jes Sorensen
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-06-23 22:42:53 +0800

20 Apr, 2006

1 commit

6d472be37 [PATCH] Remove cond_resched in gather_stats() ... Browse Code »

gather_stats() is called with a spinlock held from check_pte_range. We
cannot reschedule with a lock held.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-04-20 22:54:03 +0800

29 Mar, 2006

1 commit

7f927fcc2 [PATCH] Typo fixes ... Browse Code »

Fix a lot of typos. Eyeballed by jmc@ in OpenBSD.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2006-03-29 01:16:08 +0800

24 Mar, 2006

1 commit

c61afb181 [PATCH] cpuset memory spread slab cache optimizations ... Browse Code »

The hooks in the slab cache allocator code path for support of NUMA
mempolicies and cpuset memory spreading are in an important code path. Many
systems will use neither feature.

This patch optimizes those hooks down to a single check of some bits in the
current tasks task_struct flags. For non NUMA systems, this hook and related
code is already ifdef'd out.

The optimization is done by using another task flag, set if the task is using
a non-default NUMA mempolicy. Taking this flag bit along with the
PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
memory spreading' patch set, one can check for the combination of any of these
special case memory placement mechanisms with a single test of the current
tasks task_struct flags.

This patch also tightens up the code, to save a few bytes of kernel text
space, and moves some of it out of line. Due to the nested inlines called
from multiple places, we were ending up with three copies of this code, which
once we get off the main code path (for local node allocation) seems a bit
wasteful of instruction memory.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-03-24 23:33:23 +0800

22 Mar, 2006

2 commits

b20a35035 [PATCH] page migration reorg ... Browse Code »

Centralize the page migration functions in anticipation of additional
tinkering. Creates a new file mm/migrate.c

1. Extract buffer_migrate_page() from fs/buffer.c

2. Extract central migration code from vmscan.c

3. Extract some components from mempolicy.c

4. Export pageout() and remove_from_swap() from vmscan.c

5. Make it possible to configure NUMA systems without page migration
and non-NUMA systems with page migration.

I had to so some #ifdeffing in mempolicy.c that may need a cleanup.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-22 23:54:06 +0800
fcc234f88 [PATCH] mm: kill kmem_cache_t usage ... Browse Code »

We have struct kmem_cache now so use it instead of the old typedef.

Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2006-03-22 23:53:58 +0800

17 Mar, 2006

1 commit

90036ee59 [PATCH] page migration: Fail with error if swap not setup ... Browse Code »

Currently the migration of anonymous pages will silently fail if no swap is
setup. This patch makes page migration functions check for available swap
and fail with -ENODEV if no swap space is available.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-17 23:51:25 +0800

15 Mar, 2006

1 commit

74c002410 [PATCH] Consistent capabilites associated with MPOL_MOVE_ALL ... Browse Code »

It seems that setting scheduling policy and priorities is also the kind of
thing that might be performed in apps that also use the NUMA API, so it
would seem consistent to use CAP_SYS_NICE for NUMA also.

So use CAP_SYS_NICE for controlling migration permissions.

Signed-off-by: Christoph Lameter
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-15 13:43:02 +0800

09 Mar, 2006

1 commit

7f709ed0e [PATCH] numa_maps-update fix ... Browse Code »

Fix the mm/mempolicy.c build for !CONFIG_HUGETLB_PAGE.

Cc: Christoph Lameter
Cc: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-03-09 06:14:00 +0800

07 Mar, 2006

1 commit

397874dfe [PATCH] numa_maps update ... Browse Code »

Change the format of numa_maps to be more compact and contain additional
information that is useful for managing and troubleshooting memory on a
NUMA system. Numa_maps can now also support huge pages.

Fixes:

1. More compact format. Only display fields if they contain additional
information.

2. Always display information for all vmas. The old numa_maps did not display
vma with no mapped entries. This was a bit confusing because page
migration removes ptes for file backed vmas. After page migration
a part of the vmas vanished.

3. Rename maxref to maxmap. This is the maximum mapcount of all the pages
in a vma and may be used as an indicator as to how many processes
may be using a certain vma.

4. Include the ability to scan over huge page vmas.

New items shown:

dirty
Number of pages in a vma that have either the dirty bit set in the
page_struct or in the pte.

file=
The file backing the pages if any

stack
Stack area

heap
Heap area

huge
Huge page area. The number of pages shows is the number of huge
pages not the regular sized pages.

swapcache
Number of pages with swap references. Must be >0 in order to
be shown.

active
Number of active pages. Only displayed if different from the number
of pages mapped.

writeback
Number of pages under writeback. Only displayed if >0.

Sample ouput of a process using huge pages:

00000000 default
2000000000000000 default file=/lib/ld-2.3.90.so mapped=13 mapmax=30 N0=13
2000000000044000 default file=/lib/ld-2.3.90.so anon=2 dirty=2 swapcache=2 N2=2
2000000000064000 default file=/lib/librt-2.3.90.so mapped=2 active=1 N1=1 N3=1
2000000000074000 default file=/lib/librt-2.3.90.so
2000000000080000 default file=/lib/librt-2.3.90.so anon=1 swapcache=1 N2=1
2000000000084000 default
2000000000088000 default file=/lib/libc-2.3.90.so mapped=52 mapmax=32 active=48 N0=52
20000000002bc000 default file=/lib/libc-2.3.90.so
20000000002c8000 default file=/lib/libc-2.3.90.so anon=3 dirty=2 swapcache=3 active=2 N1=1 N2=2
20000000002d4000 default anon=1 swapcache=1 N1=1
20000000002d8000 default file=/lib/libpthread-2.3.90.so mapped=8 mapmax=3 active=7 N2=2 N3=6
20000000002fc000 default file=/lib/libpthread-2.3.90.so
2000000000308000 default file=/lib/libpthread-2.3.90.so anon=1 dirty=1 swapcache=1 N1=1
200000000030c000 default anon=1 dirty=1 swapcache=1 N1=1
2000000000320000 default anon=1 dirty=1 N1=1
200000000071c000 default
2000000000720000 default anon=2 dirty=2 swapcache=1 N1=1 N2=1
2000000000f1c000 default
2000000000f20000 default anon=2 dirty=2 swapcache=1 active=1 N2=1 N3=1
200000000171c000 default
2000000001720000 default anon=1 dirty=1 swapcache=1 N1=1
2000000001b20000 default
2000000001b38000 default file=/lib/libgcc_s.so.1 mapped=2 N1=2
2000000001b48000 default file=/lib/libgcc_s.so.1
2000000001b54000 default file=/lib/libgcc_s.so.1 anon=1 dirty=1 active=0 N1=1
2000000001b58000 default file=/lib/libunwind.so.7.0.0 mapped=2 active=1 N1=2
2000000001b74000 default file=/lib/libunwind.so.7.0.0
2000000001b80000 default file=/lib/libunwind.so.7.0.0
2000000001b84000 default
4000000000000000 default file=/media/huge/test9 mapped=1 N1=1
6000000000000000 default file=/media/huge/test9 anon=1 dirty=1 active=0 N1=1
6000000000004000 default heap
607fffff7fffc000 default anon=1 dirty=1 swapcache=1 N2=1
607fffffff06c000 default stack anon=1 dirty=1 active=0 N1=1
8000000060000000 default file=/mnt/huge/test0 huge dirty=3 N1=3
8000000090000000 default file=/mnt/huge/test1 huge dirty=3 N0=1 N2=2
80000000c0000000 default file=/mnt/huge/test2 huge dirty=3 N1=1 N3=2

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-07 10:40:45 +0800

03 Mar, 2006

1 commit

a57ebfdb2 [PATCH] numa_maps: Fix potential crash on non IA64 platforms ... Browse Code »

numa_maps should not scan over huge vmas in order not to cause problems for
non IA64 platforms that may have pte entries pointing to huge pages in a
variety of ways in their page tables. Add a simple check to ignore vmas
containing huge pages.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-03 00:33:07 +0800

01 Mar, 2006

1 commit

511030bcd [PATCH] Fix sys_migrate_pages: Move all pages when invoked from root ... Browse Code »

Currently sys_migrate_pages only moves pages belonging to a process. This
is okay when invoked from a regular user. But if invoked from root it
should move all pages as documented in the migrate_pages manpage.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-03-01 12:53:43 +0800

25 Feb, 2006

1 commit

1e275d406 [PATCH] page migration: Fix MPOL_INTERLEAVE behavior for migration via mbind() ... Browse Code »

migrate_pages_to() allocates a list of new pages on the intended target
node or with the intended policy and then uses the list of new pages as
targets for the migration of a list of pages out of place.

When the pages are allocated it is not clear which of the out of place
pages will be moved to the new pages. So we cannot specify an address as
needed by alloc_page_vma(). This causes problem for MPOL_INTERLEAVE which
will currently allocate the pages on the first node of the set. If mbind
is used with vma that has the policy of MPOL_INTERLEAVE then the
interleaving of pages may be destroyed.

This patch fixes that by generating a fake address for each alloc_page_vma
which will result is a distribution of pages as prescribed by
MPOL_INTERLEAVE.

Lee also noted that the sequence of nodes for the new pages seems to be
inverted. So we also invert the way the lists of pages for migration are
build.

Signed-off-by: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Looks-ok-to: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-02-25 06:31:38 +0800

21 Feb, 2006

2 commits

fcab6f351 [PATCH] mm/mempolicy.c: fix 'if ();' typo ... Browse Code »

[akpm; it happens that the code was still correct, only inefficient ]

Signed-off-by: Alexey Dobriyan
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2006-02-21 12:00:11 +0800
a9c930bac [PATCH] Fix units in mbind check ... Browse Code »

maxnode is a bit index and can't be directly compared against a byte length
like PAGE_SIZE

Signed-off-by: Andi Kleen
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2006-02-21 12:00:10 +0800

18 Feb, 2006

2 commits

636f13c17 [PATCH] sys_mbind sanity checking ... Browse Code »

Make sure maxnodes is safe size before calculating nlongs in
get_nodes().

Signed-off-by: Chris Wright
Signed-off-by: Linus Torvalds

Chris Wright
2006-02-18 06:09:22 +0800
dd942ae33 [PATCH] Handle all and empty zones when setting up custom zonelists for mbind ... Browse Code »

The memory allocator doesn't like empty zones (which have an
uninitialized freelist), so a x86-64 system with a node fully
in GFP_DMA32 only would crash on mbind.

Fix that up by putting all possible zones as fallback into the zonelist
and skipping the empty ones.

In fact the code always enough allocated space for all zones,
but only used it for the highest. This change just uses all the
memory that was allocated before.

This should work fine for now, but whoever implements node hot removal
needs to fix this somewhere else too (or make sure zone datastructures
by itself never go away, only their memory)

Signed-off-by: Andi Kleen
Acked-by: Christoph Lameter
Signed-off-by: Linus Torvalds

Andi Kleen
2006-02-18 00:18:14 +0800

05 Feb, 2006

1 commit

00ac59adf [PATCH] x86_64: Fix memory policy build without CONFIG_HUGETLBFS ... Browse Code »

> mm/mempolicy.c: In function `huge_zonelist':
> mm/mempolicy.c:1045: error: `HPAGE_SHIFT' undeclared (first use in this function)
> mm/mempolicy.c:1045: error: (Each undeclared identifier is reported only once
> mm/mempolicy.c:1045: error: for each function it appears in.)
> make[1]: *** [mm/mempolicy.o] Error 1

Need to wrap huge_zonelist function with CONFIG_HUGETLBFS.

Signed-off-by: Ken Chen
Signed-off-by: Andi Kleen
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-02-05 08:43:14 +0800

02 Feb, 2006

1 commit

7e2ab150d [PATCH] Direct Migration V9: upgrade MPOL_MF_MOVE and sys_migrate_pages() ... Browse Code »

Modify policy layer to support direct page migration

- Add migrate_pages_to() allowing the migration of a list of pages to a a
specified node or to vma with a specific allocation policy in sets of
MIGRATE_CHUNK_SIZE pages

- Modify do_migrate_pages() to do a staged move of pages from the source
nodes to the target nodes.

Signed-off-by: Paul Jackson
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-02-02 00:53:16 +0800

19 Jan, 2006

4 commits

86c562a9d [PATCH] mm: optimize numa policy handling in slab allocator ... Browse Code »

Move the interrupt check from slab_node into ___cache_alloc and adds an
"unlikely()" to avoid pipeline stalls on some architectures.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:18 +0800
dc85da15d [PATCH] NUMA policies in the slab allocator V2 ... Browse Code »

This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls
alloc_pages(). This means that memory policies may control the behavior of
alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE
resulting in the spreading out of allocations during bootup over all
available nodes. The slab allocator in 2.6.13 has only a single list of
slab pages. As a result the per cpu slab cache and the spinlock controlled
page lists may contain slab entries from off node memory. The slab
allocator in 2.6.13 makes no effort to discern the locality of an entry on
its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node(). The NUMA slab allocator manages
slab entries by having lists of available slab pages for each node. The
per cpu slab cache can only contain slab entries associated with the node
local to the processor. This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore. In 2.6.14 all node unspecific slab allocations are performed on
the boot processor. This means that most of key data structures are
allocated on one node. Most processors will have to refer to these
structures making the boot node a potential bottleneck. This may reduce
performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer. There is the need
of explicit application of NUMA memory policies by the slab allcator itself
since the NUMA slab allocator does no longer let the page_allocator control
locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy. The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()). So it
is highly likely that the cacheline is present. For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so that an
equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab
entries from the current node. If the policy says that the local node is
not to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages. The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the current
process. It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e. obtain a new page following that policy
since no slab entries are in the lists anymore. A page can typically be
used for multiple slab entries but lets say that the current process is
only using one. The other entries are then added to the slab lists. These
are now non local entries in the slab lists despite of the possible
availability of local pages that would provide faster access and increase
the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local slab
entry from kmalloc(). However, there are still these free slab entries
from the off node page obtained from the other process via MPOL_INTERLEAVE
in the cache. The process will then get an off node slab entry although
other slab entries may be available that are local to that process. This
means that the policy if one process may contaminate the locality of the
slab caches for other processes.

This patch in effect insures that a per process policy is followed for the
allocation of slab entries and that there cannot be a memory policy
influence from one process to another. A process with default policy will
always get a local slab entry if one is available. And the process using
memory policies will get its memory arranged as requested. Off-node slab
allocation will require the use of spinlocks and will make the use of per
cpu caches not possible. A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
name.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:18 +0800
fc3012896 [PATCH] Simplify migrate_page_add ... Browse Code »

Simplify migrate_page_add after feedback from Hugh. This also allows us to
drop one parameter from migrate_page_add.

Signed-off-by: Christoph Lameter
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-19 11:20:17 +0800
053837fce [PATCH] mm: migration page refcounting fix ... Browse Code »

Migration code currently does not take a reference to target page
properly, so between unlocking the pte and trying to take a new
reference to the page with isolate_lru_page, anything could happen to
it.

Fix this by holding the pte lock until we get a chance to elevate the
refcount.

Other small cleanups while we're here.

Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-01-19 11:20:17 +0800

15 Jan, 2006

1 commit

7339ff830 [PATCH] Add tmpfs options for memory placement policies ... Browse Code »

Anything that writes into a tmpfs filesystem is liable to disproportionately
decrease the available memory on a particular node. Since there's no telling
what sort of application (e.g. dd/cp/cat) might be dropping large files
there, this lets the admin choose the appropriate default behavior for their
site's situation.

Introduce a tmpfs mount option which allows specifying a memory policy and
a second option to specify the nodelist for that policy. With the default
policy, tmpfs will behave as it does today. This patch adds support for
preferred, bind, and interleave policies.

The default policy will cause pages to be added to tmpfs files on the node
which is doing the writing. Some jobs expect a single process to create
and manage the tmpfs files. This results in a node which has a
significantly reduced number of free pages.

With this patch, the administrator can specify the policy and nodes for
that policy where they would prefer allocations.

This patch was originally written by Brent Casavant and Hugh Dickins. I
added support for the bind and preferred policies and the mpol_nodelist
mount option.

Signed-off-by: Brent Casavant
Signed-off-by: Hugh Dickins
Signed-off-by: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2006-01-15 10:27:07 +0800

13 Jan, 2006

1 commit

f4598c8b3 [PATCH] migration: make sure there is no attempt to migrate reserved pages. ... Browse Code »

This ensures that reserved pages are not migrated. Reserved pages
currently cause the WARN_ON to trigger in migrate_page_add()

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-13 01:08:48 +0800

09 Jan, 2006

11 commits

4225399a6 [PATCH] cpuset: rebind vma mempolicies fix ... Browse Code »

Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
74cb21553 [PATCH] cpuset: numa_policy_rebind cleanup ... Browse Code »

Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
909d75a3b [PATCH] cpuset: implement cpuset_mems_allowed ... Browse Code »

Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
cf2a473c4 [PATCH] cpuset: combine refresh_mems and update_mems ... Browse Code »

The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:43 +0800
5966514db [PATCH] cpuset: mempolicy one more nodemask conversion ... Browse Code »

Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:42 +0800
6ce3c4c0f [PATCH] Move page migration related functions near do_migrate_pages() ... Browse Code »

Group page migration functions in mempolicy.c

Add a forward declaration for migrate_page_add (like gather_stats()) and use
our new found mobility to group all page migration related function around
do_migrate_pages().

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
48fce3429 [PATCH] mempolicies: unexport get_vma_policy() ... Browse Code »

Since the numa_maps functionality is now in mempolicy.c we no longer need to
export get_vma_policy().

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
132beacf9 [PATCH] Drop page table lock before calling migrate_page_add() ... Browse Code »

migrate_page_add cannot be called with a spinlock held (calls
isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in
check_pte_range before calling migrate_page_add().

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
1a75a6c82 [PATCH] Fold numa_maps into mempolicies.c ... Browse Code »

First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2

- Use the check_range() in mempolicy.c to gather statistics.

- Improve the numa_maps code in general and fix some comments.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
38e35860d [PATCH] mempolicies: private pointer in check_range and MPOL_MF_INVERT ... Browse Code »

This was was first posted at
http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2

(Part of this functionality is also contained in the direct migration
pathset. The functionality here is more generic and independent of that
patchset.)

- Add internal flags MPOL_MF_INVERT to control check_range() behavior.

- Replace the pagelist passed through by check_range by a general
private pointer that may be used for other purposes.
(The following patches will use that to merge numa_maps into
mempolicy.c and to better group the page migration code in
the policy layer)

- Improve some comments.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
d49847113 [PATCH] SwapMig: Extend parameters for migrate_pages() ... Browse Code »

Extend the parameters of migrate_pages() to allow the caller control over the
fate of successfully migrated or impossible to migrate pages.

Swap migration and direct migration will have the same interface after this
patch so that patches can be independently applied to the policy layer and the
core migration code.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800