Eric Lee / smarc-fsl-linux-kernel

10 Jan, 2006

3 commits

1b1dcc1b5 [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem ... Browse Code »

This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar

(finished the conversion)

Signed-off-by: Jes Sorensen
Signed-off-by: Ingo Molnar

Jes Sorensen
2006-01-10 07:59:24 +0800
de5097c2e [PATCH] mutex subsystem, more debugging code ... Browse Code »

more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven

Ingo Molnar
2006-01-10 07:59:21 +0800
6150c3258 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge Browse Code »

Linus Torvalds
2006-01-10 02:03:44 +0800

09 Jan, 2006

37 commits

87ba81dba [PATCH] fadvise: return ESPIPE on FIFO/pipe ... Browse Code »

The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
fully POSIX-compliant.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valentine Barshak
2006-01-09 12:14:00 +0800
28fd12982 [PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait) ... Browse Code »

This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

See mm/filemap.c:

And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)

Andrew Morton writes,

If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.

So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

Acked-by: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-01-09 12:13:47 +0800
268fc16e3 [PATCH] export/change sync_page_range/_nolock() ... Browse Code »

This exports/changes the sync_page_range/_nolock(). The fatfs needs
sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
to "loff_t count".

Signed-off-by: OGAWA Hirofumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-01-09 12:13:47 +0800
4225399a6 [PATCH] cpuset: rebind vma mempolicies fix ... Browse Code »

Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
74cb21553 [PATCH] cpuset: numa_policy_rebind cleanup ... Browse Code »

Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
909d75a3b [PATCH] cpuset: implement cpuset_mems_allowed ... Browse Code »

Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:44 +0800
cf2a473c4 [PATCH] cpuset: combine refresh_mems and update_mems ... Browse Code »

The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:43 +0800
3e0d98b9f [PATCH] cpuset: memory pressure meter ... Browse Code »

Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.

Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.

Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:42 +0800
5966514db [PATCH] cpuset: mempolicy one more nodemask conversion ... Browse Code »

Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-01-09 12:13:42 +0800
10cef6029 [PATCH] slob: introduce the SLOB allocator ... Browse Code »

configurable replacement for slab allocator

This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
disabled, the kernel falls back to using the 'SLOB' allocator.

SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
similar to the original Linux kmalloc allocator that SLAB replaced. It's
signicantly smaller code and is more memory efficient. But like all
similar allocators, it scales poorly and suffers from fragmentation more
than SLAB, so it's only appropriate for small systems.

It's been tested extensively in the Linux-tiny tree. I've also
stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
recommended).

Here's a comparison for otherwise identical builds, showing SLOB saving
nearly half a megabyte of RAM:

$ size vmlinux*
text data bss dec hex filename
3336372 529360 190812 4056544 3de5e0 vmlinux-slab
3323208 527948 190684 4041840 3dac70 vmlinux-slob

$ size mm/{slab,slob}.o
text data bss dec hex filename
13221 752 48 14021 36c5 mm/slab.o
1896 52 8 1956 7a4 mm/slob.o

/proc/meminfo:
SLAB SLOB delta
MemTotal: 27964 kB 27980 kB +16 kB
MemFree: 24596 kB 25092 kB +496 kB
Buffers: 36 kB 36 kB 0 kB
Cached: 1188 kB 1188 kB 0 kB
SwapCached: 0 kB 0 kB 0 kB
Active: 608 kB 600 kB -8 kB
Inactive: 808 kB 812 kB +4 kB
HighTotal: 0 kB 0 kB 0 kB
HighFree: 0 kB 0 kB 0 kB
LowTotal: 27964 kB 27980 kB +16 kB
LowFree: 24596 kB 25092 kB +496 kB
SwapTotal: 0 kB 0 kB 0 kB
SwapFree: 0 kB 0 kB 0 kB
Dirty: 4 kB 12 kB +8 kB
Writeback: 0 kB 0 kB 0 kB
Mapped: 560 kB 556 kB -4 kB
Slab: 1756 kB 0 kB -1756 kB
CommitLimit: 13980 kB 13988 kB +8 kB
Committed_AS: 4208 kB 4208 kB 0 kB
PageTables: 28 kB 28 kB 0 kB
VmallocTotal: 1007312 kB 1007312 kB 0 kB
VmallocUsed: 48 kB 48 kB 0 kB
VmallocChunk: 1007264 kB 1007264 kB 0 kB

(this work has been sponsored in part by CELF)

From: Ingo Molnar

Fix 32-bitness bugs in mm/slob.c.

Signed-off-by: Matt Mackall
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2006-01-09 12:13:41 +0800
30992c97a [PATCH] slob: introduce mm/util.c for shared functions ... Browse Code »

Add mm/util.c for functions common between SLAB and SLOB.

Signed-off-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Mackall
2006-01-09 12:13:41 +0800
22fc6eccb [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros ... Browse Code »

____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.

However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size. As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

With this patch, L1_CACHE_SHIFT_MAX can be killed

Signed-off-by: Ravikiran Thirumalai
Signed-off-by: Shai Fultheim
Signed-off-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ravikiran G Thirumalai
2006-01-09 12:13:38 +0800
2f659f462 [PATCH] Optimise oom kill of current task ... Browse Code »

When oom_killer kills current there's no need to call
schedule_timeout_interruptible() since task must die ASAP.

Signed-Off-By: Pavel Emelianov
Signed-Off-By: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-01-09 12:12:45 +0800
6ce3c4c0f [PATCH] Move page migration related functions near do_migrate_pages() ... Browse Code »

Group page migration functions in mempolicy.c

Add a forward declaration for migrate_page_add (like gather_stats()) and use
our new found mobility to group all page migration related function around
do_migrate_pages().

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
48fce3429 [PATCH] mempolicies: unexport get_vma_policy() ... Browse Code »

Since the numa_maps functionality is now in mempolicy.c we no longer need to
export get_vma_policy().

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
132beacf9 [PATCH] Drop page table lock before calling migrate_page_add() ... Browse Code »

migrate_page_add cannot be called with a spinlock held (calls
isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in
check_pte_range before calling migrate_page_add().

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
1a75a6c82 [PATCH] Fold numa_maps into mempolicies.c ... Browse Code »

First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2

- Use the check_range() in mempolicy.c to gather statistics.

- Improve the numa_maps code in general and fix some comments.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
38e35860d [PATCH] mempolicies: private pointer in check_range and MPOL_MF_INVERT ... Browse Code »

This was was first posted at
http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2

(Part of this functionality is also contained in the direct migration
pathset. The functionality here is more generic and independent of that
patchset.)

- Add internal flags MPOL_MF_INVERT to control check_range() behavior.

- Replace the pagelist passed through by check_range by a general
private pointer that may be used for other purposes.
(The following patches will use that to merge numa_maps into
mempolicy.c and to better group the page migration code in
the policy layer)

- Improve some comments.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:44 +0800
ef2bf0dc8 [PATCH] rmap: additional diagnostics in page_remove_rmap() ... Browse Code »

We seem to be hitting this assertion failure too often for it to be
hardware bugs.

Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2006-01-09 12:12:44 +0800
cd105df45 [PATCH] mm: clean up local variables ... Browse Code »

Clean up a local variable with the same name as a variable in a larger
block. Also move a variable into the block where it's actually used.

Spotted by http://linuxicc.sourceforge.net/

Signed-off-by: Tobias Klauser
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tobias Klauser
2006-01-09 12:12:43 +0800
aea47ff36 [PATCH] mm: make hugepages obey cpusets. ... Browse Code »

See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2

Make hugepages obey cpusets.

Signed-off-by: Christoph Lameter
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:43 +0800
d0d963281 [PATCH] SwapMig: Switch error handling in migrate_pages to use -Exx ... Browse Code »

Use -Exxx instead of numeric return codes and cleanup the code in
migrate_pages() using -Exx error codes.

Consolidate successful migration handling

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
d49847113 [PATCH] SwapMig: Extend parameters for migrate_pages() ... Browse Code »

Extend the parameters of migrate_pages() to allow the caller control over the
fate of successfully migrated or impossible to migrate pages.

Swap migration and direct migration will have the same interface after this
patch so that patches can be independently applied to the policy layer and the
core migration code.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
ee27497df [PATCH] SwapMig: Drop unused pages immediately ... Browse Code »

Drop unused pages immediately

If a page is encountered that is only referenced by the migration code then
there is no reason to swap or migrate the page. Release the page by calling
move_to_lru().

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
1480a540c [PATCH] SwapMig: add_to_swap() avoid atomic allocations ... Browse Code »

Add gfp_mask to add_to_swap

add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
swapping. During migration we may have use add_to_swap extensively which may
lead to out of memory errors.

This patch makes add_to_swap take a parameter that specifies the gfp mask.
The page migration code can then make add_to_swap use GFP_KERNEL.

Signed-off-by: Hirokazu Takahashi
Signed-off-by: Dave Hansen
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
8419c3181 [PATCH] SwapMig: CONFIG_MIGRATION fixes ... Browse Code »

Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by
CONFIG_MIGRATION saving some codesize for single processor kernels.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
39743889a [PATCH] Swap Migration V5: sys_migrate_pages interface ... Browse Code »

sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process. A process may
have migrated to another node. Memory was allocated optimally for the prior
context. sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory. sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset). When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets. The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:42 +0800
dc9aa5b9d [PATCH] Swap Migration V5: MPOL_MF_MOVE interface ... Browse Code »

Add page migration support via swap to the NUMA policy layer

This patch adds page migration support to the NUMA policy layer. An
additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is
specified then pages that do not conform to the memory policy will be evicted
from memory. When they get pages back in new pages will be allocated
following the numa policy.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:41 +0800
7cbe34cf8 [PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support ... Browse Code »

Include page migration if the system is NUMA or having a memory model that
allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM).

And:
- Only include lru_add_drain_per_cpu if building for an SMP system.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:41 +0800
49d2e9cc4 [PATCH] Swap Migration V5: migrate_pages() function ... Browse Code »

This adds the basic page migration function with a minimal implementation that
only allows the eviction of pages to swap space.

Page eviction and migration may be useful to migrate pages, to suspend
programs or for remapping single pages (useful for faulty pages or pages with
soft ECC failures)

The process is as follows:

The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.

Then the actual migration or swapout can happen by calling migrate_pages().

migrate_pages does its best to migrate or swapout the pages and does multiple
passes over the list. Some pages may only be swappable if they are not dirty.
migrate_pages may start writing out dirty pages in the initial passes over
the pages. However, migrate_pages may not be able to migrate or evict all
pages for a variety of reasons.

The remaining pages may be returned to the LRU lists using putback_lru_pages().

Changelog V4->V5:
- Use the lru caches to return pages to the LRU

Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
require minimal changes. Rename swapout_pages() to migrate_pages().

Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()

Signed-off-by: Mike Kravetz
Signed-off-by: Christoph Lameter
Cc: "Michael Kerrisk"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:41 +0800
930d91525 [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap ... Browse Code »

Add PF_SWAPWRITE to control a processes permission to write to swap.

- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
and pdflush

- Set PF_SWAPWRITE flag for kswapd and pdflush

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:41 +0800
21eac81f2 [PATCH] Swap Migration V5: LRU operations ... Browse Code »

This is the start of the `swap migration' patch series.

Swap migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the
virtual addresses that the process sees do not change. However, the system
rearranges the physical location of those pages.

The main intent of page migration patches here is to reduce the latency of
memory access by moving pages near to the processor where the process
accessing that memory is running.

The patchset allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while
setting a new memory policy.

The pages of process can also be relocated from another process using the
sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages
function call takes two sets of nodes and moves pages of a process that are
located on the from nodes to the destination nodes.

Manual migration is very useful if for example the scheduler has relocated a
process to a processor on a distant node. A batch scheduler or an
administrator can detect the situation and move the pages of the process
nearer to the new processor.

sys_migrate_pages() could be used on non-numa machines as well, to force all
of a particualr process's pages out to swap, if someone thinks that's useful.

Larger installations usually partition the system using cpusets into sections
of nodes. Paul has equipped cpusets with the ability to move pages when a
task is moved to another cpuset. This allows automatic control over locality
of a process. If a task is moved to a new cpuset then also all its pages are
moved with it so that the performance of the process does not sink
dramatically (as is the case today).

Swap migration works by simply evicting the page. The pages must be faulted
back in. The pages are then typically reallocated by the system near the node
where the process is executing.

For swap migration the destination of the move is controlled by the allocation
policy. Cpusets set the allocation policy before calling sys_migrate_pages()
in order to move the pages as intended.

No allocation policy changes are performed for sys_migrate_pages(). This
means that the pages may not faulted in to the specified nodes if no
allocation policy was set by other means. The pages will just end up near the
node where the fault occurred.

There's another patch series in the pipeline which implements "direct
migration".

The direct migration patchset extends the migration functionality to avoid
going through swap. The destination node of the relation is controllable
during the actual moving of pages. The crutch of using the allocation policy
to relocate is not necessary and the pages are moved directly to the target.
Its also faster since swap is not used.

And sys_migrate_pages() can then move pages directly to the specified node.
Implement functions to isolate pages from the LRU and put them back later.

This patch:

An earlier implementation was provided by Hirokazu Takahashi
and IWAMOTO Toshihiro for the
memory hotplug project.

From: Magnus

This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap
migration.

Signed-off-by: Magnus Damm
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:41 +0800
48db57f8f [PATCH] mm: free_pages opt ... Browse Code »

Try to streamline free_pages_bulk by ensuring callers don't pass in a
'count' that exceeds the list size.

Some cleanups:
Rename __free_pages_bulk to __free_one_page.
Put the page list manipulation from __free_pages_ok into free_one_page.
Make __free_pages_ok static.

Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-01-09 12:12:40 +0800
23316bc86 [PATCH] mm: cleanup zone_pcp ... Browse Code »

Use zone_pcp everywhere even though NUMA code "knows" the internal details
of the zone. Stop other people trying to copy, and it looks nicer.

Also, only print the pagesets of online cpus in zoneinfo.

Signed-off-by: Nick Piggin
Cc: "Seth, Rohit"
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-01-09 12:12:40 +0800
8ad4b1fb8 [PATCH] Make high and batch sizes of per_cpu_pagelists configurable ... Browse Code »

As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rohit Seth
2006-01-09 12:12:40 +0800
9d0243bca [PATCH] drop-pagecache ... Browse Code »

Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.

It won't drop dirty data, so the user should run `sync' first.

Caveats:

a) Holds inode_lock for exorbitant amounts of time.

b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.

This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-01-09 12:12:40 +0800
bec6b0c89 [PATCH] slab: remove nested #ifdef CONFIG_NUMA ... Browse Code »

For some reason there is an #ifdef CONFIG_NUMA within another #ifdef
CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:40 +0800