Eric Lee / smarc-fsl-linux-kernel

25 May, 2010

40 commits

4eaf3f643 mem-hotplug: fix potential race while building zonelist for new populated zone ... Browse Code »

Add global mutex zonelists_mutex to fix the possible race:

CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();

In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haicheng Li
2010-05-25 23:07:02 +0800
1f522509c mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset ... Browse Code »

For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().

2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[] __alloc_pages_nodemask+0x131/0x7b0
[] alloc_pages_current+0x87/0xd0
[] __page_cache_alloc+0x67/0x70
[] __do_page_cache_readahead+0x120/0x260
[] ra_submit+0x21/0x30
[] ondemand_readahead+0x166/0x2c0
[] page_cache_async_readahead+0x80/0xa0
[] generic_file_aio_read+0x364/0x670
[] nfs_file_read+0xca/0x130
[] do_sync_read+0xfa/0x140
[] vfs_read+0xb5/0x1a0
[] sys_read+0x51/0x80
[] system_call_fastpath+0x16/0x1b
RIP [] get_page_from_freelist+0x883/0x900
RSP
---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Reviewed-by: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haicheng Li
2010-05-25 23:07:01 +0800
319774e25 mem-hotplug: separate setup_per_cpu_pageset() into separate functions ... Browse Code »

No behavior change here.

Move some of setup_per_cpu_pageset() code into a new function
setup_zone_pageset() that will be useful for memory hotplug.

Signed-off-by: Wu Fengguang
Signed-off-by: Haicheng Li
Reviewed-by: Andi Kleen
Reviewed-by: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-05-25 23:07:01 +0800
ff3d58c22 highmem: remove unneeded #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT for debug_kmap_atomic() ... Browse Code »

In f4112de6b679d84bd9b9681c7504be7bdfb7c7d5 ("mm: introduce
debug_kmap_atomic") I said that debug_kmap_atomic() needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.

It was wrong. (I thought irqs_disabled() is only available when the
architecture has CONFIG_TRACE_IRQFLAGS_SUPPORT)

Remove the #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT check to enable
kmap_atomic() debugging for the architectures which do not have
CONFIG_TRACE_IRQFLAGS_SUPPORT.

Reported-by: Andrew Morton
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2010-05-25 23:07:01 +0800
cf23422b9 cpu/mem hotplug: enable CPUs online before local memory online ... Browse Code »

Enable users to online CPUs even if the CPUs belongs to a numa node which
doesn't have onlined local memory.

The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
in system boot/init period, or at the time of local memory online. For a
numa node without onlined local memory, its zonelists are not initialized
at present. As a result, any memory allocation operations executed by
CPUs within this node will fail. In fact, an out-of-memory error is
triggered when attempt to online CPUs before memory comes to online.

This patch tries to create zonelists for such numa nodes, so that the
memory allocation for this node can be fallback'ed to other nodes.

[akpm@linux-foundation.org: remove unneeded export]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: minskey guo
Cc: Minchan Kim
Cc: Yasunori Goto
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

minskey guo
2010-05-25 23:07:00 +0800
8b25c6d22 vmscan: remove isolate_pages callback scan control ... Browse Code »

For now, we have global isolation vs. memory control group isolation, do
not allow the reclaim entry function to set an arbitrary page isolation
callback, we do not need that flexibility.

And since we already pass around the group descriptor for the memory
control group isolation case, just use it to decide which one of the two
isolator functions to use.

The decisions can be merged into nearby branches, so no extra cost there.
In fact, we save the indirect calls.

Signed-off-by: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:07:00 +0800
0aeb2339e vmscan: remove all_unreclaimable scan control ... Browse Code »

This scan control is abused to communicate a return value from
shrink_zones(). Write this idiomatically and remove the knob.

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:07:00 +0800
142762bd8 mm: document follow_page() ... Browse Code »

Signed-off-by: Johannes Weiner
Cc: Dan Carpenter
Cc: Rik van Riel
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:07:00 +0800
ec95f53aa mm: introduce free_pages_prepare() ... Browse Code »

free_hot_cold_page() and __free_pages_ok() have very similar freeing
preparation. Consolidate them.

[akpm@linux-foundation.org: fix busted coding style]
Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-05-25 23:07:00 +0800
5f53e7629 vmscan: page_check_references(): check low order lumpy reclaim properly ... Browse Code »

If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.

Fix it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-05-25 23:07:00 +0800
bf8abe8b9 readahead.c: fix comment ... Browse Code »

Fix a wrong comment over page_cache_async_readahead().

Signed-off-by: Huang Shijie
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2010-05-25 23:07:00 +0800
76a33fc38 vmscan: prevent get_scan_ratio() rounding errors ... Browse Code »

get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.

To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several
scanned pages for < 1% percent.

This has some benefits:

1. increase our calculation precision

2. making our scan more smoothly. Without this, if percent[x] is
underflow, shrink_zone() doesn't scan any pages and suddenly it scans
all pages when priority is zero. With this, even priority isn't zero,
shrink_zone() gets chance to scan some pages.

Note, this patch doesn't really change logics, but just increase
precision. For system with a lot of memory, this might slightly changes
behavior. For example, in a sequential file read workload, without the
patch, we don't swap any anon pages. With it, if anon memory size is
bigger than 16G, we will see one anon page swapped. The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
which is common in this workload. So the impact sounds not a big deal.

Signed-off-by: Shaohua Li
Cc: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2010-05-25 23:07:00 +0800
6ec3a1271 mm: consider the entire user address space during node migration ... Browse Code »

Use mm->task_size instead of TASK_SIZE to ensure that the entire user
address space is migrated. mm->task_size is independent of the calling
task context. TASK SIZE may be dependant on the address space size of the
calling process. Usage of TASK_SIZE can lead to partial address space
migration if the calling process was 32 bit and the migrating process was
64 bit.

Here is the test script used on 64 system with a 32 bit echo process:

mount -t cgroup none /cgroup -o cpuset
cd /cgroup

mkdir 0
echo 1 > 0/cpuset.cpus
echo 0 > 0/cpuset.mems
echo 1 > 0/cpuset.memory_migrate

mkdir 1
echo 1 > 1/cpuset.cpus
echo 1 > 1/cpuset.mems
echo 1 > 1/cpuset.memory_migrate

echo $$ > 0/tasks
64_bit_process &
pid=$!

echo $pid > 1/tasks # This does not migrate all process pages without
# this patch. If 64 bit echo is used or this patch is
# applied, then the full address space of $pid is
# migrated.

To check memory migration, I watched:
grep MemUsed /sys/devices/system/node/node*/meminfo

Signed-off-by: Greg Thelen
Acked-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2010-05-25 23:07:00 +0800
4f92e2586 mm: compaction: defer compaction using an exponential backoff when compaction fails ... Browse Code »

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:07:00 +0800
5e7719058 mm: compaction: add a tunable that decides when memory should be compacted and w… ... Browse Code »

…hen it should be reclaimed

The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will not
be compacted. This choice is arbitrary and not based on data. To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold. The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.

[randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2010-05-25 23:06:59 +0800
56de7263f mm: compaction: direct compact when a high-order allocation fails ... Browse Code »

Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.

Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.

[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
ed4a6d7f0 mm: compaction: add /sys trigger for per-node memory compaction ... Browse Code »

Add a per-node sysfs file called compact. When the file is written to,
each zone in that node is compacted. The intention that this would be
used by something like a job scheduler in a batch system before a job
starts so that the job can allocate the maximum number of hugepages
without significant start-up cost.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
76ab0f530 mm: compaction: add /proc trigger for memory compaction ... Browse Code »

Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
written to the file, all zones are compacted. The expected user of such a
trigger is a job scheduler that prepares the system before the target
application runs.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
748446bb6 mm: compaction: memory compaction core ... Browse Code »

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.

[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
c175a0ce7 mm: move definition for LRU isolation modes to a header ... Browse Code »

Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
Memory compaction needs access to these modes for isolating pages for
migration. This patch exports them.

Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
f1a5ab121 mm: export fragmentation index via debugfs ... Browse Code »

The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such
as in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.

For the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zo basis via
/sys/kernel/debug/extfrag/extfrag_index

> cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Rik van Riel
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
d7a5752c0 mm: export unusable free space index via debugfs ... Browse Code »

The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is a
value between 0 and 1. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. For
the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zone basis via
/sys/kernel/debug/extfrag/unusable_index.

> cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

[akpm@linux-foundation.org: Fix allnoconfig]
Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
a8bef8ff6 mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migra… ... Browse Code »

…tion by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

kernel BUG at include/linux/swapops.h:105!
invalid opcode: 0000 [#1] PREEMPT SMP
...
Call Trace:
[<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[<ffffffff813099b5>] page_fault+0x25/0x30
[<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[<ffffffff8111329b>] search_binary_handler+0x173/0x313
[<ffffffff81114896>] do_execve+0x219/0x30a
[<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[<ffffffff8100320a>] stub_execve+0x6a/0xc0
RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

There is a race between shift_arg_pages and migration that triggers this
bug. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed
before migration completes, the migration PTE may not be found leading to
a BUG when the stack is faulted.

This patch causes pages within the temporary stack during exec to be
skipped by migration. It does this by marking the VMA covering the
temporary stack with an otherwise impossible combination of VMA flags.
These flags are cleared when the temporary stack is moved to its final
location.

[kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2010-05-25 23:06:59 +0800
e9e96b39f mm: allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove ... Browse Code »

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

[akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE]
Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
3fe2011ff mm: migration: allow the migration of PageSwapCache pages ... Browse Code »

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
67b9509b2 mm: migration: do not try to migrate unmapped anonymous pages ... Browse Code »

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being
isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
page dropped to 0 and the anon_vma gets freed. This can happen during
memory compaction if pages being migrated belong to a process that exits
before migration completes. Hence, the use-after-free race looks like

1. Page isolated for migration
2. Process exits
3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:58 +0800
7f60c214f mm: migration: share the anon_vma ref counts between KSM and page migration ... Browse Code »

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:58 +0800
3f6c82728 mm: migration: take a reference to the anon_vma before migrating ... Browse Code »

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.

In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.

The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.

Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax

The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running

At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size

X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791

X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234

PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838

There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.

vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70

This patch:

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:58 +0800
e325c90ff mm: default to node zonelist ordering when nodes have only lowmem ... Browse Code »

There are two types of zonelist ordering methodologies:

- node order, preferring allocations on a node to stay local to and

- zone order, preferring allocations come from a higher zone to avoid
allocating in lowmem zones even though they may not be local.

The ordering technique used by the kernel is configurable on the command
line, but also has some logic to determine what the default should be.

This logic currently lacks knowledge of systems where a node may only have
lowmem. For such systems, it is necessary to use node order so that
GFP_KERNEL allocations may be satisfied by nodes consisting of only
lowmem.

If zone order is used, GFP_KERNEL allocations to such nodes are actually
allocated on a node with local affinity that includes ZONE_NORMAL.

This change defaults to node zonelist ordering if any node lacks
ZONE_NORMAL.

To force zone order, append 'numa_zonelist_order=zone' to the kernel
command line.

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-05-25 23:06:58 +0800
e48293fd7 mincore: do nested page table walks ... Browse Code »

Do page table walks with the well-known nested loops we use in several
other places already.

This avoids doing full page table walks after every pte range and also
allows to handle unmapped areas bigger than one pte range in one go.

Signed-off-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:06:58 +0800
25ef0e50c mincore: pass ranges as start,end address pairs ... Browse Code »

Instead of passing a start address and a number of pages into the helper
functions, convert them to use a start and an end address.

Signed-off-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:06:58 +0800
f48840107 mincore: break do_mincore() into logical pieces ... Browse Code »

Split out functions to handle hugetlb ranges, pte ranges and unmapped
ranges, to improve readability but also to prepare the file structure for
nested page table walks.

No semantic changes intended.

Signed-off-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:06:58 +0800
6a60f1b35 mincore: cleanups ... Browse Code »

This fixes some minor issues that bugged me while going over the code:

o adjust argument order of do_mincore() to match the syscall
o simplify range length calculation
o drop superfluous shift in huge tlb calculation, address is page aligned
o drop dead nr_huge calculation
o check pte_none() before pte_present()
o comment and whitespace fixes

No semantic changes intended.

Signed-off-by: Johannes Weiner
Cc: Andrea Arcangeli
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:06:58 +0800
c0ff7453b cpuset,mm: fix no node to alloc memory when changing cpuset's mems ... Browse Code »
43

Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800
708c1bbc9 mempolicy: restructure rebinding-mempolicy functions ... Browse Code »

Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1]. It happens only on the kernel that do not do
atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores. The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory. The reason is like this:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
# echo $$ > /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
= max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
the mempolicy's nodes. It is used when there is no real lock to protect
the mempolicy in the read-side. Otherwise we can do rebind work at once.

In order to implement it, we define

enum mpol_rebind_step {
MPOL_REBIND_ONCE,
MPOL_REBIND_STEP1,
MPOL_REBIND_STEP2,
MPOL_REBIND_NSTEP,
};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions. Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed. If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock. So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 << 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800
15d77835a mempolicy: factor mpol_shared_policy_init() return paths ... Browse Code »

Factor out duplicate put/frees in mpol_shared_policy_init() to a common
return path.

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-25 23:06:57 +0800
345ace9c7 mempolicy: rename policy_types and cleanup initialization ... Browse Code »

Rename 'policy_types[]' to 'policy_modes[]' to better match the array
contents.

Use designated intializer syntax for policy_modes[].

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-25 23:06:57 +0800
b4652e842 mempolicy: lose unnecessary loop variable in mpol_parse_str() ... Browse Code »

We don't really need the extra variable 'i' in mpol_parse_str(). The only
use is as the the loop variable. Then, it's assigned to 'mode'. Just use
mode, and loose the 'uninitialized_var()' macro.

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-25 23:06:57 +0800
e17f74af3 mempolicy: don't call mpol_set_nodemask() when no_context ... Browse Code »

No need to call mpol_set_nodemask() when we have no context for the
mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option.
Just save the raw nodemask in the mempolicy's w.user_nodemask member for
use when a tmpfs/shmem file is created. mpol_shared_policy_init() will
"contextualize" the policy for the new file based on the creating task's
context.

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-25 23:06:57 +0800
198005025 mempolicy: remove redundant check ... Browse Code »

Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy"
has made the MPOL_DEFAULT only used in the memory policy APIs. So, no
need to check in __mpol_equal also. Also get rid of mpol_match_intent()
and move its logic directly into __mpol_equal().

Signed-off-by: Bob Liu
Acked-by: David Rientjes
Cc: Andi Kleen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2010-05-25 23:06:57 +0800