Doug / smarc-fsl-linux-kernel | Embedian Git Server

14 Jan, 2011

13 commits

c06b1fca1 mm/page_alloc.c: don't cache `current' in a local ... Browse Code »

It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
text data bss dec hex filename
39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes
Acked-by: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:49 +0800
43506fad2 mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists ... Browse Code »

The previous approach of calucation of combined index was

page_idx & ~(1 << order))

but we have same result with

page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KyongHo Cho
2011-01-14 09:32:48 +0800
5f24ce5fd thp: remove PG_buddy ... Browse Code »

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
5c3240d92 thp: don't alloc harder for gfp nomemalloc even if nowait ... Browse Code »

Not worth throwing away the precious reserved free memory pool for
allocations that can fail gracefully (either through mempool or because
they're transhuge allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:42 +0800
32dba98e0 thp: _GFP_NO_KSWAPD ... Browse Code »

Transparent hugepage allocations must be allowed not to invoke kswapd or
any other kind of indirect reclaim (especially when the defrag sysfs is
control disabled). It's unacceptable to swap out anonymous pages
(potentially anonymous transparent hugepages) in order to create new
transparent hugepages. This is true for the MADV_HUGEPAGE areas too
(swapping out a kvm virtual machine and so having it suffer an unbearable
slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
can run 30% faster if it is running memory intensive workloads, makes no
sense). If a transparent hugepage allocation fails the slowdown is minor
and there is total fallback, so kswapd should never be asked to swapout
memory to allow the high order allocation to succeed.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:41 +0800
59ff42163 thp: comment reminder in destroy_compound_page ... Browse Code »

Warn destroy_compound_page that __split_huge_page_refcount is heavily
dependent on its internal behavior.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:39 +0800
8dd60a3a6 thp: clear compound mapping ... Browse Code »

Clear compound mapping for anonymous compound pages like it already
happens for regular anonymous pages. But crash if mapping is set for any
tail page, also the PageAnon check is meaningless for tail pages. This
check only makes sense for the head page, for tail page it can only hide
bugs and we definitely don't want to hide bugs.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:39 +0800
4e9f64c42 thp: fix bad_page to show the real reason the page is bad ... Browse Code »

page_count shows the count of the head page, but the actual check is done
on the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:38 +0800
ecb256f81 mm: set correct numa_zonelist_order string when configured on the kernel command line ... Browse Code »

When numa_zonelist_order parameter is set to "node" or "zone" on the
command line it's still showing as "default" in sysctl. That's because
early_param parsing function changes only user_zonelist_order variable.
Fix this by copying user-provided string to numa_zonelist_order if it was
successfully parsed.

Signed-off-by: Volodymyr G Lukiianyk
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Volodymyr G. Lukiianyk
2011-01-14 09:32:37 +0800
995047488 mm: kswapd: stop high-order balancing when any suitable zone is balanced ... Browse Code »

Simon Kirby reported the following problem

We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.

A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.

The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.

POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47

VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363

Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%

proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355

Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.

The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.

The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.

MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79

VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130

Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%

proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15

Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.

I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.

Tgis patch:

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-01-14 09:32:37 +0800
77f1fe6b0 mm: migration: allow migration to operate asynchronously and avoid synchronous c… ... Browse Code »

…ompaction in the faster path

Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.

This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.

[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2011-01-14 09:32:34 +0800
3e7d34497 mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim ... Browse Code »

Lumpy reclaim is disruptive. It reclaims a large number of pages and
ignores the age of the pages it reclaims. This can incur significant
stalls and potentially increase the number of major faults.

Compaction has reached the point where it is considered reasonably stable
(meaning it has passed a lot of testing) and is a potential candidate for
displacing lumpy reclaim. This patch introduces an alternative to lumpy
reclaim whe compaction is available called reclaim/compaction. The basic
operation is very simple - instead of selecting a contiguous range of
pages to reclaim, a number of order-0 pages are reclaimed and then
compaction is later by either kswapd (compact_zone_order()) or direct
compaction (__alloc_pages_direct_compact()).

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: use conventional task_struct naming]
Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-01-14 09:32:33 +0800
88f5acf88 mm: page allocator: adjust the per-cpu counter threshold when memory is low ... Browse Code »

Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold. On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high. The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues. The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node. There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced. The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla 11.6615%
disable-threshold 0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity. I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman
Reported-by: Shaohua Li
Reviewed-by: Christoph Lameter
Tested-by: Nicolas Bareil
Cc: David Rientjes
Cc: Kyle McMartin
Cc: [2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-01-14 09:32:31 +0800

23 Dec, 2010

1 commit

4b7bd3647 Merge branch 'master' into for-next ... Browse Code »

Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.

Jiri Kosina
2010-12-23 01:57:02 +0800

07 Dec, 2010

1 commit

c9e664f1f PM / Hibernate: Fix memory corruption related to swap ... Browse Code »

There is a problem that swap pages allocated before the creation of
a hibernation image can be released and used for storing the contents
of different memory pages while the image is being saved. Since the
kernel stored in the image doesn't know of that, it causes memory
corruption to occur after resume from hibernation, especially on
systems with relatively small RAM that need to swap often.

This issue can be addressed by keeping the GFP_IOFS bits clear
in gfp_allowed_mask during the entire hibernation, including the
saving of the image, until the system is finally turned off or
the hibernation is aborted. Unfortunately, for this purpose
it's necessary to rework the way in which the hibernate and
suspend code manipulates gfp_allowed_mask.

This change is based on an earlier patch from Hugh Dickins.

Signed-off-by: Rafael J. Wysocki
Reported-by: Ondrej Zary
Acked-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: stable@kernel.org

Rafael J. Wysocki
2010-12-07 06:52:08 +0800

29 Nov, 2010

1 commit

fa9f90be7 Kill off a bunch of warning: ‘inline’ is not at beginning of declaration ... Browse Code »

These warnings are spewed during a build of a 'allnoconfig' kernel
(especially the ones from u64_stats_sync.h show up a lot) when building
with -Wextra (which I often do)..
They are
a) annoying
b) easy to get rid of.
This patch kills them off.

include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration

Signed-off-by: Jesper Juhl
Signed-off-by: Jiri Kosina

Jesper Juhl
2010-11-29 06:08:04 +0800

25 Nov, 2010

1 commit

e9959f0f3 mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called… ... Browse Code »

… under stop_machine_run()

During memory hotplug, build_allzonelists() may be called under
stop_machine_run(). In this function, setup_zone_pageset() is called.
But it's bug because it will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

BUG: sleeping function called from invalid context at kernel/mutex.c:94
in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
Call Trace:
[<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
[<ffffffff81468245>] mutex_lock+0x24/0x50
[<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
[<ffffffff81048888>] ? load_balance+0xbe/0x60e
[<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
[<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
[<ffffffff8110f237>] __alloc_percpu+0x10/0x12
[<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
[<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
[<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
[<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
[<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
[<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
[<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
[<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
[<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
[<ffffffff81065f29>] kthread+0x7f/0x87
[<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
[<ffffffff81065eaa>] ? kthread+0x0/0x87
[<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

[akpm@linux-foundation.org: remove unneeded local]
Reported-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vmware.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KAMEZAWA Hiroyuki
2010-11-25 05:50:45 +0800

27 Oct, 2010

5 commits

e6223a3b1 mm: add casts to/from gfp_t in gfp_to_alloc_flags() ... Browse Code »

This removes following warning from sparse:

mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer

Signed-off-by: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2010-10-27 07:52:09 +0800
0e093d997 writeback: do not sleep on the congestion queue if there are no congested BDIs o… ... Browse Code »

…r if significant congestion is not being encountered in the current zone

If congestion_wait() is called with no BDI congested, the caller will
sleep for the full timeout and this may be an unnecessary sleep. This
patch adds a wait_iff_congested() that checks congestion and only sleeps
if a BDI is congested else, it calls cond_resched() to ensure the caller
is not hogging the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but
after this patch, it'll just call schedule() if it has been on the CPU too
long. Similar logic applies to direct reclaimers that are not making
enough progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2010-10-27 07:52:07 +0800
49ac82558 memory hotplug: unify is_removable and offline detection code ... Browse Code »

Now, sysfs interface of memory hotplug shows whether the section is
removable or not. But it checks only migrateype of pages and doesn't
check details of cluster of pages.

Next, memory hotplug's set_migratetype_isolate() has the same kind of
check, too.

This patch adds the function __count_unmovable_pages() and makes above 2
checks to use the same logic. Then, is_removable and hotremove code uses
the same logic. No changes in the hotremove logic itself.

TODO: need to find a way to check RECLAMABLE. But, considering bit,
calling shrink_slab() against a range before starting memory hotremove
sounds better. If so, this patch's logic doesn't need to be changed.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Reported-by: Michal Hocko
Cc: Wu Fengguang
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-27 07:52:06 +0800
4b20477f5 memory hotplug: fix notifier's return value check ... Browse Code »

Even if notifier cannot find any pages, it doesn't mean no pages are
available...And, if there are no notifiers registered, this condition will
be always true and memory hotplug will show -EBUSY.

This is a bug but not critical.

In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
"notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
fail. So, no one notice this bug.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Wu Fengguang
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-27 07:52:06 +0800
b7f50cfa3 mm, page-allocator: do not check the state of a non-existant buddy during free ... Browse Code »

There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
in buddy allocator by adding buddies that are merging to the tail of the
free lists") that means a buddy at order MAX_ORDER is checked for merging.
A page of this order never exists so at times, an effectively random
piece of memory is being checked.

Alan Curry has reported that this is causing memory corruption in
userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
It is not clear why this is happening. It could be a cache coherency
problem where pages mapped in both user and kernel space are getting
different cache lines due to the bad read from kernel space
(http://lkml.org/lkml/2010/10/13/179). It could also be that there are
some special registers being io-remapped at the end of the memmap array
and that a read has special meaning on them. Compiler bugs have been
ruled out because the assembly before and after the patch looks relatively
harmless.

This patch fixes the problem by ensuring we are not reading a possibly
invalid location of memory. It's not clear why the read causes corruption
but one way or the other it is a buggy read.

Signed-off-by: Mel Gorman
Cc: Corrado Zoccolo
Reported-by: Alan Curry
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-10-27 07:52:03 +0800

12 Oct, 2010

1 commit

8e4029ee3 Merge branch 'x86/urgent' into core/memblock ... Browse Code »

Reason for merge:

Forward-port urgent change to arch/x86/mm/srat_64.c to the memblock tree.

Resolved Conflicts:
arch/x86/mm/srat_64.c

Originally-by: Yinghai Lu
Signed-off-by: H. Peter Anvin

H. Peter Anvin
2010-10-12 08:05:11 +0800

08 Oct, 2010

2 commits

153db80f8 Merge commit 'v2.6.36-rc7' into core/memblock ... Browse Code »

Merge reason: Update from -rc3 to -rc7.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-10-08 15:15:00 +0800
f241e6607 mm: alloc_large_system_hash() printk overflow on 16TB boot ... Browse Code »

During boot of a 16TB system, the following is printed:
Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)

Signed-off-by: Robin Holt
Reviewed-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2010-10-08 04:31:21 +0800

10 Sep, 2010

3 commits

9ee493ce0 mm: page allocator: drain per-cpu lists after direct reclaim allocation fails ... Browse Code »

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure. In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Dave Chinner
Cc: Wu Fengguang
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-09-10 09:57:25 +0800
aa4548403 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is … ... Browse Code »

…low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Christoph Lameter
2010-09-10 09:57:25 +0800
72853e299 mm: page allocator: update free page counters after pages are placed on the free list ... Browse Code »

When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-09-10 09:57:25 +0800

31 Aug, 2010

1 commit

daab7fc73 Merge commit 'v2.6.36-rc3' into x86/memblock ... Browse Code »

Conflicts:
arch/x86/kernel/trampoline.c
mm/memblock.c

Merge reason: Resolve the conflicts, update to latest upstream.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-08-31 15:45:46 +0800

28 Aug, 2010

2 commits

72d7c3b33 x86: Use memblock to replace early_res ... Browse Code »

1. replace find_e820_area with memblock_find_in_range
2. replace reserve_early with memblock_x86_reserve_range
3. replace free_early with memblock_x86_free_range.
4. NO_BOOTMEM will switch to use memblock too.
5. use _e820, _early wrap in the patch, in following patch, will
replace them all
6. because memblock_x86_free_range support partial free, we can remove some special care
7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
so adjust some calling later in setup.c::setup_arch()
-- corruption_check and mptable_update

-v2: Move reserve_brk() early
Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
that could happen We have more then 128 RAM entry in E820 tables, and
memblock_x86_fill() could use memblock_find_in_range() to find a new place for
memblock.memory.region array.
and We don't need to use extend_brk() after fill_memblock_area()
So move reserve_brk() early before fill_memblock_area().
-v3: Move find_smp_config early
To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
in right place.
-v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
memblock.reserved already..
use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
-v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
active_region for 32bit does include high pages
need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
-v6: Use current_limit instead
-v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
-v8: Set memblock_can_resize early to handle EFI with more RAM entries
-v9: update after kmemleak changes in mainline

Suggested-by: David S. Miller
Suggested-by: Benjamin Herrenschmidt
Suggested-by: Thomas Gleixner
Signed-off-by: Yinghai Lu
Signed-off-by: H. Peter Anvin

Yinghai Lu
2010-08-28 02:12:29 +0800
edbe7d23b memblock: Add find_memory_core_early() ... Browse Code »

According to node range in early_node_map[] with __memblock_find_in_range
to find free range.

Will be used by memblock_x86_find_in_range_node()

memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA

Signed-off-by: Yinghai Lu
Signed-off-by: H. Peter Anvin

Yinghai Lu
2010-08-28 02:10:54 +0800

10 Aug, 2010

3 commits

25edde033 vmscan: kill prev_priority completely ... Browse Code »

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Dave Chinner
Cc: Chris Mason
Cc: Nick Piggin
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Christoph Hellwig
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Andrea Arcangeli
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-08-10 11:45:00 +0800
ff321feac mm: rename try_set_zone_oom() to try_set_zonelist_oom() ... Browse Code »

We have been used naming try_set_zone_oom and clear_zonelist_oom.
The role of functions is to lock of zonelist for preventing parallel
OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
awkward and unmatched with clear_zonelist_oom.

Let's change it with try_set_zonelist_oom.

Signed-off-by: Minchan Kim
Acked-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2010-08-10 11:44:57 +0800
03668b3ce oom: avoid oom killer for lowmem allocations ... Browse Code »

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is no
guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: KOSAKI Motohiro
Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-08-10 11:44:56 +0800

21 Jul, 2010

1 commit

b8ab9f820 x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used ... Browse Code »

Borislav Petkov reported his 32bit numa system has problem:

[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [] ? sparse_init+0x1dc/0x499
[ 0.000000] [] ? paging_init+0x168/0x1df
[ 0.000000] [] ? native_pagetable_setup_start+0xef/0x1bb

looks like it allocates too much high address for bootmem.

Try to cut limit with get_max_mapped()

Reported-by: Borislav Petkov
Tested-by: Conny Seidel
Signed-off-by: Yinghai Lu
Cc: [2.6.34.x]
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Johannes Weiner
Cc: Lee Schermerhorn
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2010-07-21 07:25:40 +0800

19 Jul, 2010

1 commit

9078370c0 kmemleak: Add support for NO_BOOTMEM configurations ... Browse Code »

With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.

Signed-off-by: Catalin Marinas
Acked-by: Pekka Enberg
Acked-by: Yinghai Lu
Cc: H. Peter Anvin
Cc: stable@kernel.org

Catalin Marinas
2010-07-19 18:54:15 +0800

28 May, 2010

2 commits

7aac78988 numa: introduce numa_mem_id()- effective local memory node id ... Browse Code »

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild. Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800
728120192 numa: add generic percpu var numa_node_id() implementation ... Browse Code »

Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.

Guard the new implementation with a new config option:

CONFIG_USE_PERCPU_NUMA_NODE_ID.

Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:

1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.

Subsequent patches will convert x86_64 and ia64 to use this implemenation.

Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2010-05-28 00:12:57 +0800

25 May, 2010

2 commits

4eaf3f643 mem-hotplug: fix potential race while building zonelist for new populated zone ... Browse Code »

Add global mutex zonelists_mutex to fix the possible race:

CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();

In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haicheng Li
2010-05-25 23:07:02 +0800
1f522509c mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset ... Browse Code »

For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().

2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[] __alloc_pages_nodemask+0x131/0x7b0
[] alloc_pages_current+0x87/0xd0
[] __page_cache_alloc+0x67/0x70
[] __do_page_cache_readahead+0x120/0x260
[] ra_submit+0x21/0x30
[] ondemand_readahead+0x166/0x2c0
[] page_cache_async_readahead+0x80/0xa0
[] generic_file_aio_read+0x364/0x670
[] nfs_file_read+0xca/0x130
[] do_sync_read+0xfa/0x140
[] vfs_read+0xb5/0x1a0
[] sys_read+0x51/0x80
[] system_call_fastpath+0x16/0x1b
RIP [] get_page_from_freelist+0x883/0x900
RSP
---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Reviewed-by: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haicheng Li
2010-05-25 23:07:01 +0800