28 May, 2010
2 commits
-
Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.Define API in when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.Archs can override definitions of:
numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalueGeneric initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild. Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.Guard the new implementation with a new config option:
CONFIG_USE_PERCPU_NUMA_NODE_ID.
Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.Subsequent patches will convert x86_64 and ia64 to use this implemenation.
Signed-off-by: Lee Schermerhorn
Cc: Tejun Heo
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Nick Piggin
Cc: David Rientjes
Cc: Eric Whitney
Cc: KAMEZAWA Hiroyuki
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 May, 2010
10 commits
-
Add global mutex zonelists_mutex to fix the possible race:
CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[] __alloc_pages_nodemask+0x131/0x7b0
[] alloc_pages_current+0x87/0xd0
[] __page_cache_alloc+0x67/0x70
[] __do_page_cache_readahead+0x120/0x260
[] ra_submit+0x21/0x30
[] ondemand_readahead+0x166/0x2c0
[] page_cache_async_readahead+0x80/0xa0
[] generic_file_aio_read+0x364/0x670
[] nfs_file_read+0xca/0x130
[] do_sync_read+0xfa/0x140
[] vfs_read+0xb5/0x1a0
[] sys_read+0x51/0x80
[] system_call_fastpath+0x16/0x1b
RIP [] get_page_from_freelist+0x883/0x900
RSP
---[ end trace 4bda28328b9990db ][akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li
Signed-off-by: Wu Fengguang
Reviewed-by: Andi Kleen
Reviewed-by: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
No behavior change here.
Move some of setup_per_cpu_pageset() code into a new function
setup_zone_pageset() that will be useful for memory hotplug.Signed-off-by: Wu Fengguang
Signed-off-by: Haicheng Li
Reviewed-by: Andi Kleen
Reviewed-by: Christoph Lameter
Cc: Mel Gorman
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
free_hot_cold_page() and __free_pages_ok() have very similar freeing
preparation. Consolidate them.[akpm@linux-foundation.org: fix busted coding style]
Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to whyo Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not metIn the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are two types of zonelist ordering methodologies:
- node order, preferring allocations on a node to stay local to and
- zone order, preferring allocations come from a higher zone to avoid
allocating in lowmem zones even though they may not be local.The ordering technique used by the kernel is configurable on the command
line, but also has some logic to determine what the default should be.This logic currently lacks knowledge of systems where a node may only have
lowmem. For such systems, it is necessary to use node order so that
GFP_KERNEL allocations may be satisfied by nodes consisting of only
lowmem.If zone order is used, GFP_KERNEL allocations to such nodes are actually
allocated on a node with local affinity that includes ZONE_NORMAL.This change defaults to node zonelist ordering if any node lacks
ZONE_NORMAL.To force zone order, append 'numa_zonelist_order=zone' to the kernel
command line.Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oomThis patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
…re merging to the tail of the free lists
In order to reduce fragmentation, this patch classifies freed pages in two
groups according to their probability of being part of a high order merge.
Pages belonging to a compound whose next-highest buddy is free are more
likely to be part of a high order merge in the near future, so they will
be added at the tail of the freelist. The remaining pages are put at the
front of the freelist.In this way, the pages that are more likely to cause a big merge are kept
free longer. Consequently there is a tendency to aggregate the
long-living allocations on a subset of the compounds, reducing the
fragmentation.This heuristic was tested on three machines, x86, x86-64 and ppc64 with
3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
and STREAM for performance and a high-order stress test for huge page
allocations.KernBench X86
Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
User mean 649.53 ( 0.00%) 650.44 (-0.14%)
System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)KernBench X86-64
Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)KernBench PPC64
Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)Nothing notable for kernbench.
NetPerf UDP X86
64 42.68 ( 0.00%) 42.77 ( 0.21%)
128 85.62 ( 0.00%) 85.32 (-0.35%)
256 170.01 ( 0.00%) 168.76 (-0.74%)
1024 655.68 ( 0.00%) 652.33 (-0.51%)
2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*NetPerf UDP X86-64
64 148.82 ( 0.00%) 154.92 ( 3.94%)
128 298.96 ( 0.00%) 312.95 ( 4.47%)
256 583.67 ( 0.00%) 626.39 ( 6.82%)
1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
1.64% 2.73%NetPerf UDP PPC64
64 49.98 ( 0.00%) 50.25 ( 0.54%)
128 98.66 ( 0.00%) 100.95 ( 2.27%)
256 197.33 ( 0.00%) 191.03 (-3.30%)
1024 761.98 ( 0.00%) 785.07 ( 2.94%)
2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)The tests are run to have confidence limits within 1%. Results marked
with a * were not confident although in this case, it's only outside by
small amounts. Even with some results that were not confident, the
netperf UDP results were generally positive.NetPerf TCP X86
64 652.25 ( 0.00%)* 648.12 (-0.64%)*
23.80% 22.82%
128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
21.03% 18.90%
256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
1.00% 16.46%
1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
13.37% 11.39%
2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
9.76% 12.48%
3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
6.49% 8.75%
4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
9.85% 8.50%
8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
9.13% 13.04%
16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
7.73% 8.68%NETPERF TCP X86-64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
5.79% 4.78%
128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
3.68% 6.06%
256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
3.02% 2.10%
1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
5.89% 6.55%
2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
7.08% 7.44%
3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
6.87% 7.33%
4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
6.86% 8.18%
8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
7.49% 5.55%
16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
7.36% 6.49%NETPERF TCP PPC64
netperf-tcp-vanilla-netperf netperf-tcp
tcp-vanilla pgalloc-delay
64 594.17 ( 0.00%) 596.04 ( 0.31%)*
1.00% 2.29%
128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
1.30% 1.40%
256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
1.25% 1.00%
1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
1.02% 1.00%
2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
1.15% 1.04%
3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
16384 9618.65 ( 0.00%) 9490.10 (-1.35%)There was a distinct lack of confidence in the X86* figures so I included
what the devation was where the results were not confident. Many of the
results, whether gains or losses were within the standard deviation so no
solid conclusion can be reached on performance impact. Looking at the
figures, only the X86-64 ones look suspicious with a few losses that were
outside the noise. However, the results were so unstable that without
knowing why they vary so much, a solid conclusion cannot be reached.SYSBENCH X86
sysbench-vanilla pgalloc-delay
1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
2 14901.11 ( 0.00%) 13683.44 (-8.90%)
3 15171.71 ( 0.00%) 14888.25 (-1.90%)
4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
6 14870.33 ( 0.00%) 14845.57 (-0.17%)
7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
8 14354.35 ( 0.00%) 14362.31 ( 0.06%)SYSBENCH X86-64
1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
2 34276.39 ( 0.00%) 34251.00 (-0.07%)
3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
4 66667.10 ( 0.00%) 66174.69 (-0.74%)
5 66003.91 ( 0.00%) 65685.25 (-0.49%)
6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
7 64933.16 ( 0.00%) 64379.23 (-0.86%)
8 63353.30 ( 0.00%) 63281.22 (-0.11%)
9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
11 62092.81 ( 0.00%) 61787.75 (-0.49%)
12 61330.11 ( 0.00%) 61036.34 (-0.48%)
13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
14 62304.48 ( 0.00%) 62064.90 (-0.39%)
15 63296.48 ( 0.00%) 62875.16 (-0.67%)
16 63951.76 ( 0.00%) 63769.09 (-0.29%)SYSBENCH PPC64
-sysbench-pgalloc-delay-sysbench
sysbench-vanilla pgalloc-delay
1 7645.08 ( 0.00%) 7467.43 (-2.38%)
2 14856.67 ( 0.00%) 14558.73 (-2.05%)
3 21952.31 ( 0.00%) 21683.64 (-1.24%)
4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
6 27477.10 ( 0.00%) 27337.45 (-0.51%)
7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
8 26642.91 ( 0.00%) 25274.33 (-5.41%)
9 25137.27 ( 0.00%) 24810.06 (-1.32%)
10 24451.99 ( 0.00%) 24275.85 (-0.73%)
11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
12 24234.81 ( 0.00%) 23640.89 (-2.51%)
13 24577.75 ( 0.00%) 24433.50 (-0.59%)
14 25640.19 ( 0.00%) 25116.52 (-2.08%)
15 26188.84 ( 0.00%) 26181.36 (-0.03%)
16 26782.37 ( 0.00%) 26255.99 (-2.00%)Again, there is little to conclude here. While there are a few losses,
the results vary by +/- 8% in some cases. They are the results of most
concern as there are some large losses but it's also within the variance
typically seen between kernel releases.The STREAM results varied so little and are so verbose that I didn't
include them here.The final test stressed how many huge pages can be allocated. The
absolute number of huge pages allocated are the same with or without the
page. However, the "unusability free space index" which is a measure of
external fragmentation was slightly lower (lower is better) throughout the
lifetime of the system. I also measured the latency of how long it took
to successfully allocate a huge page. The latency was slightly lower and
on X86 and PPC64, more huge pages were allocated almost immediately from
the free lists. The improvement is slight but there.[mel@csn.ul.ie: Tested, reworked for less branches]
[czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 Mar, 2010
1 commit
-
[Ss]ytem => [Ss]ystem
udpate => update
paramters => parameters
orginal => originalSigned-off-by: Thomas Weber
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina
13 Mar, 2010
2 commits
-
- introduce dump_page() to print the page info for debugging some error
condition.- convert three mm users: bad_page(), print_bad_pte() and memory offline
failure.- print an extra field: the symbolic names of page->flags
Example dump_page() output:
[ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
[ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)Signed-off-by: Wu Fengguang
Cc: Ingo Molnar
Cc: Alex Chiang
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__zone_pcp_update() iterates over NR_CPUS instead of limiting the access
to the possible cpus. This might result in access to uninitialized areas
as the per cpu allocator only populates the per cpu memory for possible
cpus.This problem was created as a result of the dynamic allocation of pagesets
from percpu memory that went in during the merge window - commit
99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
conversion").Signed-off-by: Thomas Gleixner
Acked-by: Pekka Enberg
Acked-by: Tejun Heo
Acked-by: Christoph Lameter
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Mar, 2010
6 commits
-
free_area_init_nodes() emits pfn ranges for all zones on the system.
There may be no pages on a higher zone, however, due to memory limitations
or the use of the mem= kernel parameter. For example:Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x00100000The implementation copies the previous zone's highest pfn, if any, as the
next zone's lowest pfn. If its highest pfn is then greater than the
amount of addressable memory, the upper memory limit is used instead.
Thus, both the lowest and highest possible pfn for higher zones without
memory may be the same.The pfn range for zones without memory is now shown as "empty" instead.
Signed-off-by: David Rientjes
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki
Reported-by: Maxim Levitsky
Cc: Sebastian Ott
Cc: Benjamin Herrenschmidt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro
Cc: David Rientjes
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Cc: Americo Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Mar, 2010
1 commit
-
…l/git/tip/linux-2.6-tip
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...
22 Feb, 2010
1 commit
-
These build errors on some non-x86 platforms (PowerPC for example):
mm/page_alloc.c: In function '__alloc_memory_core_early':
mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'The function is only needed on CONFIG_NO_BOOTMEM.
Signed-off-by: Yinghai Lu
Cc: Andrew Morton
Cc: Johannes Weiner
Cc: Mel Gorman
LKML-Reference:
Signed-off-by: Ingo Molnar
13 Feb, 2010
1 commit
-
Finally we can use early_res to replace bootmem for x86_64 now.
Still can use CONFIG_NO_BOOTMEM to enable it or not.
-v2: fix 32bit compiling about MAX_DMA32_PFN
-v3: folded bug fix from LKML message belowSigned-off-by: Yinghai Lu
LKML-Reference:
Signed-off-by: H. Peter Anvin
02 Feb, 2010
1 commit
30 Jan, 2010
1 commit
-
After memory pressure has forced it to dip into the reserves, 2.6.32's
5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.Fix that in the most straightforward way (which, considering the overheads
of alternative approaches, is Mel's preference): the right migratetype is
already in page_private(page), but free_pcppages_bulk() wasn't using it.How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
commit was easy, but explaining the magnitude of the slowdown not easy.The same effect appears, but much less markedly, with SLAB, and even
less markedly on other machines (the PowerMac divides into fewer zones
than x86, I think that may be a factor). We guess that lumpy reclaim
of short-lived high-order pages is implicated in some way, and probably
this bug has been tickling a poor decision somewhere in page reclaim.But instrumentation hasn't told me much, I've run out of time and
imagination to determine exactly what's going on, and shouldn't hold up
the fix any longer: it's valid, and might even fix other misbehaviours.Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
17 Jan, 2010
2 commits
-
commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
go wrong. this patch fixes it.Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Reviewed-by: Minchan Kim
Reported-by: Huang Shijie
Reviewed-by: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The current check for 'backward merging' within add_active_range() does
not seem correct. start_pfn must be compared against
early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
the new region is backward-mergeable with the existing range.Signed-off-by: Kazuhisa Ichikawa
Acked-by: David Rientjes
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jan, 2010
1 commit
-
Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.V7-V8:
- Explain chicken egg dilemmna with percpu allocator.V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.Reviewed-by: Mel Gorman
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo
24 Dec, 2009
1 commit
-
The zone list code clearly cannot tolerate concurrent writers (I couldn't
find any locks for that), so simply add a global mutex. No need for RCU
in this case.Signed-off-by: Andi Kleen
23 Dec, 2009
1 commit
-
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
powerpc/gc/wii: Remove get_irq_desc()
powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
powerpc/mpic: Fix problem that affinity is not updated
powerpc/mm: Fix stupid bug in subpge protection handling
powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
powerpc: Fix MSI support on U4 bridge PCIe slot
powerpc: Handle VSX alignment faults correctly in little-endian mode
powerpc/mm: Fix typo of cpumask_clear_cpu()
powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
powerpc: Convert BUG() to use unreachable()
powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
powerpc/defconfigs: Disable token ring in powerpc defconfigs
powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
powerpc/pseries: Select XICS and PCI_MSI PSERIES
powerpc/85xx: Wrong variable returned on error
powerpc/iseries: Convert to proc_fops
...
20 Dec, 2009
1 commit
-
…git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
Makefile: Unexport LC_ALL instead of clearing it
x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
x86: Fix checking of SRAT when node 0 ram is not from 0
x86, cpuid: Add "volatile" to asm in native_cpuid()
x86, msr: msrs_alloc/free for CONFIG_SMP=n
x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
x86: Add IA32_TSC_AUX MSR and use it
x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
initramfs: add missing decompressor error check
bzip2: Add missing checks for malloc returning NULL
bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure
18 Dec, 2009
1 commit
-
Memory balloon drivers can allocate a large amount of memory which is not
movable but could be freed to accomodate memory hotplug remove.Prior to calling the memory hotplug notifier chain the memory in the
pageblock is isolated. Currently, if the migrate type is not
MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
for that page range to fail.Rather than failing pageblock isolation if the migrateteype is not
MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
and not on the LRU, are owned by a registered balloon driver (or other
entity) using a notifier chain. If all of the non-movable pages are owned
by a balloon, they can be freed later through the memory notifier chain
and the range can still be isolated in set_migratetype_isolate().Signed-off-by: Robert Jennings
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Brian King
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Benjamin Herrenschmidt
17 Dec, 2009
2 commits
-
Found one system that boot from socket1 instead of socket0, SRAT get rejected...
[ 0.000000] SRAT: Node 1 PXM 0 0-a0000
[ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
[ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[ 0.000000] NUMA: Using 20 for the hash shift.
[ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[ 0.000000] SRAT: SRAT not used.the early_node_map is not sorted because node0 with non zero start come first.
so try to sort it right away after all regions are registered.
also fixs refression by 8716273c (x86: Export srat physical topology)
-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
-v3: update comments.Reported-and-tested-by: Jens Axboe
Signed-off-by: Yinghai Lu
LKML-Reference:
Signed-off-by: H. Peter Anvin -
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
HWPOISON: Remove stray phrase in a comment
HWPOISON: Try to allocate migration page on the same node
HWPOISON: Don't do early filtering if filter is disabled
HWPOISON: Add a madvise() injector for soft page offlining
HWPOISON: Add soft page offline support
HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
HWPOISON: Use new shake_page in memory_failure
HWPOISON: Use correct name for MADV_HWPOISON in documentation
HWPOISON: mention HWPoison in Kconfig entry
HWPOISON: Use get_user_page_fast in hwpoison madvise
HWPOISON: add an interface to switch off/on all the page filters
HWPOISON: add memory cgroup filter
memcg: add accessor to mem_cgroup.css
memcg: rename and export try_get_mem_cgroup_from_page()
HWPOISON: add page flags filter
mm: export stable page flags
HWPOISON: limit hwpoison injector to known page types
HWPOISON: add fs/device filters
HWPOISON: return 0 to indicate success reliably
HWPOISON: make semantics of IGNORED/DELAYED clear
...
16 Dec, 2009
3 commits
-
Fix node-oriented allocation handling in oom-kill.c I myself think of this
as a bugfix not as an ehnancement.In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())So, current oom-killer's check function is wrong.
This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc: Daisuke Nishimura
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.CC: Nick Piggin
CC: Mel Gorman
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Lee Schermerhorn
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Wu Fengguang
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Nov, 2009
2 commits
-
Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
calculate the alloc_flags for allocation only once") altered watermark
logic slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.This change results in a reduction of the number high-order GFP_ATOMIC
allocation failures reported. See
http://www.gossamer-threads.com/lists/linux/kernel/1144153[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman
Reviewed-by: Pekka Enberg
Reviewed-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
allocation afterwards. In times past, this would always wake kswapd as
well but currently, kswapd is not woken up after direct reclaim fails.
For order-0 allocations, this makes little difference but if there is a
heavy mix of higher-order allocations that direct reclaim is failing for,
it might mean that kswapd is not rewoken for higher orders as much as it
did previously.This patch wakes up kswapd when an allocation is being retried after a
direct reclaim failure. It would be expected that kswapd is already
awake, but this has the effect of telling kswapd to reclaim at the higher
order as well.Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Reviewed-by: KOSAKI Motohiro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds