Doug / smarc-fsl-linux-kernel | Embedian Git Server

03 Oct, 2008

1 commit

6babc32c4 mm: handle initialising compound pages at orders greater than MAX_ORDER ... Browse Code »

When we initialise a compound page we initialise the page flags and head
page pointer for all base pages spanned by that page. When we initialise
a gigantic page (a page of order greater than or equal to MAX_ORDER) we
have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we
assume that all elements of the mem_map in this page are contigious in
memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
and with SPARSEMEM enabled they will not be contigious. This leads us to
walk off the end of the first section and scribble on everything which
follows, BAD.

When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
section of the mem_map. As gigantic pages can only be maximally aligned
we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
the start of the page.

This is a bug fix for the gigantic page support in hugetlbfs.

Credit to Mel Gorman for spotting the issue.

Signed-off-by: Andy Whitcroft
Cc: Mel Gorman
Cc: Jon Tollefson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-10-03 06:53:13 +0800

03 Sep, 2008

2 commits

527655835 mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data ... Browse Code »

WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
The variable contig_page_data references
the variable __initdata bootmem_node_data
If the reference is valid then annotate the
variable with __init* (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

Signed-off-by: Marcin Slusarz
Cc: Johannes Weiner
Cc: Sean MacLennan
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marcin Slusarz
2008-09-03 10:21:37 +0800
344c790e3 mm: make setup_zone_migrate_reserve() aware of overlapping nodes ... Browse Code »

I have gotten to the root cause of the hugetlb badness I reported back on
August 15th. My system has the following memory topology (note the
overlapping node):

Node 0 Memory: 0x8000000-0x44000000
Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
candidates, it happily continues the scan into 0x8000000-0x44000000. When
a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
the wrong zone. Oops.

setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

Signed-off-by: Adam Litke
Acked-by: Mel Gorman
Cc: Dave Hansen
Cc: Nishanth Aravamudan
Cc: Andy Whitcroft
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-09-03 10:21:37 +0800

13 Aug, 2008

1 commit

74768ed83 page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash() ... Browse Code »

.. since a failed allocation is being (initially) handled gracefully, and
panic()-ed upon failure explicitly in the function if retries with smaller
sizes failed.

Signed-off-by: Jan Beulich
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2008-08-13 07:07:27 +0800

31 Jul, 2008

1 commit

1d1958f05 mm: remove find_max_pfn_with_active_regions ... Browse Code »

It has no user now

Also print out info about adding/removing active regions.

Signed-off-by: Yinghai Lu
Acked-by: Mel Gorman
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2008-07-31 00:41:44 +0800

28 Jul, 2008

1 commit

9b1a4d383 stop_machine: Wean existing callers off stop_machine_run() ... Browse Code »

Signed-off-by: Rusty Russell

Rusty Russell
2008-07-28 10:16:31 +0800

25 Jul, 2008

12 commits

af370fb8c memory hotplug: small fixes to bootmem freeing for memory hotremove ... Browse Code »

- Change some naming
* Magic -> types
* MIX_INFO -> MIX_SECTION_INFO
* Change definition of bootmem type from direct hex value

- __free_pages_bootmem() becomes __meminit.

Signed-off-by: Yasunori Goto
Cc: Andy Whitcroft
Cc: Badari Pulavarty
Cc: Yinghai Lu
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasunori Goto
2008-07-25 01:47:21 +0800
b69a7288e mm/page_alloc.c: cleanups ... Browse Code »

This patch contains the following cleanups:
- make the following needlessly global variables static:
- required_kernelcore
- zone_movable_pfn[]
- make the following needlessly global functions static:
- move_freepages()
- move_freepages_block()
- setup_pageset()
- find_usable_zone_for_movable()
- adjust_zone_range_for_zone_movable()
- __absent_pages_in_range()
- find_min_pfn_for_node()
- find_zone_movable_pfns_for_nodes()

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-25 01:47:20 +0800
2be0ffe2b mm: add alloc_pages_exact() and free_pages_exact() ... Browse Code »

alloc_pages_exact() is similar to alloc_pages(), except that it allocates
the minimum number of pages to fulfill the request. This is useful if you
want to allocate a very large buffer that is slightly larger than an even
power-of-two number of pages. In that case, alloc_pages() will waste a
lot of memory.

I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
wiill waste 3MB of physically-contiguous memory.

Signed-off-by: Timur Tabi
Cc: Andi Kleen
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Timur Tabi
2008-07-25 01:47:20 +0800
01ad1c082 mm: export prep_compound_page to mm ... Browse Code »

hugetlb will need to get compound pages from bootmem to handle the case of
them being greater than or equal to MAX_ORDER. Export the constructor
function needed for this.

Acked-by: Adam Litke
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
9109fb7b3 mm: drop unneeded pgdat argument from free_area_init_node() ... Browse Code »

free_area_init_node() gets passed in the node id as well as the node
descriptor. This is redundant as the function can trivially get the node
descriptor itself by means of NODE_DATA() and the node's id.

I checked all the users and NODE_DATA() seems to be usable everywhere
from where this function is called.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-07-25 01:47:16 +0800
3c82d0ce2 buddy: clarify comments describing buddy merge ... Browse Code »

In __free_one_page(), the comment "Move the buddy up one level" appears
attached to the break and by implication when the break is taken we are
moving it up one level:

if (!page_is_buddy(page, buddy, order))
break; /* Move the buddy up one level. */

In reality the inverse is true, we break out when we can no longer merge
this page with its buddy. Looking back into pre-history (into the full
git history) it appears that these two lines accidentally got joined as
part of another change.

Move the comment down where it belongs below the if and clarify its
language.

Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-07-25 01:47:15 +0800
e4048e5dc page allocator: inline some __alloc_pages() wrappers ... Browse Code »

Two zonelist patch series rewrote __page_alloc() largely. Now, it is just
a wrapper function. Inlining them will save a function call.

[akpm@linux-foundation.org: export __alloc_pages_internal]
Cc: Lee Schermerhorn
Cc: Mel Gorman
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-07-25 01:47:14 +0800
b61bfa3c4 mm: move bootmem descriptors definition to a single place ... Browse Code »

There are a lot of places that define either a single bootmem descriptor or an
array of them. Use only one central array with MAX_NUMNODES items instead.

Signed-off-by: Johannes Weiner
Acked-by: Ralf Baechle
Cc: Ingo Molnar
Cc: Richard Henderson
Cc: Russell King
Cc: Tony Luck
Cc: Hirokazu Takata
Cc: Geert Uytterhoeven
Cc: Kyle McMartin
Cc: Paul Mackerras
Cc: Paul Mundt
Cc: David S. Miller
Cc: Yinghai Lu
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-07-25 01:47:14 +0800
68ad8df42 mm: print out the zonelists on request for manual verification ... Browse Code »

This patch prints out the zonelists during boot for manual verification by the
user if the mminit_loglevel is MMINIT_VERIFY or higher.

Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: Andy Whitcroft
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:14 +0800
2dbb51c49 mm: make defensive checks around PFN values registered for memory usage ... Browse Code »

There are a number of different views to how much memory is currently active.
There is the arch-independent zone-sizing view, the bootmem allocator and
memory models view.

Architectures register this information at different times and is not
necessarily in sync particularly with respect to some SPARSEMEM limitations.

This patch introduces mminit_validate_memmodel_limits() which is able to
validate and correct PFN ranges with respect to the memory model. It is only
SPARSEMEM that currently validates itself.

Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: Andy Whitcroft
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:13 +0800
708614e61 mm: verify the page links and memory model ... Browse Code »

Print out information on how the page flags are being used if mminit_loglevel
is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
flags regardless of loglevel.

When the page flags are updated with section, node and zone information, a
check are made to ensure the values can be retrieved correctly. Finally we
confirm that pfn_to_page and page_to_pfn are the correct inverse functions.

[akpm@linux-foundation.org: fix printk warnings]
Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: Andy Whitcroft
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:13 +0800
6b74ab97b mm: add a basic debugging framework for memory initialisation ... Browse Code »

Boot initialisation is very complex, with significant numbers of
architecture-specific routines, hooks and code ordering. While significant
amounts of the initialisation is architecture-independent, it trusts the data
received from the architecture layer. This is a mistake, and has resulted in
a number of difficult-to-diagnose bugs.

This patchset adds some validation and tracing to memory initialisation. It
also introduces a few basic defensive measures. The validation code can be
explicitly disabled for embedded systems.

This patch:

Add additional debugging and verification code for memory initialisation.

Once enabled, the verification checks are always run and when required
additional debugging information may be outputted via a mminit_loglevel=
command-line parameter.

The verification code is placed in a new file mm/mm_init.c. Ideally other mm
initialisation code will be moved here over time.

Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: Andy Whitcroft
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:13 +0800

16 Jul, 2008

1 commit

1a781a777 Merge branch 'generic-ipi' into generic-ipi-for-linus ... Browse Code »

Conflicts:

arch/powerpc/Kconfig
arch/s390/kernel/time.c
arch/x86/kernel/apic_32.c
arch/x86/kernel/cpu/perfctr-watchdog.c
arch/x86/kernel/i8259_64.c
arch/x86/kernel/ldt.c
arch/x86/kernel/nmi_64.c
arch/x86/kernel/smpboot.c
arch/x86/xen/smp.c
include/asm-x86/hw_irq_32.h
include/asm-x86/hw_irq_64.h
include/asm-x86/mach-default/irq_vectors.h
include/asm-x86/mach-voyager/irq_vectors.h
include/asm-x86/smp.h
kernel/Makefile

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-07-16 03:55:59 +0800

08 Jul, 2008

7 commits

5dab8ec13 mm, generic, x86 boot: more tweaks to hex prints of some pfn addresses ... Browse Code »

Fix some problems with (and applies on top of) a previous patch:
x86 boot: show pfn addresses in hex not decimal in some kernel info printks

Primarily change "0x%8lx" format, which displays with a right aligned
space filled hex number (spaces between the "0x" prefix and the number),
into "%0#10lx" format, which zero fills instead of space fills, and
which uses the printf flag '#' to request the "0x" prefix instead of
hard coding it.

Also replace some other "0x%lx" formats with "%#lx", making use of the
'#' printf flag again.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 19:10:40 +0800
2bc0d2615 x86 boot: more consistently use type int for node ids ... Browse Code »

Everywhere I look, node id's are of type 'int', except in this one
case, which has 'unsigned long'. Change this one to 'int' as well.
There is nothing special about the way this variable 'nid' is used in
this routine to justify using an unusual type here.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 18:51:38 +0800
e2fc252e0 x86 boot: show pfn addresses in hex not decimal in some kernel info printks ... Browse Code »

Page frame numbers (the portion of physical addresses above the low
order page offsets) are displayed in several kernel debug and info
prints in decimal, not hex. Decimal addresse are unreadable. Use hex.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 18:51:37 +0800
d52d53b8a RFC x86: try to remove arch_get_ram_range ... Browse Code »

want to remove arch_get_ram_range, and use early_node_map instead.

Signed-off-by: Yinghai Lu

Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 18:48:27 +0800
b5bc6c0e5 x86, mm: use add_highpages_with_active_regions() for high pages init v2 ... Browse Code »

use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage

also remove page_is_reserved_early(), it is not needed anymore.

v2: fix the build of other platforms

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 16:37:25 +0800
cc1050baf x86: replace shrink_active_range() with remove_active_range() ... Browse Code »

in case we have kva before ramdisk on a node, we still need to use
those ranges.

v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 16:36:29 +0800
896395c29 Merge branch 'linus' into tmp.x86.mpparse.new Browse Code »

Ingo Molnar
2008-07-08 16:32:56 +0800

04 Jul, 2008

1 commit

494de9009 Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr ... Browse Code »

The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.

Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.

As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims. This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman
Tested-by: Dan Williams
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-04 00:22:59 +0800

26 Jun, 2008

1 commit

15c8b6c1a on_each_cpu(): kill unused 'retry' parameter ... Browse Code »

It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge
Reviewed-by: Paul E. McKenney
Signed-off-by: Jens Axboe

Jens Axboe
2008-06-26 17:24:38 +0800

10 Jun, 2008

2 commits

cc1a9d86c mm, x86: shrink_active_range() should check all ... Browse Code »

Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.

So we need to make shrink_active_range() smarter.

shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-06-10 17:31:44 +0800
dfa7e20cc mm: Minor clean-up of page flags in mm/page_alloc.c ... Browse Code »

Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.

The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups. Doing that
in a header file is cleaner than adding #ifdefs to the
C code.

Signed-off-by: Russ Anderson
Signed-off-by: Linus Torvalds

Russ Anderson
2008-06-10 01:22:24 +0800

03 Jun, 2008

1 commit

e8c27ac91 x86, numa, 32-bit: print out debug info on all kvas ... Browse Code »

also fix the print out of node_remap_end_vaddr

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-06-03 19:26:26 +0800

25 May, 2008

3 commits

cd94b9dbf memory hotplug: fix early allocation handling ... Browse Code »

Trying to add memory via add_memory() from within an initcall function
results in

bootmem alloc of 163840 bytes failed!
Kernel panic - not syncing: Out of memory

This is caused by zone_wait_table_init() which uses system_state to decide
if it should use the bootmem allocator or not.

When initcalls are handled the system_state is still SYSTEM_BOOTING but
the bootmem allocator doesn't work anymore. So the allocation will fail.

To fix this use slab_is_available() instead as indicator like we do it
everywhere else.

[akpm@linux-foundation.org: coding-style fix]
Reviewed-by: Andy Whitcroft
Cc: Dave Hansen
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Acked-by: Yasunori Goto
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2008-05-25 00:56:12 +0800
7eb54824b zonelists: handle a node zonelist with no applicable entries ... Browse Code »

When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
panics when trying node local allocations:

BUG: unable to handle kernel NULL pointer dereference at 0000034c
IP: [] get_page_from_freelist+0x4a/0x18e
*pdpt = 00000000013a7001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:

Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
EIP: 0060:[] EFLAGS: 00010282 CPU: 0
EIP is at get_page_from_freelist+0x4a/0x18e
EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
Call Trace:
[] __alloc_pages_internal+0x99/0x378
[] __alloc_pages+0x7/0x9
[] kmem_getpages+0x66/0xef
[] cache_grow+0x8f/0x123
[] ____cache_alloc_node+0xb9/0xe4
[] kmem_cache_alloc_node+0x92/0xd2
[] setup_cpu_cache+0xaf/0x177
[] kmem_cache_create+0x2c8/0x353
[] kmem_cache_init+0x1ce/0x3ad
[] start_kernel+0x178/0x1ee

This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
node 0, all other nodes are mapped above 4GB physical. Here is a dump
of the zonelists from this system:

zonelists pgdat=c1400000
0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
1: c14006c0:2 c1400360:1 c1400000:0
zonelists pgdat=f7c00000
0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
1: f7c006c0:2
zonelists pgdat=f7e00000
0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
1: f7e006c0:2

When performing a node local allocation we call get_page_from_freelist()
looking for a page. It in turn calls first_zones_zonelist() which returns
a preferred_zone. Where there are no applicable zones this will be NULL.
However we use this unconditionally, leading to this panic.

Where there are no applicable zones there is no possibility of a successful
allocation, so simply fail the allocation.

Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-05-25 00:56:11 +0800
f72321541 mm: don't drop a partial page in a zone's memory map size ... Browse Code »

In a zone's present pages number, account for all pages occupied by the
memory map, including a partial.

Signed-off-by: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-05-25 00:56:07 +0800

15 May, 2008

1 commit

76cdd58e5 memory_hotplug: always initialize pageblock bitmap ... Browse Code »

Trying to online a new memory section that was added via memory hotplug
sometimes results in crashes when the new pages are added via __free_page.
Reason for that is that the pageblock bitmap isn't initialized and hence
contains random stuff. That means that get_pageblock_migratetype()
returns also random stuff and therefore

list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);

in __free_one_page() tries to do a list_add to something that isn't even
necessarily a list.

This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix
usemap initialization") which makes sure that the pageblock bitmap gets
only initialized for pages present in a zone. Unfortunately for hot-added
memory the zones "grow" after the memmap and the pageblock memmap have
been initialized. Which means that the new pages have an unitialized
bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
are moved to __add_zone() just before the initialization happens.

The patch also moves the two functions since __add_zone() is the only
caller and I didn't want to add a forward declaration.

Signed-off-by: Heiko Carstens
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2008-05-15 10:11:15 +0800

30 Apr, 2008

1 commit

3ac7fe5a4 infrastructure to debug (dynamic) objects ... Browse Code »

We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore. One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause. This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash. Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation. When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises. For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets. The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active. The runtime overhead
is minimal and could be optimized by asm alternatives. A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.

Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar
Cc: Greg KH
Cc: Randy Dunlap
Cc: Kay Sievers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2008-04-30 23:29:53 +0800

29 Apr, 2008

3 commits

a41f24ea9 page allocator: smarter retry of costly-order allocations ... Browse Code »

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).

Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).

This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<
Cc: Andy Whitcroft
Tested-by: Mel Gorman
Cc: Dave Hansen
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-04-29 23:05:58 +0800
ab857d093 mm: fix misleading __GFP_REPEAT related comments ... Browse Code »

The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
core VM have somewhat differing comments as to their actual semantics.
Annoyingly, the flags definition has inline and header comments, which might
be interpreted as not being equivalent. Just add references to the header
comments in the inline ones so they don't go out of sync in the future. In
their use in __alloc_pages() clarify that the current implementation treats
low-order allocations and __GFP_REPEAT allocations as distinct cases.

To clarify, the flags' semantics are:

__GFP_NORETRY means try no harder than one run through __alloc_pages

__GFP_REPEAT means __GFP_NOFAIL

__GFP_NOFAIL means repeat forever

order
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-04-29 23:05:58 +0800
86051ca5e mm: fix usemap initialization ... Browse Code »

usemap must be initialized only when pfn is within zone. If not, it corrupts
memory.

And this patch also reduces the number of calls to set_pageblock_migratetype()
from
(pfn & (pageblock_nr_pages -1)
to
!(pfn & (pageblock_nr_pages-1)
it should be called once per pageblock.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Shi Weihua
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-04-29 23:05:58 +0800

28 Apr, 2008

1 commit

2309f9e6f mm/page_alloc.c: remove hand-coded get_order() ... Browse Code »

Remove hand-coded get_order() from page_alloc.c.

Signed-off-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Machek
2008-04-28 23:58:26 +0800