03 Oct, 2008

1 commit

  • When we initialise a compound page we initialise the page flags and head
    page pointer for all base pages spanned by that page. When we initialise
    a gigantic page (a page of order greater than or equal to MAX_ORDER) we
    have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we
    assume that all elements of the mem_map in this page are contigious in
    memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
    and with SPARSEMEM enabled they will not be contigious. This leads us to
    walk off the end of the first section and scribble on everything which
    follows, BAD.

    When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
    section of the mem_map. As gigantic pages can only be maximally aligned
    we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
    the start of the page.

    This is a bug fix for the gigantic page support in hugetlbfs.

    Credit to Mel Gorman for spotting the issue.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Jon Tollefson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

03 Sep, 2008

2 commits

  • WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
    The variable contig_page_data references
    the variable __initdata bootmem_node_data
    If the reference is valid then annotate the
    variable with __init* (see linux/init.h) or name the variable:
    *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

    Signed-off-by: Marcin Slusarz
    Cc: Johannes Weiner
    Cc: Sean MacLennan
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • I have gotten to the root cause of the hugetlb badness I reported back on
    August 15th. My system has the following memory topology (note the
    overlapping node):

    Node 0 Memory: 0x8000000-0x44000000
    Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

    setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
    for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
    candidates, it happily continues the scan into 0x8000000-0x44000000. When
    a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
    the wrong zone. Oops.

    setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

    Signed-off-by: Adam Litke
    Acked-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Nishanth Aravamudan
    Cc: Andy Whitcroft
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

13 Aug, 2008

1 commit


31 Jul, 2008

1 commit


28 Jul, 2008

1 commit


25 Jul, 2008

12 commits

  • - Change some naming
    * Magic -> types
    * MIX_INFO -> MIX_SECTION_INFO
    * Change definition of bootmem type from direct hex value

    - __free_pages_bootmem() becomes __meminit.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This patch contains the following cleanups:
    - make the following needlessly global variables static:
    - required_kernelcore
    - zone_movable_pfn[]
    - make the following needlessly global functions static:
    - move_freepages()
    - move_freepages_block()
    - setup_pageset()
    - find_usable_zone_for_movable()
    - adjust_zone_range_for_zone_movable()
    - __absent_pages_in_range()
    - find_min_pfn_for_node()
    - find_zone_movable_pfns_for_nodes()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • alloc_pages_exact() is similar to alloc_pages(), except that it allocates
    the minimum number of pages to fulfill the request. This is useful if you
    want to allocate a very large buffer that is slightly larger than an even
    power-of-two number of pages. In that case, alloc_pages() will waste a
    lot of memory.

    I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
    wiill waste 3MB of physically-contiguous memory.

    Signed-off-by: Timur Tabi
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timur Tabi
     
  • hugetlb will need to get compound pages from bootmem to handle the case of
    them being greater than or equal to MAX_ORDER. Export the constructor
    function needed for this.

    Acked-by: Adam Litke
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • free_area_init_node() gets passed in the node id as well as the node
    descriptor. This is redundant as the function can trivially get the node
    descriptor itself by means of NODE_DATA() and the node's id.

    I checked all the users and NODE_DATA() seems to be usable everywhere
    from where this function is called.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In __free_one_page(), the comment "Move the buddy up one level" appears
    attached to the break and by implication when the break is taken we are
    moving it up one level:

    if (!page_is_buddy(page, buddy, order))
    break; /* Move the buddy up one level. */

    In reality the inverse is true, we break out when we can no longer merge
    this page with its buddy. Looking back into pre-history (into the full
    git history) it appears that these two lines accidentally got joined as
    part of another change.

    Move the comment down where it belongs below the if and clarify its
    language.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Two zonelist patch series rewrote __page_alloc() largely. Now, it is just
    a wrapper function. Inlining them will save a function call.

    [akpm@linux-foundation.org: export __alloc_pages_internal]
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • There are a lot of places that define either a single bootmem descriptor or an
    array of them. Use only one central array with MAX_NUMNODES items instead.

    Signed-off-by: Johannes Weiner
    Acked-by: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Tony Luck
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Kyle McMartin
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Yinghai Lu
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch prints out the zonelists during boot for manual verification by the
    user if the mminit_loglevel is MMINIT_VERIFY or higher.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are a number of different views to how much memory is currently active.
    There is the arch-independent zone-sizing view, the bootmem allocator and
    memory models view.

    Architectures register this information at different times and is not
    necessarily in sync particularly with respect to some SPARSEMEM limitations.

    This patch introduces mminit_validate_memmodel_limits() which is able to
    validate and correct PFN ranges with respect to the memory model. It is only
    SPARSEMEM that currently validates itself.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Print out information on how the page flags are being used if mminit_loglevel
    is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
    flags regardless of loglevel.

    When the page flags are updated with section, node and zone information, a
    check are made to ensure the values can be retrieved correctly. Finally we
    confirm that pfn_to_page and page_to_pfn are the correct inverse functions.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Boot initialisation is very complex, with significant numbers of
    architecture-specific routines, hooks and code ordering. While significant
    amounts of the initialisation is architecture-independent, it trusts the data
    received from the architecture layer. This is a mistake, and has resulted in
    a number of difficult-to-diagnose bugs.

    This patchset adds some validation and tracing to memory initialisation. It
    also introduces a few basic defensive measures. The validation code can be
    explicitly disabled for embedded systems.

    This patch:

    Add additional debugging and verification code for memory initialisation.

    Once enabled, the verification checks are always run and when required
    additional debugging information may be outputted via a mminit_loglevel=
    command-line parameter.

    The verification code is placed in a new file mm/mm_init.c. Ideally other mm
    initialisation code will be moved here over time.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jul, 2008

1 commit

  • Conflicts:

    arch/powerpc/Kconfig
    arch/s390/kernel/time.c
    arch/x86/kernel/apic_32.c
    arch/x86/kernel/cpu/perfctr-watchdog.c
    arch/x86/kernel/i8259_64.c
    arch/x86/kernel/ldt.c
    arch/x86/kernel/nmi_64.c
    arch/x86/kernel/smpboot.c
    arch/x86/xen/smp.c
    include/asm-x86/hw_irq_32.h
    include/asm-x86/hw_irq_64.h
    include/asm-x86/mach-default/irq_vectors.h
    include/asm-x86/mach-voyager/irq_vectors.h
    include/asm-x86/smp.h
    kernel/Makefile

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

08 Jul, 2008

7 commits


04 Jul, 2008

1 commit

  • The non-NUMA case of build_zonelist_cache() would initialize the
    zlcache_ptr for both node_zonelists[] to NULL.

    Which is problematic, since non-NUMA only has a single node_zonelists[]
    entry, and trying to zero the non-existent second one just overwrote the
    nr_zones field instead.

    As kswapd uses this value to determine what reclaim work is necessary,
    the result is that kswapd never reclaims. This causes processes to
    stall frequently in low-memory situations as they always direct reclaim.
    This patch initialises zlcache_ptr correctly.

    Signed-off-by: Mel Gorman
    Tested-by: Dan Williams
    [ Simplified patch a bit ]
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Jun, 2008

1 commit


10 Jun, 2008

2 commits

  • Now we are using register_e820_active_regions() instead of
    add_active_range() directly. So end_pfn could be different between the
    value in early_node_map to node_end_pfn.

    So we need to make shrink_active_range() smarter.

    shrink_active_range() is a generic MM function in mm/page_alloc.c but
    it is only used on 32-bit x86. Should we move it back to some file in
    arch/x86?

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     
  • Minor source code cleanup of page flags in mm/page_alloc.c.
    Move the definition of the groups of bits to page-flags.h.

    The purpose of this clean up is that the next patch will
    conditionally add a page flag to the groups. Doing that
    in a header file is cleaner than adding #ifdefs to the
    C code.

    Signed-off-by: Russ Anderson
    Signed-off-by: Linus Torvalds

    Russ Anderson
     

03 Jun, 2008

1 commit


25 May, 2008

3 commits

  • Trying to add memory via add_memory() from within an initcall function
    results in

    bootmem alloc of 163840 bytes failed!
    Kernel panic - not syncing: Out of memory

    This is caused by zone_wait_table_init() which uses system_state to decide
    if it should use the bootmem allocator or not.

    When initcalls are handled the system_state is still SYSTEM_BOOTING but
    the bootmem allocator doesn't work anymore. So the allocation will fail.

    To fix this use slab_is_available() instead as indicator like we do it
    everywhere else.

    [akpm@linux-foundation.org: coding-style fix]
    Reviewed-by: Andy Whitcroft
    Cc: Dave Hansen
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Yasunori Goto
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
    panics when trying node local allocations:

    BUG: unable to handle kernel NULL pointer dereference at 0000034c
    IP: [] get_page_from_freelist+0x4a/0x18e
    *pdpt = 00000000013a7001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    Modules linked in:

    Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
    EIP: 0060:[] EFLAGS: 00010282 CPU: 0
    EIP is at get_page_from_freelist+0x4a/0x18e
    EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
    ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
    Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
    f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
    00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
    Call Trace:
    [] __alloc_pages_internal+0x99/0x378
    [] __alloc_pages+0x7/0x9
    [] kmem_getpages+0x66/0xef
    [] cache_grow+0x8f/0x123
    [] ____cache_alloc_node+0xb9/0xe4
    [] kmem_cache_alloc_node+0x92/0xd2
    [] setup_cpu_cache+0xaf/0x177
    [] kmem_cache_create+0x2c8/0x353
    [] kmem_cache_init+0x1ce/0x3ad
    [] start_kernel+0x178/0x1ee

    This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
    page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
    node 0, all other nodes are mapped above 4GB physical. Here is a dump
    of the zonelists from this system:

    zonelists pgdat=c1400000
    0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
    1: c14006c0:2 c1400360:1 c1400000:0
    zonelists pgdat=f7c00000
    0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
    1: f7c006c0:2
    zonelists pgdat=f7e00000
    0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
    1: f7e006c0:2

    When performing a node local allocation we call get_page_from_freelist()
    looking for a page. It in turn calls first_zones_zonelist() which returns
    a preferred_zone. Where there are no applicable zones this will be NULL.
    However we use this unconditionally, leading to this panic.

    Where there are no applicable zones there is no possibility of a successful
    allocation, so simply fail the allocation.

    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • In a zone's present pages number, account for all pages occupied by the
    memory map, including a partial.

    Signed-off-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 May, 2008

1 commit

  • Trying to online a new memory section that was added via memory hotplug
    sometimes results in crashes when the new pages are added via __free_page.
    Reason for that is that the pageblock bitmap isn't initialized and hence
    contains random stuff. That means that get_pageblock_migratetype()
    returns also random stuff and therefore

    list_add(&page->lru,
    &zone->free_area[order].free_list[migratetype]);

    in __free_one_page() tries to do a list_add to something that isn't even
    necessarily a list.

    This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix
    usemap initialization") which makes sure that the pageblock bitmap gets
    only initialized for pages present in a zone. Unfortunately for hot-added
    memory the zones "grow" after the memmap and the pageblock memmap have
    been initialized. Which means that the new pages have an unitialized
    bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
    are moved to __add_zone() just before the initialization happens.

    The patch also moves the two functions since __add_zone() is the only
    caller and I didn't want to add a forward declaration.

    Signed-off-by: Heiko Carstens
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

30 Apr, 2008

1 commit

  • We can see an ever repeating problem pattern with objects of any kind in the
    kernel:

    1) freeing of active objects
    2) reinitialization of active objects

    Both problems can be hard to debug because the crash happens at a point where
    we have no chance to decode the root cause anymore. One problem spot are
    kernel timers, where the detection of the problem often happens in interrupt
    context and usually causes the machine to panic.

    While working on a timer related bug report I had to hack specialized code
    into the timer subsystem to get a reasonable hint for the root cause. This
    debug hack was fine for temporary use, but far from a mergeable solution due
    to the intrusiveness into the timer code.

    The code further lacked the ability to detect and report the root cause
    instantly and keep the system operational.

    Keeping the system operational is important to get hold of the debug
    information without special debugging aids like serial consoles and special
    knowledge of the bug reporter.

    The problems described above are not restricted to timers, but timers tend to
    expose it usually in a full system crash. Other objects are less explosive,
    but the symptoms caused by such mistakes can be even harder to debug.

    Instead of creating specialized debugging code for the timer subsystem a
    generic infrastructure is created which allows developers to verify their code
    and provides an easy to enable debug facility for users in case of trouble.

    The debugobjects core code keeps track of operations on static and dynamic
    objects by inserting them into a hashed list and sanity checking them on
    object operations and provides additional checks whenever kernel memory is
    freed.

    The tracked object operations are:
    - initializing an object
    - adding an object to a subsystem list
    - deleting an object from a subsystem list

    Each operation is sanity checked before the operation is executed and the
    subsystem specific code can provide a fixup function which allows to prevent
    the damage of the operation. When the sanity check triggers a warning message
    and a stack trace is printed.

    The list of operations can be extended if the need arises. For now it's
    limited to the requirements of the first user (timers).

    The core code enqueues the objects into hash buckets. The hash index is
    generated from the address of the object to simplify the lookup for the check
    on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
    global lock.

    The debug code can be compiled in without being active. The runtime overhead
    is minimal and could be optimized by asm alternatives. A kernel command line
    option enables the debugging code.

    Thanks to Ingo Molnar for review, suggestions and cleanup patches.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

29 Apr, 2008

3 commits

  • Because of page order checks in __alloc_pages(), hugepage (and similarly
    large order) allocations will not retry unless explicitly marked
    __GFP_REPEAT. However, the current retry logic is nearly an infinite
    loop (or until reclaim does no progress whatsoever). For these costly
    allocations, that seems like overkill and could potentially never
    terminate. Mel observed that allowing current __GFP_REPEAT semantics for
    hugepage allocations essentially killed the system. I believe this is
    because we may continue to reclaim small orders of pages all over, but
    never have enough to satisfy the hugepage allocation request. This is
    clearly only a problem for large order allocations, of which hugepages
    are the most obvious (to me).

    Modify try_to_free_pages() to indicate how many pages were reclaimed.
    Use that information in __alloc_pages() to eventually fail a large
    __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
    or greater than the allocation's order. This relies on lumpy reclaim
    functioning as advertised. Due to fragmentation, lumpy reclaim may not
    be able to free up the order needed in one invocation, so multiple
    iterations may be requred. In other words, the more fragmented memory
    is, the more retry attempts __GFP_REPEAT will make (particularly for
    higher order allocations).

    This changes the semantics of __GFP_REPEAT subtly, but *only* for
    allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
    allocations, we will try up to some point (at least 1<
    Cc: Andy Whitcroft
    Tested-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
    core VM have somewhat differing comments as to their actual semantics.
    Annoyingly, the flags definition has inline and header comments, which might
    be interpreted as not being equivalent. Just add references to the header
    comments in the inline ones so they don't go out of sync in the future. In
    their use in __alloc_pages() clarify that the current implementation treats
    low-order allocations and __GFP_REPEAT allocations as distinct cases.

    To clarify, the flags' semantics are:

    __GFP_NORETRY means try no harder than one run through __alloc_pages

    __GFP_REPEAT means __GFP_NOFAIL

    __GFP_NOFAIL means repeat forever

    order
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • usemap must be initialized only when pfn is within zone. If not, it corrupts
    memory.

    And this patch also reduces the number of calls to set_pageblock_migratetype()
    from
    (pfn & (pageblock_nr_pages -1)
    to
    !(pfn & (pageblock_nr_pages-1)
    it should be called once per pageblock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Shi Weihua
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Apr, 2008

1 commit