01 Aug, 2012

4 commits

  • sparse_index_init() uses the index_init_lock spinlock to protect root
    mem_section assignment. The lock is not necessary anymore because the
    function is called only during boot (during paging init which is executed
    only from a single CPU) and from the hotplug code (by add_memory() via
    arch_add_memory()) which uses mem_hotplug_mutex.

    The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
    preparation") and sparse_index_init() was used only during boot at that
    time.

    Later when the hotplug code (and add_memory()) was introduced there was no
    synchronization so it was possible to online more sections from the same
    root probably (though I am not 100% sure about that). The first
    synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
    hibernation in the same kernel") which was later replaced by the
    mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
    {un}lock_memory_hotplug()").

    Let's remove the lock as it is not needed and it makes the code more
    confusing.

    [mhocko@suse.cz: changelog]
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • __section_nr() was implemented to retrieve the corresponding memory
    section number according to its descriptor. It's possible that the
    specified memory section descriptor doesn't exist in the global array. So
    add more checking on that and report an error for a wrong case.

    Signed-off-by: Gavin Shan
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
    descriptors are allocated from slab or bootmem. When allocating from
    slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
    it explicitly.

    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
    Itanium, pageblock_order is a variable with default value of 0. It's set
    to the right value by set_pageblock_order() in function
    free_area_init_core().

    But pageblock_order may be used by sparse_init() before free_area_init_core()
    is called along path:
    sparse_init()
    ->sparse_early_usemaps_alloc_node()
    ->usemap_size()
    ->SECTION_BLOCKFLAGS_BITS
    ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
    NR_PAGEBLOCK_BITS)

    The uninitialized pageblock_size will cause memory wasting because
    usemap_size() returns a much bigger value then it's really needed.

    For example, on an Itanium platform,
    sparse_init() pageblock_order=0 usemap_size=24576
    free_area_init_core() before pageblock_order=0, usemap_size=24576
    free_area_init_core() after pageblock_order=12, usemap_size=8

    That means 24K memory has been wasted for each section, so fix it by calling
    set_pageblock_order() from sparse_init().

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Cc: Tony Luck
    Cc: Yinghai Lu
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

12 Jul, 2012

2 commits

  • After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
    in alloc_bootmem_section"), usemap allocations may easily be placed
    outside the optimal section that holds the node descriptor, even if
    there is space available in that section. This results in unnecessary
    hotplug dependencies that need to have the node unplugged before the
    section holding the usemap.

    The reason is that the bootmem allocator doesn't guarantee a linear
    search starting from the passed allocation goal but may start out at a
    much higher address absent an upper limit.

    Fix this by trying the allocation with the limit at the section end,
    then retry without if that fails. This keeps the fix from f5bf18fa22f8
    of not panicking if the allocation does not fit in the section, but
    still makes sure to try to stay within the section at first.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
    bootmem allocator") introduced a bug in the allocation goal calculation
    that put section usemaps not in the same section as the node
    descriptors, creating unnecessary hotplug dependencies between them:

    node 0 must be removed before remove section 16399
    node 1 must be removed before remove section 16399
    node 2 must be removed before remove section 16399
    node 3 must be removed before remove section 16399
    node 4 must be removed before remove section 16399
    node 5 must be removed before remove section 16399
    node 6 must be removed before remove section 16399

    The reason is that it applies PAGE_SECTION_MASK to the physical address
    of the node descriptor when finding a suitable place to put the usemap,
    when this mask is actually intended to be used with PFNs. Because the
    PFN mask is wider, the target address will point beyond the wanted
    section holding the node descriptor and the node must be offlined before
    the section holding the usemap can go.

    Fix this by extending the mask to address width before use.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

30 May, 2012

1 commit

  • alloc_bootmem_section() derives allocation area constraints from the
    specified sparsemem section. This is a bit specific for a generic memory
    allocator like bootmem, though, so move it over to sparsemem.

    As __alloc_bootmem_node_nopanic() already retries failed allocations with
    relaxed area constraints, the fallback code in sparsemem.c can be removed
    and the code becomes a bit more compact overall.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Yinghai Lu
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

22 Mar, 2012

1 commit

  • While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
    Overcommit) on powerpc, we tripped the following:

    kernel BUG at mm/bootmem.c:483!
    cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
    pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
    lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
    sp: c000000000c03bc0
    msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca = 0xc000000001d80000
    pid = 0, comm = swapper
    kernel BUG at mm/bootmem.c:483!
    enter ? for help
    [c000000000c03c80] c000000000a64bcc
    .sparse_early_usemaps_alloc_node+0x84/0x29c
    [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
    [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
    [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
    [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

    This is

    BUG_ON(limit && goal + size > limit);

    and after some debugging, it seems that

    goal = 0x7ffff000000
    limit = 0x80000000000

    and sparse_early_usemaps_alloc_node ->
    sparse_early_usemaps_alloc_pgdat_section calls

    return alloc_bootmem_section(usemap_size() * count, section_nr);

    This is on a system with 8TB available via the AMS pool, and as a quirk
    of AMS in firmware, all of that memory shows up in node 0. So, we end
    up with an allocation that will fail the goal/limit constraints.

    In theory, we could "fall-back" to alloc_bootmem_node() in
    sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
    defined, we'll BUG_ON() instead. A simple solution appears to be to
    unconditionally remove the limit condition in alloc_bootmem_section,
    meaning allocations are allowed to cross section boundaries (necessary
    for systems of this size).

    Johannes Weiner pointed out that if alloc_bootmem_section() no longer
    guarantees section-locality, we need check_usemap_section_nr() to print
    possible cross-dependencies between node descriptors and the usemaps
    allocated through it. That makes the two loops in
    sparse_early_usemaps_alloc_node() identical, so re-factor the code a
    bit.

    [akpm@linux-foundation.org: code simplification]
    Signed-off-by: Nishanth Aravamudan
    Cc: Dave Hansen
    Cc: Anton Blanchard
    Cc: Paul Mackerras
    Cc: Ben Herrenschmidt
    Cc: Robert Jennings
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: [3.3.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

31 Oct, 2011

1 commit


26 Jul, 2011

1 commit

  • These uses are read-only and in a subsequent patch I have a const struct
    page in my hand...

    [akpm@linux-foundation.org: fix warnings in lowmem_page_address()]
    Signed-off-by: Ian Campbell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Campbell
     

31 Mar, 2011

1 commit


14 Jan, 2011

1 commit

  • PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes. We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2010

1 commit

  • We need to put mem_map high when virtual memmap is not used.

    before this patch
    free mem pfn range on first node:
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3b80 3b8b
    [ 0.000000] 73b8b 73bc0 - 73c00 73c00
    [ 0.000000] 73ddd - 73e00
    [ 0.000000] 73fdd - 74000
    [ 0.000000] 741dd - 74200
    [ 0.000000] 743dd - 74400
    [ 0.000000] 745dd - 74600
    [ 0.000000] 747dd - 74800
    [ 0.000000] 749dd - 74a00
    [ 0.000000] 74bdd - 74c00
    [ 0.000000] 74ddd - 74e00
    [ 0.000000] 74fdd - 75000
    [ 0.000000] 751dd - 75200
    [ 0.000000] 753dd - 75400
    [ 0.000000] 755dd - 75600
    [ 0.000000] 757dd - 75800
    [ 0.000000] 759dd - 75a00
    [ 0.000000] 79bdd 79c00 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 10000b 100040 - 2080000 2080000
    so only 79c00 - 7d540 are major free block under 4g...

    after this patch, we will get
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3600 3600
    [ 0.000000] 37dd - 3800
    [ 0.000000] 39dd - 3a00
    [ 0.000000] 3bdd - 3c00
    [ 0.000000] 3ddd - 3e00
    [ 0.000000] 3fdd - 4000
    [ 0.000000] 41dd - 4200
    [ 0.000000] 43dd - 4400
    [ 0.000000] 45dd - 4600
    [ 0.000000] 47dd - 4800
    [ 0.000000] 49dd - 4a00
    [ 0.000000] 4bdd - 4c00
    [ 0.000000] 4ddd - 4e00
    [ 0.000000] 4fdd - 5000
    [ 0.000000] 51dd - 5200
    [ 0.000000] 53dd - 5400
    [ 0.000000] 95dd 9600 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 17000b 170040 - 2080000 2080000
    we will have 9600 - 7d540 for major free block...

    sparse-vmemmap path already used __alloc_bootmem_node_high()

    Signed-off-by: Yinghai Lu
    Cc: Jiri Slaby
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Christoph Lameter
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

02 Mar, 2010

1 commit

  • Stephen reported:
    build (powerpc
    ppc64_defconfig) produced these warnings:

    mm/sparse.c: In function 'sparse_init':
    mm/sparse.c:488: warning: unused variable 'map_count'
    mm/sparse.c:484: warning: unused variable 'size2'
    mm/sparse.c:481: warning: unused variable 'map_map'
    mm/sparse.c: At top level:
    mm/sparse.c:442: warning: 'sparse_early_mem_maps_alloc_node' defined but not used

    Introduced by commit 9bdac914240759457175ac0d6529a37d2820bc4d
    ("sparsemem: Put mem map for one node together").

    Conditionalize the bits appropriately based on the setting of
    CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

    Reported-by: Stephen Rothwell
    Tested-by: Stephen Rothwell
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

13 Feb, 2010

2 commits

  • Add vmemmap_alloc_block_buf for mem map only.

    It will fallback to the old way if it cannot get a block that big.

    Before this patch, when a node have 128g ram installed, memmap are
    split into two parts or more.
    [ 0.000000] [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
    [ 0.000000] [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
    [ 0.000000] [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
    [ 0.000000] [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
    [ 0.000000] [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
    [ 0.000000] [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
    [ 0.000000] [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
    [ 0.000000] [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
    [ 0.000000] [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
    [ 0.000000] [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
    [ 0.000000] [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
    [ 0.000000] [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
    [ 0.000000] [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
    [ 0.000000] [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
    [ 0.000000] [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7

    after patch will get
    [ 0.000000] [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
    [ 0.000000] [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
    [ 0.000000] [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7

    -v2: change buf to vmemmap_buf instead according to Ingo
    also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
    -v3: according to Andrew, use sizeof(name) instead of hard coded 15

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Cc: Christoph Lameter
    Acked-by: Christoph Lameter
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • Could save some buffer space instead of applying one by one.

    Could help that system that is going to use early_res instead of bootmem
    less entries in early_res make search more faster on system with more memory.

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

22 Sep, 2009

1 commit

  • To initialize hotadded node, some pages are allocated. At that time, the
    node hasn't memory, this makes the allocation always fail. In such case,
    let's allocate pages from other nodes.

    Signed-off-by: Shaohua Li
    Signed-off-by: Yakui Zhao
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

01 Apr, 2009

1 commit


01 Dec, 2008

1 commit


13 Aug, 2008

1 commit


27 Jul, 2008

1 commit


25 Jul, 2008

2 commits

  • Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are
    allocated on only one page. If a section has usemap, it can't be removed
    until removing other sections. This dependency is not desirable for
    memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the
    last section for removing on the node. So, if section A has pgdat and
    section B has usemap for section A, Both sections can't be removed due to
    dependency each other.

    To solve this issue, this patch collects usemap on same section with pgdat
    as much as possible. If other sections doesn't have any dependency, this
    section will be able to be removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Cc: David Miller
    Cc: Badari Pulavarty
    Cc: Heiko Carstens
    Cc: Hiroyuki KAMEZAWA
    Cc: Tony Breeds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • There are a number of different views to how much memory is currently active.
    There is the arch-independent zone-sizing view, the bootmem allocator and
    memory models view.

    Architectures register this information at different times and is not
    necessarily in sync particularly with respect to some SPARSEMEM limitations.

    This patch introduces mminit_validate_memmodel_limits() which is able to
    validate and correct PFN ranges with respect to the memory model. It is only
    SPARSEMEM that currently validates itself.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Apr, 2008

2 commits

  • This:

    commit 86f6dae1377523689bd8468fed2f2dd180fc0560
    Author: Yasunori Goto
    Date: Mon Apr 28 02:13:33 2008 -0700

    memory hotplug: allocate usemap on the section with pgdat

    Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are allocated
    on only one page. If a section has usemap, it can't be removed until removing
    other sections. This dependency is not desirable for memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the last
    section for removing on the node. So, if section A has pgdat and section B
    has usemap for section A, Both sections can't be removed due to dependency
    each other.

    To solve this issue, this patch collects usemap on same section with pgdat.
    If other sections doesn't have any dependency, this section will be able to be
    removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    broke davem's sparc64 bootup. Revert it while we work out what went wrong.

    Cc: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • __FUNCTION__ is gcc-specific, use __func__

    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

28 Apr, 2008

5 commits

  • This patch is to free memmaps which is allocated by bootmem.

    Freeing usemap is not necessary. The pages of usemap may be necessary for
    other sections.

    If removing section is last section on the node, its section is the final user
    of usemap page. (usemaps are allocated on its section by previous patch.) But
    it shouldn't be freed too, because the section must be logical offline state
    which all pages are isolated against page allocater. If it is freed, page
    alloctor may use it which will be removed physically soon. It will be
    disaster. So, this patch keeps it as it is.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are allocated
    on only one page. If a section has usemap, it can't be removed until removing
    other sections. This dependency is not desirable for memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the last
    section for removing on the node. So, if section A has pgdat and section B
    has usemap for section A, Both sections can't be removed due to dependency
    each other.

    To solve this issue, this patch collects usemap on same section with pgdat.
    If other sections doesn't have any dependency, this section will be able to be
    removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • To free memmap easier, this patch aligns it to page size. Bootmem allocater
    may mix some objects in one pages. It's not good for freeing memmap of memory
    hot-remove.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This patch set is to free pages which is allocated by bootmem for
    memory-hotremove. Some structures of memory management are allocated by
    bootmem. ex) memmap, etc.

    To remove memory physically, some of them must be freed according to
    circumstance. This patch set makes basis to free those pages, and free
    memmaps.

    Basic my idea is using remain members of struct page to remember information
    of users of bootmem (section number or node id). When the section is
    removing, kernel can confirm it. By this information, some issues can be
    solved.

    1) When the memmap of removing section is allocated on other
    section by bootmem, it should/can be free.
    2) When the memmap of removing section is allocated on the
    same section, it shouldn't be freed. Because the section has to be
    logical memory offlined already and all pages must be isolated against
    page allocater. If it is freed, page allocator may use it which will
    be removed physically soon.
    3) When removing section has other section's memmap,
    kernel will be able to show easily which section should be removed
    before it for user. (Not implemented yet)
    4) When the above case 2), the page isolation will be able to check and skip
    memmap's page when logical memory offline (offline_pages()).
    Current page isolation code fails in this case because this page is
    just reserved page and it can't distinguish this pages can be
    removed or not. But, it will be able to do by this patch.
    (Not implemented yet.)
    5) The node information like pgdat has similar issues. But, this
    will be able to be solved too by this.
    (Not implemented yet, but, remembering node id in the pages.)

    Fortunately, current bootmem allocator just keeps PageReserved flags,
    and doesn't use any other members of page struct. The users of
    bootmem doesn't use them too.

    This patch:

    This is to register information which is node or section's id. Kernel can
    distinguish which node/section uses the pages allcated by bootmem. This is
    basis for hot-remove sections or nodes.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Generic helper function to remove section mappings and sysfs entries for the
    section of the memory we are removing. offline_pages() correctly adjusted
    zone and marked the pages reserved.

    TODO: Yasunori Goto is working on patches to free up allocations from bootmem.

    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

27 Apr, 2008

2 commits

  • On big systems with lots of memory, don't print out too much during
    bootup, and make it easy to find if it is continuous.

    on 256G 8 sockets system will get
    [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
    [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
    [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
    [ffffe20038700000-ffffe200387fffff] potential offnode page_structs
    [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
    [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
    [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
    [ffffe20054700000-ffffe200547fffff] potential offnode page_structs
    [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
    [ffffe20070700000-ffffe200707fffff] potential offnode page_structs
    [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
    [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
    [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
    [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
    [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
    [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
    [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
    [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
    [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
    [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
    [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
    [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7

    instead of a very long print out...

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Yinghai Lu
     
  • vmemmap allocation currently has this layout:

    [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
    [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0
    [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0
    [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0
    [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0
    ...

    note that there is a 2M hole between them - not optimal.

    the root cause is that usemap (24 bytes) will be allocated after every 2M
    mem_map, and it will push next vmemmap (2M) to the next (2M) alignment.

    solution: try to allocate the mem_map continously.

    after the patch, we get:

    [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
    [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0
    [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0
    [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0
    [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0
    ...

    which is the ideal layout.

    and usemap will share a page because of they are allocated continuously too:

    sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
    sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
    sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
    sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24
    ...

    so we make the bootmem allocation more compact and use less memory
    for usemap => mission accomplished ;-)

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

16 Apr, 2008

1 commit

  • Fix memory corruption and crash on 32-bit x86 systems.

    If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of
    RAM, then we call memory_present() with a start/end that goes outside
    the scope of MAX_PHYSMEM_BITS.

    That causes this loop to happily walk over the limit of the sparse
    memory section map:

    for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
    unsigned long section = pfn_to_section_nr(pfn);
    struct mem_section *ms;

    sparse_index_init(section, nid);
    set_section_nid(section, nid);

    ms = __nr_to_section(section);
    if (!ms->section_mem_map)
    ms->section_mem_map = sparse_encode_early_nid(nid) |
    SECTION_MARKED_PRESENT;

    'ms' will be out of bounds and we'll corrupt a small amount of memory by
    encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.

    The corruption might happen when encoding a non-zero node ID, or due to
    the SECTION_MARKED_PRESENT which is 0x1:

    mmzone.h:#define SECTION_MARKED_PRESENT (1UL<
    Tested-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Rafael J. Wysocki
    Cc: Yinghai Lu
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

06 Feb, 2008

2 commits

  • Fix following warning:
    WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()

    static sparse_early_usemap_alloc() were used only by sparse_init()
    and with sparse_init() annotated _init it is safe to
    annotate sparse_early_usemap_alloc with __init too.

    Signed-off-by: Sam Ravnborg
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     
  • Checking if an address is a vmalloc address is done in a couple of places.
    Define a common version in mm.h and replace the other checks.

    Again the include structures suck. The definition of VMALLOC_START and
    VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
    there.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

18 Dec, 2007

2 commits

  • Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I
    see no reason to check 'usemap' until holding the 'pgdat_resize_lock'.

    [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST]
    Cc: Christoph Lameter
    Acked-by: Dave Hansen
    Cc: Rik van Riel
    Acked-by: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: WANG Cong
    Signed-off-by: Geoff Levand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • Since sparse_index_alloc() can return NULL on memory allocation failure,
    we must deal with the failure condition when calling it.

    Signed-off-by: WANG Cong
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     

30 Oct, 2007

1 commit

  • This reverts commit 2e1c49db4c640b35df13889b86b9d62215ade4b6.

    First off, testing in Fedora has shown it to cause boot failures,
    bisected down by Martin Ebourne, and reported by Dave Jobes. So the
    commit will likely be reverted in the 2.6.23 stable kernels.

    Secondly, in the 2.6.24 model, x86-64 has now grown support for
    SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the
    bug is not visible any more, it's become invisible due to the code just
    being irrelevant and no longer enabled on the only architecture that
    this ever affected.

    Reported-by: Dave Jones
    Tested-by: Martin Ebourne
    Cc: Zou Nan hai
    Cc: Suresh Siddha
    Cc: Andrew Morton
    Acked-by: Andy Whitcroft
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Oct, 2007

1 commit

  • This patch is to avoid panic when memory hot-add is executed with
    sparsemem-vmemmap. Current vmemmap-sparsemem code doesn't support memory
    hot-add. Vmemmap must be populated when hot-add. This is for
    2.6.23-rc2-mm2.

    Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
    offnode page_structs" is displayed. To allocate memmap on its node,
    memmap (and pgdat) must be initialized itself like chicken and
    egg relationship.

    # vmemmap_unpopulate will be necessary for followings.
    - For cancel hot-add due to error.
    - For unplug.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto