13 Nov, 2013

2 commits

  • The Linux kernel cannot migrate pages used by the kernel. As a result,
    kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
    memory for the kernel.

    ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
    info. But before SRAT is parsed, memblock has already started to allocate
    memory for the kernel. So we need to prevent memblock from doing this.

    In a memory hotplug system, any numa node the kernel resides in should be
    unhotpluggable. And for a modern server, each node could have at least
    16GB memory. So memory around the kernel image is highly likely
    unhotpluggable.

    So the basic idea is: Allocate memory from the end of the kernel image and
    to the higher memory. Since memory allocation before SRAT is parsed won't
    be too much, it could highly likely be in the same node with kernel image.

    The current memblock can only allocate memory top-down. So this patch
    introduces a new bottom-up allocation mode to allocate memory bottom-up.
    And later when we use this allocation direction to allocate memory, we
    will limit the start address above the kernel.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Tejun Heo
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • [Problem]

    The current Linux cannot migrate pages used by the kernel because of the
    kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
    When the pa is changed, we cannot simply update the pagetable and keep the
    va unmodified. So the kernel pages are not migratable.

    There are also some other issues will cause the kernel pages not
    migratable. For example, the physical address may be cached somewhere and
    will be used. It is not to update all the caches.

    When doing memory hotplug in Linux, we first migrate all the pages in one
    memory device somewhere else, and then remove the device. But if pages
    are used by the kernel, they are not migratable. As a result, memory used
    by the kernel cannot be hot-removed.

    Modifying the kernel direct mapping mechanism is too difficult to do. And
    it may cause the kernel performance down and unstable. So we use the
    following way to do memory hotplug.

    [What we are doing]

    In Linux, memory in one numa node is divided into several zones. One of
    the zones is ZONE_MOVABLE, which the kernel won't use.

    In order to implement memory hotplug in Linux, we are going to arrange all
    hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
    memory. To do this, we need ACPI's help.

    In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
    memory affinities in SRAT record every memory range in the system, and
    also, flags specifying if the memory range is hotpluggable. (Please refer
    to ACPI spec 5.0 5.2.16)

    With the help of SRAT, we have to do the following two things to achieve our
    goal:

    1. When doing memory hot-add, allow the users arranging hotpluggable as
    ZONE_MOVABLE.
    (This has been done by the MOVABLE_NODE functionality in Linux.)

    2. when the system is booting, prevent bootmem allocator from allocating
    hotpluggable memory for the kernel before the memory initialization
    finishes.

    The problem 2 is the key problem we are going to solve. But before solving it,
    we need some preparation. Please see below.

    [Preparation]

    Bootloader has to load the kernel image into memory. And this memory must
    be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
    system, we can assume any node the kernel resides in is not hotpluggable.

    Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
    But memblock has already started to work. In the current kernel,
    memblock allocates the following memory before SRAT is parsed:

    setup_arch()
    |->memblock_x86_fill() /* memblock is ready */
    |......
    |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
    |->reserve_real_mode() /* allocate memory under 1MB */
    |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
    |->dma_contiguous_reserve() /* specified by user, should be low */
    |->setup_log_buf() /* specified by user, several mega bytes */
    |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
    |->acpi_initrd_override() /* several mega bytes */
    |->reserve_crashkernel() /* could be large, should reorder */
    |......
    |->initmem_init() /* Parse SRAT */

    According to Tejun's advice, before SRAT is parsed, we should try our best
    to allocate memory near the kernel image. Since the whole node the kernel
    resides in won't be hotpluggable, and for a modern server, a node may have
    at least 16GB memory, allocating several mega bytes memory around the
    kernel image won't cross to hotpluggable memory.

    [About this patchset]

    So this patchset is the preparation for the problem 2 that we want to
    solve. It does the following:

    1. Make memblock be able to allocate memory bottom up.
    1) Keep all the memblock APIs' prototype unmodified.
    2) When the direction is bottom up, keep the start address greater than the
    end of kernel image.

    2. Improve init_mem_mapping() to support allocate page tables in
    bottom up direction.

    3. Introduce "movable_node" boot option to enable and disable this
    functionality.

    This patch (of 6):

    Create a new function __memblock_find_range_top_down to factor out of
    top-down allocation from memblock_find_in_range_node. This is a
    preparation because we will introduce a new bottom-up allocation mode in
    the following patch.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

12 Sep, 2013

1 commit

  • Current early_pfn_to_nid() on arch that support memblock go over
    memblock.memory one by one, so will take too many try near the end.

    We can use existing memblock_search to find the node id for given pfn,
    that could save some time on bigger system that have many entries
    memblock.memory array.

    Here are the timing differences for several machines. In each case with
    the patch less time was spent in __early_pfn_to_nid().

    3.11-rc5 with patch difference (%)
    -------- ---------- --------------
    UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
    UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
    UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
    UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
    Time in seconds.

    Signed-off-by: Yinghai Lu
    Cc: Tejun Heo
    Acked-by: Russ Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

10 Jul, 2013

1 commit


30 Apr, 2013

2 commits

  • There is no comment for parameter nid of memblock_insert_region().
    This patch adds comment for it.

    Signed-off-by: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • This came to light when calling memblock allocator from arc port (for
    copying flattended DT). If a "0" alignment is passed, the allocator
    round_up() call incorrectly rounds up the size to 0.

    round_up(num, alignto) => ((num - 1) | (alignto -1)) + 1

    While the obvious allocation failure causes kernel to panic, it is better
    to warn the caller to fix the code.

    Tejun suggested that instead of BUG_ON(!align) - which might be
    ineffective due to pending console init and such, it is better to WARN_ON,
    and continue the boot with a reasonable default align.

    Caller passing @size need not be handled similarly as the subsequent
    panic will indicate that anyhow.

    Signed-off-by: Vineet Gupta
    Cc: Yinghai Lu
    Cc: Wanpeng Li
    Cc: Ingo Molnar
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

03 Mar, 2013

1 commit

  • Tim found:

    WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
    Hardware name: S2600CP
    sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
    smpboot: Booting Node 1, Processors #1
    Modules linked in:
    Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
    Call Trace:
    set_cpu_sibling_map+0x279/0x449
    start_secondary+0x11d/0x1e5

    Don Morris reproduced on a HP z620 workstation, and bisected it to
    commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
    is ready")

    It turns out movable_map has some problems, and it breaks several things

    1. numa_init is called several times, NOT just for srat. so those
    nodes_clear(numa_nodes_parsed)
    memset(&numa_meminfo, 0, sizeof(numa_meminfo))
    can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
    and make fall back path working.

    2. simply split acpi_numa_init to early_parse_srat.
    a. that early_parse_srat is NOT called for ia64, so you break ia64.
    b. for (i = 0; i < MAX_LOCAL_APIC; i++)
    set_apicid_to_node(i, NUMA_NO_NODE)
    still left in numa_init. So it will just clear result from early_parse_srat.
    it should be moved before that....
    c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
    early before override from INITRD is settled.

    3. that patch TITLE is total misleading, there is NO x86 in the title,
    but it changes critical x86 code. It caused x86 guys did not
    pay attention to find the problem early. Those patches really should
    be routed via tip/x86/mm.

    4. after that commit, following range can not use movable ram:
    a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
    b. initrd... it will be freed after booting, so it could be on movable...
    c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
    anymore.
    d. init_mem_mapping: can not put page table high anymore.
    e. initmem_init: vmemmap can not be high local node anymore. That is
    not good.

    If node is hotplugable, the mem related range like page table and
    vmemmap could be on the that node without problem and should be on that
    node.

    We have workaround patch that could fix some problems, but some can not
    be fixed.

    So just remove that offending commit and related ones including:

    f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
    protect movablecore_map in memblock_overlaps_region().")

    01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
    SRAT")

    27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
    the end of node")

    e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
    ready")

    fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")

    42f47e27e761 ("page_alloc: make movablemem_map have higher priority")

    6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
    movable limit for nodes")

    34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")

    4d59a75125d5 ("x86: get pg_data_t's memory from other node")

    Later we should have patches that will make sure kernel put page table
    and vmemmap on local node ram instead of push them down to node0. Also
    need to find way to put other kernel used ram to local node ram.

    Reported-by: Tim Gardner
    Reported-by: Don Morris
    Bisected-by: Don Morris
    Tested-by: Don Morris
    Signed-off-by: Yinghai Lu
    Cc: Tony Luck
    Cc: Thomas Renninger
    Cc: Tejun Heo
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

24 Feb, 2013

2 commits

  • …emblock_overlaps_region().

    The definition of struct movablecore_map is protected by
    CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
    is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
    movablecore_map in memblock_overlaps_region().

    Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Tang Chen
     
  • Ensure the bootmem will not allocate memory from areas that may be
    ZONE_MOVABLE. The map info is from movablecore_map boot option.

    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Wu Jianguo
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

30 Jan, 2013

1 commit

  • Use it to get mem size under the limit_pfn.
    to replace local version in x86 reserved_initrd.

    -v2: remove not needed cast that is pointed out by HPA.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/1359058816-7615-29-git-send-email-yinghai@kernel.org
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

12 Jan, 2013

1 commit

  • The memmove span covers from (next+1) to the end of the array, and the
    index of next is (i+1), so the index of (next+1) is (i+2). So the size
    of remaining array elements is (type->cnt - (i + 2)).

    Since the remaining elements of the memblock array are move forward by
    one element and there is only one additional element caused by this bug.
    So there won't be any write overflow here but read overflow. It may
    read one more element out of the array address if the array happens to
    be full. Commonly it doesn't matter at all but if the array happens to
    be located at the end a memblock, it may cause a invalid read operation
    for the physical address doesn't exist.

    There are 2 *happens to be* here, so I think the probability is quite
    low, I don't know if any guy is haunted by this bug before.

    Mostly I think it's user-invisible.

    Signed-off-by: Lin Feng
    Acked-by: Tejun Heo
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng
     

25 Oct, 2012

1 commit

  • We will not map partial pages, so need to make sure memblock
    allocation will not allocate those bytes out.

    Also we will use for_each_mem_pfn_range() to loop to map memory
    range to keep them consistent.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com
    Acked-by: Jacob Shin
    Signed-off-by: H. Peter Anvin
    Cc:

    Yinghai Lu
     

09 Oct, 2012

2 commits

  • Following section mismatch warning is thrown during build;

    WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
    The function memblock_type_name() references
    the variable __meminitdata memblock.
    This is often because memblock_type_name lacks a __meminitdata
    annotation or the annotation of memblock is wrong.

    This is because memblock_type_name makes reference to memblock variable
    with attribute __meminitdata. Hence, the warning (even if the function is
    inline).

    [akpm@linux-foundation.org: remove inline]
    Signed-off-by: Raghavendra D Prabhu
    Cc: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra D Prabhu
     
  • Use the existing interface function to set the NUMA node ID (NID) for the
    regions, either memory or reserved region.

    Signed-off-by: Wanpeng Li
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Gavin Shan
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

05 Sep, 2012

1 commit


01 Aug, 2012

1 commit


12 Jul, 2012

1 commit

  • memblock_free_reserved_regions() calls memblock_free(), but
    memblock_free() would double reserved.regions too, so we could free the
    old range for reserved.regions.

    Also tj said there is another bug which could be related to this.

    | I don't think we're saving any noticeable
    | amount by doing this "free - give it to page allocator - reserve
    | again" dancing. We should just allocate regions aligned to page
    | boundaries and free them later when memblock is no longer in use.

    in that case, when DEBUG_PAGEALLOC, will get panic:

    memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
    BUG: unable to handle kernel paging request at ffff88102febd948
    IP: [] __next_free_mem_range+0x9b/0x155
    PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
    RIP: 0010:[] [] __next_free_mem_range+0x9b/0x155

    See the discussion at https://lkml.org/lkml/2012/6/13/469

    So try to allocate with PAGE_SIZE alignment and free it later.

    Reported-by: Sasha Levin
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Yinghai Lu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

21 Jun, 2012

2 commits

  • __alloc_memory_core_early() asks memblock for a range of memory then try
    to reserve it. If the reserved region array lacks space for the new
    range, memblock_double_array() is called to allocate more space for the
    array. If memblock is used to allocate memory for the new array it can
    end up using a range that overlaps with the range originally allocated in
    __alloc_memory_core_early(), leading to possible data corruption.

    With this patch memblock_double_array() now calls memblock_find_in_range()
    with a narrowed candidate range (in cases where the reserved.regions array
    is being doubled) so any memory allocated will not overlap with the
    original range that was being reserved. The range is narrowed by passing
    in the starting address and size of the previously allocated range. Then
    the range above the ending address is searched and if a candidate is not
    found, the range below the starting address is searched.

    Signed-off-by: Greg Pearson
    Signed-off-by: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Pearson
     
  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

08 Jun, 2012

1 commit

  • At first glance one would think that memblock_is_region_memory()
    and memblock_is_region_reserved() would be implemented in the
    same way. Unfortunately they aren't and the former returns
    whether the region specified is a subset of a memory bank while
    the latter returns whether the region specified intersects with
    reserved memory.

    Document the two functions so that users aren't tempted to
    make the implementation the same between them and to clarify the
    purpose of the functions.

    Signed-off-by: Stephen Boyd
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1337845521-32755-1-git-send-email-sboyd@codeaurora.org
    Signed-off-by: Ingo Molnar

    Stephen Boyd
     

30 May, 2012

2 commits

  • The overall memblock has been organized into the memory regions and
    reserved regions. Initially, the memory regions and reserved regions are
    stored in the predetermined arrays of "struct memblock _region". It's
    possible for the arrays to be enlarged when we have newly added regions,
    but no free space left there. The policy here is to create double-sized
    array either by slab allocator or memblock allocator. Unfortunately, we
    didn't free the old array, which might be allocated through slab allocator
    before. That would cause memory leak.

    The patch introduces 2 variables to trace where (slab or memblock) the
    memory and reserved regions come from. The memory for the memory or
    reserved regions will be deallocated by kfree() if that was allocated by
    slab allocator. Thus to fix the memory leak issue.

    Signed-off-by: Gavin Shan
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • The overall memblock has been organized into the memory regions and
    reserved regions. Initially, the memory regions and reserved regions are
    stored in the predetermined arrays of "struct memblock _region". It's
    possible for the arrays to be enlarged when we have newly added regions
    for them, but no enough space there. Under the situation, We will created
    double-sized array to meet the requirement. However, the original
    implementation converted the VA (Virtual Address) of the newly allocated
    array of regions to PA (Physical Address), then translate back when we
    allocates the new array from slab. That's actually unnecessary.

    The patch removes the duplicate VA/PA conversion.

    Signed-off-by: Gavin Shan
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     

21 Apr, 2012

1 commit

  • Commit 24aa07882b ("memblock, x86: Replace memblock_x86_reserve/
    free_range() with generic ones") replaced x86 specific memblock
    operations with the generic ones; unfortunately, it lost zero length
    operation handling in the process making the kernel panic if somebody
    tries to reserve zero length area.

    There isn't much to be gained by being cranky to zero length operations
    and panicking is almost the worst response. Drop the BUG_ON() in
    memblock_reserve() and update memblock_add_region/isolate_range() so
    that all zero length operations are handled as noops.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Reported-by: Valere Monseur
    Bisected-by: Joseph Freeman
    Tested-by: Joseph Freeman
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

01 Mar, 2012

1 commit

  • memblock allocator aligns @size to @align to reduce the amount
    of fragmentation. Commit:

    7bd0b0f0da ("memblock: Reimplement memblock allocation using reverse free area iterator")

    Broke it by incorrectly relocating @size aligning to
    memblock_find_in_range_node(). As the aligned size is not
    propagated back to memblock_alloc_base_nid(), the actually
    reserved size isn't aligned.

    While this increases memory use for memblock reserved array,
    this shouldn't cause any critical failure; however, it seems
    that the size aligning was hiding a use-beyond-allocation bug in
    sparc64 and losing the aligning causes boot failure.

    The underlying problem is currently being debugged but this is a
    proper fix in itself, it's already pretty late in -rc cycle for
    boot failures and reverting the change for debugging isn't
    difficult. Restore the size aligning moving it to
    memblock_alloc_base_nid().

    Reported-by: Meelis Roos
    Signed-off-by: Tejun Heo
    Cc: David S. Miller
    Cc: Grant Likely
    Cc: Rob Herring
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20120228205621.GC3252@dhcp-172-17-108-109.mtv.corp.google.com
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Tejun Heo
     

16 Jan, 2012

1 commit

  • 7bd0b0f0da ("memblock: Reimplement memblock allocation using
    reverse free area iterator") implemented a simple top-down
    allocator using a reverse memblock iterator. To avoid underflow
    in the allocator loop, it simply raised the lower boundary to
    the requested size under the assumption that requested size
    would be far smaller than available memblocks.

    This causes early page table allocation failure under certain
    configurations in Xen. Fix it by checking for underflow directly
    instead of bumping up lower bound.

    Signed-off-by: Tejun Heo
    Reported-by: Konrad Rzeszutek Wilk
    Cc: rjw@sisk.pl
    Cc: xen-devel@lists.xensource.com
    Cc: Benjamin Herrenschmidt
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20120113181412.GA11112@google.com
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

09 Dec, 2011

14 commits

  • Now that all early memory information is in memblock when enabled, we
    can implement reverse free area iterator and use it to implement NUMA
    aware allocator which is then wrapped for simpler variants instead of
    the confusing and inefficient mending of information in separate NUMA
    aware allocator.

    Implement for_each_free_mem_range_reverse(), use it to reimplement
    memblock_find_in_range_node() which in turn is used by all allocators.

    The visible allocator interface is inconsistent and can probably use
    some cleanup too.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
    there's no user of early_node_map[] left. Kill early_node_map[] and
    replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
    relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
    as page_alloc.c would no longer host an alternative implementation.

    This change is ultimately one to one mapping and shouldn't cause any
    observable difference; however, after the recent changes, there are
    some functions which now would fit memblock.c better than page_alloc.c
    and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
    doesn't make much sense on some of them. Further cleanups for
    functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

    -v2: Fix compile bug introduced by mis-spelling
    CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
    mmzone.h. Reported by Stephen Rothwell.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Chen Liqin
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • Implement memblock_add_node() which can add a new memblock memory
    region with specific node ID.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • The only function of memblock_analyze() is now allowing resize of
    memblock region arrays. Rename it to memblock_allow_resize() and
    update its users.

    * The following users remain the same other than renaming.

    arm/mm/init.c::arm_memblock_init()
    microblaze/kernel/prom.c::early_init_devtree()
    powerpc/kernel/prom.c::early_init_devtree()
    openrisc/kernel/prom.c::early_init_devtree()
    sh/mm/init.c::paging_init()
    sparc/mm/init_64.c::paging_init()
    unicore32/mm/init.c::uc32_memblock_init()

    * In the following users, analyze was used to update total size which
    is no longer necessary.

    powerpc/kernel/machine_kexec.c::reserve_crashkernel()
    powerpc/kernel/prom.c::early_init_devtree()
    powerpc/mm/init_32.c::MMU_init()
    powerpc/mm/tlb_nohash.c::__early_init_mmu()
    powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
    powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
    sh/kernel/machine_kexec.c::reserve_crashkernel()

    * x86/kernel/e820.c::memblock_x86_fill() was directly setting
    memblock_can_resize before populating memblock and calling analyze
    afterwards. Call memblock_allow_resize() before start populating.

    memblock_can_resize is now static inside memblock.c.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Michal Simek
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • Total size of memory regions was calculated by memblock_analyze()
    requiring explicitly calling the function between operations which can
    change memory regions and possible users of total size, which is
    cumbersome and fragile.

    This patch makes each memblock_type track total size automatically
    with minor modifications to memblock manipulation functions and remove
    requirements on calling memblock_analyze(). [__]memblock_dump_all()
    now also dumps the total size of reserved regions.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • With recent updates, the basic memblock operations are robust enough
    that there's no reason for memblock_enfore_memory_limit() to directly
    manipulate memblock region arrays. Reimplement it using
    __memblock_remove().

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Allow memblock users to specify range where @base + @size overflows
    and automatically cap it at maximum. This makes the interface more
    robust and specifying till-the-end-of-memory easier.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • __memblock_remove()'s open coded region manipulation can be trivially
    replaced with memblock_islate_range(). This increases code sharing
    and eases improving region tracking.

    This pulls memblock_isolate_range() out of HAVE_MEMBLOCK_NODE_MAP.
    Make it use memblock_get_region_node() instead of assuming rgn->nid is
    available.

    -v2: Fixed build failure on !HAVE_MEMBLOCK_NODE_MAP caused by direct
    rgn->nid access.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_set_node() operates in three steps - break regions crossing
    boundaries, set nid and merge back regions. This patch separates the
    first part into a separate function - memblock_isolate_range(), which
    breaks regions crossing range boundaries and returns range index range
    for regions properly contained in the specified memory range.

    This doesn't introduce any behavior change and will be used to further
    unify region handling.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_init() initializes arrays for regions and memblock itself;
    however, all these can be done with struct initializers and
    memblock_init() can be removed. This patch kills memblock_init() and
    initializes memblock with struct initializer.

    The only difference is that the first dummy entries don't have .nid
    set to MAX_NUMNODES initially. This doesn't cause any behavior
    difference.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Michal Simek
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • memblock no longer depends on having one more entry at the end during
    addition making the sentinel entries at the end of region arrays not
    too useful. Remove the sentinels. This eases further updates.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Add __memblock_dump_all() which dumps memblock configuration whether
    memblock_debug is enabled or not.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Make memblock_double_array(), __memblock_alloc_base() and
    memblock_alloc_nid() use memblock_reserve() instead of calling
    memblock_add_region() with reserved array directly. This eases
    debugging and updates to memblock_add_region().

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_{add|remove|free|reserve}() return either 0 or -errno but had
    long as return type. Chage it to int. Also, drop 'extern' from all
    prototypes in memblock.h - they are unnecessary and used
    inconsistently (especially if mm.h is included in the picture).

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     

29 Nov, 2011

1 commit

  • Conflicts & resolutions:

    * arch/x86/xen/setup.c

    dc91c728fd "xen: allow extra memory to be in multiple regions"
    24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."

    conflicted on xen_add_extra_mem() updates. The resolution is
    trivial as the latter just want to replace
    memblock_x86_reserve_range() with memblock_reserve().

    * drivers/pci/intel-iommu.c

    166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/"
    5dfe8660a3d "bootmem: Replace work_with_active_regions() with..."

    conflicted as the former moved the file under drivers/iommu/.
    Resolved by applying the chnages from the latter on the moved
    file.

    * mm/Kconfig

    6661672053a "memblock: add NO_BOOTMEM config symbol"
    c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"

    conflicted trivially. Both added config options. Just
    letting both add their own options resolves the conflict.

    * mm/memblock.c

    d1f0ece6cdc "mm/memblock.c: small function definition fixes"
    ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()"

    confliected. The former updates function removed by the
    latter. Resolution is trivial.

    Signed-off-by: Tejun Heo

    Tejun Heo