30 Apr, 2013

2 commits


30 Jan, 2013

2 commits


13 Dec, 2012

1 commit

  • Currently a zone's present_pages is calcuated as below, which is
    inaccurate and may cause trouble to memory hotplug.

    spanned_pages - absent_pages - memmap_pages - dma_reserve.

    During fixing bugs caused by inaccurate zone->present_pages, we found
    zone->present_pages has been abused. The field zone->present_pages may
    have different meanings in different contexts:

    1) pages existing in a zone.
    2) pages managed by the buddy system.

    For more discussions about the issue, please refer to:
    http://lkml.org/lkml/2012/11/5/866
    https://patchwork.kernel.org/patch/1346751/

    This patchset tries to introduce a new field named "managed_pages" to
    struct zone, which counts "pages managed by the buddy system". And revert
    zone->present_pages to count "physical pages existing in a zone", which
    also keep in consistence with pgdat->node_present_pages.

    We will set an initial value for zone->managed_pages in function
    free_area_init_core() and will adjust it later if the initial value is
    inaccurate.

    For DMA/normal zones, the initial value is set to:

    (spanned_pages - absent_pages - memmap_pages - dma_reserve)

    Later zone->managed_pages will be adjusted to the accurate value when the
    bootmem allocator frees all free pages to the buddy system in function
    free_all_bootmem_node() and free_all_bootmem().

    The bootmem allocator doesn't touch highmem pages, so highmem zones'
    managed_pages is set to the accurate value "spanned_pages - absent_pages"
    in function free_area_init_core() and won't be updated anymore.

    This patch also adds a new field "managed_pages" to /proc/zoneinfo
    and sysrq showmem.

    [akpm@linux-foundation.org: small comment tweaks]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Tested-by: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

18 Nov, 2012

1 commit

  • Now NO_BOOTMEM version free_all_bootmem_node() does not really
    do free_bootmem at all, and it only call register_page_bootmem_info_node
    for online nodes instead.

    That is confusing.

    We can kill that free_all_bootmem_node(), after we kill two callings
    in x86 and sparc.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/1353123563-3103-46-git-send-email-yinghai@kernel.org
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

17 Nov, 2012

1 commit

  • Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages")

    That patch tried to fix a issue when calculating zone->present_pages,
    but it caused a regression on 32bit systems with HIGHMEM. With that
    change, reset_zone_present_pages() resets all zone->present_pages to
    zero, and fixup_zone_present_pages() is called to recalculate
    zone->present_pages when the boot allocator frees core memory pages into
    buddy allocator. Because highmem pages are not freed by bootmem
    allocator, all highmem zones' present_pages becomes zero.

    Various options for improving the situation are being discussed but for
    now, let's return to the 3.6 code.

    Cc: Jianguo Wu
    Cc: Jiang Liu
    Cc: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Tested-by: Chris Clayton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Oct, 2012

2 commits

  • I think zone->present_pages indicates pages that buddy system can management,
    it should be:

    zone->present_pages = spanned pages - absent pages - bootmem pages,

    but is now:
    zone->present_pages = spanned pages - absent pages - memmap pages.

    spanned pages: total size, including holes.
    absent pages: holes.
    bootmem pages: pages used in system boot, managed by bootmem allocator.
    memmap pages: pages used by page structs.

    This may cause zone->present_pages less than it should be. For example,
    numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
    bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
    present_pages should be spanned pages - absent pages, but now it also
    minus memmap pages(free_area_init_core), which are actually allocated from
    ZONE_MOVABLE. When offlining all memory of a zone, this will cause
    zone->present_pages less than 0, because present_pages is unsigned long
    type, it is actually a very large integer, it indirectly caused
    zone->watermark[WMARK_MIN] becomes a large
    integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
    large integer(calculate_totalreserve_pages()), and finally cause memory
    allocating failure when fork process(__vm_enough_memory()).

    [root@localhost ~]# dmesg
    -bash: fork: Cannot allocate memory

    I think the bug described in

    http://marc.info/?l=linux-mm&m=134502182714186&w=2

    is also caused by wrong zone present pages.

    This patch intends to fix-up zone->present_pages when memory are freed to
    buddy system on x86_64 and IA64 platforms.

    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Reported-by: Petr Tesarik
    Tested-by: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed
    early_node_map[]. Clean up the comments to comply with that change.

    Signed-off-by: Wanpeng Li
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Gavin Shan
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

12 Jul, 2012

2 commits

  • memblock_free_reserved_regions() calls memblock_free(), but
    memblock_free() would double reserved.regions too, so we could free the
    old range for reserved.regions.

    Also tj said there is another bug which could be related to this.

    | I don't think we're saving any noticeable
    | amount by doing this "free - give it to page allocator - reserve
    | again" dancing. We should just allocate regions aligned to page
    | boundaries and free them later when memblock is no longer in use.

    in that case, when DEBUG_PAGEALLOC, will get panic:

    memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
    BUG: unable to handle kernel paging request at ffff88102febd948
    IP: [] __next_free_mem_range+0x9b/0x155
    PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
    RIP: 0010:[] [] __next_free_mem_range+0x9b/0x155

    See the discussion at https://lkml.org/lkml/2012/6/13/469

    So try to allocate with PAGE_SIZE alignment and free it later.

    Reported-by: Sasha Levin
    Acked-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Yinghai Lu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
    in alloc_bootmem_section"), usemap allocations may easily be placed
    outside the optimal section that holds the node descriptor, even if
    there is space available in that section. This results in unnecessary
    hotplug dependencies that need to have the node unplugged before the
    section holding the usemap.

    The reason is that the bootmem allocator doesn't guarantee a linear
    search starting from the passed allocation goal but may start out at a
    much higher address absent an upper limit.

    Fix this by trying the allocation with the limit at the section end,
    then retry without if that fails. This keeps the fix from f5bf18fa22f8
    of not panicking if the allocation does not fit in the section, but
    still makes sure to try to stay within the section at first.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

30 May, 2012

3 commits

  • alloc_bootmem_section() derives allocation area constraints from the
    specified sparsemem section. This is a bit specific for a generic memory
    allocator like bootmem, though, so move it over to sparsemem.

    As __alloc_bootmem_node_nopanic() already retries failed allocations with
    relaxed area constraints, the fallback code in sparsemem.c can be removed
    and the code becomes a bit more compact overall.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Yinghai Lu
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • While the panicking node-specific allocation function tries to satisfy
    node+goal, goal, node, anywhere, the non-panicking function still does
    node+goal, goal, anywhere.

    Make it simpler: define the panicking version in terms of the non-panicking
    one, like the node-agnostic interface, so they always behave the same way
    apart from how to deal with allocation failure.

    Signed-off-by: Johannes Weiner
    Acked-by: Yinghai Lu
    Acked-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims
    the functions panic on allocation failure. Do it.

    Signed-off-by: Johannes Weiner
    Acked-by: Yinghai Lu
    Acked-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 May, 2012

1 commit

  • Systems with 8 TBytes of memory or greater can hit a problem where only
    the the first 8 TB of memory shows up. This is due to "int i" being
    smaller than "unsigned long start_aligned", causing the high bits to be
    dropped.

    The fix is to change `i' to unsigned long to match start_aligned
    and end_aligned.

    Thanks to Jack Steiner for assistance tracking this down.

    Signed-off-by: Russ Anderson
    Cc: Jack Steiner
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: David S. Miller
    Cc: Yinghai Lu
    Cc: Gavin Shan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russ Anderson
     

26 Apr, 2012

1 commit

  • The comments above __alloc_bootmem_node() claim that the code will
    first try the allocation using 'goal' and if that fails it will
    try again but with the 'goal' requirement dropped.

    Unfortunately, this is not what the code does, so fix it to do so.

    This is important for nobootmem conversions to architectures such
    as sparc where MAX_DMA_ADDRESS is infinity.

    On such architectures all of the allocations done by generic spots,
    such as the sparse-vmemmap implementation, will pass in:

    __pa(MAX_DMA_ADDRESS)

    as the goal, and with the limit given as "-1" this will always fail
    unless we add the appropriate fallback logic here.

    Signed-off-by: David S. Miller
    Acked-by: Yinghai Lu
    Signed-off-by: Linus Torvalds

    David Miller
     

29 Nov, 2011

1 commit

  • Conflicts & resolutions:

    * arch/x86/xen/setup.c

    dc91c728fd "xen: allow extra memory to be in multiple regions"
    24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."

    conflicted on xen_add_extra_mem() updates. The resolution is
    trivial as the latter just want to replace
    memblock_x86_reserve_range() with memblock_reserve().

    * drivers/pci/intel-iommu.c

    166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/"
    5dfe8660a3d "bootmem: Replace work_with_active_regions() with..."

    conflicted as the former moved the file under drivers/iommu/.
    Resolved by applying the chnages from the latter on the moved
    file.

    * mm/Kconfig

    6661672053a "memblock: add NO_BOOTMEM config symbol"
    c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"

    conflicted trivially. Both added config options. Just
    letting both add their own options resolves the conflict.

    * mm/memblock.c

    d1f0ece6cdc "mm/memblock.c: small function definition fixes"
    ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()"

    confliected. The former updates function removed by the
    latter. Resolution is trivial.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

31 Oct, 2011

1 commit


15 Jul, 2011

4 commits

  • Other than sanity check and debug message, the x86 specific version of
    memblock reserve/free functions are simple wrappers around the generic
    versions - memblock_reserve/free().

    This patch adds debug messages with caller identification to the
    generic versions and replaces x86 specific ones and kills them.
    arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty
    after this change and removed.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • __get_free_all_memory_range() walks memblock, calculates free memory
    areas and fills in the specified range. It can be easily replaced
    with for_each_free_mem_range().

    Convert free_low_memory_core_early() and
    add_highpages_with_active_regions() to for_each_free_mem_range().
    This leaves __get_free_all_memory_range() without any user. Kill it
    and related functions.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310462166-31469-10-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • nomemblock is currently used only by x86 and on x86_32
    free_all_memory_core_early() silently freed only the low mem because
    get_free_all_memory_range() in arch/x86/mm/memblock.c implicitly
    limited range to max_low_pfn.

    Rename free_all_memory_core_early() to free_low_memory_core_early()
    and make it call __get_free_all_memory_range() and limit the range to
    max_low_pfn explicitly. This makes things clearer and also is
    consistent with the bootmem behavior.

    This leaves get_free_all_memory_range() without any user. Kill it.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310462166-31469-9-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     
  • With the previous changes, generic NUMA aware memblock API has feature
    parity with memblock_x86_find_in_range_node(). There currently are
    two users - x86 setup_node_data() and __alloc_memory_core_early() in
    nobootmem.c.

    This patch converts the former to use memblock_alloc_nid() and the
    latter memblock_find_range_in_node(), and kills
    memblock_x86_find_in_range_node() and related functions including
    find_memory_early_core_early() in page_alloc.c.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310460395-30913-9-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

14 Jul, 2011

1 commit

  • 25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made
    MEMBLOCK_ERROR 0 and there already are codes which expect error return
    to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its
    misery.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org
    Cc: Yinghai Lu
    Cc: Benjamin Herrenschmidt
    Signed-off-by: H. Peter Anvin

    Tejun Heo
     

25 May, 2011

1 commit


31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • …p_elfcorehdr and saved_max_pfn

    The Xen PV drivers in a crashed HVM guest can not connect to the dom0
    backend drivers because both frontend and backend drivers are still in
    connected state. To run the connection reset function only in case of a
    crashdump, the is_kdump_kernel() function needs to be available for the PV
    driver modules.

    Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into
    kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel()
    usable for modules.

    Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup()
    to early_param(). It adds an address range check from x86 also on ia64
    and powerpc.

    [akpm@linux-foundation.org: additional #includes]
    [akpm@linux-foundation.org: remove elfcorehdr_addr export]
    [akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes]
    Signed-off-by: Olaf Hering <olaf@aepfle.de>
    Cc: Russell King <rmk@arm.linux.org.uk>
    Cc: "Luck, Tony" <tony.luck@intel.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mundt <lethal@linux-sh.org>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Olaf Hering
     

24 Feb, 2011

3 commits

  • Now that bootmem.c and nobootmem.c are separate, there's no reason to
    define __alloc_memory_core_early(), which is used only by nobootmem,
    inside #ifdef in page_alloc.c. Move it to nobootmem.c and make it
    static.

    This patch doesn't introduce any behavior change.

    -tj: Updated commit description.

    Signed-off-by: Yinghai Lu
    Acked-by: Andrew Morton
    Signed-off-by: Tejun Heo

    Yinghai Lu
     
  • Now that bootmem.c and nobootmem.c are separate, it's cleaner to
    define contig_page_data in each file than in page_alloc.c with #ifdef.
    Move it.

    This patch doesn't introduce any behavior change.

    -v2: According to Andrew, fixed the struct layout.
    -tj: Updated commit description.

    Signed-off-by: Yinghai Lu
    Acked-by: Andrew Morton
    Signed-off-by: Tejun Heo

    Yinghai Lu
     
  • mm/bootmem.c contained code paths for both bootmem and no bootmem
    configurations. They implement about the same set of APIs in
    different ways and as a result bootmem.c contains massive amount of
    #ifdef CONFIG_NO_BOOTMEM.

    Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c. As the
    common part is relatively small, duplicate them in nobootmem.c instead
    of creating a common file or ifdef'ing in bootmem.c.

    The followings are duplicated.

    * {min|max}_low_pfn, max_pfn, saved_max_pfn
    * free_bootmem_late()
    * ___alloc_bootmem()
    * __alloc_bootmem_low()

    The followings are applicable only to nobootmem and moved verbatim.

    * __free_pages_memory()
    * free_all_memory_core_early()

    The followings are not applicable to nobootmem and omitted in
    nobootmem.c.

    * reserve_bootmem_node()
    * reserve_bootmem()

    The rest split function bodies according to CONFIG_NO_BOOTMEM.

    Makefile is updated so that only either bootmem.c or nobootmem.c is
    built according to CONFIG_NO_BOOTMEM.

    This patch doesn't introduce any behavior change.

    -tj: Rewrote commit description.

    Suggested-by: Ingo Molnar
    Signed-off-by: Yinghai Lu
    Acked-by: Andrew Morton
    Signed-off-by: Tejun Heo

    Yinghai Lu