01 Oct, 2013

1 commit

  • This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
    overflow when offline pages").

    The fixed bug by commit cea27eb was fixed to another way by commit
    3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
    enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory, for details please refer to:

    http://marc.info/?l=linux-mm&m=136957578620221&w=2

    As a result, commit cea27eb2a202 currently causes duplicated decreasing
    of totalhigh_pages, thus the revert.

    Signed-off-by: Joonyoung Shim
    Reviewed-by: Wanpeng Li
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonyoung Shim
     

12 Sep, 2013

16 commits

  • Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
    magic number -2.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
    comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

    Signed-off-by: SeungHun Lee
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeungHun Lee
     
  • Current early_pfn_to_nid() on arch that support memblock go over
    memblock.memory one by one, so will take too many try near the end.

    We can use existing memblock_search to find the node id for given pfn,
    that could save some time on bigger system that have many entries
    memblock.memory array.

    Here are the timing differences for several machines. In each case with
    the patch less time was spent in __early_pfn_to_nid().

    3.11-rc5 with patch difference (%)
    -------- ---------- --------------
    UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
    UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
    UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
    UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
    Time in seconds.

    Signed-off-by: Yinghai Lu
    Cc: Tejun Heo
    Acked-by: Russ Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Until now we can't offline memory blocks which contain hugepages because a
    hugepage is considered as an unmovable page. But now with this patch
    series, a hugepage has become movable, so by using hugepage migration we
    can offline such memory blocks.

    What's different from other users of hugepage migration is that we need to
    decompose all the hugepages inside the target memory block into free buddy
    pages after hugepage migration, because otherwise free hugepages remaining
    in the memory block intervene the memory offlining. For this reason we
    introduce new functions dissolve_free_huge_page() and
    dissolve_free_huge_pages().

    Other than that, what this patch does is straightforwardly to add hugepage
    migration code, that is, adding hugepage code to the functions which scan
    over pfn and collect hugepages to be migrated, and adding a hugepage
    allocation function to alloc_migrate_target().

    As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
    over them because it's larger than memory block. So we now simply leave
    it to fail as it is.

    [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The main idea behind this patchset is to reduce the vmstat update overhead
    by avoiding interrupt enable/disable and the use of per cpu atomics.

    This patch (of 3):

    It is better to have a separate folding function because
    refresh_cpu_vm_stats() also does other things like expire pages in the
    page allocator caches.

    If we have a separate function then refresh_cpu_vm_stats() is only called
    from the local cpu which allows additional optimizations.

    The folding function is only called when a cpu is being downed and
    therefore no other processor will be accessing the counters. Also
    simplifies synchronization.

    [akpm@linux-foundation.org: fix UP build]
    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
    path. For helping compiler optimization, add unlikely macro to
    ALLOC_NO_WATERMARKS checking.

    This patch doesn't have any effect now, because gcc already optimize this
    properly. But we cannot assume that gcc always does right and nobody
    re-evaluate if gcc do proper optimization with their change, for example,
    it is not optimized properly on v3.10. So adding compiler hint here is
    reasonable.

    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Allocations that do not have to respect the watermarks are rare
    high-priority events. Reorder the code such that per-zone dirty limits
    and future checks important only to regular page allocations are ignored
    in these extraordinary situations.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Tested-by: Zlatko Calusic
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We should not check loop+1 with loop end in loop body. Just duplicate two
    lines code to avoid it.

    That will help a bit when we have huge amount of pages on system with
    16TiB memory.

    Signed-off-by: Yinghai Lu
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • In the current code, the value of fallback_migratetype that is printed
    using the mm_page_alloc_extfrag tracepoint, is the value of the
    migratetype *after* it has been set to the preferred migratetype (if the
    ownership was changed). Obviously that wouldn't have been the original
    intent. (We already have a separate 'change_ownership' field to tell
    whether the ownership of the pageblock was changed from the
    fallback_migratetype to the preferred type.)

    The intent of the fallback_migratetype field is to show the migratetype
    from which we borrowed pages in order to satisfy the allocation request.
    So fix the code to print that value correctly.

    Signed-off-by: Srivatsa S. Bhat
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • The free-page stealing code in __rmqueue_fallback() is somewhat hard to
    follow, and has an incredible amount of subtlety hidden inside!

    First off, there is a minor bug in the reporting of change-of-ownership of
    pageblocks. Under some conditions, we try to move upto
    'pageblock_nr_pages' no. of pages to the preferred allocation list. But
    we change the ownership of that pageblock to the preferred type only if we
    manage to successfully move atleast half of that pageblock (or if
    page_group_by_mobility_disabled is set).

    However, the current code ignores the latter part and sets the
    'migratetype' variable to the preferred type, irrespective of whether we
    actually changed the pageblock migratetype of that block or not. So, the
    page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
    'change_ownership' might be shown as 1 when it must have been 0).

    So fixing this involves moving the update of the 'migratetype' variable to
    the right place. But looking closer, we observe that the 'migratetype'
    variable is used subsequently for checks such as "is_migrate_cma()".
    Obviously the intent there is to check if the *fallback* type is
    MIGRATE_CMA, but since we already set the 'migratetype' variable to
    start_migratetype, we end up checking if the *preferred* type is
    MIGRATE_CMA!!

    To make things more interesting, this actually doesn't cause a bug in
    practice, because we never change *anything* if the fallback type is CMA.

    So, restructure the code in such a way that it is trivial to understand
    what is going on, and also fix the above mentioned bug. And while at it,
    also add a comment explaining the subtlety behind the migratetype used in
    the call to expand().

    [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
    Signed-off-by: Srivatsa S. Bhat
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • Fix all errors reported by checkpatch and some small spelling mistakes.

    Signed-off-by: Pintu Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     
  • set_pageblock_order() may be called when memory hotplug, so need use
    '__paginginit' instead of '__init'.

    The related warning:

    The function __meminit .free_area_init_node() references
    a function __init .set_pageblock_order().
    If .set_pageblock_order is only used by .free_area_init_node then
    annotate .set_pageblock_order with a matching annotation.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
    previous calculating here will generate an unexpected result. In
    addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
    already integral multiple of 1MB.

    Signed-off-by: Jerry Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerry Zhou
     

27 Aug, 2013

1 commit


10 Jul, 2013

4 commits

  • min_free_kbytes is updated during memory hotplug (by
    init_per_zone_wmark_min) currently which is right thing to do in most
    cases but this could be unexpected if admin increased the value to
    prevent from allocation failures and the new min_free_kbytes would be
    decreased as a result of memory hotadd.

    This patch saves the user defined value and allows updating
    min_free_kbytes only if it is higher than the saved one.

    A warning is printed when the new value is ignored.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Acked-by: Zhang Yanfei
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
    the order passed. MAX_ORDER is typically 11 and pageblock_order is
    typically 9 on x86. Integer division truncates, so pageblock_order / 2
    is 4. For the first eight iterations, it's guaranteed that
    current_order >= pageblock_order / 2 if it even gets that far!

    So just remove the unlikely(), it's completely bogus.

    Signed-off-by: Zhang Yanfei
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
    zone_type argument, so we can directly use the value in
    build_zonelists_node and remove zone_type argument.

    Signed-off-by: Zhang Yanfei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • When calculating pages in a node, for each zone in that node, we will
    have

    zone_spanned_pages_in_node
    --> get_pfn_range_for_nid
    zone_absent_pages_in_node
    --> get_pfn_range_for_nid

    That is to say, we call the get_pfn_range_for_nid to get start_pfn and
    end_pfn of the node for MAX_NR_ZONES * 2 times. And this is totally
    unnecessary if we call the get_pfn_range_for_nid before
    zone_*_pages_in_node add two extra arguments node_start_pfn and
    node_end_pfn for zone_*_pages_in_node, then we can remove the
    get_pfn_range_in_node in zone_*_pages_in_node.

    [akpm@linux-foundation.org: make definitions more readable]
    Signed-off-by: Zhang Yanfei
    Cc: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     

04 Jul, 2013

18 commits

  • Introduce helper function mem_init_print_info() to simplify mem_init()
    across different architectures, which also unifies the format and
    information printed.

    Function mem_init_print_info() calculates memory statistics information
    without walking each page, so it should be a little faster on some
    architectures.

    Also introduce another helper get_num_physpages() to kill the global
    variable num_physpages.

    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • As reported by https://bugzilla.kernel.org/show_bug.cgi?id=53501,
    "MemTotal" from /proc/meminfo means memory pages managed by the buddy
    system (managed_pages), but "MemTotal" from /sys/.../node/nodex/meminfo
    means physical pages present (present_pages) within the NUMA node.
    There's a difference between managed_pages and present_pages due to
    bootmem allocator and reserved pages.

    And Documentation/filesystems/proc.txt says
    MemTotal: Total usable ram (i.e. physical ram minus a few reserved
    bits and the kernel binary code)

    So change /sys/.../node/nodex/meminfo to report available pages within
    the node as "MemTotal".

    Signed-off-by: Jiang Liu
    Reported-by:
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Enhance adjust_managed_page_count() to adjust totalhigh_pages for
    highmem pages. And change code which directly adjusts totalram_pages to
    use adjust_managed_page_count() because it adjusts totalram_pages,
    totalhigh_pages and zone->managed_pages altogether in a safe way.

    Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
    driver bacause adjust_managed_page_count() has already adjusted
    totalhigh_pages.

    This patch also fixes two bugs:

    1) enhances virtio_balloon driver to adjust totalhigh_pages when
    reserve/unreserve pages.
    2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory.

    We still need to deal with modifications of totalram_pages in file
    arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.

    [akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
    [akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
    Signed-off-by: Jiang Liu
    Cc: Chris Metcalf
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Marek Szyprowski
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Sergei Shtylyov
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • In order to simpilify management of totalram_pages and
    zone->managed_pages, make __free_pages_bootmem() only available at boot
    time. With this change applied, __free_pages_bootmem() will only be
    used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
    Other callers of __free_pages_bootmem() have been converted to use
    free_reserved_page(), which handles totalram_pages and
    zone->managed_pages in a safer way.

    This patch also fix a bug in free_pagetable() for x86_64, which should
    increase zone->managed_pages instead of zone->present_pages when freeing
    reserved pages.

    And now we have managed_pages_count_lock to protect totalram_pages and
    zone->managed_pages, so remove the redundant ppb_lock lock in
    put_page_bootmem(). This greatly simplifies the locking rules.

    Signed-off-by: Jiang Liu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tejun Heo
    Cc: Will Deacon
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
    protect totalram_pages and zone->managed_pages. Other than the memory
    hotplug driver, totalram_pages and zone->managed_pages may also be
    modified at runtime by other drivers, such as Xen balloon,
    virtio_balloon etc. For those cases, memory hotplug lock is a little
    too heavy, so introduce a dedicated lock to protect totalram_pages and
    zone->managed_pages.

    Now we have a simplified locking rules totalram_pages and
    zone->managed_pages as:

    1) no locking for read accesses because they are unsigned long.
    2) no locking for write accesses at boot time in single-threaded context.
    3) serialize write accesses at runtime by acquiring the dedicated
    managed_page_count_lock.

    Also adjust zone->managed_pages when freeing reserved pages into the
    buddy system, to keep totalram_pages and zone->managed_pages in
    consistence.

    [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
    that all highmem pages will be freed into the buddy system by function
    mem_init(). But that's not always true, some architectures may reserve
    some highmem pages during boot. For example PPC may allocate highmem
    pages for giagant HugeTLB pages, and several architectures have code to
    check PageReserved flag to exclude highmem pages allocated during boot
    when freeing highmem pages into the buddy system.

    So treat highmem pages in the same way as normal pages, that is to:
    1) reset zone->managed_pages to zero in mem_init().
    2) recalculate managed_pages when freeing pages into the buddy system.

    Signed-off-by: Jiang Liu
    Cc: "H. Peter Anvin"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Yinghai Lu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Marek Szyprowski
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use zone->managed_pages instead of zone->present_pages to calculate
    default zonelist order because managed_pages means allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Fix some trivial typos in comments.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Address more review comments from last round of code review.
    1) Enhance free_reserved_area() to support poisoning freed memory with
    pattern '0'. This could be used to get rid of poison_init_mem()
    on ARM64.
    2) A previous patch has disabled memory poison for initmem on s390
    by mistake, so restore to the original behavior.
    3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().

    Signed-off-by: Jiang Liu
    Cc: Geert Uytterhoeven
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Change signature of free_reserved_area() according to Russell King's
    suggestion to fix following build warnings:

    arch/arm/mm/init.c: In function 'mem_init':
    arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
    free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
    ^
    In file included from include/linux/mman.h:4:0,
    from arch/arm/mm/init.c:15:
    include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
    extern unsigned long free_reserved_area(unsigned long start, unsigned long end,

    mm/page_alloc.c: In function 'free_reserved_area':
    >> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
    In file included from arch/mips/include/asm/page.h:49:0,
    from include/linux/mmzone.h:20,
    from include/linux/gfp.h:4,
    from include/linux/mm.h:8,
    from mm/page_alloc.c:18:
    arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
    mm/page_alloc.c: In function 'free_area_init_nodes':
    mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]

    Also address some minor code review comments.

    Signed-off-by: Jiang Liu
    Reported-by: Arnd Bergmann
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • The logic for the memory-remove code fails to correctly account the
    Total High Memory when a memory block which contains High Memory is
    offlined as shown in the example below. The following patch fixes it.

    Before logic memory remove:

    MemTotal: 7603740 kB
    MemFree: 6329612 kB
    Buffers: 94352 kB
    Cached: 872008 kB
    SwapCached: 0 kB
    Active: 626932 kB
    Inactive: 519216 kB
    Active(anon): 180776 kB
    Inactive(anon): 222944 kB
    Active(file): 446156 kB
    Inactive(file): 296272 kB
    Unevictable: 0 kB
    Mlocked: 0 kB
    HighTotal: 7294672 kB
    HighFree: 5704696 kB
    LowTotal: 309068 kB
    LowFree: 624916 kB

    After logic memory remove:

    MemTotal: 7079452 kB
    MemFree: 5805976 kB
    Buffers: 94372 kB
    Cached: 872000 kB
    SwapCached: 0 kB
    Active: 626936 kB
    Inactive: 519236 kB
    Active(anon): 180780 kB
    Inactive(anon): 222944 kB
    Active(file): 446156 kB
    Inactive(file): 296292 kB
    Unevictable: 0 kB
    Mlocked: 0 kB
    HighTotal: 7294672 kB
    HighFree: 5181024 kB
    LowTotal: 4294752076 kB
    LowFree: 624952 kB

    [mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: [2.6.24+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • - check the length of the procfs data before copying it into a fixed
    size array.

    - when __parse_numa_zonelist_order() fails, save the error code for
    return.

    - 'char*' --> 'char *' coding style fix

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • When memory hotplug is triggered, we call pageset_init() on
    per-cpu-pagesets which both contain pages and are in use, causing both the
    leakage of those pages and (potentially) bad behaviour if a page is
    allocated from a pageset while it is being cleared.

    Avoid this by factoring out pageset_set_high_and_batch() (which contains
    all needed logic too set a pageset's ->high and ->batch inrespective of
    system state) from zone_pageset_init() and using the new
    pageset_set_high_and_batch() instead of zone_pageset_init() in
    zone_pcp_update().

    Signed-off-by: Cody P Schafer
    Cc: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Cc: Gilad Ben-Yossef
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Previously, zone_pcp_update() called pageset_set_batch() directly,
    essentially assuming that percpu_pagelist_fraction == 0.

    Correct this by calling zone_pageset_init(), which chooses the
    appropriate ->batch and ->high calculations.

    Signed-off-by: Cody P Schafer
    Cc: Gilad Ben-Yossef
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Cc: Gilad Ben-Yossef
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Cc: Gilad Ben-Yossef
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Cc: Gilad Ben-Yossef
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer