19 Dec, 2014

1 commit

  • When the system boots up, in the dmesg logs we can see the memory
    statistics along with total reserved as below. Memory: 458840k/458840k
    available, 65448k reserved, 0K highmem

    When CMA is enabled, still the total reserved memory remains the same.
    However, the CMA memory is not considered as reserved. But, when we see
    /proc/meminfo, the CMA memory is part of free memory. This creates
    confusion. This patch corrects the problem by properly subtracting the
    CMA reserved memory from the total reserved memory in dmesg logs.

    Below is the dmesg snapshot from an arm based device with 512MB RAM and
    12MB single CMA region.

    Before this change:
    Memory: 458840k/458840k available, 65448k reserved, 0K highmem

    After this change:
    Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem

    Signed-off-by: Pintu Kumar
    Signed-off-by: Vishnu Pratap Singh
    Acked-by: Michal Nazarewicz
    Cc: Rafael Aquini
    Cc: Jerome Marchand
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     

14 Dec, 2014

7 commits

  • Since 01cefaef40c4 ("mm: provide more accurate estimation
    of pages occupied by memmap") allocate the pages from lowmem for the
    highmem zones' memmap. So It is not need to reserver the memmap's for
    the highmem.

    A 2G DDR3 for the arm platform:
    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 528 pages used for memmap
    HighMem zone: 67584 pages, LIFO batch:15

    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 67584 pages, LIFO batch:15

    Signed-off-by: Hongbo Zhong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhong Hongbo
     
  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is the page owner tracking code which is introduced so far ago. It
    is resident on Andrew's tree, though, nobody tried to upstream so it
    remain as is. Our company uses this feature actively to debug memory leak
    or to find a memory hogger so I decide to upstream this feature.

    This functionality help us to know who allocates the page. When
    allocating a page, we store some information about allocation in extra
    memory. Later, if we need to know status of all pages, we can get and
    analyze it from this stored information.

    In previous version of this feature, extra memory is statically defined in
    struct page, but, in this version, extra memory is allocated outside of
    struct page. It enables us to turn on/off this feature at boottime
    without considerable memory waste.

    Although we already have tracepoint for tracing page allocation/free,
    using it to analyze page owner is rather complex. We need to enlarge the
    trace buffer for preventing overlapping until userspace program launched.
    And, launched program continually dump out the trace buffer for later
    analysis and it would change system behaviour with more possibility rather
    than just keeping it in memory, so bad for debug.

    Moreover, we can use page_owner feature further for various purposes. For
    example, we can use it for fragmentation statistics implemented in this
    patch. And, I also plan to implement some CMA failure debugging feature
    using this interface.

    I'd like to give the credit for all developers contributed this feature,
    but, it's not easy because I don't know exact history. Sorry about that.
    Below is people who has "Signed-off-by" in the patches in Andrew's tree.

    Contributor:
    Alexander Nyberg
    Mel Gorman
    Dave Hansen
    Minchan Kim
    Michal Nazarewicz
    Andrew Morton
    Jungsoo Son

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have prepared to avoid using debug-pagealloc in boottime. So
    introduce new kernel-parameter to disable debug-pagealloc in boottime, and
    makes related functions to be disabled in this case.

    Only non-intuitive part is change of guard page functions. Because guard
    page is effective only if debug-pagealloc is enabled, turning off
    according to debug-pagealloc is reasonable thing to do.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, debug-pagealloc needs extra flags in struct page, so we need to
    recompile whole source code when we decide to use it. This is really
    painful, because it takes some time to recompile and sometimes rebuild is
    not possible due to third party module depending on struct page. So, we
    can't use this good feature in many cases.

    Now, we have the page extension feature that allows us to insert extra
    flags to outside of struct page. This gets rid of third party module
    issue mentioned above. And, this allows us to determine if we need extra
    memory for this page extension in boottime. With these property, we can
    avoid using debug-pagealloc in boottime with low computational overhead in
    the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
    development process greatly.

    This patch is the preparation step to achive above goal. debug-pagealloc
    originally uses extra field of struct page, but, after this patch, it will
    use field of struct page_ext. Because memory for page_ext is allocated
    later than initialization of page allocator in CONFIG_SPARSEMEM, we should
    disable debug-pagealloc feature temporarily until initialization of
    page_ext. This patch implements this.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When we debug something, we'd like to insert some information to every
    page. For this purpose, we sometimes modify struct page itself. But,
    this has drawbacks. First, it requires re-compile. This makes us
    hesitate to use the powerful debug feature so development process is
    slowed down. And, second, sometimes it is impossible to rebuild the
    kernel due to third party module dependency. At third, system behaviour
    would be largely different after re-compile, because it changes size of
    struct page greatly and this structure is accessed by every part of
    kernel. Keeping this as it is would be better to reproduce errornous
    situation.

    This feature is intended to overcome above mentioned problems. This
    feature allocates memory for extended data per page in certain place
    rather than the struct page itself. This memory can be accessed by the
    accessor functions provided by this code. During the boot process, it
    checks whether allocation of huge chunk of memory is needed or not. If
    not, it avoids allocating memory at all. With this advantage, we can
    include this feature into the kernel in default and can avoid rebuild and
    solve related problems.

    Until now, memcg uses this technique. But, now, memcg decides to embed
    their variable to struct page itself and it's code to extend struct page
    has been removed. I'd like to use this code to develop debug feature, so
    this patch resurrect it.

    To help these things to work well, this patch introduces two callbacks for
    clients. One is the need callback which is mandatory if user wants to
    avoid useless memory allocation at boot-time. The other is optional, init
    callback, which is used to do proper initialization after memory is
    allocated. Detailed explanation about purpose of these functions is in
    code comment. Please refer it.

    Others are completely same with previous extension code in memcg.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Page guard is used by debug-pagealloc feature. Currently, it is
    open-coded, but, I think that more abstraction of it makes core page
    allocator code more readable.

    There is no functional difference.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Gioh Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

10 commits

  • Now that the external page_cgroup data structure and its lookup is
    gone, let the generic bad_page() check for page->mem_cgroup sanity.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Cc: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroups used to have 5 per-page pointers. To allow users to
    disable that amount of overhead during runtime, those pointers were
    allocated in a separate array, with a translation layer between them and
    struct page.

    There is now only one page pointer remaining: the memcg pointer, that
    indicates which cgroup the page is associated with when charged. The
    complexity of runtime allocation and the runtime translation overhead is
    no longer justified to save that *potential* 0.19% of memory. With
    CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
    after the page->private member and doesn't even increase struct page,
    and then this patch actually saves space. Remaining users that care can
    still compile their kernels without CONFIG_MEMCG.

    text data bss dec hex filename
    8828345 1725264 983040 11536649 b00909 vmlinux.old
    8827425 1725264 966656 11519345 afc571 vmlinux.new

    [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by Wei Yuan
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yuan
     
  • The goal of memory compaction is to create high-order freepages through
    page migration. Page migration however puts pages on the per-cpu lru_add
    cache, which is later flushed to per-cpu pcplists, and only after pcplists
    are drained the pages can actually merge. This can happen due to the
    per-cpu caches becoming full through further freeing, or explicitly.

    During direct compaction, it is useful to do the draining explicitly so
    that pages merge as soon as possible and compaction can detect success
    immediately and keep the latency impact at minimum. However the current
    implementation is far from ideal. Draining is done only in
    __alloc_pages_direct_compact(), after all zones were already compacted,
    and the decisions to continue or stop compaction in individual zones was
    done without the last batch of migrations being merged. It is also
    missing the draining of lru_add cache before the pcplists.

    This patch moves the draining for direct compaction into compact_zone().
    It adds the missing lru_cache draining and uses the newly introduced
    single zone pcplists draining to reduce overhead and avoid impact on
    unrelated zones. Draining is only performed when it can actually lead to
    merging of a page of desired order (passed by cc->order). This means it
    is only done when migration occurred in the previously scanned cc->order
    aligned block(s) and the migration scanner is now pointing to the next
    cc->order aligned block.

    The patch has been tested with stress-highalloc benchmark from mmtests.
    Although overal allocation success rates of the benchmark were not
    affected, the number of detected compaction successes has doubled. This
    suggests that allocations were previously successful due to implicit
    merging caused by background activity, making a later allocation attempt
    succeed immediately, but not attributing the success to compaction. Since
    stress-highalloc always tries to allocate almost the whole memory, it
    cannot show the improvement in its reported success rate metric. However
    after this patch, compaction should detect success and terminate earlier,
    reducing the direct compaction latencies in a real scenario.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually
    instead of preferred zone"), compaction is deferred for each zone where
    sync direct compaction fails, and reset where it succeeds. However, it
    was observed that for DMA zone compaction often appeared to succeed
    while subsequent allocation attempt would not, due to different outcome
    of watermark check.

    In order to properly defer compaction in this zone, the candidate zone
    has to be passed back to __alloc_pages_direct_compact() and compaction
    deferred in the zone after the allocation attempt fails.

    The large source of mismatch between watermark check in compaction and
    allocation was the lack of alloc_flags and classzone_idx values in
    compaction, which has been fixed in the previous patch. So with this
    problem fixed, we can simplify the code by removing the candidate_zone
    parameter and deferring in __alloc_pages_direct_compact().

    After this patch, the compaction activity during stress-highalloc
    benchmark is still somewhat increased, but it's negligible compared to the
    increase that occurred without the better watermark checking. This
    suggests that it is still possible to apparently succeed in compaction but
    fail to allocate, possibly due to parallel allocation activity.

    [akpm@linux-foundation.org: fix build]
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction relies on zone watermark checks for decisions such as if it's
    worth to start compacting in compaction_suitable() or whether compaction
    should stop in compact_finished(). The watermark checks take
    classzone_idx and alloc_flags parameters, which are related to the memory
    allocation request. But from the context of compaction they are currently
    passed as 0, including the direct compaction which is invoked to satisfy
    the allocation request, and could therefore know the proper values.

    The lack of proper values can lead to mismatch between decisions taken
    during compaction and decisions related to the allocation request. Lack
    of proper classzone_idx value means that lowmem_reserve is not taken into
    account. This has manifested (during recent changes to deferred
    compaction) when DMA zone was used as fallback for preferred Normal zone.
    compaction_suitable() without proper classzone_idx would think that the
    watermarks are already satisfied, but watermark check in
    get_page_from_freelist() would fail. Because of this problem, deferring
    compaction has extra complexity that can be removed in the following
    patch.

    The issue (not confirmed in practice) with missing alloc_flags is opposite
    in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
    ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
    CMA-enabled systems) the watermark checking in compaction with 0 passed
    will be stricter than in get_page_from_freelist(). In these cases
    compaction might be running for a longer time than is really needed.

    Another issue compaction_suitable() is that the check for "does the zone
    need compaction at all?" comes only after the check "does the zone have
    enough free free pages to succeed compaction". The latter considers extra
    pages for migration and can therefore in some situations fail and return
    COMPACT_SKIPPED, although the high-order allocation would succeed and we
    should return COMPACT_PARTIAL.

    This patch fixes these problems by adding alloc_flags and classzone_idx to
    struct compact_control and related functions involved in direct compaction
    and watermark checking. Where possible, all other callers of
    compaction_suitable() pass proper values where those are known. This is
    currently limited to classzone_idx, which is sometimes known in kswapd
    context. However, the direct reclaim callers should_continue_reclaim()
    and compaction_ready() do not currently know the proper values, so the
    coordination between reclaim and compaction may still not be as accurate
    as it could. This can be fixed later, if it's shown to be an issue.

    Additionaly the checks in compact_suitable() are reordered to address the
    second issue described above.

    The effect of this patch should be slightly better high-order allocation
    success rates and/or less compaction overhead, depending on the type of
    allocations and presence of CMA. It allows simplifying deferred
    compaction code in a followup patch.

    When testing with stress-highalloc, there was some slight improvement
    (which might be just due to variance) in success rates of non-THP-like
    allocations.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This allows us to catch the bug fixed in the previous patch (mm: free
    compound page with correct order).

    Here we also verify whether a page is tail page or not -- tail pages are
    supposed to be freed along with their head, not by themselves.

    Signed-off-by: Yu Zhao
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • CMA allocation drains pcplists so that pages can merge back to buddy
    allocator. Since it operates on a single zone, we can reduce the
    pcplists drain to the single zone, which is now possible.

    The change should make CMA allocations faster and not disturbing
    unrelated pcplists anymore.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The functions for draining per-cpu pages back to buddy allocators
    currently always operate on all zones. There are however several cases
    where the drain is only needed in the context of a single zone, and
    spilling other pcplists is a waste of time both due to the extra
    spilling and later refilling.

    This patch introduces new zone pointer parameter to drain_all_pages()
    and changes the dummy parameter of drain_local_pages() to be also a zone
    pointer. When NULL is passed, the functions operate on all zones as
    usual. Passing a specific zone pointer reduces the work to the single
    zone.

    All callers are updated to pass the NULL pointer in this patch.
    Conversion to single zone (where appropriate) is done in further
    patches.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

14 Nov, 2014

6 commits

  • One thing I did in this patch is fixing freepage accounting. If we
    clear guard page and link it onto isolate buddy list, we should not
    increase freepage count. This patch adds conditional branch to skip
    counting in this case. Without this patch, this overcounting happens
    frequently if guard order is set and CMA is used.

    Another thing fixed in this patch is the target to reset order. In
    __free_one_page(), we check the buddy page whether it is a guard page or
    not. And, if so, we should clear guard attribute on the buddy page and
    reset order of it to 0. But, current code resets original page's order
    rather than buddy one's. Maybe, this doesn't have any problem, because
    whole merged page's order will be re-assigned soon. But, it is better
    to correct code.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Gioh Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Having test_pages_isolated failure message as a warning confuses users
    into thinking that it is more serious than it really is. In reality, if
    called via CMA, allocation will be retried so a single
    test_pages_isolated failure does not prevent allocation from succeeding.

    Demote the warning message to an info message and reformat it such that
    the text "failed" does not appear and instead a less worrying "PFNS
    busy" is used.

    This message is trivially reproducible on a 10GB x86 machine on 3.16.y
    kernels configured with CONFIG_DMA_CMA.

    Signed-off-by: Michal Nazarewicz
    Cc: Laurent Pinchart
    Cc: Peter Hurley
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • Current pageblock isolation logic could isolate each pageblock
    individually. This causes freepage accounting problem if freepage with
    pageblock order on isolate pageblock is merged with other freepage on
    normal pageblock. We can prevent merging by restricting max order of
    merging to pageblock order if freepage is on isolate pageblock.

    A side-effect of this change is that there could be non-merged buddy
    freepage even if finishing pageblock isolation, because undoing
    pageblock isolation is just to move freepage from isolate buddy list to
    normal buddy list rather than to consider merging. So, the patch also
    makes undoing pageblock isolation consider freepage merge. When
    un-isolation, freepage with more than pageblock order and it's buddy are
    checked. If they are on normal pageblock, instead of just moving, we
    isolate the freepage and free it in order to get merged.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • All the caller of __free_one_page() has similar freepage counting logic,
    so we can move it to __free_one_page(). This reduce line of code and
    help future maintenance.

    This is also preparation step for "mm/page_alloc: restrict max order of
    merging on isolated pageblock" which fix the freepage counting problem
    on freepage with more than pageblock order.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In free_pcppages_bulk(), we use cached migratetype of freepage to
    determine type of buddy list where freepage will be added. This
    information is stored when freepage is added to pcp list, so if
    isolation of pageblock of this freepage begins after storing, this
    cached information could be stale. In other words, it has original
    migratetype rather than MIGRATE_ISOLATE.

    There are two problems caused by this stale information.

    One is that we can't keep these freepages from being allocated.
    Although this pageblock is isolated, freepage will be added to normal
    buddy list so that it could be allocated without any restriction. And
    the other problem is incorrect freepage accounting. Freepages on
    isolate pageblock should not be counted for number of freepage.

    Following is the code snippet in free_pcppages_bulk().

    /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
    __free_one_page(page, page_to_pfn(page), zone, 0, mt);
    trace_mm_page_pcpu_drain(page, 0, mt);
    if (likely(!is_migrate_isolate_page(page))) {
    __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
    if (is_migrate_cma(mt))
    __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
    }

    As you can see above snippet, current code already handle second
    problem, incorrect freepage accounting, by re-fetching pageblock
    migratetype through is_migrate_isolate_page(page).

    But, because this re-fetched information isn't used for
    __free_one_page(), first problem would not be solved. This patch try to
    solve this situation to re-fetch pageblock migratetype before
    __free_one_page() and to use it for __free_one_page().

    In addition to move up position of this re-fetch, this patch use
    optimization technique, re-fetching migratetype only if there is isolate
    pageblock. Pageblock isolation is rare event, so we can avoid
    re-fetching in common case with this optimization.

    This patch also correct migratetype of the tracepoint output.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before describing bugs itself, I first explain definition of freepage.

    1. pages on buddy list are counted as freepage.
    2. pages on isolate migratetype buddy list are *not* counted as freepage.
    3. pages on cma buddy list are counted as CMA freepage, too.

    Now, I describe problems and related patch.

    Patch 1: There is race conditions on getting pageblock migratetype that
    it results in misplacement of freepages on buddy list, incorrect
    freepage count and un-availability of freepage.

    Patch 2: Freepages on pcp list could have stale cached information to
    determine migratetype of buddy list to go. This causes misplacement of
    freepages on buddy list and incorrect freepage count.

    Patch 4: Merging between freepages on different migratetype of
    pageblocks will cause freepages accouting problem. This patch fixes it.

    Without patchset [3], above problem doesn't happens on my CMA allocation
    test, because CMA reserved pages aren't used at all. So there is no
    chance for above race.

    With patchset [3], I did simple CMA allocation test and get below
    result:

    - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
    - run kernel build (make -j16) on background
    - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
    - Result: more than 5000 freepage count are missed

    With patchset [3] and this patchset, I found that no freepage count are
    missed so that I conclude that problems are solved.

    On my simple memory offlining test, these problems also occur on that
    environment, too.

    This patch (of 4):

    There are two paths to reach core free function of buddy allocator,
    __free_one_page(), one is free_one_page()->__free_one_page() and the
    other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
    Each paths has race condition causing serious problems. At first, this
    patch is focused on first type of freepath. And then, following patch
    will solve the problem in second type of freepath.

    In the first type of freepath, we got migratetype of freeing page
    without holding the zone lock, so it could be racy. There are two cases
    of this race.

    1. pages are added to isolate buddy list after restoring orignal
    migratetype

    CPU1 CPU2

    get migratetype => return MIGRATE_ISOLATE
    call free_one_page() with MIGRATE_ISOLATE

    grab the zone lock
    unisolate pageblock
    release the zone lock

    grab the zone lock
    call __free_one_page() with MIGRATE_ISOLATE
    freepage go into isolate buddy list,
    although pageblock is already unisolated

    This may cause two problems. One is that we can't use this page anymore
    until next isolation attempt of this pageblock, because freepage is on
    isolate buddy list. The other is that freepage accouting could be wrong
    due to merging between different buddy list. Freepages on isolate buddy
    list aren't counted as freepage, but ones on normal buddy list are
    counted as freepage. If merge happens, buddy freepage on normal buddy
    list is inevitably moved to isolate buddy list without any consideration
    of freepage accouting so it could be incorrect.

    2. pages are added to normal buddy list while pageblock is isolated.
    It is similar with above case.

    This also may cause two problems. One is that we can't keep these
    freepages from being allocated. Although this pageblock is isolated,
    freepage would be added to normal buddy list so that it could be
    allocated without any restriction. And the other problem is same as
    case 1, that it, incorrect freepage accouting.

    This race condition would be prevented by checking migratetype again
    with holding the zone lock. Because it is somewhat heavy operation and
    it isn't needed in common case, we want to avoid rechecking as much as
    possible. So this patch introduce new variable, nr_isolate_pageblock in
    struct zone to check if there is isolated pageblock. With this, we can
    avoid to re-check migratetype in common case and do it only if there is
    isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
    mentioned problems.

    Changes from v3:
    Add one more check in free_one_page() that checks whether migratetype is
    MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

22 Oct, 2014

1 commit

  • PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Cc: 3.2+ # 3.2+
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Michal Hocko
     

14 Oct, 2014

1 commit

  • Pull x86 mm updates from Ingo Molnar:
    "This tree includes the following changes:

    - fix memory hotplug
    - fix hibernation bootup memory layout assumptions
    - fix hyperv numa guest kernel messages
    - remove dead code
    - update documentation"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Update memory map description to list hypervisor-reserved area
    x86/mm, hibernate: Do not assume the first e820 area to be RAM
    x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()
    x86/mm/hotplug: Modify PGD entry when removing memory
    x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable()
    x86: Remove set_pmd_pfn

    Linus Torvalds
     

10 Oct, 2014

12 commits

  • dump_page() and dump_vma() are not specific to page_alloc.c, move them out
    so page_alloc.c won't turn into the unofficial debug repository.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Zones are allocated by the page allocator in either node or zone order.
    Node ordering is preferred in terms of locality and is applied
    automatically in one of three cases:

    1. If a node has only low memory

    2. If DMA/DMA32 is a high percentage of memory

    3. If low memory on a single node is greater than 70% of the node size

    Otherwise zone ordering is used to preserve low memory for devices that
    require it. Unfortunately a consequence of this is that applications
    running on a machine with balanced NUMA nodes will experience different
    performance characteristics depending on which node they happen to start
    from.

    The point of zone ordering is to protect lower zones for devices that
    require DMA/DMA32 memory. When NUMA was first introduced, this was
    critical as 32-bit NUMA machines existed and exhausting low memory
    triggered OOMs easily as so many allocations required low memory. On
    64-bit machines the primary concern is devices that are 32-bit only which
    is less severe than the low memory exhaustion problem on 32-bit NUMA. It
    seems there are really few devices that depends on it.

    AGP -- I assume this is getting more rare but even then I think the allocations
    happen early in boot time where lowmem pressure is less of a problem

    DRM -- If the device is 32-bit only then there may be low pressure. I didn't
    evaluate these in detail but it looks like some of these are mobile
    graphics card. Not many NUMA laptops out there. DRM folk should know
    better though.

    Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?

    B43 wireless card -- again not really a NUMA thing.

    I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
    machines in case someone throws a brain damanged TV or graphics card in there.
    This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
    to make it default everywhere but I understand that some embedded arches may
    be using 32-bit NUMA where I cannot predict the consequences.

    The performance impact depends on the workload and the characteristics of the
    machine and the machine I tested on had a large Normal zone on node 0 so the
    impact is within the noise for the majority of tests. The allocation stats
    show more allocation requests were from DMA32 and local node. Running SpecJBB
    with multiple JVMs and automatic NUMA balancing disabled the results were

    specjbb
    3.17.0-rc2 3.17.0-rc2
    vanilla nodeorder-v1r1
    Min 1 29534.00 ( 0.00%) 30020.00 ( 1.65%)
    Min 10 115717.00 ( 0.00%) 134038.00 ( 15.83%)
    Min 19 109718.00 ( 0.00%) 114186.00 ( 4.07%)
    Min 28 104459.00 ( 0.00%) 103639.00 ( -0.78%)
    Min 37 98245.00 ( 0.00%) 103756.00 ( 5.61%)
    Min 46 97198.00 ( 0.00%) 96197.00 ( -1.03%)
    Mean 1 30953.25 ( 0.00%) 31917.75 ( 3.12%)
    Mean 10 124432.50 ( 0.00%) 140904.00 ( 13.24%)
    Mean 19 116033.50 ( 0.00%) 119294.75 ( 2.81%)
    Mean 28 108365.25 ( 0.00%) 106879.50 ( -1.37%)
    Mean 37 102984.75 ( 0.00%) 106924.25 ( 3.83%)
    Mean 46 100783.25 ( 0.00%) 105368.50 ( 4.55%)
    Stddev 1 1260.38 ( 0.00%) 1109.66 ( 11.96%)
    Stddev 10 7434.03 ( 0.00%) 5171.91 ( 30.43%)
    Stddev 19 8453.84 ( 0.00%) 5309.59 ( 37.19%)
    Stddev 28 4184.55 ( 0.00%) 2906.63 ( 30.54%)
    Stddev 37 5409.49 ( 0.00%) 3192.12 ( 40.99%)
    Stddev 46 4521.95 ( 0.00%) 7392.52 (-63.48%)
    Max 1 32738.00 ( 0.00%) 32719.00 ( -0.06%)
    Max 10 136039.00 ( 0.00%) 148614.00 ( 9.24%)
    Max 19 130566.00 ( 0.00%) 127418.00 ( -2.41%)
    Max 28 115404.00 ( 0.00%) 111254.00 ( -3.60%)
    Max 37 112118.00 ( 0.00%) 111732.00 ( -0.34%)
    Max 46 108541.00 ( 0.00%) 116849.00 ( 7.65%)
    TPut 1 123813.00 ( 0.00%) 127671.00 ( 3.12%)
    TPut 10 497730.00 ( 0.00%) 563616.00 ( 13.24%)
    TPut 19 464134.00 ( 0.00%) 477179.00 ( 2.81%)
    TPut 28 433461.00 ( 0.00%) 427518.00 ( -1.37%)
    TPut 37 411939.00 ( 0.00%) 427697.00 ( 3.83%)
    TPut 46 403133.00 ( 0.00%) 421474.00 ( 4.55%)

    3.17.0-rc2 3.17.0-rc2
    vanillanodeorder-v1r1
    DMA allocs 0 0
    DMA32 allocs 57 1491992
    Normal allocs 32543566 30026383
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 0 0
    Kswapd efficiency 100% 100%
    Kswapd velocity 0.000 0.000
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Zone normal velocity 0.000 0.000
    Zone dma32 velocity 0.000 0.000
    Zone dma velocity 0.000 0.000
    THP fault alloc 55164 52987
    THP collapse alloc 139 147
    THP splits 26 21
    NUMA alloc hit 4169066 4250692
    NUMA alloc miss 0 0

    Note that there were more DMA32 allocations with the patch applied. In this
    particular case there was no difference in numa_hit and numa_miss. The
    expectation is that DMA32 was being used at the low watermark instead of
    falling into the slow path. kswapd was not woken but it's not worken for
    THP allocations.

    On 32-bit, this patch defaults to zone-ordering as low memory depletion
    can be a serious problem on 32-bit large memory machines. If the default
    ordering was node then processes on node 0 will deplete the Normal zone
    due to normal activity. The problem is worse if CONFIG_HIGHPTE is not
    set. If combined with large amounts of dirty/writeback pages in Normal
    zone then there is also a high risk of OOM. The heuristics are removed
    as it's not clear they were ever important on 32-bit. They were only
    relevant for setting node-ordering on 64-bit.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Since 2.6.24 there has been a paranoid check in move_freepages that looks
    up the zone of two pages. This is a very slow path and the only time I've
    seen this bug trigger recently is when memory initialisation was broken
    during patch development. Despite the fact it's a slow path, this patch
    converts the check to a VM_BUG_ON anyway as it has served its purpose by
    now.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
    sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
    reader through layers indirection just to track down a simple bit.

    Remove all zone flag wrappers and just use bitops against zone->flags
    directly. It's just as readable and the lines are barely any longer.

    Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
    remove the zone_flags_t typedef.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
    according to the zonelist and high_zoneidx. However, this doesn't take
    nodemask into account, and could prematurely wakeup kswapd on some
    unintended nodes.

    This patch uses for_each_zone_zonelist_nodemask() instead of
    for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
    situation.

    Signed-off-by: Weijie Yang
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Introduce a helper to dump information about a VMA, this also makes
    dump_page_flags more generic and re-uses that so the output looks very
    similar to dump_page:

    [ 61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
    [ 61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
    [ 61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops (null)
    [ 61.903437] pgoff 7ffffffdd file (null) private_data (null)
    [ 61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)

    [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
    [swarren@nvidia.com: fix dump_vma() compilation]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Stephen Warren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
    ALLOC_CPUSET) that have separate semantics.

    The function allocflags_to_migratetype() actually takes gfp flags, not
    alloc flags, and returns a migratetype. Rename it to
    gfpflags_to_migratetype().

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Async compaction aborts when it detects zone lock contention or
    need_resched() is true. David Rientjes has reported that in practice,
    most direct async compactions for THP allocation abort due to
    need_resched(). This means that a second direct compaction is never
    attempted, which might be OK for a page fault, but khugepaged is intended
    to attempt a sync compaction in such case and in these cases it won't.

    This patch replaces "bool contended" in compact_control with an int that
    distinguishes between aborting due to need_resched() and aborting due to
    lock contention. This allows propagating the abort through all compaction
    functions as before, but passing the abort reason up to
    __alloc_pages_slowpath() which decides when to continue with direct
    reclaim and another compaction attempt.

    Another problem is that try_to_compact_pages() did not act upon the
    reported contention (both need_resched() or lock contention) immediately
    and would proceed with another zone from the zonelist. When
    need_resched() is true, that means initializing another zone compaction,
    only to check again need_resched() in isolate_migratepages() and aborting.
    For zone lock contention, the unintended consequence is that the lock
    contended status reported back to the allocator is detrmined from the last
    zone where compaction was attempted, which is rather arbitrary.

    This patch fixes the problem in the following way:
    - async compaction of a zone aborting due to need_resched() or fatal signal
    pending means that further zones should not be tried. We report
    COMPACT_CONTENDED_SCHED to the allocator.
    - aborting zone compaction due to lock contention means we can still try
    another zone, since it has different set of locks. We report back
    COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
    it was aborted due to lock contention.

    As a result of these fixes, khugepaged will proceed with second sync
    compaction as intended, when the preceding async compaction aborted due to
    need_resched(). Page fault compactions aborting due to need_resched()
    will spare some cycles previously wasted by initializing another zone
    compaction only to abort again. Lock contention will be reported only
    when compaction in all zones aborted due to lock contention, and therefore
    it's not a good idea to try again after reclaim.

    In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
    has improved number of THP collapse allocations by 10%, which shows
    positive effect on khugepaged. The benchmark's success rates are
    unchanged as it is not recognized as khugepaged. Numbers of compact_stall
    and compact_fail events have however decreased by 20%, with
    compact_success still a bit improved, which is good. With benchmark
    configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
    collapse allocations, and only slight improvement in stalls and failures.

    [akpm@linux-foundation.org: fix warnings]
    Reported-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_migratepages_range() is the main function of the compaction
    scanner, called either on a single pageblock by isolate_migratepages()
    during regular compaction, or on an arbitrary range by CMA's
    __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
    compaction suitability checks, and because of the CMA callpath, it tracks
    if it crossed a pageblock boundary in order to repeat those checks.

    However, closer inspection shows that those checks are always true for CMA:
    - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
    - migrate_async_suitable() check is skipped because CMA uses sync compaction

    We can therefore move the compaction-specific checks to
    isolate_migratepages() and simplify isolate_migratepages_range().
    Furthermore, we can mimic the freepage scanner family of functions, which
    has isolate_freepages_block() function called both by compaction from
    isolate_freepages() and by CMA from isolate_freepages_range(), where each
    use-case adds own specific glue code. This allows further code
    simplification.

    Thus, we rename isolate_migratepages_range() to
    isolate_migratepages_block() and limit its functionality to a single
    pageblock (or its subset). For CMA, a new different
    isolate_migratepages_range() is created as a CMA-specific wrapper for the
    _block() function. The checks specific to compaction are moved to
    isolate_migratepages(). As part of the unification of these two families
    of functions, we remove the redundant zone parameter where applicable,
    since zone pointer is already passed in cc->zone.

    Furthermore, going back to compact_zone() and compact_finished() when
    pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
    - the checks are meant to skip pageblocks quickly. The patch therefore
    also introduces a simple loop into isolate_migratepages() so that it does
    not return immediately on failed pageblock checks, but keeps going until
    isolate_migratepages_range() gets called once. Similarily to
    isolate_freepages(), the function periodically checks if it needs to
    reschedule or abort async compaction.

    [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compact_stall vmstat counter counts the number of allocations stalled
    by direct compaction. It does not count when all attempted zones had
    deferred compaction, but it does count when all zones skipped compaction.
    The skipping is decided based on very early check of
    compaction_suitable(), based on watermarks and memory fragmentation.
    Therefore it makes sense not to count skipped compactions as stalls.
    Moreover, compact_success or compact_fail is also already not being
    counted when compaction was skipped, so this patch changes the
    compact_stall counting to match the other two.

    Additionally, restructure __alloc_pages_direct_compact() code for better
    readability.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When direct sync compaction is often unsuccessful, it may become deferred
    for some time to avoid further useless attempts, both sync and async.
    Successful high-order allocations un-defer compaction, while further
    unsuccessful compaction attempts prolong the compaction deferred period.

    Currently the checking and setting deferred status is performed only on
    the preferred zone of the allocation that invoked direct compaction. But
    compaction itself is attempted on all eligible zones in the zonelist, so
    the behavior is suboptimal and may lead both to scenarios where 1)
    compaction is attempted uselessly, or 2) where it's not attempted despite
    good chances of succeeding, as shown on the examples below:

    1) A direct compaction with Normal preferred zone failed and set
    deferred compaction for the Normal zone. Another unrelated direct
    compaction with DMA32 as preferred zone will attempt to compact DMA32
    zone even though the first compaction attempt also included DMA32 zone.

    In another scenario, compaction with Normal preferred zone failed to
    compact Normal zone, but succeeded in the DMA32 zone, so it will not
    defer compaction. In the next attempt, it will try Normal zone which
    will fail again, instead of skipping Normal zone and trying DMA32
    directly.

    2) Kswapd will balance DMA32 zone and reset defer status based on
    watermarks looking good. A direct compaction with preferred Normal
    zone will skip compaction of all zones including DMA32 because Normal
    was still deferred. The allocation might have succeeded in DMA32, but
    won't.

    This patch makes compaction deferring work on individual zone basis
    instead of preferred zone. For each zone, it checks compaction_deferred()
    to decide if the zone should be skipped. If watermarks fail after
    compacting the zone, defer_compaction() is called. The zone where
    watermarks passed can still be deferred when the allocation attempt is
    unsuccessful. When allocation is successful, compaction_defer_reset() is
    called for the zone containing the allocated page. This approach should
    approximate calling defer_compaction() only on zones where compaction was
    attempted and did not yield allocated page. There might be corner cases
    but that is inevitable as long as the decision to stop compacting dues not
    guarantee that a page will be allocated.

    Due to a new COMPACT_DEFERRED return value, some functions relying
    implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
    more accurate. The did_some_progress output parameter of
    __alloc_pages_direct_compact() is removed completely, as the caller
    actually does not use it after compaction sets it - it is only considered
    when direct reclaim sets it.

    During testing on a two-node machine with a single very small Normal zone
    on node 1, this patch has improved success rates in stress-highalloc
    mmtests benchmark. The success here were previously made worse by commit
    3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
    kswapd") as kswapd was no longer resetting often enough the deferred
    compaction for the Normal zone, and DMA32 zones on both nodes were thus
    not considered for compaction. On different machine, success rates were
    improved with __GFP_NO_KSWAPD allocations.

    [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
    Signed-off-by: Vlastimil Babka
    Acked-by: Minchan Kim
    Reviewed-by: Zhang Yanfei
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
    from gfp_mask in each retry pass, although the migratetype variable
    already has the value determined and it does not change. Use the variable
    and perform the check only once. Also convert #ifdef CONFIG_CMA to
    IS_ENABLED.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: "Srivatsa S. Bhat"
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka