16 Apr, 2015

3 commits

  • mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

    Reported-by: Fengguang Wu
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the compaction is activated via /proc/sys/vm/compact_memory it would
    better scan the whole zone. And some platforms, for instance ARM, have
    the start_pfn of a zone at zero. Therefore the first try to compact via
    /proc doesn't work. It needs to reset the compaction scanner position
    first.

    Signed-off-by: Gioh Kim
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

1 commit

  • Compaction has anti fragmentation algorithm. It is that freepage should
    be more than pageblock order to finish the compaction if we don't find any
    freepage in requested migratetype buddy list. This is for mitigating
    fragmentation, but, there is a lack of migratetype consideration and it is
    too excessive compared to page allocator's anti fragmentation algorithm.

    Not considering migratetype would cause premature finish of compaction.
    For example, if allocation request is for unmovable migratetype, freepage
    with CMA migratetype doesn't help that allocation and compaction should
    not be stopped. But, current logic regards this situation as compaction
    is no longer needed, so finish the compaction.

    Secondly, condition is too excessive compared to page allocator's logic.
    We can steal freepage from other migratetype and change pageblock
    migratetype on more relaxed conditions in page allocator. This is
    designed to prevent fragmentation and we can use it here. Imposing hard
    constraint only to the compaction doesn't help much in this case since
    page allocator would cause fragmentation again.

    To solve these problems, this patch borrows anti fragmentation logic from
    page allocator. It will reduce premature compaction finish in some cases
    and reduce excessive compaction work.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    considerable increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    31.82 : 42.20

    I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
    there is no more degradation on allocation success rate than before. That
    roughly means that this patch doesn't result in more fragmentations.

    Vlastimil suggests additional idea that we only test for fallbacks when
    migration scanner has scanned a whole pageblock. It looked good for
    fragmentation because chance of stealing increase due to making more free
    pages in certain pageblock. So, I tested it, but, it results in decreased
    compaction success rate, roughly 38.00. I guess the reason that if system
    is low memory condition, watermark check could be failed due to not enough
    order 0 free page and so, sometimes, we can't reach a fallback check
    although migrate_pfn is aligned to pageblock_nr_pages. I can insert code
    to cope with this situation but it makes code more complicated so I don't
    include his idea at this patch.

    [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

14 Feb, 2015

1 commit

  • Add kernel address sanitizer hooks to mark allocated page's addresses as
    accessible in corresponding shadow region. Mark freed pages as
    inaccessible.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

13 Feb, 2015

3 commits

  • The vmstat interfaces are good at hiding negative counts (at least when
    CONFIG_SMP); but if you peer behind the curtain, you find that
    nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
    more negative: so they can absorb larger and larger numbers of isolated
    pages, yet still appear to be zero.

    I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
    but I guess it's there for a good reason, in which case we ought to get
    too_many_isolated() working again.

    The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
    putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
    to call acct_isolated() to increment them.

    It is possible that the bug whcih this patch fixes could cause OOM kills
    when the system still has a lot of reclaimable page cache.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    Signed-off-by: Hugh Dickins
    Acked-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently, freepage isolation in one pageblock doesn't consider how many
    freepages we isolate. When I traced flow of compaction, compaction
    sometimes isolates more than 256 freepages to migrate just 32 pages.

    In this patch, freepage isolation is stopped at the point that we
    have more isolated freepage than isolated page for migration. This
    results in slowing down free page scanner and make compaction success
    rate higher.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    27.13 : 31.82

    pfn where both scanners meets on compaction complete
    (separate test due to enormous tracepoint buffer)
    (zone_start=4096, zone_end=1048576)
    586034 : 654378

    In fact, I didn't fully understand why this patch results in such good
    result. There was a guess that not used freepages are released to pcp list
    and on next compaction trial we won't isolate them again so compaction
    success rate would decrease. To prevent this effect, I tested with adding
    pcp drain code on release_freepages(), but, it has no good effect.

    Anyway, this patch reduces waste time to isolate unneeded freepages so
    seems reasonable.

    Vlastimil said:

    : I briefly tried it on top of the pivot-changing series and with order-9
    : allocations it reduced free page scanned counter by almost 10%. No effect
    : on success rates (maybe because pivot changing already took care of the
    : scanners meeting problem) but the scanning reduction is good on its own.
    :
    : It also explains why e14c720efdd7 ("mm, compaction: remember position
    : within pageblock in free pages scanner") had less than expected
    : improvements. It would only actually stop within pageblock in case of
    : async compaction detecting contention. I guess that's also why the
    : infinite loop problem fixed by 1d5bfe1ffb5b affected so relatively few
    : people.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Tested-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • What we want to check here is whether there is highorder freepage in buddy
    list of other migratetype in order to steal it without fragmentation.
    But, current code just checks cc->order which means allocation request
    order. So, this is wrong.

    Without this fix, non-movable synchronous compaction below pageblock order
    would not stopped until compaction is complete, because migratetype of
    most pageblocks are movable and high order freepage made by compaction is
    usually on movable type buddy list.

    There is some report related to this bug. See below link.

    http://www.spinics.net/lists/linux-mm/msg81666.html

    Although the issued system still has load spike comes from compaction,
    this makes that system completely stable and responsive according to his
    report.

    stress-highalloc test in mmtests with non movable order 7 allocation
    doesn't show any notable difference in allocation success rate, but, it
    shows more compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    18.47 : 28.94

    Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Feb, 2015

5 commits

  • Compaction deferring logic is heavy hammer that block the way to the
    compaction. It doesn't consider overall system state, so it could prevent
    user from doing compaction falsely. In other words, even if system has
    enough range of memory to compact, compaction would be skipped due to
    compaction deferring logic. This patch add new tracepoint to understand
    work of deferring logic. This will also help to check compaction success
    and fail.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It is not well analyzed that when/why compaction start/finish or not.
    With these new tracepoints, we can know much more about start/finish
    reason of compaction. I can find following bug with these tracepoint.

    http://www.spinics.net/lists/linux-mm/msg81582.html

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It'd be useful to know current range where compaction work for detailed
    analysis. With it, we can know pageblock where we actually scan and
    isolate, and, how much pages we try in that pageblock and can guess why it
    doesn't become freepage with pageblock order roughly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We now have tracepoint for begin event of compaction and it prints start
    position of both scanners, but, tracepoint for end event of compaction
    doesn't print finish position of both scanners. It'd be also useful to
    know finish position of both scanners so this patch add it. It will help
    to find odd behavior or problem on compaction internal logic.

    And mode is added to both begin/end tracepoint output, since according to
    mode, compaction behavior is quite different.

    And lastly, status format is changed to string rather than status number
    for readability.

    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Expand the usage of the struct alloc_context introduced in the previous
    patch also for calling try_to_compact_pages(), to reduce the number of its
    parameters. Since the function is in different compilation unit, we need
    to move alloc_context definition in the shared mm/internal.h header.

    With this change we get simpler code and small savings of code size and stack
    usage:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
    function old new delta
    __alloc_pages_direct_compact 283 256 -27
    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
    function old new delta
    try_to_compact_pages 582 569 -13

    Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
    scripts/checkstack.pl).

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

11 Dec, 2014

5 commits

  • The goal of memory compaction is to create high-order freepages through
    page migration. Page migration however puts pages on the per-cpu lru_add
    cache, which is later flushed to per-cpu pcplists, and only after pcplists
    are drained the pages can actually merge. This can happen due to the
    per-cpu caches becoming full through further freeing, or explicitly.

    During direct compaction, it is useful to do the draining explicitly so
    that pages merge as soon as possible and compaction can detect success
    immediately and keep the latency impact at minimum. However the current
    implementation is far from ideal. Draining is done only in
    __alloc_pages_direct_compact(), after all zones were already compacted,
    and the decisions to continue or stop compaction in individual zones was
    done without the last batch of migrations being merged. It is also
    missing the draining of lru_add cache before the pcplists.

    This patch moves the draining for direct compaction into compact_zone().
    It adds the missing lru_cache draining and uses the newly introduced
    single zone pcplists draining to reduce overhead and avoid impact on
    unrelated zones. Draining is only performed when it can actually lead to
    merging of a page of desired order (passed by cc->order). This means it
    is only done when migration occurred in the previously scanned cc->order
    aligned block(s) and the migration scanner is now pointing to the next
    cc->order aligned block.

    The patch has been tested with stress-highalloc benchmark from mmtests.
    Although overal allocation success rates of the benchmark were not
    affected, the number of detected compaction successes has doubled. This
    suggests that allocations were previously successful due to implicit
    merging caused by background activity, making a later allocation attempt
    succeed immediately, but not attributing the success to compaction. Since
    stress-highalloc always tries to allocate almost the whole memory, it
    cannot show the improvement in its reported success rate metric. However
    after this patch, compaction should detect success and terminate earlier,
    reducing the direct compaction latencies in a real scenario.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction caches the migration and free scanner positions between
    compaction invocations, so that the whole zone gets eventually scanned and
    there is no bias towards the initial scanner positions at the
    beginning/end of the zone.

    The cached positions are continuously updated as scanners progress and the
    updating stops as soon as a page is successfully isolated. The reasoning
    behind this is that a pageblock where isolation succeeded is likely to
    succeed again in near future and it should be worth revisiting it.

    However, the downside is that potentially many pages are rescanned without
    successful isolation. At worst, there might be a page where isolation
    from LRU succeeds but migration fails (potentially always). So upon
    encountering this page, cached position would always stop being updated
    for no good reason. It might have been useful to let such page be
    rescanned with sync compaction after async one failed, but this is now
    handled by caching scanner position for async and sync mode separately
    since commit 35979ef33931 ("mm, compaction: add per-zone migration pfn
    cache for async compaction").

    After this patch, cached positions are updated unconditionally. In
    stress-highalloc benchmark, this has decreased the numbers of scanned
    pages by few percent, without affecting allocation success rates.

    To prevent free scanner from leaving free pages behind after they are
    returned due to page migration failure, the cached scanner pfn is changed
    to point to the pageblock of the returned free page with the highest pfn,
    before leaving compact_zone().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Deferred compaction is employed to avoid compacting zone where sync direct
    compaction has recently failed. As such, it makes sense to only defer
    when a full zone was scanned, which is when compact_zone returns with
    COMPACT_COMPLETE. It's less useful to defer when compact_zone returns
    with apparent success (COMPACT_PARTIAL), followed by a watermark check
    failure, which can happen due to parallel allocation activity. It also
    does not make much sense to defer compaction which was completely skipped
    (COMPACT_SKIP) for being unsuitable in the first place.

    This patch therefore makes deferred compaction trigger only when
    COMPACT_COMPLETE is returned from compact_zone(). Results of
    stress-highalloc becnmark show the difference is within measurement error,
    so the issue is rather cosmetic.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually
    instead of preferred zone"), compaction is deferred for each zone where
    sync direct compaction fails, and reset where it succeeds. However, it
    was observed that for DMA zone compaction often appeared to succeed
    while subsequent allocation attempt would not, due to different outcome
    of watermark check.

    In order to properly defer compaction in this zone, the candidate zone
    has to be passed back to __alloc_pages_direct_compact() and compaction
    deferred in the zone after the allocation attempt fails.

    The large source of mismatch between watermark check in compaction and
    allocation was the lack of alloc_flags and classzone_idx values in
    compaction, which has been fixed in the previous patch. So with this
    problem fixed, we can simplify the code by removing the candidate_zone
    parameter and deferring in __alloc_pages_direct_compact().

    After this patch, the compaction activity during stress-highalloc
    benchmark is still somewhat increased, but it's negligible compared to the
    increase that occurred without the better watermark checking. This
    suggests that it is still possible to apparently succeed in compaction but
    fail to allocate, possibly due to parallel allocation activity.

    [akpm@linux-foundation.org: fix build]
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction relies on zone watermark checks for decisions such as if it's
    worth to start compacting in compaction_suitable() or whether compaction
    should stop in compact_finished(). The watermark checks take
    classzone_idx and alloc_flags parameters, which are related to the memory
    allocation request. But from the context of compaction they are currently
    passed as 0, including the direct compaction which is invoked to satisfy
    the allocation request, and could therefore know the proper values.

    The lack of proper values can lead to mismatch between decisions taken
    during compaction and decisions related to the allocation request. Lack
    of proper classzone_idx value means that lowmem_reserve is not taken into
    account. This has manifested (during recent changes to deferred
    compaction) when DMA zone was used as fallback for preferred Normal zone.
    compaction_suitable() without proper classzone_idx would think that the
    watermarks are already satisfied, but watermark check in
    get_page_from_freelist() would fail. Because of this problem, deferring
    compaction has extra complexity that can be removed in the following
    patch.

    The issue (not confirmed in practice) with missing alloc_flags is opposite
    in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
    ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
    CMA-enabled systems) the watermark checking in compaction with 0 passed
    will be stricter than in get_page_from_freelist(). In these cases
    compaction might be running for a longer time than is really needed.

    Another issue compaction_suitable() is that the check for "does the zone
    need compaction at all?" comes only after the check "does the zone have
    enough free free pages to succeed compaction". The latter considers extra
    pages for migration and can therefore in some situations fail and return
    COMPACT_SKIPPED, although the high-order allocation would succeed and we
    should return COMPACT_PARTIAL.

    This patch fixes these problems by adding alloc_flags and classzone_idx to
    struct compact_control and related functions involved in direct compaction
    and watermark checking. Where possible, all other callers of
    compaction_suitable() pass proper values where those are known. This is
    currently limited to classzone_idx, which is sometimes known in kswapd
    context. However, the direct reclaim callers should_continue_reclaim()
    and compaction_ready() do not currently know the proper values, so the
    coordination between reclaim and compaction may still not be as accurate
    as it could. This can be fixed later, if it's shown to be an issue.

    Additionaly the checks in compact_suitable() are reordered to address the
    second issue described above.

    The effect of this patch should be slightly better high-order allocation
    success rates and/or less compaction overhead, depending on the type of
    allocations and presence of CMA. It allows simplifying deferred
    compaction code in a followup patch.

    When testing with stress-highalloc, there was some slight improvement
    (which might be just due to variance) in success rates of non-THP-like
    allocations.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Nov, 2014

2 commits

  • Several people have reported occasionally seeing processes stuck in
    compact_zone(), even triggering soft lockups, in 3.18-rc2+.

    Testing a revert of commit e14c720efdd7 ("mm, compaction: remember
    position within pageblock in free pages scanner") fixed the issue,
    although the stuck processes do not appear to involve the free scanner.

    Finally, by code inspection, the bug was found in isolate_migratepages()
    which uses a slightly different condition to detect if the migration and
    free scanners have met, than compact_finished(). That has not been a
    problem until commit e14c720efdd7 allowed the free scanner position
    between individual invocations to be in the middle of a pageblock.

    In a relatively rare case, the migration scanner position can end up at
    the beginning of a pageblock, with the free scanner position in the
    middle of the same pageblock. If it's the migration scanner's turn,
    isolate_migratepages() exits immediately (without updating the
    position), while compact_finished() decides to continue compaction,
    resulting in a potentially infinite loop. The system can recover only
    if another process creates enough high-order pages to make the watermark
    checks in compact_finished() pass.

    This patch fixes the immediate problem by bumping the migration
    scanner's position to meet the free scanner in isolate_migratepages(),
    when both are within the same pageblock. This causes compact_finished()
    to terminate properly. A more robust check in compact_finished() is
    planned as a cleanup for better future maintainability.

    Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner)
    Signed-off-by: Vlastimil Babka
    Reported-by: P. Christeas
    Tested-by: P. Christeas
    Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2
    Reported-by: Norbert Preining
    Tested-by: Norbert Preining
    Link: https://lkml.org/lkml/2014/11/4/904
    Reported-by: Pavel Machek
    Link: https://lkml.org/lkml/2014/11/7/164
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in
    the migration scanner") has a side-effect that changes the iteration
    range calculation. Before the change, block_end_pfn is calculated using
    start_pfn, but now it blindly adds pageblock_nr_pages to the previous
    value.

    This causes the problem that isolation_start_pfn is larger than
    block_end_pfn when we isolate the page with more than pageblock order.
    In this case, isolation would fail due to an invalid range parameter.

    To prevent this, this patch implements skipping the range until a proper
    target pageblock is met. Without this patch, CMA with more than
    pageblock order always fails but with this patch it will succeed.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

30 Oct, 2014

1 commit

  • Commit edc2ca612496 ("mm, compaction: move pageblock checks up from
    isolate_migratepages_range()") commonizes isolate_migratepages variants
    and make them use isolate_migratepages_block().

    isolate_migratepages_block() could stop the execution when enough pages
    are isolated, but, there is no code in isolate_migratepages_range() to
    handle this case. In the result, even if isolate_migratepages_block()
    returns prematurely without checking all pages in the range,

    isolate_migratepages_block() is called repeately on the following
    pageblock and some pages in the previous range are skipped to check.
    Then, CMA is failed frequently due to this fact.

    To fix this problem, this patch let isolate_migratepages_range() know
    the situation that enough pages are isolated and stop the isolation in
    that case.

    Note that isolate_migratepages() has no such problem, because, it always
    stops the isolation after just one call of isolate_migratepages_block().

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

13 commits

  • Sasha Levin reported KASAN splash inside isolate_migratepages_range().
    Problem is in the function __is_movable_balloon_page() which tests
    AS_BALLOON_MAP in page->mapping->flags. This function has no protection
    against anonymous pages. As result it tried to check address space flags
    inside struct anon_vma.

    Further investigation shows more problems in current implementation:

    * Special branch in __unmap_and_move() never works:
    balloon_page_movable() checks page flags and page_count. In
    __unmap_and_move() page is locked, reference counter is elevated, thus
    balloon_page_movable() always fails. As a result execution goes to the
    normal migration path. virtballoon_migratepage() returns
    MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
    move_to_new_page() thinks this is an error code and assigns
    newpage->mapping to NULL. Newly migrated page lose connectivity with
    balloon an all ability for further migration.

    * lru_lock erroneously required in isolate_migratepages_range() for
    isolation ballooned page. This function releases lru_lock periodically,
    this makes migration mostly impossible for some pages.

    * balloon_page_dequeue have a tight race with balloon_page_isolate:
    balloon_page_isolate could be executed in parallel with dequeue between
    picking page from list and locking page_lock. Race is rare because they
    use trylock_page() for locking.

    This patch fixes all of them.

    Instead of fake mapping with special flag this patch uses special state of
    page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
    PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
    directly in struct page makes everything safer and easier.

    PagePrivate is used to mark pages present in page list (i.e. not
    isolated, like PageLRU for normal pages). It replaces special rules for
    reference counter and makes balloon migration similar to migration of
    normal pages. This flag is protected by page_lock together with link to
    the balloon device.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Sasha Levin
    Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • C mm/compaction.o
    mm/compaction.c: In function isolate_freepages_block:
    mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
    && compact_unlock_should_abort(&cc->zone->lock, flags,
    ^

    Signed-off-by: Xiubo Li
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiubo Li
     
  • struct compact_control currently converts the gfp mask to a migratetype,
    but we need the entire gfp mask in a follow-up patch.

    Pass the entire gfp mask as part of struct compact_control.

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
    ALLOC_CPUSET) that have separate semantics.

    The function allocflags_to_migratetype() actually takes gfp flags, not
    alloc flags, and returns a migratetype. Rename it to
    gfpflags_to_migratetype().

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The migration scanner skips PageBuddy pages, but does not consider their
    order as checking page_order() is generally unsafe without holding the
    zone->lock, and acquiring the lock just for the check wouldn't be a good
    tradeoff.

    Still, this could avoid some iterations over the rest of the buddy page,
    and if we are careful, the race window between PageBuddy() check and
    page_order() is small, and the worst thing that can happen is that we skip
    too much and miss some isolation candidates. This is not that bad, as
    compaction can already fail for many other reasons like parallel
    allocations, and those have much larger race window.

    This patch therefore makes the migration scanner obtain the buddy page
    order and use it to skip the whole buddy page, if the order appears to be
    in the valid range.

    It's important that the page_order() is read only once, so that the value
    used in the checks and in the pfn calculation is the same. But in theory
    the compiler can replace the local variable by multiple inlines of
    page_order(). Therefore, the patch introduces page_order_unsafe() that
    uses ACCESS_ONCE to prevent this.

    Testing with stress-highalloc from mmtests shows a 15% reduction in number
    of pages scanned by migration scanner. The reduction is >60% with
    __GFP_NO_KSWAPD allocations, along with success rates better by few
    percent.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Unlike the migration scanner, the free scanner remembers the beginning of
    the last scanned pageblock in cc->free_pfn. It might be therefore
    rescanning pages uselessly when called several times during single
    compaction. This might have been useful when pages were returned to the
    buddy allocator after a failed migration, but this is no longer the case.

    This patch changes the meaning of cc->free_pfn so that if it points to a
    middle of a pageblock, that pageblock is scanned only from cc->free_pfn to
    the end. isolate_freepages_block() will record the pfn of the last page
    it looked at, which is then used to update cc->free_pfn.

    In the mmtests stress-highalloc benchmark, this has resulted in lowering
    the ratio between pages scanned by both scanners, from 2.5 free pages per
    migrate page, to 2.25 free pages per migrate page, without affecting
    success rates.

    With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio
    (2.1 instead of 1.8), but page migration successes increased by 10%, so
    this could mean that more useful work can be done until need_resched()
    aborts this kind of compaction.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction scanners try to lock zone locks as late as possible by checking
    many page or pageblock properties opportunistically without lock and
    skipping them if not unsuitable. For pages that pass the initial checks,
    some properties have to be checked again safely under lock. However, if
    the lock was already held from a previous iteration in the initial checks,
    the rechecks are unnecessary.

    This patch therefore skips the rechecks when the lock was already held.
    This is now possible to do, since we don't (potentially) drop and
    reacquire the lock between the initial checks and the safe rechecks
    anymore.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction scanners regularly check for lock contention and need_resched()
    through the compact_checklock_irqsave() function. However, if there is no
    contention, the lock can be held and IRQ disabled for potentially long
    time.

    This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise
    the time IRQs are disabled while isolating pages for migration") for the
    migration scanner. However, the refactoring done by commit 2a1402aa044b
    ("mm: compaction: acquire the zone->lru_lock as late as possible") has
    changed the conditions so that the lock is dropped only when there's
    contention on the lock or need_resched() is true. Also, need_resched() is
    checked only when the lock is already held. The comment "give a chance to
    irqs before checking need_resched" is therefore misleading, as IRQs remain
    disabled when the check is done.

    This patch restores the behavior intended by commit b2eef8c0d091 and also
    tries to better balance and make more deterministic the time spent by
    checking for contention vs the time the scanners might run between the
    checks. It also avoids situations where checking has not been done often
    enough before. The result should be avoiding both too frequent and too
    infrequent contention checking, and especially the potentially
    long-running scans with IRQs disabled and no checking of need_resched() or
    for fatal signal pending, which can happen when many consecutive pages or
    pageblocks fail the preliminary tests and do not reach the later call site
    to compact_checklock_irqsave(), as explained below.

    Before the patch:

    In the migration scanner, compact_checklock_irqsave() was called each
    loop, if reached. If not reached, some lower-frequency checking could
    still be done if the lock was already held, but this would not result in
    aborting contended async compaction until reaching
    compact_checklock_irqsave() or end of pageblock. In the free scanner, it
    was similar but completely without the periodical checking, so lock can be
    potentially held until reaching the end of pageblock.

    After the patch, in both scanners:

    The periodical check is done as the first thing in the loop on each
    SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
    function, which always unlocks the lock (if locked) and aborts async
    compaction if scheduling is needed. It also aborts any type of compaction
    when a fatal signal is pending.

    The compact_checklock_irqsave() function is replaced with a slightly
    different compact_trylock_irqsave(). The biggest difference is that the
    function is not called at all if the lock is already held. The periodical
    need_resched() checking is left solely to compact_unlock_should_abort().
    The lock contention avoidance for async compaction is achieved by the
    periodical unlock by compact_unlock_should_abort() and by using trylock in
    compact_trylock_irqsave() and aborting when trylock fails. Sync
    compaction does not use trylock.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Async compaction aborts when it detects zone lock contention or
    need_resched() is true. David Rientjes has reported that in practice,
    most direct async compactions for THP allocation abort due to
    need_resched(). This means that a second direct compaction is never
    attempted, which might be OK for a page fault, but khugepaged is intended
    to attempt a sync compaction in such case and in these cases it won't.

    This patch replaces "bool contended" in compact_control with an int that
    distinguishes between aborting due to need_resched() and aborting due to
    lock contention. This allows propagating the abort through all compaction
    functions as before, but passing the abort reason up to
    __alloc_pages_slowpath() which decides when to continue with direct
    reclaim and another compaction attempt.

    Another problem is that try_to_compact_pages() did not act upon the
    reported contention (both need_resched() or lock contention) immediately
    and would proceed with another zone from the zonelist. When
    need_resched() is true, that means initializing another zone compaction,
    only to check again need_resched() in isolate_migratepages() and aborting.
    For zone lock contention, the unintended consequence is that the lock
    contended status reported back to the allocator is detrmined from the last
    zone where compaction was attempted, which is rather arbitrary.

    This patch fixes the problem in the following way:
    - async compaction of a zone aborting due to need_resched() or fatal signal
    pending means that further zones should not be tried. We report
    COMPACT_CONTENDED_SCHED to the allocator.
    - aborting zone compaction due to lock contention means we can still try
    another zone, since it has different set of locks. We report back
    COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
    it was aborted due to lock contention.

    As a result of these fixes, khugepaged will proceed with second sync
    compaction as intended, when the preceding async compaction aborted due to
    need_resched(). Page fault compactions aborting due to need_resched()
    will spare some cycles previously wasted by initializing another zone
    compaction only to abort again. Lock contention will be reported only
    when compaction in all zones aborted due to lock contention, and therefore
    it's not a good idea to try again after reclaim.

    In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
    has improved number of THP collapse allocations by 10%, which shows
    positive effect on khugepaged. The benchmark's success rates are
    unchanged as it is not recognized as khugepaged. Numbers of compact_stall
    and compact_fail events have however decreased by 20%, with
    compact_success still a bit improved, which is good. With benchmark
    configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
    collapse allocations, and only slight improvement in stalls and failures.

    [akpm@linux-foundation.org: fix warnings]
    Reported-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The unification of the migrate and free scanner families of function has
    highlighted a difference in how the scanners ensure they only isolate
    pages of the intended zone. This is important for taking zone lock or lru
    lock of the correct zone. Due to nodes overlapping, it is however
    possible to encounter a different zone within the range of the zone being
    compacted.

    The free scanner, since its inception by commit 748446bb6b5a ("mm:
    compaction: memory compaction core"), has been checking the zone of the
    first valid page in a pageblock, and skipping the whole pageblock if the
    zone does not match.

    This checking was completely missing from the migration scanner at first,
    and later added by commit dc9086004b3d ("mm: compaction: check for
    overlapping nodes during isolation for migration") in a reaction to a bug
    report. But the zone comparison in migration scanner is done once per a
    single scanned page, which is more defensive and thus more costly than a
    check per pageblock.

    This patch unifies the checking done in both scanners to once per
    pageblock, through a new pageblock_pfn_to_page() function, which also
    includes pfn_valid() checks. It is more defensive than the current free
    scanner checks, as it checks both the first and last page of the
    pageblock, but less defensive by the migration scanner per-page checks.
    It assumes that node overlapping may result (on some architecture) in a
    boundary between two nodes falling into the middle of a pageblock, but
    that there cannot be a node0 node1 node0 interleaving within a single
    pageblock.

    The result is more code being shared and a bit less per-page CPU cost in
    the migration scanner.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_migratepages_range() is the main function of the compaction
    scanner, called either on a single pageblock by isolate_migratepages()
    during regular compaction, or on an arbitrary range by CMA's
    __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
    compaction suitability checks, and because of the CMA callpath, it tracks
    if it crossed a pageblock boundary in order to repeat those checks.

    However, closer inspection shows that those checks are always true for CMA:
    - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
    - migrate_async_suitable() check is skipped because CMA uses sync compaction

    We can therefore move the compaction-specific checks to
    isolate_migratepages() and simplify isolate_migratepages_range().
    Furthermore, we can mimic the freepage scanner family of functions, which
    has isolate_freepages_block() function called both by compaction from
    isolate_freepages() and by CMA from isolate_freepages_range(), where each
    use-case adds own specific glue code. This allows further code
    simplification.

    Thus, we rename isolate_migratepages_range() to
    isolate_migratepages_block() and limit its functionality to a single
    pageblock (or its subset). For CMA, a new different
    isolate_migratepages_range() is created as a CMA-specific wrapper for the
    _block() function. The checks specific to compaction are moved to
    isolate_migratepages(). As part of the unification of these two families
    of functions, we remove the redundant zone parameter where applicable,
    since zone pointer is already passed in cc->zone.

    Furthermore, going back to compact_zone() and compact_finished() when
    pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
    - the checks are meant to skip pageblocks quickly. The patch therefore
    also introduces a simple loop into isolate_migratepages() so that it does
    not return immediately on failed pageblock checks, but keeps going until
    isolate_migratepages_range() gets called once. Similarily to
    isolate_freepages(), the function periodically checks if it needs to
    reschedule or abort async compaction.

    [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_freepages_block() rechecks if the pageblock is suitable to be a
    target for migration after it has taken the zone->lock. However, the
    check has been optimized to occur only once per pageblock, and
    compact_checklock_irqsave() might be dropping and reacquiring lock, which
    means somebody else might have changed the pageblock's migratetype
    meanwhile.

    Furthermore, nothing prevents the migratetype to change right after
    isolate_freepages_block() has finished isolating. Given how imperfect
    this is, it's simpler to just rely on the check done in
    isolate_freepages() without lock, and not pretend that the recheck under
    lock guarantees anything. It is just a heuristic after all.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When direct sync compaction is often unsuccessful, it may become deferred
    for some time to avoid further useless attempts, both sync and async.
    Successful high-order allocations un-defer compaction, while further
    unsuccessful compaction attempts prolong the compaction deferred period.

    Currently the checking and setting deferred status is performed only on
    the preferred zone of the allocation that invoked direct compaction. But
    compaction itself is attempted on all eligible zones in the zonelist, so
    the behavior is suboptimal and may lead both to scenarios where 1)
    compaction is attempted uselessly, or 2) where it's not attempted despite
    good chances of succeeding, as shown on the examples below:

    1) A direct compaction with Normal preferred zone failed and set
    deferred compaction for the Normal zone. Another unrelated direct
    compaction with DMA32 as preferred zone will attempt to compact DMA32
    zone even though the first compaction attempt also included DMA32 zone.

    In another scenario, compaction with Normal preferred zone failed to
    compact Normal zone, but succeeded in the DMA32 zone, so it will not
    defer compaction. In the next attempt, it will try Normal zone which
    will fail again, instead of skipping Normal zone and trying DMA32
    directly.

    2) Kswapd will balance DMA32 zone and reset defer status based on
    watermarks looking good. A direct compaction with preferred Normal
    zone will skip compaction of all zones including DMA32 because Normal
    was still deferred. The allocation might have succeeded in DMA32, but
    won't.

    This patch makes compaction deferring work on individual zone basis
    instead of preferred zone. For each zone, it checks compaction_deferred()
    to decide if the zone should be skipped. If watermarks fail after
    compacting the zone, defer_compaction() is called. The zone where
    watermarks passed can still be deferred when the allocation attempt is
    unsuccessful. When allocation is successful, compaction_defer_reset() is
    called for the zone containing the allocated page. This approach should
    approximate calling defer_compaction() only on zones where compaction was
    attempted and did not yield allocated page. There might be corner cases
    but that is inevitable as long as the decision to stop compacting dues not
    guarantee that a page will be allocated.

    Due to a new COMPACT_DEFERRED return value, some functions relying
    implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
    more accurate. The did_some_progress output parameter of
    __alloc_pages_direct_compact() is removed completely, as the caller
    actually does not use it after compaction sets it - it is only considered
    when direct reclaim sets it.

    During testing on a two-node machine with a single very small Normal zone
    on node 1, this patch has improved success rates in stress-highalloc
    mmtests benchmark. The success here were previously made worse by commit
    3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
    kswapd") as kswapd was no longer resetting often enough the deferred
    compaction for the Normal zone, and DMA32 zones on both nodes were thus
    not considered for compaction. On different machine, success rates were
    improved with __GFP_NO_KSWAPD allocations.

    [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
    Signed-off-by: Vlastimil Babka
    Acked-by: Minchan Kim
    Reviewed-by: Zhang Yanfei
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

05 Jun, 2014

6 commits

  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compaction free scanner in isolate_freepages() currently remembers PFN
    of the highest pageblock where it successfully isolates, to be used as the
    starting pageblock for the next invocation. The rationale behind this is
    that page migration might return free pages to the allocator when
    migration fails and we don't want to skip them if the compaction
    continues.

    Since migration now returns free pages back to compaction code where they
    can be reused, this is no longer a concern. This patch changes
    isolate_freepages() so that the PFN for restarting is updated with each
    pageblock where isolation is attempted. Using stress-highalloc from
    mmtests, this resulted in 10% reduction of the pages scanned by the free
    scanner.

    Note that the somewhat similar functionality that records highest
    successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
    This cache is used when the whole compaction is restarted, not for
    multiple invocations of the free scanner during single compaction.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Reviewed-by: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During compaction, update_nr_listpages() has been used to count remaining
    non-migrated and free pages after a call to migrage_pages(). The
    freepages counting has become unneccessary, and it turns out that
    migratepages counting is also unnecessary in most cases.

    The only situation when it's needed to count cc->migratepages is when
    migrate_pages() returns with a negative error code. Otherwise, the
    non-negative return value is the number of pages that were not migrated,
    which is exactly the count of remaining pages in the cc->migratepages
    list.

    Furthermore, any non-zero count is only interesting for the tracepoint of
    mm_compaction_migratepages events, because after that all remaining
    unmigrated pages are put back and their count is set to 0.

    This patch therefore removes update_nr_listpages() completely, and changes
    the tracepoint definition so that the manual counting is done only when
    the tracepoint is enabled, and only when migrate_pages() returns a
    negative error code.

    Furthermore, migrate_pages() and the tracepoints won't be called when
    there's nothing to migrate. This potentially avoids some wasted cycles
    and reduces the volume of uninteresting mm_compaction_migratepages events
    where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
    was about 75% of the events. The mm_compaction_isolate_migratepages event
    is better for determining that nothing was isolated for migration, and
    this one was just duplicating the info.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Async compaction terminates prematurely when need_resched(), see
    compact_checklock_irqsave(). This can never trigger, however, if the
    cond_resched() in isolate_migratepages_range() always takes care of the
    scheduling.

    If the cond_resched() actually triggers, then terminate this pageblock
    scan for async compaction as well.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes