18 Mar, 2016

2 commits

  • Similarly to direct reclaim/compaction, kswapd attempts to combine
    reclaim and compaction to attempt making memory allocation of given
    order available.

    The details differ from direct reclaim e.g. in having high watermark as
    a goal. The code involved in kswapd's reclaim/compaction decisions has
    evolved to be quite complex.

    Testing reveals that it doesn't actually work in at least one scenario,
    and closer inspection suggests that it could be greatly simplified
    without compromising on the goal (make high-order page available) or
    efficiency (don't reclaim too much). The simplification relieas of
    doing all compaction in kcompactd, which is simply woken up when high
    watermarks are reached by kswapd's reclaim.

    The scenario where kswapd compaction doesn't work was found with mmtests
    test stress-highalloc configured to attempt order-9 allocations without
    direct reclaim, just waking up kswapd. There was no compaction attempt
    from kswapd during the whole test. Some added instrumentation shows
    what happens:

    - balance_pgdat() sets end_zone to Normal, as it's not balanced
    - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
    it cannot reclaim anything, so sc.nr_reclaimed is 0
    - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
    it merely checks if high watermarks were reached for base pages.
    This is true, so no reclaim is attempted. For DMA, testorder=0
    wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
    - even though the pgdat_needs_compaction flag wasn't set to false, no
    compaction happens due to the condition sc.nr_reclaimed >
    nr_attempted being false (as 0 < 99)
    - priority-- due to nr_reclaimed being 0, repeat until priority reaches
    0 pgdat_balanced() is false as only the small zone DMA appears
    balanced (curiously in that check, watermark appears OK and
    compaction_suitable() returns COMPACT_PARTIAL, because a lower
    classzone_idx is used there)

    Now, even if it was decided that reclaim shouldn't be attempted on the
    DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
    nr_attempted=0) is also false. The condition really should use >= as
    the comment suggests. Then there is a mismatch in the check for setting
    pgdat_needs_compaction to false using low watermark, while the rest uses
    high watermark, and who knows what other subtlety. Hopefully this
    demonstrates that this is unsustainable.

    Luckily we can simplify this a lot. The reclaim/compaction decisions
    make sense for direct reclaim scenario, but in kswapd, our primary goal
    is to reach high watermark in order-0 pages. Afterwards we can attempt
    compaction just once. Unlike direct reclaim, we don't reclaim extra
    pages (over the high watermark), the current code already disallows it
    for good reasons.

    After this patch, we simply wake up kcompactd to process the pgdat,
    after we have either succeeded or failed to reach the high watermarks in
    kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
    so kcompactd can apply the same criteria to determine which zones are
    worth compacting. Note that we use the classzone_idx from
    wakeup_kswapd(), not balanced_classzone_idx which can include higher
    zones that kswapd tried to balance too, but didn't consider them in
    pgdat_balanced().

    Since kswapd now cannot create high-order pages itself, we need to
    adjust how it determines the zones to be balanced. The key element here
    is adding a "highorder" parameter to zone_balanced, which, when set to
    false, makes it consider only order-0 watermark instead of the desired
    higher order (this was done previously by kswapd_shrink_zone(), but not
    elsewhere). This false is passed for example in pgdat_balanced().
    Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
    kcompactd are woken up for a high-order allocation failure.

    The last thing is to decide what to do with pageblock_skip bitmap
    handling. Compaction maintains a pageblock_skip bitmap to record
    pageblocks where isolation recently failed. This bitmap can be reset by
    three ways:

    1) direct compaction is restarting after going through the full deferred cycle

    2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

    3) compaction was invoked manually via trigger in /proc

    The case 2) is somewhat fuzzy to begin with, but after introducing
    kcompactd we should update it. The check for direct compaction in 1),
    and to set the flush flag in 2) use current_is_kswapd(), which doesn't
    work for kcompactd. Thus, this patch adds bool direct_compaction to
    compact_control to use in 2). For the case 1) we remove the check
    completely - unlike the former kswapd compaction, kcompactd does use the
    deferred compaction functionality, so flushing tied to restarting from
    deferred compaction makes sense here.

    Note that when kswapd goes to sleep, kcompactd is woken up, so it will
    see the flushed pageblock_skip bits. This is different from when the
    former kswapd compaction observed the bits and I believe it makes more
    sense. Kcompactd can afford to be more thorough than a direct
    compaction trying to limit allocation latency, or kswapd whose primary
    goal is to reclaim.

    For testing, I used stress-highalloc configured to do order-9
    allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
    on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
    phases 1 and 2 work as usual):

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
    Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
    Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
    Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
    Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
    Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
    Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)

    User 3166.67 3181.09
    System 1153.37 1158.25
    Elapsed 1768.53 1799.37

    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Direct pages scanned 32938 32797
    Kswapd pages scanned 2183166 2202613
    Kswapd pages reclaimed 2152359 2143524
    Direct pages reclaimed 32735 32545
    Percentage direct scans 1% 1%
    THP fault alloc 579 612
    THP collapse alloc 304 316
    THP splits 0 0
    THP fault fallback 793 778
    THP collapse fail 11 16
    Compaction stalls 1013 1007
    Compaction success 92 67
    Compaction failures 920 939
    Page migrate success 238457 721374
    Page migrate failure 23021 23469
    Compaction pages isolated 504695 1479924
    Compaction migrate scanned 661390 8812554
    Compaction free scanned 13476658 84327916
    Compaction cost 262 838

    After this patch we see improvements in allocation success rate
    (especially for phase 3) along with increased compaction activity. The
    compaction stalls (direct compaction) in the interfering kernel builds
    (probably THP's) also decreased somewhat thanks to kcompactd activity,
    yet THP alloc successes improved a bit.

    Note that elapsed and user time isn't so useful for this benchmark,
    because of the background interference being unpredictable. It's just
    to quickly spot some major unexpected differences. System time is
    somewhat more useful and that didn't increase.

    Also (after adjusting mmtests' ftrace monitor):

    Time kswapd awake 2547781 2269241
    Time kcompactd awake 0 119253
    Time direct compacting 939937 557649
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 119099

    The decrease of overal time spent compacting appears to not match the
    increased compaction stats. I suspect the tasks get rescheduled and
    since the ftrace monitor doesn't see that, the reported time is wall
    time, not CPU time. But arguably direct compactors care about overall
    latency anyway, whether busy compacting or waiting for CPU doesn't
    matter. And that latency seems to almost halved.

    It's also interesting how much time kswapd spent awake just going
    through all the priorities and failing to even try compacting, over and
    over.

    We can also configure stress-highalloc to perform both direct
    reclaim/compaction and wakeup kswapd/kcompactd, by using
    GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
    Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
    Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
    Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
    Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
    Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
    Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)

    User 3344.73 3246.04
    System 1194.24 1172.29
    Elapsed 1838.04 1836.76

    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Direct pages scanned 125146 120966
    Kswapd pages scanned 2119757 2135012
    Kswapd pages reclaimed 2073183 2108388
    Direct pages reclaimed 124909 120577
    Percentage direct scans 5% 5%
    THP fault alloc 599 652
    THP collapse alloc 323 354
    THP splits 0 0
    THP fault fallback 806 793
    THP collapse fail 17 16
    Compaction stalls 2457 2025
    Compaction success 906 518
    Compaction failures 1551 1507
    Page migrate success 2031423 2360608
    Page migrate failure 32845 40852
    Compaction pages isolated 4129761 4802025
    Compaction migrate scanned 11996712 21750613
    Compaction free scanned 214970969 344372001
    Compaction cost 2271 2694

    In this scenario, this patch doesn't change the overall success rate as
    direct compaction already tries all it can. There's however significant
    reduction in direct compaction stalls (that is, the number of
    allocations that went into direct compaction). The number of successes
    (i.e. direct compaction stalls that ended up with successful
    allocation) is reduced by the same number. This means the offload to
    kcompactd is working as expected, and direct compaction is reduced
    either due to detecting contention, or compaction deferred by kcompactd.
    In the previous version of this patchset there was some apparent
    reduction of success rate, but the changes in this version (such as
    using sync compaction only), new baseline kernel, and/or averaging
    results from 5 executions (my bet), made this go away.

    Ftrace-based stats seem to roughly agree:

    Time kswapd awake 2532984 2326824
    Time kcompactd awake 0 257916
    Time direct compacting 864839 735130
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 257585

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Mar, 2016

3 commits

  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • pageblock_pfn_to_page() is used to check there is valid pfn and all
    pages in the pageblock is in a single zone. If there is a hole in the
    pageblock, passing arbitrary position to pageblock_pfn_to_page() could
    cause to skip whole pageblock scanning, instead of just skipping the
    hole page. For deterministic behaviour, it's better to always pass
    pageblock aligned range to pageblock_pfn_to_page(). It will also help
    further optimization on pageblock_pfn_to_page() in the following patch.

    Signed-off-by: Joonsoo Kim
    Cc: Aaron Lu
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • free_pfn and compact_cached_free_pfn are the pointer that remember
    restart position of freepage scanner. When they are reset or invalid,
    we set them to zone_end_pfn because freepage scanner works in reverse
    direction. But, because zone range is defined as [zone_start_pfn,
    zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should
    not store it to free_pfn and compact_cached_free_pfn. Instead, we need
    to store zone_end_pfn - 1 to them. There is one more thing we should
    consider. Freepage scanner scan reversely by pageblock unit. If
    free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
    regards that sitiation as that it already scans front part of pageblock
    so we lose opportunity to scan there. To fix-up, this patch do
    round_down() to guarantee that reset position will be pageblock aligned.

    Note that thanks to the current pageblock_pfn_to_page() implementation,
    actual access to zone_end_pfn doesn't happen until now. But, following
    patch will change pageblock_pfn_to_page() so this patch is needed from
    now on.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

15 Jan, 2016

2 commits

  • This patch uses is_via_compact_memory() to distinguish compaction from
    sysfs or sysctl. And, this patch also reduces indentation on
    compaction_defer_reset() by filtering these cases first before checking
    watermark.

    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Yaowei Bai
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • sysctl_compaction_handler() is the handler function for compact_memory
    tunable knob under /proc/sys/vm, add the missing knob name to make this
    more accurate in comment.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Vlastimil Babka
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

06 Nov, 2015

3 commits

  • Compaction returns prematurely with COMPACT_PARTIAL when contended or has
    fatal signal pending. This is ok for the callers, but might be misleading
    in the traces, as the usual reason to return COMPACT_PARTIAL is that we
    think the allocation should succeed. After this patch we distinguish the
    premature ending condition in the mm_compaction_finished and
    mm_compaction_end tracepoints.

    The contended status covers the following reasons:
    - lock contention or need_resched() detected in async compaction
    - fatal signal pending
    - too many pages isolated in the zone (only for async compaction)
    Further distinguishing the exact reason seems unnecessary for now.

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Some compaction tracepoints convert the integer return values to strings
    using the compaction_status_string array. This works for in-kernel
    printing, but not userspace trace printing of raw captured trace such as
    via trace-cmd report.

    This patch converts the private array to appropriate tracepoint macros
    that result in proper userspace support.

    trace-cmd output before:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=

    after:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=partial

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Introduce is_via_compact_memory() helper indicating compacting via
    /proc/sys/vm/compact_memory to improve readability.

    To catch this situation in __compaction_suitable, use order as parameter
    directly instead of using struct compact_control.

    This patch has no functional changes.

    Signed-off-by: Yaowei Bai
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

09 Sep, 2015

6 commits

  • We cache isolate_start_pfn before entering isolate_migratepages(). If
    pageblock is skipped in isolate_migratepages() due to whatever reason,
    cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
    that were freed. For example, the following scenario can be possible:

    - assume order-9 compaction, pageblock order is 9
    - start_isolate_pfn is 0x200
    - isolate_migratepages()
    - skip a number of pageblocks
    - start to isolate from pfn 0x600
    - cc->migrate_pfn = 0x620
    - return
    - last_migrated_pfn is set to 0x200
    - check flushing condition
    - current_block_start is set to 0x600
    - last_migrated_pfn < current_block_start then do useless flush

    This wrong flush would not help the performance and success rate so this
    patch tries to fix it. One simple way to know the exact position where
    we start to isolate migratable pages is that we cache it in
    isolate_migratepages() before entering actual isolation. This patch
    implements that and fixes the problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The compaction free scanner is looking for PageBuddy() pages and
    skipping all others. For large compound pages such as THP or hugetlbfs,
    we can save a lot of iterations if we skip them at once using their
    compound_order(). This is generally unsafe and we can read a bogus
    value of order due to a race, but if we are careful, the only danger is
    skipping too much.

    When tested with stress-highalloc from mmtests on 4GB system with 1GB
    hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
    least 15%.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compaction migrate scanner tries to skip THP pages by their order,
    to reduce number of iterations for pages it cannot isolate. The check
    is only done if PageLRU() is true, which means it applies to THP pages,
    but not e.g. hugetlbfs pages or any other non-LRU compound pages, which
    we have to iterate by base pages.

    This limitation comes from the assumption that it's only safe to read
    compound_order() when we have the zone's lru_lock and THP cannot be
    split under us. But the only danger (after filtering out order values
    that are not below MAX_ORDER, to prevent overflows) is that we skip too
    much or too little after reading a bogus compound_order() due to a rare
    race. This is the same reasoning as patch 99c0fd5e51c4 ("mm,
    compaction: skip buddy pages by their order in the migrate scanner")
    introduced for unsafely reading PageBuddy() order.

    After this patch, all pages are tested for PageCompound() and we skip
    them by compound_order(). The test is done after the test for
    balloon_page_movable() as we don't want to assume if balloon pages (or
    other pages with own isolation and migration implementation if a generic
    API gets implemented) are compound or not.

    When tested with stress-highalloc from mmtests on 4GB system with 1GB
    hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by
    15%.

    [kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Reseting the cached compaction scanner positions is now open-coded in
    __reset_isolation_suitable() and compact_finished(). Encapsulate the
    functionality in a new function reset_cached_positions().

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Handling the position where compaction free scanner should restart
    (stored in cc->free_pfn) got more complex with commit e14c720efdd7 ("mm,
    compaction: remember position within pageblock in free pages scanner").
    Currently the position is updated in each loop iteration of
    isolate_freepages(), although it should be enough to update it only when
    breaking from the loop. There's also an extra check outside the loop
    updates the position in case we have met the migration scanner.

    This can be simplified if we move the test for having isolated enough
    from the for-loop header next to the test for contention, and
    determining the restart position only in these cases. We can reuse the
    isolate_start_pfn variable for this instead of setting cc->free_pfn
    directly. Outside the loop, we can simply set cc->free_pfn to current
    value of isolate_start_pfn without any extra check.

    Also add a VM_BUG_ON to catch possible mistake in the future, in case we
    later add a new condition that terminates isolate_freepages_block()
    prematurely without also considering the condition in
    isolate_freepages().

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Assorted compaction cleanups and optimizations. The interesting patches
    are 4 and 5. In 4, skipping of compound pages in single iteration is
    improved for migration scanner, so it works also for !PageLRU compound
    pages such as hugetlbfs, slab etc. Patch 5 introduces this kind of
    skipping in the free scanner. The trick is that we can read
    compound_order() without any protection, if we are careful to filter out
    values larger than MAX_ORDER. The only danger is that we skip too much.
    The same trick was already used for reading the freepage order in the
    migrate scanner.

    To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc
    from mmtests, set to simulate THP allocations (including __GFP_COMP) on
    a 4GB system where 1GB was occupied by hugetlbfs pages. I'll include
    just the relevant stats:

    Patch 3 Patch 4 Patch 5

    Compaction stalls 7523 7529 7515
    Compaction success 323 304 322
    Compaction failures 7200 7224 7192
    Page migrate success 247778 264395 240737
    Page migrate failure 15358 33184 21621
    Compaction pages isolated 906928 980192 909983
    Compaction migrate scanned 2005277 1692805 1498800
    Compaction free scanned 13255284 11539986 9011276
    Compaction cost 288 305 277

    With 5 iterations per patch, the results are still noisy, but we can see
    that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the
    hugetlbfs pages at once. Interestingly, free_scanned is also reduced
    and I have no idea why. Patch 5 further reduces free_scanned as
    expected, by 15%. Other stats are unaffected modulo noise.

    [1] https://lkml.org/lkml/2015/1/19/158

    This patch (of 5):

    Compaction should finish when the migration and free scanner meet, i.e.
    they reach the same pageblock. Currently however, the test in
    compact_finished() simply just compares the exact pfns, which may yield
    a false negative when the free scanner position is in the middle of a
    pageblock and the migration scanner reaches the begining of the same
    pageblock.

    This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
    remember position within pageblock in free pages scanner") allowed the
    free scanner position to be in the middle of a pageblock between
    invocations. The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent
    infinite loop in compact_zone") prevented the issue by adding a special
    check in the migration scanner to satisfy the current detection of
    scanners meeting.

    However, the proper fix is to make the detection more robust. This
    patch introduces the compact_scanners_met() function that returns true
    when the free scanner position is in the same or lower pageblock than
    the migration scanner. The special case in isolate_migratepages()
    introduced by 1d5bfe1ffb5b is removed.

    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Apr, 2015

3 commits

  • mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]

    Reported-by: Fengguang Wu
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the compaction is activated via /proc/sys/vm/compact_memory it would
    better scan the whole zone. And some platforms, for instance ARM, have
    the start_pfn of a zone at zero. Therefore the first try to compact via
    /proc doesn't work. It needs to reset the compaction scanner position
    first.

    Signed-off-by: Gioh Kim
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

1 commit

  • Compaction has anti fragmentation algorithm. It is that freepage should
    be more than pageblock order to finish the compaction if we don't find any
    freepage in requested migratetype buddy list. This is for mitigating
    fragmentation, but, there is a lack of migratetype consideration and it is
    too excessive compared to page allocator's anti fragmentation algorithm.

    Not considering migratetype would cause premature finish of compaction.
    For example, if allocation request is for unmovable migratetype, freepage
    with CMA migratetype doesn't help that allocation and compaction should
    not be stopped. But, current logic regards this situation as compaction
    is no longer needed, so finish the compaction.

    Secondly, condition is too excessive compared to page allocator's logic.
    We can steal freepage from other migratetype and change pageblock
    migratetype on more relaxed conditions in page allocator. This is
    designed to prevent fragmentation and we can use it here. Imposing hard
    constraint only to the compaction doesn't help much in this case since
    page allocator would cause fragmentation again.

    To solve these problems, this patch borrows anti fragmentation logic from
    page allocator. It will reduce premature compaction finish in some cases
    and reduce excessive compaction work.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    considerable increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    31.82 : 42.20

    I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
    there is no more degradation on allocation success rate than before. That
    roughly means that this patch doesn't result in more fragmentations.

    Vlastimil suggests additional idea that we only test for fallbacks when
    migration scanner has scanned a whole pageblock. It looked good for
    fragmentation because chance of stealing increase due to making more free
    pages in certain pageblock. So, I tested it, but, it results in decreased
    compaction success rate, roughly 38.00. I guess the reason that if system
    is low memory condition, watermark check could be failed due to not enough
    order 0 free page and so, sometimes, we can't reach a fallback check
    although migrate_pfn is aligned to pageblock_nr_pages. I can insert code
    to cope with this situation but it makes code more complicated so I don't
    include his idea at this patch.

    [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

14 Feb, 2015

1 commit

  • Add kernel address sanitizer hooks to mark allocated page's addresses as
    accessible in corresponding shadow region. Mark freed pages as
    inaccessible.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

13 Feb, 2015

3 commits

  • The vmstat interfaces are good at hiding negative counts (at least when
    CONFIG_SMP); but if you peer behind the curtain, you find that
    nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
    more negative: so they can absorb larger and larger numbers of isolated
    pages, yet still appear to be zero.

    I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
    but I guess it's there for a good reason, in which case we ought to get
    too_many_isolated() working again.

    The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
    putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
    to call acct_isolated() to increment them.

    It is possible that the bug whcih this patch fixes could cause OOM kills
    when the system still has a lot of reclaimable page cache.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    Signed-off-by: Hugh Dickins
    Acked-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently, freepage isolation in one pageblock doesn't consider how many
    freepages we isolate. When I traced flow of compaction, compaction
    sometimes isolates more than 256 freepages to migrate just 32 pages.

    In this patch, freepage isolation is stopped at the point that we
    have more isolated freepage than isolated page for migration. This
    results in slowing down free page scanner and make compaction success
    rate higher.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    27.13 : 31.82

    pfn where both scanners meets on compaction complete
    (separate test due to enormous tracepoint buffer)
    (zone_start=4096, zone_end=1048576)
    586034 : 654378

    In fact, I didn't fully understand why this patch results in such good
    result. There was a guess that not used freepages are released to pcp list
    and on next compaction trial we won't isolate them again so compaction
    success rate would decrease. To prevent this effect, I tested with adding
    pcp drain code on release_freepages(), but, it has no good effect.

    Anyway, this patch reduces waste time to isolate unneeded freepages so
    seems reasonable.

    Vlastimil said:

    : I briefly tried it on top of the pivot-changing series and with order-9
    : allocations it reduced free page scanned counter by almost 10%. No effect
    : on success rates (maybe because pivot changing already took care of the
    : scanners meeting problem) but the scanning reduction is good on its own.
    :
    : It also explains why e14c720efdd7 ("mm, compaction: remember position
    : within pageblock in free pages scanner") had less than expected
    : improvements. It would only actually stop within pageblock in case of
    : async compaction detecting contention. I guess that's also why the
    : infinite loop problem fixed by 1d5bfe1ffb5b affected so relatively few
    : people.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Tested-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • What we want to check here is whether there is highorder freepage in buddy
    list of other migratetype in order to steal it without fragmentation.
    But, current code just checks cc->order which means allocation request
    order. So, this is wrong.

    Without this fix, non-movable synchronous compaction below pageblock order
    would not stopped until compaction is complete, because migratetype of
    most pageblocks are movable and high order freepage made by compaction is
    usually on movable type buddy list.

    There is some report related to this bug. See below link.

    http://www.spinics.net/lists/linux-mm/msg81666.html

    Although the issued system still has load spike comes from compaction,
    this makes that system completely stable and responsive according to his
    report.

    stress-highalloc test in mmtests with non movable order 7 allocation
    doesn't show any notable difference in allocation success rate, but, it
    shows more compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    18.47 : 28.94

    Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Feb, 2015

5 commits

  • Compaction deferring logic is heavy hammer that block the way to the
    compaction. It doesn't consider overall system state, so it could prevent
    user from doing compaction falsely. In other words, even if system has
    enough range of memory to compact, compaction would be skipped due to
    compaction deferring logic. This patch add new tracepoint to understand
    work of deferring logic. This will also help to check compaction success
    and fail.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It is not well analyzed that when/why compaction start/finish or not.
    With these new tracepoints, we can know much more about start/finish
    reason of compaction. I can find following bug with these tracepoint.

    http://www.spinics.net/lists/linux-mm/msg81582.html

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It'd be useful to know current range where compaction work for detailed
    analysis. With it, we can know pageblock where we actually scan and
    isolate, and, how much pages we try in that pageblock and can guess why it
    doesn't become freepage with pageblock order roughly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We now have tracepoint for begin event of compaction and it prints start
    position of both scanners, but, tracepoint for end event of compaction
    doesn't print finish position of both scanners. It'd be also useful to
    know finish position of both scanners so this patch add it. It will help
    to find odd behavior or problem on compaction internal logic.

    And mode is added to both begin/end tracepoint output, since according to
    mode, compaction behavior is quite different.

    And lastly, status format is changed to string rather than status number
    for readability.

    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Expand the usage of the struct alloc_context introduced in the previous
    patch also for calling try_to_compact_pages(), to reduce the number of its
    parameters. Since the function is in different compilation unit, we need
    to move alloc_context definition in the shared mm/internal.h header.

    With this change we get simpler code and small savings of code size and stack
    usage:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
    function old new delta
    __alloc_pages_direct_compact 283 256 -27
    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
    function old new delta
    try_to_compact_pages 582 569 -13

    Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
    scripts/checkstack.pl).

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

11 Dec, 2014

5 commits

  • The goal of memory compaction is to create high-order freepages through
    page migration. Page migration however puts pages on the per-cpu lru_add
    cache, which is later flushed to per-cpu pcplists, and only after pcplists
    are drained the pages can actually merge. This can happen due to the
    per-cpu caches becoming full through further freeing, or explicitly.

    During direct compaction, it is useful to do the draining explicitly so
    that pages merge as soon as possible and compaction can detect success
    immediately and keep the latency impact at minimum. However the current
    implementation is far from ideal. Draining is done only in
    __alloc_pages_direct_compact(), after all zones were already compacted,
    and the decisions to continue or stop compaction in individual zones was
    done without the last batch of migrations being merged. It is also
    missing the draining of lru_add cache before the pcplists.

    This patch moves the draining for direct compaction into compact_zone().
    It adds the missing lru_cache draining and uses the newly introduced
    single zone pcplists draining to reduce overhead and avoid impact on
    unrelated zones. Draining is only performed when it can actually lead to
    merging of a page of desired order (passed by cc->order). This means it
    is only done when migration occurred in the previously scanned cc->order
    aligned block(s) and the migration scanner is now pointing to the next
    cc->order aligned block.

    The patch has been tested with stress-highalloc benchmark from mmtests.
    Although overal allocation success rates of the benchmark were not
    affected, the number of detected compaction successes has doubled. This
    suggests that allocations were previously successful due to implicit
    merging caused by background activity, making a later allocation attempt
    succeed immediately, but not attributing the success to compaction. Since
    stress-highalloc always tries to allocate almost the whole memory, it
    cannot show the improvement in its reported success rate metric. However
    after this patch, compaction should detect success and terminate earlier,
    reducing the direct compaction latencies in a real scenario.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction caches the migration and free scanner positions between
    compaction invocations, so that the whole zone gets eventually scanned and
    there is no bias towards the initial scanner positions at the
    beginning/end of the zone.

    The cached positions are continuously updated as scanners progress and the
    updating stops as soon as a page is successfully isolated. The reasoning
    behind this is that a pageblock where isolation succeeded is likely to
    succeed again in near future and it should be worth revisiting it.

    However, the downside is that potentially many pages are rescanned without
    successful isolation. At worst, there might be a page where isolation
    from LRU succeeds but migration fails (potentially always). So upon
    encountering this page, cached position would always stop being updated
    for no good reason. It might have been useful to let such page be
    rescanned with sync compaction after async one failed, but this is now
    handled by caching scanner position for async and sync mode separately
    since commit 35979ef33931 ("mm, compaction: add per-zone migration pfn
    cache for async compaction").

    After this patch, cached positions are updated unconditionally. In
    stress-highalloc benchmark, this has decreased the numbers of scanned
    pages by few percent, without affecting allocation success rates.

    To prevent free scanner from leaving free pages behind after they are
    returned due to page migration failure, the cached scanner pfn is changed
    to point to the pageblock of the returned free page with the highest pfn,
    before leaving compact_zone().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Deferred compaction is employed to avoid compacting zone where sync direct
    compaction has recently failed. As such, it makes sense to only defer
    when a full zone was scanned, which is when compact_zone returns with
    COMPACT_COMPLETE. It's less useful to defer when compact_zone returns
    with apparent success (COMPACT_PARTIAL), followed by a watermark check
    failure, which can happen due to parallel allocation activity. It also
    does not make much sense to defer compaction which was completely skipped
    (COMPACT_SKIP) for being unsuitable in the first place.

    This patch therefore makes deferred compaction trigger only when
    COMPACT_COMPLETE is returned from compact_zone(). Results of
    stress-highalloc becnmark show the difference is within measurement error,
    so the issue is rather cosmetic.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually
    instead of preferred zone"), compaction is deferred for each zone where
    sync direct compaction fails, and reset where it succeeds. However, it
    was observed that for DMA zone compaction often appeared to succeed
    while subsequent allocation attempt would not, due to different outcome
    of watermark check.

    In order to properly defer compaction in this zone, the candidate zone
    has to be passed back to __alloc_pages_direct_compact() and compaction
    deferred in the zone after the allocation attempt fails.

    The large source of mismatch between watermark check in compaction and
    allocation was the lack of alloc_flags and classzone_idx values in
    compaction, which has been fixed in the previous patch. So with this
    problem fixed, we can simplify the code by removing the candidate_zone
    parameter and deferring in __alloc_pages_direct_compact().

    After this patch, the compaction activity during stress-highalloc
    benchmark is still somewhat increased, but it's negligible compared to the
    increase that occurred without the better watermark checking. This
    suggests that it is still possible to apparently succeed in compaction but
    fail to allocate, possibly due to parallel allocation activity.

    [akpm@linux-foundation.org: fix build]
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction relies on zone watermark checks for decisions such as if it's
    worth to start compacting in compaction_suitable() or whether compaction
    should stop in compact_finished(). The watermark checks take
    classzone_idx and alloc_flags parameters, which are related to the memory
    allocation request. But from the context of compaction they are currently
    passed as 0, including the direct compaction which is invoked to satisfy
    the allocation request, and could therefore know the proper values.

    The lack of proper values can lead to mismatch between decisions taken
    during compaction and decisions related to the allocation request. Lack
    of proper classzone_idx value means that lowmem_reserve is not taken into
    account. This has manifested (during recent changes to deferred
    compaction) when DMA zone was used as fallback for preferred Normal zone.
    compaction_suitable() without proper classzone_idx would think that the
    watermarks are already satisfied, but watermark check in
    get_page_from_freelist() would fail. Because of this problem, deferring
    compaction has extra complexity that can be removed in the following
    patch.

    The issue (not confirmed in practice) with missing alloc_flags is opposite
    in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
    ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
    CMA-enabled systems) the watermark checking in compaction with 0 passed
    will be stricter than in get_page_from_freelist(). In these cases
    compaction might be running for a longer time than is really needed.

    Another issue compaction_suitable() is that the check for "does the zone
    need compaction at all?" comes only after the check "does the zone have
    enough free free pages to succeed compaction". The latter considers extra
    pages for migration and can therefore in some situations fail and return
    COMPACT_SKIPPED, although the high-order allocation would succeed and we
    should return COMPACT_PARTIAL.

    This patch fixes these problems by adding alloc_flags and classzone_idx to
    struct compact_control and related functions involved in direct compaction
    and watermark checking. Where possible, all other callers of
    compaction_suitable() pass proper values where those are known. This is
    currently limited to classzone_idx, which is sometimes known in kswapd
    context. However, the direct reclaim callers should_continue_reclaim()
    and compaction_ready() do not currently know the proper values, so the
    coordination between reclaim and compaction may still not be as accurate
    as it could. This can be fixed later, if it's shown to be an issue.

    Additionaly the checks in compact_suitable() are reordered to address the
    second issue described above.

    The effect of this patch should be slightly better high-order allocation
    success rates and/or less compaction overhead, depending on the type of
    allocations and presence of CMA. It allows simplifying deferred
    compaction code in a followup patch.

    When testing with stress-highalloc, there was some slight improvement
    (which might be just due to variance) in success rates of non-THP-like
    allocations.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Nov, 2014

2 commits

  • Several people have reported occasionally seeing processes stuck in
    compact_zone(), even triggering soft lockups, in 3.18-rc2+.

    Testing a revert of commit e14c720efdd7 ("mm, compaction: remember
    position within pageblock in free pages scanner") fixed the issue,
    although the stuck processes do not appear to involve the free scanner.

    Finally, by code inspection, the bug was found in isolate_migratepages()
    which uses a slightly different condition to detect if the migration and
    free scanners have met, than compact_finished(). That has not been a
    problem until commit e14c720efdd7 allowed the free scanner position
    between individual invocations to be in the middle of a pageblock.

    In a relatively rare case, the migration scanner position can end up at
    the beginning of a pageblock, with the free scanner position in the
    middle of the same pageblock. If it's the migration scanner's turn,
    isolate_migratepages() exits immediately (without updating the
    position), while compact_finished() decides to continue compaction,
    resulting in a potentially infinite loop. The system can recover only
    if another process creates enough high-order pages to make the watermark
    checks in compact_finished() pass.

    This patch fixes the immediate problem by bumping the migration
    scanner's position to meet the free scanner in isolate_migratepages(),
    when both are within the same pageblock. This causes compact_finished()
    to terminate properly. A more robust check in compact_finished() is
    planned as a cleanup for better future maintainability.

    Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner)
    Signed-off-by: Vlastimil Babka
    Reported-by: P. Christeas
    Tested-by: P. Christeas
    Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2
    Reported-by: Norbert Preining
    Tested-by: Norbert Preining
    Link: https://lkml.org/lkml/2014/11/4/904
    Reported-by: Pavel Machek
    Link: https://lkml.org/lkml/2014/11/7/164
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in
    the migration scanner") has a side-effect that changes the iteration
    range calculation. Before the change, block_end_pfn is calculated using
    start_pfn, but now it blindly adds pageblock_nr_pages to the previous
    value.

    This causes the problem that isolation_start_pfn is larger than
    block_end_pfn when we isolate the page with more than pageblock order.
    In this case, isolation would fail due to an invalid range parameter.

    To prevent this, this patch implements skipping the range until a proper
    target pageblock is met. Without this patch, CMA with more than
    pageblock order always fails but with this patch it will succeed.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

30 Oct, 2014

1 commit

  • Commit edc2ca612496 ("mm, compaction: move pageblock checks up from
    isolate_migratepages_range()") commonizes isolate_migratepages variants
    and make them use isolate_migratepages_block().

    isolate_migratepages_block() could stop the execution when enough pages
    are isolated, but, there is no code in isolate_migratepages_range() to
    handle this case. In the result, even if isolate_migratepages_block()
    returns prematurely without checking all pages in the range,

    isolate_migratepages_block() is called repeately on the following
    pageblock and some pages in the previous range are skipped to check.
    Then, CMA is failed frequently due to this fact.

    To fix this problem, this patch let isolate_migratepages_range() know
    the situation that enough pages are isolated and stop the isolation in
    that case.

    Note that isolate_migratepages() has no such problem, because, it always
    stops the isolation after just one call of isolate_migratepages_block().

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

3 commits

  • Sasha Levin reported KASAN splash inside isolate_migratepages_range().
    Problem is in the function __is_movable_balloon_page() which tests
    AS_BALLOON_MAP in page->mapping->flags. This function has no protection
    against anonymous pages. As result it tried to check address space flags
    inside struct anon_vma.

    Further investigation shows more problems in current implementation:

    * Special branch in __unmap_and_move() never works:
    balloon_page_movable() checks page flags and page_count. In
    __unmap_and_move() page is locked, reference counter is elevated, thus
    balloon_page_movable() always fails. As a result execution goes to the
    normal migration path. virtballoon_migratepage() returns
    MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
    move_to_new_page() thinks this is an error code and assigns
    newpage->mapping to NULL. Newly migrated page lose connectivity with
    balloon an all ability for further migration.

    * lru_lock erroneously required in isolate_migratepages_range() for
    isolation ballooned page. This function releases lru_lock periodically,
    this makes migration mostly impossible for some pages.

    * balloon_page_dequeue have a tight race with balloon_page_isolate:
    balloon_page_isolate could be executed in parallel with dequeue between
    picking page from list and locking page_lock. Race is rare because they
    use trylock_page() for locking.

    This patch fixes all of them.

    Instead of fake mapping with special flag this patch uses special state of
    page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
    PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
    directly in struct page makes everything safer and easier.

    PagePrivate is used to mark pages present in page list (i.e. not
    isolated, like PageLRU for normal pages). It replaces special rules for
    reference counter and makes balloon migration similar to migration of
    normal pages. This flag is protected by page_lock together with link to
    the balloon device.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Sasha Levin
    Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • C mm/compaction.o
    mm/compaction.c: In function isolate_freepages_block:
    mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
    && compact_unlock_should_abort(&cc->zone->lock, flags,
    ^

    Signed-off-by: Xiubo Li
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiubo Li
     
  • struct compact_control currently converts the gfp mask to a migratetype,
    but we need the entire gfp mask in a follow-up patch.

    Pass the entire gfp mask as part of struct compact_control.

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes