12 Jan, 2017

1 commit

  • commit 6afcf8ef0ca0a69d014f8edb613d94821f0ae700 upstream.

    Since commit bda807d44454 ("mm: migrate: support non-lru movable page
    migration") isolate_migratepages_block) can isolate !PageLRU pages which
    would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
    pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
    heuristics based on those counters such as pgdat_reclaimable_pages resp.
    too_many_isolated which would lead to unexpected stalls during the
    direct reclaim without any good reason. Note that
    __alloc_contig_migrate_range can isolate a lot of pages at once.

    On mobile devices such as 512M ram android Phone, it may use a big zram
    swap. In some cases zram(zsmalloc) uses too many non-lru but
    migratedable pages, such as:

    MemTotal: 468148 kB
    Normal free:5620kB
    Free swap:4736kB
    Total swap:409596kB
    ZRAM: 164616kB(zsmalloc non-lru pages)
    active_anon:60700kB
    inactive_anon:60744kB
    active_file:34420kB
    inactive_file:37532kB

    Fix this by only accounting lru pages to NR_ISOLATED_* in
    isolate_migratepages_block right after they were isolated and we still
    know they were on LRU. Drop acct_isolated because it is called after
    the fact and we've lost that information. Batching per-cpu counter
    doesn't make much improvement anyway. Also make sure that we uncharge
    only LRU pages when putting them back on the LRU in
    putback_movable_pages resp. when unmap_and_move migrates the page.

    [mhocko@suse.com: replace acct_isolated() with direct counting]
    Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
    Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
    Signed-off-by: Ming Ling
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ming Ling
     

08 Oct, 2016

12 commits

  • Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
    heuristic to prevent excessive compaction for costly orders (i.e. THP).
    It's unlikely to make any difference for non-costly orders, especially
    with the default threshold. But we cannot afford any uncertainty for
    the non-costly orders where the only alternative to successful
    reclaim/compaction is OOM. After the recent patches we are guaranteed
    maximum effort without heuristics from compaction before deciding OOM,
    and fragindex is the last remaining heuristic. Therefore skip fragindex
    altogether for non-costly orders.

    Suggested-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compaction_zonelist_suitable() function tries to determine if
    compaction will be able to proceed after sufficient reclaim, i.e.
    whether there are enough reclaimable pages to provide enough order-0
    freepages for compaction.

    This addition of reclaimable pages to the free pages works well for the
    order-0 watermark check, but in the fragmentation index check we only
    consider truly free pages. Thus we can get fragindex value close to 0
    which indicates failure do to lack of memory, and wrongly decide that
    compaction won't be suitable even after reclaim.

    Instead of trying to somehow adjust fragindex for reclaimable pages,
    let's just skip it from compaction_zonelist_suitable().

    Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Several people have reported premature OOMs for order-2 allocations
    (stack) due to OOM rework in 4.7. In the scenario (parallel kernel
    build and dd writing to two drives) many pageblocks get marked as
    Unmovable and compaction free scanner struggles to isolate free pages.
    Joonsoo Kim pointed out that the free scanner skips pageblocks that are
    not movable to prevent filling them and forcing non-movable allocations
    to fallback to other pageblocks. Such heuristic makes sense to help
    prevent long-term fragmentation, but premature OOMs are relatively more
    urgent problem. As a compromise, this patch disables the heuristic only
    for the ultimate compaction priority.

    Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
    Reported-by: Ralf-Peter Rohbeck
    Reported-by: Arkadiusz Miskiewicz
    Reported-by: Olaf Hering
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. Then __isolate_free_page uses low watermark check to decide
    if particular free page can be isolated. In the latter case, using low
    watermark is needlessly pessimistic, as the free page isolations are
    only temporary. For __compaction_suitable() the higher watermark makes
    sense for high-order allocations where more freepages increase the
    chance of success, and we can typically fail with some order-0 fallback
    when the system is struggling to reach that watermark. But for
    low-order allocation, forming the page should not be that hard. So
    using low watermark here might just prevent compaction from even trying,
    and eventually lead to OOM killer even if we are above min watermarks.

    So after this patch, we use min watermark for non-costly orders in
    __compaction_suitable(), and for all orders in __isolate_free_page().

    [vbabka@suse.cz: clarify __isolate_free_page() comment]
    Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. This check uses direct compactor's alloc_flags, but that's
    wrong, since these flags are not applicable for freepage isolation.

    For example, alloc_flags may indicate access to memory reserves, making
    compaction proceed, and then fail watermark check during the isolation.

    A similar problem exists for ALLOC_CMA, which may be part of
    alloc_flags, but not during freepage isolation. In this case however it
    makes sense to use ALLOC_CMA both in __compaction_suitable() and
    __isolate_free_page(), since there's actually nothing preventing the
    freepage scanner to isolate from CMA pageblocks, with the assumption
    that a page that could be migrated once by compaction can be migrated
    also later by CMA allocation. Thus we should count pages in CMA
    pageblocks when considering compaction suitability and when isolating
    freepages.

    To sum up, this patch should remove some false positives from
    __compaction_suitable(), and allow compaction to proceed when free pages
    required for compaction reside in the CMA pageblocks.

    Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction uses a watermark gap of (2UL << order) pages at various
    places and it's not immediately obvious why. Abstract it through a
    compact_gap() wrapper to create a single place with a thorough
    explanation.

    [vbabka@suse.cz: clarify the comment of compact_gap()]
    Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compact_finished() function uses low watermark in a check that has
    to pass if the direct compaction is to finish and allocation should
    succeed. This is too pessimistic, as the allocation will typically use
    min watermark. It may happen that during compaction, we drop below the
    low watermark (due to parallel activity), but still form the target
    high-order page. By checking against low watermark, we might needlessly
    continue compaction.

    Similarly, __compaction_suitable() uses low watermark in a check whether
    allocation can succeed without compaction. Again, this is unnecessarily
    pessimistic.

    After this patch, these check will use direct compactor's alloc_flags to
    determine the watermark, which is effectively the min watermark.

    Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During reclaim/compaction loop, it's desirable to get a final answer
    from unsuccessful compaction so we can either fail the allocation or
    invoke the OOM killer. However, heuristics such as deferred compaction
    or pageblock skip bits can cause compaction to skip parts or whole zones
    and lead to premature OOM's, failures or excessive reclaim/compaction
    retries.

    To remedy this, we introduce a new direct compaction priority called
    COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

    - ignore deferred compaction status for a zone
    - ignore pageblock skip hints
    - ignore cached scanner positions and scan the whole zone

    The new priority should get eventually picked up by
    should_compact_retry() and this should improve success rates for costly
    allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
    reduce some corner-case OOM's for non-costly allocations.

    Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
    [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
    Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Joonsoo has reminded me that in a later patch changing watermark checks
    throughout compaction I forgot to update checks in
    try_to_compact_pages() and compactd_do_work(). Closer inspection
    however shows that they are redundant now in the success case, because
    compact_zone() now reliably reports this with COMPACT_SUCCESS. So
    effectively the checks just repeat (a subset) of checks that have just
    passed. So instead of checking watermarks again, just test the return
    value.

    Note it's also possible that compaction would declare failure e.g.
    because its find_suitable_fallback() is more strict than simple
    watermark check, and then the watermark check we are removing would then
    still succeed. After this patch this is not possible and it's arguably
    better, because for long-term fragmentation avoidance we should rather
    try a different zone than allocate with the unsuitable fallback. If
    compaction of all zones fail and the allocation is important enough, it
    will retry and succeed anyway.

    Also remove the stray "bool success" variable from kcompactd_do_work().

    Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • COMPACT_PARTIAL has historically meant that compaction returned after
    doing some work without fully compacting a zone. It however didn't
    distinguish if compaction terminated because it succeeded in creating
    the requested high-order page. This has changed recently and now we
    only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
    high-order watermark check in compaction_suitable() passes and no
    compaction needs to be done.

    So at this point we can make the return value clearer by renaming it to
    COMPACT_SUCCESS. The next patch will remove some redundant tests for
    success where compaction just returned COMPACT_SUCCESS.

    Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since kswapd compaction moved to kcompactd, compact_pgdat() is not
    called anymore, so we remove it. The only caller of __compact_pgdat()
    is compact_node(), so we merge them and remove code that was only
    reachable from kswapd.

    Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "make direct compaction more deterministic")

    This is mostly a followup to Michal's oom detection rework, which
    highlighted the need for direct compaction to provide better feedback in
    reclaim/compaction loop, so that it can reliably recognize when
    compaction cannot make further progress, and allocation should invoke
    OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed
    expanding the async/sync migration mode used in compaction to more
    general "priorities". This patchset adds one new priority that just
    overrides all the heuristics and makes compaction fully scan all zones.
    I don't currently think that we need more fine-grained priorities, but
    we'll see. Other than that there's some smaller fixes and cleanups,
    mainly related to the THP-specific hacks.

    I've tested this with stress-highalloc in GFP_KERNEL order-4 and
    THP-like order-9 scenarios. There's some improvement for compaction
    stats for the order-4, which is likely due to the better watermarks
    handling. In the previous version I reported mostly noise wrt
    compaction stats, and decreased direct reclaim - now the reclaim is
    without difference. I believe this is due to the less aggressive
    compaction priority increase in patch 6.

    "before" is a mmotm tree prior to 4.7 release plus the first part of the
    series that was sent and merged separately

    before after
    order-4:

    Compaction stalls 27216 30759
    Compaction success 19598 25475
    Compaction failures 7617 5283
    Page migrate success 370510 464919
    Page migrate failure 25712 27987
    Compaction pages isolated 849601 1041581
    Compaction migrate scanned 143146541 101084990
    Compaction free scanned 208355124 144863510
    Compaction cost 1403 1210

    order-9:

    Compaction stalls 7311 7401
    Compaction success 1634 1683
    Compaction failures 5677 5718
    Page migrate success 194657 183988
    Page migrate failure 4753 4170
    Compaction pages isolated 498790 456130
    Compaction migrate scanned 565371 524174
    Compaction free scanned 4230296 4250744
    Compaction cost 215 203

    [1] https://lwn.net/Articles/684611/

    This patch (of 11):

    A recent patch has added whole_zone flag that compaction sets when
    scanning starts from the zone boundary, in order to report that zone has
    been fully scanned in one attempt. For allocations that want to try
    really hard or cannot fail, we will want to introduce a mode where
    scanning whole zone is guaranteed regardless of the cached positions.

    This patch reuses the whole_zone flag in a way that if it's already
    passed true to compaction, the cached scanner positions are ignored.
    Employing this flag during reclaim/compaction loop will be done in the
    next patch. This patch however converts compaction invoked from
    userspace via procfs to use this flag. Before this patch, the cached
    positions were first reset to zone boundaries and then read back from
    struct zone, so there was a window where a parallel compaction could
    replace the reset values, making the manual compaction less effective.
    Using the flag instead of performing reset is more robust.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Jul, 2016

8 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
    isolate a PageWriteback page, which __unmap_and_move() then rejects with
    -EBUSY: of course the writeback might complete in between, but that's
    not what we usually expect, so probably better not to isolate it.

    When tested by stress-highalloc from mmtests, this has reduced the
    number of page migrate failures by 60-70%.

    Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The caller __alloc_pages_direct_compact() already checked (order == 0)
    so there's no need to check again.

    Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     

27 Jul, 2016

6 commits

  • This patch is motivated from Hugh and Vlastimil's concern [1].

    There are two ways to get freepage from the allocator. One is using
    normal memory allocation API and the other is __isolate_free_page()
    which is internally used for compaction and pageblock isolation. Later
    usage is rather tricky since it doesn't do whole post allocation
    processing done by normal API.

    One problematic thing I already know is that poisoned page would not be
    checked if it is allocated by __isolate_free_page(). Perhaps, there
    would be more.

    We could add more debug logic for allocated page in the future and this
    separation would cause more problem. I'd like to fix this situation at
    this time. Solution is simple. This patch commonize some logic for
    newly allocated page and uses it on all sites. This will solve the
    problem.

    [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

    [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It's not necessary to initialized page_owner with holding the zone lock.
    It would cause more contention on the zone lock although it's not a big
    problem since it is just debug feature. But, it is better than before
    so do it. This is also preparation step to use stackdepot in page owner
    feature. Stackdepot allocates new pages when there is no reserved space
    and holding the zone lock in this case will cause deadlock.

    Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to split freepages with holding the zone lock. It will
    cause more contention on zone lock so not desirable.

    [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have squeezed meta data of zspage into first page's descriptor. So,
    to get meta data from subpage, we should get first page first of all.
    But it makes trouble to implment page migration feature of zsmalloc
    because any place where to get first page from subpage can be raced with
    first page migration. IOW, first page it got could be stale. For
    preventing it, I have tried several approahces but it made code
    complicated so finally, I concluded to separate metadata from first
    page. Of course, it consumes more memory. IOW, 16bytes per zspage on
    32bit at the moment. It means we lost 1% at *worst case*(40B/4096B)
    which is not bad I think at the cost of maintenance.

    Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now, VM has a feature to migrate non-lru movable pages so balloon
    doesn't need custom migration hooks in migrate.c and compaction.c.

    Instead, this patch implements the page->mapping->a_ops->
    {isolate|migrate|putback} functions.

    With that, we could remove hooks for ballooning in general migration
    functions and make balloon compaction simple.

    [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
    Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Jul, 2016

1 commit

  • It's possible to isolate some freepages in a pageblock and then fail
    split_free_page() due to the low watermark check. In this case, we hit
    VM_BUG_ON() because the freeing scanner terminated early without a
    contended lock or enough freepages.

    This should never have been a VM_BUG_ON() since it's not a fatal
    condition. It should have been a VM_WARN_ON() at best, or even handled
    gracefully.

    Regardless, we need to terminate anytime the full pageblock scan was not
    done. The logic belongs in isolate_freepages_block(), so handle its
    state gracefully by terminating the pageblock loop and making a note to
    restart at the same pageblock next time since it was not possible to
    complete the scan this time.

    [rientjes@google.com: don't rescan pages in a pageblock]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Minchan Kim
    Tested-by: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Jun, 2016

1 commit

  • If the memory compaction free scanner cannot successfully split a free
    page (only possible due to per-zone low watermark), terminate the free
    scanner rather than continuing to scan memory needlessly. If the
    watermark is insufficient for a free page of order order, then
    terminate the scanner since all future splits will also likely fail.

    This prevents the compaction freeing scanner from scanning all memory on
    very large zones (very noticeable for zones > 128GB, for instance) when
    all splits will likely fail while holding zone->lock.

    compaction_alloc() iterating a 128GB zone has been benchmarked to take
    over 400ms on some systems whereas any free page isolated and ready to
    be split ends up failing in split_free_page() because of the low
    watermark check and thus the iteration continues.

    The next time compaction occurs, the freeing scanner will likely start
    at the end of the zone again since no success was made previously and we
    get the same lengthy iteration until the zone is brought above the low
    watermark. All thp page faults can take >400ms in such a state without
    this fix.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 May, 2016

6 commits

  • While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
    found the kcompactd never wakeup. It seems the zoneindex has already
    minus 1 before. So the traverse here should be
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Zhuangluan Su
    Cc: Yiping Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Feng
     
  • "mm: consider compaction feedback also for costly allocation" has
    removed the upper bound for the reclaim/compaction retries based on the
    number of reclaimed pages for costly orders. While this is desirable
    the patch did miss a mis interaction between reclaim, compaction and the
    retry logic. The direct reclaim tries to get zones over min watermark
    while compaction backs off and returns COMPACT_SKIPPED when all zones
    are below low watermark + 1<
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • COMPACT_COMPLETE now means that compaction and free scanner met. This
    is not very useful information if somebody just wants to use this
    feedback and make any decisions based on that. The current caller might
    be a poor guy who just happened to scan tiny portion of the zone and
    that could be the reason no suitable pages were compacted. Make sure we
    distinguish the full and partial zone walks.

    Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
    and be optimistic in retrying.

    The existing users of COMPACT_COMPLETE are conservatively changed to use
    COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
    reconsidered and only defer the compaction only for COMPACT_COMPLETE
    with the new semantic.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_to_compact_pages() can currently return COMPACT_SKIPPED even when
    the compaction is defered for some zone just because zone DMA is skipped
    in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
    basically unusable for the page allocator as a feedback mechanism.

    Make sure we distinguish those two states properly and switch their
    ordering in the enum. This would mean that the COMPACT_SKIPPED will be
    returned only when all eligible zones are skipped.

    As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
    will be more precise and we would bail out rather than reclaim.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The compiler is complaining after "mm, compaction: change COMPACT_
    constants into enum"

    mm/compaction.c: In function `compact_zone':
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
    switch (ret) {
    ^
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]

    compaction_suitable is allowed to return only COMPACT_PARTIAL,
    COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
    impossible. Put a VM_BUG_ON to catch an impossible return value.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Compaction code is doing weird dances between COMPACT_FOO -> int ->
    unsigned long

    But there doesn't seem to be any reason for that. All functions which
    return/use one of those constants are not expecting any other value so it
    really makes sense to define an enum for them and make it clear that no
    other values are expected.

    This is a pure cleanup and shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

5 commits

  • The classzone_idx can be inferred from preferred_zoneref so remove the
    unnecessary field and save stack space.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • alloc_flags is a bitmask of flags but it is signed which does not
    necessarily generate the best code depending on the compiler. Even
    without an impact, it makes more sense that this be unsigned.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The goal of direct compaction is to quickly make a high-order page
    available for the pending allocation. Within an aligned block of pages
    of desired order, a single allocated page that cannot be isolated for
    migration means that the block cannot fully merge to a buddy page that
    would satisfy the allocation request. Therefore we can reduce the
    allocation stall by skipping the rest of the block immediately on
    isolation failure. For async compaction, this also means a higher
    chance of succeeding until it detects contention.

    We however shouldn't completely sacrifice the second objective of
    compaction, which is to reduce overal long-term memory fragmentation.
    As a compromise, perform the eager skipping only in direct async
    compaction, while sync compaction (including kcompactd) remains
    thorough.

    Testing was done using stress-highalloc from mmtests, configured for
    order-4 GFP_KERNEL allocations:

    4.6-rc1 4.6-rc1
    before after
    Success 1 Min 24.00 ( 0.00%) 27.00 (-12.50%)
    Success 1 Mean 30.20 ( 0.00%) 31.60 ( -4.64%)
    Success 1 Max 37.00 ( 0.00%) 35.00 ( 5.41%)
    Success 2 Min 42.00 ( 0.00%) 32.00 ( 23.81%)
    Success 2 Mean 44.00 ( 0.00%) 44.80 ( -1.82%)
    Success 2 Max 48.00 ( 0.00%) 52.00 ( -8.33%)
    Success 3 Min 91.00 ( 0.00%) 92.00 ( -1.10%)
    Success 3 Mean 92.20 ( 0.00%) 92.80 ( -0.65%)
    Success 3 Max 94.00 ( 0.00%) 93.00 ( 1.06%)

    We can see that success rates are unaffected by the skipping.

    4.6-rc1 4.6-rc1
    before after
    User 2587.42 2566.53
    System 482.89 471.20
    Elapsed 1395.68 1382.00

    Times are not so useful metric for this benchmark as main portion is the
    interfering kernel builds, but results do hint at reduced system times.

    4.6-rc1 4.6-rc1
    before after
    Direct pages scanned 163614 159608
    Kswapd pages scanned 2070139 2078790
    Kswapd pages reclaimed 2061707 2069757
    Direct pages reclaimed 163354 159505

    Reduced direct reclaim was unintended, but could be explained by more
    successful first attempt at (async) direct compaction, which is
    attempted before the first reclaim attempt in __alloc_pages_slowpath().

    Compaction stalls 33052 39853
    Compaction success 12121 19773
    Compaction failures 20931 20079

    Compaction is indeed more successful, and thus less likely to get
    deferred, so there are also more direct compaction stalls.

    Page migrate success 3781876 3326819
    Page migrate failure 45817 41774
    Compaction pages isolated 7868232 6941457
    Compaction migrate scanned 168160492 127269354
    Compaction migrate prescanned 0 0
    Compaction free scanned 2522142582 2326342620
    Compaction free direct alloc 0 0
    Compaction free dir. all. miss 0 0
    Compaction cost 5252 4476

    The patch reduces migration scanned pages by 25% thanks to the eager
    skipping.

    [hughd@google.com: prevent nr_isolated_* from going negative]
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction drains the local pcplists each time migration scanner moves
    away from a cc->order aligned block where it isolated pages for
    migration, so that the pages freed by migrations can merge into higher
    orders.

    The detection is currently coarser than it could be. The
    cc->last_migrated_pfn variable should track the lowest pfn that was
    isolated for migration. But it is set to the pfn where
    isolate_migratepages_block() starts scanning, which is typically the
    first pfn of the pageblock. There, the scanner might fail to isolate
    several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
    another block. This would cause the pcplists drain to be performed,
    although the scanner didn't yet finish the block where it isolated from.

    This patch thus makes cc->last_migrated_pfn handling more accurate by
    setting it to the pfn of an actually isolated page in
    isolate_migratepages_block(). Although practical effects of this patch
    are likely low, it arguably makes the intent of the code more obvious.
    Also the next patch will make async direct compaction skip blocks more
    aggressively, and draining pcplists due to skipped blocks is wasteful.

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction code has accumulated numerous instances of manual
    calculations of the first (inclusive) and last (exclusive) pfn of a
    pageblock (or a smaller block of given order), given a pfn within the
    pageblock.

    Wrap these calculations by introducing pageblock_start_pfn(pfn) and
    pageblock_end_pfn(pfn) macros.

    [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka