29 Jul, 2016

8 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
    isolate a PageWriteback page, which __unmap_and_move() then rejects with
    -EBUSY: of course the writeback might complete in between, but that's
    not what we usually expect, so probably better not to isolate it.

    When tested by stress-highalloc from mmtests, this has reduced the
    number of page migrate failures by 60-70%.

    Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The caller __alloc_pages_direct_compact() already checked (order == 0)
    so there's no need to check again.

    Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     

27 Jul, 2016

6 commits

  • This patch is motivated from Hugh and Vlastimil's concern [1].

    There are two ways to get freepage from the allocator. One is using
    normal memory allocation API and the other is __isolate_free_page()
    which is internally used for compaction and pageblock isolation. Later
    usage is rather tricky since it doesn't do whole post allocation
    processing done by normal API.

    One problematic thing I already know is that poisoned page would not be
    checked if it is allocated by __isolate_free_page(). Perhaps, there
    would be more.

    We could add more debug logic for allocated page in the future and this
    separation would cause more problem. I'd like to fix this situation at
    this time. Solution is simple. This patch commonize some logic for
    newly allocated page and uses it on all sites. This will solve the
    problem.

    [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

    [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It's not necessary to initialized page_owner with holding the zone lock.
    It would cause more contention on the zone lock although it's not a big
    problem since it is just debug feature. But, it is better than before
    so do it. This is also preparation step to use stackdepot in page owner
    feature. Stackdepot allocates new pages when there is no reserved space
    and holding the zone lock in this case will cause deadlock.

    Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to split freepages with holding the zone lock. It will
    cause more contention on zone lock so not desirable.

    [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have squeezed meta data of zspage into first page's descriptor. So,
    to get meta data from subpage, we should get first page first of all.
    But it makes trouble to implment page migration feature of zsmalloc
    because any place where to get first page from subpage can be raced with
    first page migration. IOW, first page it got could be stale. For
    preventing it, I have tried several approahces but it made code
    complicated so finally, I concluded to separate metadata from first
    page. Of course, it consumes more memory. IOW, 16bytes per zspage on
    32bit at the moment. It means we lost 1% at *worst case*(40B/4096B)
    which is not bad I think at the cost of maintenance.

    Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now, VM has a feature to migrate non-lru movable pages so balloon
    doesn't need custom migration hooks in migrate.c and compaction.c.

    Instead, this patch implements the page->mapping->a_ops->
    {isolate|migrate|putback} functions.

    With that, we could remove hooks for ballooning in general migration
    functions and make balloon compaction simple.

    [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
    Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Jul, 2016

1 commit

  • It's possible to isolate some freepages in a pageblock and then fail
    split_free_page() due to the low watermark check. In this case, we hit
    VM_BUG_ON() because the freeing scanner terminated early without a
    contended lock or enough freepages.

    This should never have been a VM_BUG_ON() since it's not a fatal
    condition. It should have been a VM_WARN_ON() at best, or even handled
    gracefully.

    Regardless, we need to terminate anytime the full pageblock scan was not
    done. The logic belongs in isolate_freepages_block(), so handle its
    state gracefully by terminating the pageblock loop and making a note to
    restart at the same pageblock next time since it was not possible to
    complete the scan this time.

    [rientjes@google.com: don't rescan pages in a pageblock]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Minchan Kim
    Tested-by: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Jun, 2016

1 commit

  • If the memory compaction free scanner cannot successfully split a free
    page (only possible due to per-zone low watermark), terminate the free
    scanner rather than continuing to scan memory needlessly. If the
    watermark is insufficient for a free page of order order, then
    terminate the scanner since all future splits will also likely fail.

    This prevents the compaction freeing scanner from scanning all memory on
    very large zones (very noticeable for zones > 128GB, for instance) when
    all splits will likely fail while holding zone->lock.

    compaction_alloc() iterating a 128GB zone has been benchmarked to take
    over 400ms on some systems whereas any free page isolated and ready to
    be split ends up failing in split_free_page() because of the low
    watermark check and thus the iteration continues.

    The next time compaction occurs, the freeing scanner will likely start
    at the end of the zone again since no success was made previously and we
    get the same lengthy iteration until the zone is brought above the low
    watermark. All thp page faults can take >400ms in such a state without
    this fix.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 May, 2016

6 commits

  • While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
    found the kcompactd never wakeup. It seems the zoneindex has already
    minus 1 before. So the traverse here should be
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Zhuangluan Su
    Cc: Yiping Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Feng
     
  • "mm: consider compaction feedback also for costly allocation" has
    removed the upper bound for the reclaim/compaction retries based on the
    number of reclaimed pages for costly orders. While this is desirable
    the patch did miss a mis interaction between reclaim, compaction and the
    retry logic. The direct reclaim tries to get zones over min watermark
    while compaction backs off and returns COMPACT_SKIPPED when all zones
    are below low watermark + 1<
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • COMPACT_COMPLETE now means that compaction and free scanner met. This
    is not very useful information if somebody just wants to use this
    feedback and make any decisions based on that. The current caller might
    be a poor guy who just happened to scan tiny portion of the zone and
    that could be the reason no suitable pages were compacted. Make sure we
    distinguish the full and partial zone walks.

    Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
    and be optimistic in retrying.

    The existing users of COMPACT_COMPLETE are conservatively changed to use
    COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
    reconsidered and only defer the compaction only for COMPACT_COMPLETE
    with the new semantic.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_to_compact_pages() can currently return COMPACT_SKIPPED even when
    the compaction is defered for some zone just because zone DMA is skipped
    in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
    basically unusable for the page allocator as a feedback mechanism.

    Make sure we distinguish those two states properly and switch their
    ordering in the enum. This would mean that the COMPACT_SKIPPED will be
    returned only when all eligible zones are skipped.

    As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
    will be more precise and we would bail out rather than reclaim.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The compiler is complaining after "mm, compaction: change COMPACT_
    constants into enum"

    mm/compaction.c: In function `compact_zone':
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
    switch (ret) {
    ^
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]

    compaction_suitable is allowed to return only COMPACT_PARTIAL,
    COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
    impossible. Put a VM_BUG_ON to catch an impossible return value.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Compaction code is doing weird dances between COMPACT_FOO -> int ->
    unsigned long

    But there doesn't seem to be any reason for that. All functions which
    return/use one of those constants are not expecting any other value so it
    really makes sense to define an enum for them and make it clear that no
    other values are expected.

    This is a pure cleanup and shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

5 commits

  • The classzone_idx can be inferred from preferred_zoneref so remove the
    unnecessary field and save stack space.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • alloc_flags is a bitmask of flags but it is signed which does not
    necessarily generate the best code depending on the compiler. Even
    without an impact, it makes more sense that this be unsigned.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The goal of direct compaction is to quickly make a high-order page
    available for the pending allocation. Within an aligned block of pages
    of desired order, a single allocated page that cannot be isolated for
    migration means that the block cannot fully merge to a buddy page that
    would satisfy the allocation request. Therefore we can reduce the
    allocation stall by skipping the rest of the block immediately on
    isolation failure. For async compaction, this also means a higher
    chance of succeeding until it detects contention.

    We however shouldn't completely sacrifice the second objective of
    compaction, which is to reduce overal long-term memory fragmentation.
    As a compromise, perform the eager skipping only in direct async
    compaction, while sync compaction (including kcompactd) remains
    thorough.

    Testing was done using stress-highalloc from mmtests, configured for
    order-4 GFP_KERNEL allocations:

    4.6-rc1 4.6-rc1
    before after
    Success 1 Min 24.00 ( 0.00%) 27.00 (-12.50%)
    Success 1 Mean 30.20 ( 0.00%) 31.60 ( -4.64%)
    Success 1 Max 37.00 ( 0.00%) 35.00 ( 5.41%)
    Success 2 Min 42.00 ( 0.00%) 32.00 ( 23.81%)
    Success 2 Mean 44.00 ( 0.00%) 44.80 ( -1.82%)
    Success 2 Max 48.00 ( 0.00%) 52.00 ( -8.33%)
    Success 3 Min 91.00 ( 0.00%) 92.00 ( -1.10%)
    Success 3 Mean 92.20 ( 0.00%) 92.80 ( -0.65%)
    Success 3 Max 94.00 ( 0.00%) 93.00 ( 1.06%)

    We can see that success rates are unaffected by the skipping.

    4.6-rc1 4.6-rc1
    before after
    User 2587.42 2566.53
    System 482.89 471.20
    Elapsed 1395.68 1382.00

    Times are not so useful metric for this benchmark as main portion is the
    interfering kernel builds, but results do hint at reduced system times.

    4.6-rc1 4.6-rc1
    before after
    Direct pages scanned 163614 159608
    Kswapd pages scanned 2070139 2078790
    Kswapd pages reclaimed 2061707 2069757
    Direct pages reclaimed 163354 159505

    Reduced direct reclaim was unintended, but could be explained by more
    successful first attempt at (async) direct compaction, which is
    attempted before the first reclaim attempt in __alloc_pages_slowpath().

    Compaction stalls 33052 39853
    Compaction success 12121 19773
    Compaction failures 20931 20079

    Compaction is indeed more successful, and thus less likely to get
    deferred, so there are also more direct compaction stalls.

    Page migrate success 3781876 3326819
    Page migrate failure 45817 41774
    Compaction pages isolated 7868232 6941457
    Compaction migrate scanned 168160492 127269354
    Compaction migrate prescanned 0 0
    Compaction free scanned 2522142582 2326342620
    Compaction free direct alloc 0 0
    Compaction free dir. all. miss 0 0
    Compaction cost 5252 4476

    The patch reduces migration scanned pages by 25% thanks to the eager
    skipping.

    [hughd@google.com: prevent nr_isolated_* from going negative]
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction drains the local pcplists each time migration scanner moves
    away from a cc->order aligned block where it isolated pages for
    migration, so that the pages freed by migrations can merge into higher
    orders.

    The detection is currently coarser than it could be. The
    cc->last_migrated_pfn variable should track the lowest pfn that was
    isolated for migration. But it is set to the pfn where
    isolate_migratepages_block() starts scanning, which is typically the
    first pfn of the pageblock. There, the scanner might fail to isolate
    several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
    another block. This would cause the pcplists drain to be performed,
    although the scanner didn't yet finish the block where it isolated from.

    This patch thus makes cc->last_migrated_pfn handling more accurate by
    setting it to the pfn of an actually isolated page in
    isolate_migratepages_block(). Although practical effects of this patch
    are likely low, it arguably makes the intent of the code more obvious.
    Also the next patch will make async direct compaction skip blocks more
    aggressively, and draining pcplists due to skipped blocks is wasteful.

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction code has accumulated numerous instances of manual
    calculations of the first (inclusive) and last (exclusive) pfn of a
    pageblock (or a smaller block of given order), given a pfn within the
    pageblock.

    Wrap these calculations by introducing pageblock_start_pfn(pfn) and
    pageblock_end_pfn(pfn) macros.

    [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

06 May, 2016

2 commits

  • Assume memory47 is the last online block left in node1. This will hang:

    # echo offline > /sys/devices/system/node/node1/memory47/state

    After a couple of minutes, the following pops up in dmesg:

    INFO: task bash:957 blocked for more than 120 seconds.
    Not tainted 4.6.0-rc6+ #6
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    bash D ffff8800b7adbaf8 0 957 951 0x00000000
    Call Trace:
    schedule+0x35/0x80
    schedule_timeout+0x1ac/0x270
    wait_for_completion+0xe1/0x120
    kthread_stop+0x4f/0x110
    kcompactd_stop+0x26/0x40
    __offline_pages.constprop.28+0x7e6/0x840
    offline_pages+0x11/0x20
    memory_block_action+0x73/0x1d0
    memory_subsys_offline+0x47/0x60
    device_offline+0x86/0xb0
    store_mem_state+0xda/0xf0
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x37/0x40
    kernfs_fop_write+0x11d/0x170
    __vfs_write+0x37/0x120
    vfs_write+0xa9/0x1a0
    SyS_write+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa4

    kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
    actually exit. Check kthread_should_stop() to break out of the wait.

    Fixes: 698b1b306 ("mm, compaction: introduce kcompactd").
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
    increasingly negative under compaction: which would add delay when
    should be none, or no delay when should delay. The bug in compaction
    was due to a recent mmotm patch, but much older instance of the bug was
    also noticed in isolate_migratepages_range() which is used for CMA and
    gigantic hugepage allocations.

    The bug is caused by putback_movable_pages() in an error path
    decrementing the isolated counters without them being previously
    incremented by acct_isolated(). Fix isolate_migratepages_range() by
    removing the error-path putback, thus reaching acct_isolated() with
    migratepages still isolated, and leaving putback to caller like most
    other places do.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    [vbabka@suse.cz: expanded the changelog]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Mar, 2016

2 commits

  • Similarly to direct reclaim/compaction, kswapd attempts to combine
    reclaim and compaction to attempt making memory allocation of given
    order available.

    The details differ from direct reclaim e.g. in having high watermark as
    a goal. The code involved in kswapd's reclaim/compaction decisions has
    evolved to be quite complex.

    Testing reveals that it doesn't actually work in at least one scenario,
    and closer inspection suggests that it could be greatly simplified
    without compromising on the goal (make high-order page available) or
    efficiency (don't reclaim too much). The simplification relieas of
    doing all compaction in kcompactd, which is simply woken up when high
    watermarks are reached by kswapd's reclaim.

    The scenario where kswapd compaction doesn't work was found with mmtests
    test stress-highalloc configured to attempt order-9 allocations without
    direct reclaim, just waking up kswapd. There was no compaction attempt
    from kswapd during the whole test. Some added instrumentation shows
    what happens:

    - balance_pgdat() sets end_zone to Normal, as it's not balanced
    - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
    it cannot reclaim anything, so sc.nr_reclaimed is 0
    - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
    it merely checks if high watermarks were reached for base pages.
    This is true, so no reclaim is attempted. For DMA, testorder=0
    wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
    - even though the pgdat_needs_compaction flag wasn't set to false, no
    compaction happens due to the condition sc.nr_reclaimed >
    nr_attempted being false (as 0 < 99)
    - priority-- due to nr_reclaimed being 0, repeat until priority reaches
    0 pgdat_balanced() is false as only the small zone DMA appears
    balanced (curiously in that check, watermark appears OK and
    compaction_suitable() returns COMPACT_PARTIAL, because a lower
    classzone_idx is used there)

    Now, even if it was decided that reclaim shouldn't be attempted on the
    DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
    nr_attempted=0) is also false. The condition really should use >= as
    the comment suggests. Then there is a mismatch in the check for setting
    pgdat_needs_compaction to false using low watermark, while the rest uses
    high watermark, and who knows what other subtlety. Hopefully this
    demonstrates that this is unsustainable.

    Luckily we can simplify this a lot. The reclaim/compaction decisions
    make sense for direct reclaim scenario, but in kswapd, our primary goal
    is to reach high watermark in order-0 pages. Afterwards we can attempt
    compaction just once. Unlike direct reclaim, we don't reclaim extra
    pages (over the high watermark), the current code already disallows it
    for good reasons.

    After this patch, we simply wake up kcompactd to process the pgdat,
    after we have either succeeded or failed to reach the high watermarks in
    kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
    so kcompactd can apply the same criteria to determine which zones are
    worth compacting. Note that we use the classzone_idx from
    wakeup_kswapd(), not balanced_classzone_idx which can include higher
    zones that kswapd tried to balance too, but didn't consider them in
    pgdat_balanced().

    Since kswapd now cannot create high-order pages itself, we need to
    adjust how it determines the zones to be balanced. The key element here
    is adding a "highorder" parameter to zone_balanced, which, when set to
    false, makes it consider only order-0 watermark instead of the desired
    higher order (this was done previously by kswapd_shrink_zone(), but not
    elsewhere). This false is passed for example in pgdat_balanced().
    Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
    kcompactd are woken up for a high-order allocation failure.

    The last thing is to decide what to do with pageblock_skip bitmap
    handling. Compaction maintains a pageblock_skip bitmap to record
    pageblocks where isolation recently failed. This bitmap can be reset by
    three ways:

    1) direct compaction is restarting after going through the full deferred cycle

    2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

    3) compaction was invoked manually via trigger in /proc

    The case 2) is somewhat fuzzy to begin with, but after introducing
    kcompactd we should update it. The check for direct compaction in 1),
    and to set the flush flag in 2) use current_is_kswapd(), which doesn't
    work for kcompactd. Thus, this patch adds bool direct_compaction to
    compact_control to use in 2). For the case 1) we remove the check
    completely - unlike the former kswapd compaction, kcompactd does use the
    deferred compaction functionality, so flushing tied to restarting from
    deferred compaction makes sense here.

    Note that when kswapd goes to sleep, kcompactd is woken up, so it will
    see the flushed pageblock_skip bits. This is different from when the
    former kswapd compaction observed the bits and I believe it makes more
    sense. Kcompactd can afford to be more thorough than a direct
    compaction trying to limit allocation latency, or kswapd whose primary
    goal is to reclaim.

    For testing, I used stress-highalloc configured to do order-9
    allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
    on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
    phases 1 and 2 work as usual):

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
    Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
    Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
    Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
    Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
    Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
    Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)

    User 3166.67 3181.09
    System 1153.37 1158.25
    Elapsed 1768.53 1799.37

    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Direct pages scanned 32938 32797
    Kswapd pages scanned 2183166 2202613
    Kswapd pages reclaimed 2152359 2143524
    Direct pages reclaimed 32735 32545
    Percentage direct scans 1% 1%
    THP fault alloc 579 612
    THP collapse alloc 304 316
    THP splits 0 0
    THP fault fallback 793 778
    THP collapse fail 11 16
    Compaction stalls 1013 1007
    Compaction success 92 67
    Compaction failures 920 939
    Page migrate success 238457 721374
    Page migrate failure 23021 23469
    Compaction pages isolated 504695 1479924
    Compaction migrate scanned 661390 8812554
    Compaction free scanned 13476658 84327916
    Compaction cost 262 838

    After this patch we see improvements in allocation success rate
    (especially for phase 3) along with increased compaction activity. The
    compaction stalls (direct compaction) in the interfering kernel builds
    (probably THP's) also decreased somewhat thanks to kcompactd activity,
    yet THP alloc successes improved a bit.

    Note that elapsed and user time isn't so useful for this benchmark,
    because of the background interference being unpredictable. It's just
    to quickly spot some major unexpected differences. System time is
    somewhat more useful and that didn't increase.

    Also (after adjusting mmtests' ftrace monitor):

    Time kswapd awake 2547781 2269241
    Time kcompactd awake 0 119253
    Time direct compacting 939937 557649
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 119099

    The decrease of overal time spent compacting appears to not match the
    increased compaction stats. I suspect the tasks get rescheduled and
    since the ftrace monitor doesn't see that, the reported time is wall
    time, not CPU time. But arguably direct compactors care about overall
    latency anyway, whether busy compacting or waiting for CPU doesn't
    matter. And that latency seems to almost halved.

    It's also interesting how much time kswapd spent awake just going
    through all the priorities and failing to even try compacting, over and
    over.

    We can also configure stress-highalloc to perform both direct
    reclaim/compaction and wakeup kswapd/kcompactd, by using
    GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
    Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
    Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
    Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
    Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
    Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
    Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)

    User 3344.73 3246.04
    System 1194.24 1172.29
    Elapsed 1838.04 1836.76

    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Direct pages scanned 125146 120966
    Kswapd pages scanned 2119757 2135012
    Kswapd pages reclaimed 2073183 2108388
    Direct pages reclaimed 124909 120577
    Percentage direct scans 5% 5%
    THP fault alloc 599 652
    THP collapse alloc 323 354
    THP splits 0 0
    THP fault fallback 806 793
    THP collapse fail 17 16
    Compaction stalls 2457 2025
    Compaction success 906 518
    Compaction failures 1551 1507
    Page migrate success 2031423 2360608
    Page migrate failure 32845 40852
    Compaction pages isolated 4129761 4802025
    Compaction migrate scanned 11996712 21750613
    Compaction free scanned 214970969 344372001
    Compaction cost 2271 2694

    In this scenario, this patch doesn't change the overall success rate as
    direct compaction already tries all it can. There's however significant
    reduction in direct compaction stalls (that is, the number of
    allocations that went into direct compaction). The number of successes
    (i.e. direct compaction stalls that ended up with successful
    allocation) is reduced by the same number. This means the offload to
    kcompactd is working as expected, and direct compaction is reduced
    either due to detecting contention, or compaction deferred by kcompactd.
    In the previous version of this patchset there was some apparent
    reduction of success rate, but the changes in this version (such as
    using sync compaction only), new baseline kernel, and/or averaging
    results from 5 executions (my bet), made this go away.

    Ftrace-based stats seem to roughly agree:

    Time kswapd awake 2532984 2326824
    Time kcompactd awake 0 257916
    Time direct compacting 864839 735130
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 257585

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Mar, 2016

3 commits

  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • pageblock_pfn_to_page() is used to check there is valid pfn and all
    pages in the pageblock is in a single zone. If there is a hole in the
    pageblock, passing arbitrary position to pageblock_pfn_to_page() could
    cause to skip whole pageblock scanning, instead of just skipping the
    hole page. For deterministic behaviour, it's better to always pass
    pageblock aligned range to pageblock_pfn_to_page(). It will also help
    further optimization on pageblock_pfn_to_page() in the following patch.

    Signed-off-by: Joonsoo Kim
    Cc: Aaron Lu
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • free_pfn and compact_cached_free_pfn are the pointer that remember
    restart position of freepage scanner. When they are reset or invalid,
    we set them to zone_end_pfn because freepage scanner works in reverse
    direction. But, because zone range is defined as [zone_start_pfn,
    zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should
    not store it to free_pfn and compact_cached_free_pfn. Instead, we need
    to store zone_end_pfn - 1 to them. There is one more thing we should
    consider. Freepage scanner scan reversely by pageblock unit. If
    free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
    regards that sitiation as that it already scans front part of pageblock
    so we lose opportunity to scan there. To fix-up, this patch do
    round_down() to guarantee that reset position will be pageblock aligned.

    Note that thanks to the current pageblock_pfn_to_page() implementation,
    actual access to zone_end_pfn doesn't happen until now. But, following
    patch will change pageblock_pfn_to_page() so this patch is needed from
    now on.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

15 Jan, 2016

2 commits

  • This patch uses is_via_compact_memory() to distinguish compaction from
    sysfs or sysctl. And, this patch also reduces indentation on
    compaction_defer_reset() by filtering these cases first before checking
    watermark.

    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Yaowei Bai
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • sysctl_compaction_handler() is the handler function for compact_memory
    tunable knob under /proc/sys/vm, add the missing knob name to make this
    more accurate in comment.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Vlastimil Babka
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

06 Nov, 2015

3 commits

  • Compaction returns prematurely with COMPACT_PARTIAL when contended or has
    fatal signal pending. This is ok for the callers, but might be misleading
    in the traces, as the usual reason to return COMPACT_PARTIAL is that we
    think the allocation should succeed. After this patch we distinguish the
    premature ending condition in the mm_compaction_finished and
    mm_compaction_end tracepoints.

    The contended status covers the following reasons:
    - lock contention or need_resched() detected in async compaction
    - fatal signal pending
    - too many pages isolated in the zone (only for async compaction)
    Further distinguishing the exact reason seems unnecessary for now.

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Some compaction tracepoints convert the integer return values to strings
    using the compaction_status_string array. This works for in-kernel
    printing, but not userspace trace printing of raw captured trace such as
    via trace-cmd report.

    This patch converts the private array to appropriate tracepoint macros
    that result in proper userspace support.

    trace-cmd output before:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=

    after:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=partial

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Introduce is_via_compact_memory() helper indicating compacting via
    /proc/sys/vm/compact_memory to improve readability.

    To catch this situation in __compaction_suitable, use order as parameter
    directly instead of using struct compact_control.

    This patch has no functional changes.

    Signed-off-by: Yaowei Bai
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

09 Sep, 2015

1 commit

  • We cache isolate_start_pfn before entering isolate_migratepages(). If
    pageblock is skipped in isolate_migratepages() due to whatever reason,
    cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
    that were freed. For example, the following scenario can be possible:

    - assume order-9 compaction, pageblock order is 9
    - start_isolate_pfn is 0x200
    - isolate_migratepages()
    - skip a number of pageblocks
    - start to isolate from pfn 0x600
    - cc->migrate_pfn = 0x620
    - return
    - last_migrated_pfn is set to 0x200
    - check flushing condition
    - current_block_start is set to 0x600
    - last_migrated_pfn < current_block_start then do useless flush

    This wrong flush would not help the performance and success rate so this
    patch tries to fix it. One simple way to know the exact position where
    we start to isolate migratable pages is that we cache it in
    isolate_migratepages() before entering actual isolation. This patch
    implements that and fixes the problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim