26 Sep, 2014

40 commits

  • commit be9765722e6b7ece8263cbab857490332339bd6f upstream.

    Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit e9ade569910a82614ff5f2c2cea2b65a8d785da4 upstream.

    The compaction free scanner in isolate_freepages() currently remembers PFN
    of the highest pageblock where it successfully isolates, to be used as the
    starting pageblock for the next invocation. The rationale behind this is
    that page migration might return free pages to the allocator when
    migration fails and we don't want to skip them if the compaction
    continues.

    Since migration now returns free pages back to compaction code where they
    can be reused, this is no longer a concern. This patch changes
    isolate_freepages() so that the PFN for restarting is updated with each
    pageblock where isolation is attempted. Using stress-highalloc from
    mmtests, this resulted in 10% reduction of the pages scanned by the free
    scanner.

    Note that the somewhat similar functionality that records highest
    successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
    This cache is used when the whole compaction is restarted, not for
    multiple invocations of the free scanner during single compaction.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Reviewed-by: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream.

    During compaction, update_nr_listpages() has been used to count remaining
    non-migrated and free pages after a call to migrage_pages(). The
    freepages counting has become unneccessary, and it turns out that
    migratepages counting is also unnecessary in most cases.

    The only situation when it's needed to count cc->migratepages is when
    migrate_pages() returns with a negative error code. Otherwise, the
    non-negative return value is the number of pages that were not migrated,
    which is exactly the count of remaining pages in the cc->migratepages
    list.

    Furthermore, any non-zero count is only interesting for the tracepoint of
    mm_compaction_migratepages events, because after that all remaining
    unmigrated pages are put back and their count is set to 0.

    This patch therefore removes update_nr_listpages() completely, and changes
    the tracepoint definition so that the manual counting is done only when
    the tracepoint is enabled, and only when migrate_pages() returns a
    negative error code.

    Furthermore, migrate_pages() and the tracepoints won't be called when
    there's nothing to migrate. This potentially avoids some wasted cycles
    and reduces the volume of uninteresting mm_compaction_migratepages events
    where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
    was about 75% of the events. The mm_compaction_isolate_migratepages event
    is better for determining that nothing was isolated for migration, and
    this one was just duplicating the info.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit aeef4b83806f49a0c454b7d4578671b71045bee2 upstream.

    Async compaction terminates prematurely when need_resched(), see
    compact_checklock_irqsave(). This can never trigger, however, if the
    cond_resched() in isolate_migratepages_range() always takes care of the
    scheduling.

    If the cond_resched() actually triggers, then terminate this pageblock
    scan for async compaction as well.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream.

    We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream.

    Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit d53aea3d46d64e95da9952887969f7533b9ab25e upstream.

    Greg reported that he found isolated free pages were returned back to the
    VM rather than the compaction freelist. This will cause holes behind the
    free scanner and cause it to reallocate additional memory if necessary
    later.

    He detected the problem at runtime seeing that ext4 metadata pages (esp
    the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
    constantly visited by compaction calls of migrate_pages(). These pages
    had a non-zero b_count which caused fallback_migrate_page() ->
    try_to_release_page() -> try_to_free_buffers() to fail.

    Memory compaction works by having a "freeing scanner" scan from one end of
    a zone which isolates pages as migration targets while another "migrating
    scanner" scans from the other end of the same zone which isolates pages
    for migration.

    When page migration fails for an isolated page, the target page is
    returned to the system rather than the freelist built by the freeing
    scanner. This may require the freeing scanner to continue scanning memory
    after suitable migration targets have already been returned to the system
    needlessly.

    This patch returns destination pages to the freeing scanner freelist when
    page migration fails. This prevents unnecessary work done by the freeing
    scanner but also encourages memory to be as compacted as possible at the
    end of the zone.

    Signed-off-by: David Rientjes
    Reported-by: Greg Thelen
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

    Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit c96b9e508f3d06ddb601dcc9792d62c044ab359e upstream.

    isolate_freepages() is currently somewhat hard to follow thanks to many
    looks like it is related to the 'low_pfn' variable, but in fact it is not.

    This patch renames the 'high_pfn' variable to a hopefully less confusing name,
    and slightly changes its handling without a functional change. A comment made
    obsolete by recent changes is also updated.

    [akpm@linux-foundation.org: comment fixes, per Minchan]
    [iamjoonsoo.kim@lge.com: cleanups]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Dongjun Shin
    Cc: Sunghwan Yun
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 13fb44e4b0414d7e718433a49e6430d5b76bd46e upstream.

    Remove code lines currently not in use or never called.

    Signed-off-by: Heesub Shin
    Acked-by: Vlastimil Babka
    Cc: Dongjun Shin
    Cc: Sunghwan Yun
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Dongjun Shin
    Cc: Sunghwan Yun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Heesub Shin
     
  • commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream.

    Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Fabian Frederick
     
  • commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream.

    ... it does that itself (via kmap_atomic())

    Signed-off-by: Al Viro
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Al Viro
     
  • commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream.

    This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.

    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete. This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that. This never actually happened anywhere in
    the code.

    read_cache_page_async() had 3 different callers:

    - read_cache_page() which is the sync version, it would just wait for
    the requested read to complete using wait_on_page_read().

    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
    function it supplied doesn't do any async reads, and would complete
    before the filler function returns - making it actually a sync read.

    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
    a similar story to JFFS2 - the filler function doesn't do anything that
    reminds async reads and would always complete before the filler function
    returns.

    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async(). While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().

    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Sasha Levin
     
  • commit 55231e5c898c5c03c14194001e349f40f59bd300 upstream.

    MADV_WILLNEED currently does not read swapped out shmem pages back in.

    Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
    cache radix trees") made find_get_page() filter exceptional radix tree
    entries but failed to convert all find_get_page() callers that WANT
    exceptional entries over to find_get_entry(). One of them is shmem swap
    readahead in madvise, which now skips over any swap-out records.

    Convert it to find_get_entry().

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.

    shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream.

    The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87 upstream.

    Page cache radix tree slots are usually stabilized by the page lock, but
    shmem's swap cookies have no such thing. Because the overall truncation
    loop is lockless, the swap entry is currently confirmed by a tree lookup
    and then deleted by another tree lookup under the same tree lock region.

    Use radix_tree_delete_item() instead, which does the verification and
    deletion with only one lookup. This also allows removing the
    delete-only special case from shmem_radix_tree_replace().

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream.

    BUG_ON() is a big hammer, and should be used _only_ if there is some
    major corruption that you cannot possibly recover from, making it
    imperative that the current process (and possibly the whole machine) be
    terminated with extreme prejudice.

    The trivial sanity check in the vmacache code is *not* such a fatal
    error. Recovering from it is absolutely trivial, and using BUG_ON()
    just makes it harder to debug for no actual advantage.

    To make matters worse, the placement of the BUG_ON() (only if the range
    check matched) actually makes it harder to hit the sanity check to begin
    with, so _if_ there is a bug (and we just got a report from Srivatsa
    Bhat that this can indeed trigger), it is harder to debug not just
    because the machine is possibly dead, but because we don't have better
    coverage.

    BUG_ON() must *die*. Maybe we should add a checkpatch warning for it,
    because it is simply just about the worst thing you can ever do if you
    hit some "this cannot happen" situation.

    Reported-by: Srivatsa S. Bhat
    Cc: Davidlohr Bueso
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Linus Torvalds
     
  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.

    Seems to be called with preemption enabled. Therefore it must use
    mod_zone_page_state instead.

    Signed-off-by: Christoph Lameter
    Reported-by: Grygorii Strashko
    Tested-by: Grygorii Strashko
    Cc: Tejun Heo
    Cc: Santosh Shilimkar
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Christoph Lameter
     
  • commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.

    The name `max_pass' is misleading, because this variable actually keeps
    the estimate number of freeable objects, not the maximal number of
    objects we can scan in this pass, which can be twice that. Rename it to
    reflect its actual meaning.

    Signed-off-by: Vladimir Davydov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.

    When direct reclaim is executed by a process bound to a set of NUMA
    nodes, we should scan only those nodes when possible, but currently we
    will scan kmem from all online nodes even if the kmem shrinker is NUMA
    aware. That said, binding a process to a particular NUMA node won't
    prevent it from shrinking inode/dentry caches from other nodes, which is
    not good. Fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.

    In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors(). That
    smells fishy. Looking further, this is basically what happens:

    blkdev_aio_read()
    generic_file_aio_read()
    filemap_write_and_wait_range()
    if (!mapping->nr_pages)
    filemap_check_errors()

    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation. The
    patch below tests each of these bits before clearing them, avoiding this
    issue. In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.

    Signed-off-by: Jens Axboe
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Jens Axboe
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.

    Currently max_sane_readahead() returns zero on the cpu whose NUMA node
    has no local memory which leads to readahead failure. Fix this
    readahead failure by returning minimum of (requested pages, 512). Users
    running applications on a memory-less cpu which needs readahead such as
    streaming application see considerable boost in the performance.

    Result:

    fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
    CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.

    fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
    testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
    the normal NUMA cases w/ patch.

    Kernel Avg Stddev
    base 7.4975 3.92%
    patched 7.4174 3.26%

    [Andrew: making return value PAGE_SIZE independent]
    Suggested-by: Linus Torvalds
    Signed-off-by: Raghavendra K T
    Acked-by: Jan Kara
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Raghavendra K T
     
  • commit 91ca9186484809c57303b33778d841cc28f696ed upstream.

    The cached pageblock hint should be ignored when triggering compaction
    through /proc/sys/vm/compact_memory so all eligible memory is isolated.
    Manually invoking compaction is known to be expensive, there's no need
    to skip pageblocks based on heuristics (mainly for debugging).

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.

    The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.

    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.

    It is just for clean-up to reduce code size and improve readability.
    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.

    isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted. It isn't needed to
    call it on every page. Current code do well if not suitable, but, don't
    do well when suitable.

    1) It re-checks isolation_suitable() on each page of a pageblock that was
    already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
    was not entered through the next_pageblock: label, because
    last_pageblock_nr is not otherwise updated.

    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.

    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.

    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.

    It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page. This may results in below situation while isolating
    migratepage.

    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
    the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.

    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock. This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.

    suitable_migration_target() checks that pageblock is suitable for
    migration target. In isolate_freepages_block(), it is called on every
    page and this is inefficient. So make it called once per pageblock.

    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order. So calling it once
    within pageblock range has no problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.

    Purpose of compaction is to get a high order page. Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target. It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.

    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code. There is no functional changes from this clean-up.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.

    Page migration will fail for memory that is pinned in memory with, for
    example, get_user_pages(). In this case, it is unnecessary to take
    zone->lru_lock or isolating the page and passing it to page migration
    which will ultimately fail.

    This is a racy check, the page can still change from under us, but in
    that case we'll just fail later when attempting to move the page.

    This avoids very expensive memory compaction when faulting transparent
    hugepages after pinning a lot of memory with a Mellanox driver.

    On a 128GB machine and pinning ~120GB of memory, before this patch we
    see the enormous disparity in the number of page migration failures
    because of the pinning (from /proc/vmstat):

    compact_pages_moved 8450
    compact_pagemigrate_failed 15614415

    0.05% of pages isolated are successfully migrated and explicitly
    triggering memory compaction takes 102 seconds. After the patch:

    compact_pages_moved 9197
    compact_pagemigrate_failed 7

    99.9% of pages isolated are now successfully migrated in this
    configuration and memory compaction takes less than one second.

    Signed-off-by: David Rientjes
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.

    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     
  • commit ec97097bca147d5718a5d2c024d1ec740b10096d upstream.

    If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
    once with nid=0, but currently it is not true: if node 0 is not set in
    the nodemask or if it is not online, we will not call such shrinkers at
    all. As a result some slabs will be left untouched under some
    circumstances. Let us fix it.

    Signed-off-by: Vladimir Davydov
    Reported-by: Dave Chinner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 0b1fb40a3b1291f2f12f13f644ac95cf756a00e6 upstream.

    When reclaiming kmem, we currently don't scan slabs that have less than
    batch_size objects (see shrink_slab_node()):

    while (total_scan >= batch_size) {
    shrinkctl->nr_to_scan = batch_size;
    shrinker->scan_objects(shrinker, shrinkctl);
    total_scan -= batch_size;
    }

    If there are only a few shrinkers available, such a behavior won't cause
    any problems, because the batch_size is usually small, but if we have a
    lot of slab shrinkers, which is perfectly possible since FS shrinkers
    are now per-superblock, we can end up with hundreds of megabytes of
    practically unreclaimable kmem objects. For instance, mounting a
    thousand of ext2 FS images with a hundred of files in each and iterating
    over all the files using du(1) will result in about 200 Mb of FS caches
    that cannot be dropped even with the aid of the vm.drop_caches sysctl!

    This problem was initially pointed out by Glauber Costa [*]. Glauber
    proposed to fix it by making the shrink_slab() always take at least one
    pass, to put it simply, turning the scan loop above to a do{}while()
    loop. However, this proposal was rejected, because it could result in
    more aggressive and frequent slab shrinking even under low memory
    pressure when total_scan is naturally very small.

    This patch is a slightly modified version of Glauber's approach.
    Similarly to Glauber's patch, it makes shrink_slab() scan less than
    batch_size objects, but only if the total number of objects we want to
    scan (total_scan) is greater than the total number of objects available
    (max_pass). Since total_scan is biased as half max_pass if the current
    delta change is small:

    if (delta < max_pass / 4)
    total_scan = min(total_scan, max_pass / 2);

    this is only possible if we are scanning at high prio. That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.

    [*] http://www.spinics.net/lists/cgroups/msg06913.html

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.

    This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     
  • commit 55b7c4c99f6a448f72179297fe6432544f220063 upstream.

    Compaction used to start its migrate and free page scaners at the zone's
    lowest and highest pfn, respectively. Later, caching was introduced to
    remember the scanners' progress across compaction attempts so that
    pageblocks are not re-scanned uselessly. Additionally, pageblocks where
    isolation failed are marked to be quickly skipped when encountered again
    in future compactions.

    Currently, both the reset of cached pfn's and clearing of the pageblock
    skip information for a zone is done in __reset_isolation_suitable().
    This function gets called when:

    - compaction is restarting after being deferred
    - compact_blockskip_flush flag is set in compact_finished() when the scanners
    meet (and not again cleared when direct compaction succeeds in allocation)
    and kswapd acts upon this flag before going to sleep

    This behavior is suboptimal for several reasons:

    - when direct sync compaction is called after async compaction fails (in the
    allocation slowpath), it will effectively do nothing, unless kswapd
    happens to process the compact_blockskip_flush flag meanwhile. This is racy
    and goes against the purpose of sync compaction to more thoroughly retry
    the compaction of a zone where async compaction has failed.
    The restart-after-deferring path cannot help here as deferring happens only
    after the sync compaction fails. It is also done only for the preferred
    zone, while the compaction might be done for a fallback zone.

    - the mechanism of marking pageblock to be skipped has little value since the
    cached pfn's are reset only together with the pageblock skip flags. This
    effectively limits pageblock skip usage to parallel compactions.

    This patch changes compact_finished() so that cached pfn's are reset
    immediately when the scanners meet. Clearing pageblock skip flags is
    unchanged, as well as the other situations where cached pfn's are reset.
    This allows the sync-after-async compaction to retry pageblocks not
    marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
    compactions now skips without marking them.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0 upstream.

    Compaction temporarily marks pageblocks where it fails to isolate pages
    as to-be-skipped in further compactions, in order to improve efficiency.
    One of the reasons to fail isolating pages is that isolation is not
    attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

    The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
    async compaction become skipped due to the temporary mark also in future
    sync compaction. Moreover, this may follow quite soon during
    __alloc_page_slowpath, without much time for kswapd to clear the
    pageblock skip marks. This goes against the idea that sync compaction
    should try to scan these blocks more thoroughly than the async
    compaction.

    The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
    blocks are not marked to be skipped. Note this should not affect
    performance or locking impact of further async compactions, as skipping
    a block due to being !MIGRATE_MOVABLE is done soon after skipping a
    block marked to be skipped, both without locking.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.

    Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka