24 Feb, 2013

5 commits

  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • These functions always return 0. Formalise this.

    Cc: Jason Liu
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Compaction uses the ALIGN macro incorrectly with the migrate scanner by
    adding pageblock_nr_pages to a PFN. It happened to work when initially
    implemented as the starting PFN was also aligned but with caching
    restarts and isolating in smaller chunks this is no longer always true.

    The impact is that the migrate scanner scans outside its current
    pageblock. As pfn_valid() is still checked properly it does not cause
    any failure and the impact of the bug is that in some cases it will scan
    more than necessary when it crosses a page boundary but by no more than
    COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
    it's still wrong so this patch addresses the problem.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Jan, 2013

2 commits

  • Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
    waiting for POLLIN on a local TCP socket. It was easier to trigger if
    there was disk IO and dirty pages at the same time and he bisected it to
    commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available").

    The intention of that patch was to improve high-order allocations under
    memory pressure after changes made to reclaim in 3.6 drastically hurt
    THP allocations but the approach was flawed. For Eric, the problem was
    that page->pfmemalloc was not being cleared for captured pages leading
    to a poor interaction with swap-over-NFS support causing the packets to
    be dropped. However, I identified a few more problems with the patch
    including the fact that it can increase contention on zone->lock in some
    cases which could result in async direct compaction being aborted early.

    In retrospect the capture patch took the wrong approach. What it should
    have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
    was allocating for THP and avoided races that way. While the patch was
    showing to improve allocation success rates at the time, the benefit is
    marginal given the relative complexity and it should be revisited from
    scratch in the context of the other reclaim-related changes that have
    taken place since the patch was first written and tested. This patch
    partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
    suitable high-order page immediately when it is made available").

    Reported-and-tested-by: Eric Wong
    Tested-by: Eric Dumazet
    Cc:
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • when run the folloing command under shell, it will return error

    sh/$ echo 1 > /proc/sys/vm/compact_memory
    sh/$ sh: write error: Bad address

    After strace, I found the following log:

    ...
    write(1, "1\n", 2) = 3
    write(1, "", 4294967295) = -1 EFAULT (Bad address)
    write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
    ) = 31

    This tells system return 3(COMPACT_COMPLETE) after write data to
    compact_memory.

    The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
    from sysctl_compaction_handler after compaction_nodes finished.

    Signed-off-by: Jason Liu
    Suggested-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Liu
     

21 Dec, 2012

1 commit

  • isolate_freepages_block() and isolate_migratepages_range() are used for
    CMA as well as compaction so it breaks build for CONFIG_CMA &&
    !CONFIG_COMPACTION.

    This patch fixes it.

    [akpm@linux-foundation.org: add "do { } while (0)", per Mel]
    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • compact_capture_page() is only used if compaction is enabled so it should
    be moved into the corresponding #ifdef.

    Signed-off-by: Thierry Reding
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     

12 Dec, 2012

2 commits

  • The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
    hacks around putback_lru_pages() in order to allow ballooned pages to be
    re-inserted on balloon page list as if a ballooned page was like a LRU page.

    As ballooned pages are not legitimate LRU pages, this patch introduces
    putback_movable_pages() to properly cope with cases where the isolated
    pageset contains ballooned pages and LRU pages, thus fixing the mentioned
    inelegant hack around putback_lru_pages().

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a guest,
    thus imposing performance penalties associated with the reduced number of
    transparent huge pages that could be used by the guest workload.

    This patch introduces the helper functions as well as the necessary changes
    to teach compaction and migration bits how to cope with pages which are
    part of a guest memory balloon, in order to make them movable by memory
    compaction procedures.

    Signed-off-by: Rafael Aquini
    Acked-by: Mel Gorman
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

11 Dec, 2012

3 commits

  • Compaction already has tracepoints to count scanned and isolated pages
    but it requires that ftrace be enabled and if that information has to be
    written to disk then it can be disruptive. This patch adds vmstat counters
    for compaction called compact_migrate_scanned, compact_free_scanned and
    compact_isolated.

    With these counters, it is possible to define a basic cost model for
    compaction. This approximates of how much work compaction is doing and can
    be compared that with an oprofile showing TLB misses and see if the cost of
    compaction is being offset by THP for example. Minimally a compaction patch
    can be evaluated in terms of whether it increases or decreases cost. The
    basic cost model looks like this

    Fundamental unit u: a word sizeof(void *)

    Ca = cost of struct page access = sizeof(struct page) / u

    Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
    Cmf = Cost migrate failure = Ca * 2
    Ci = Cost page isolation = (Ca + Wi)
    where Wi is a constant that should reflect the approximate
    cost of the locking operation.

    Csm = Cost migrate scanning = Ca
    Csf = Cost free scanning = Ca

    Overall cost = (Csm * compact_migrate_scanned) +
    (Csf * compact_free_scanned) +
    (Ci * compact_isolated) +
    (Cmc * pgmigrate_success) +
    (Cmf * pgmigrate_failed)

    Where the values are read from /proc/vmstat.

    This is very basic and ignores certain costs such as the allocation cost
    to do a migrate page copy but any improvement to the model would still
    use the same vmstat counters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

07 Dec, 2012

1 commit

  • Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
    new MAX_ORDER_NR_PAGES block during isolation for migration") added a
    check for pfn_valid() when isolating pages for migration as the scanner
    does not necessarily start pageblock-aligned.

    Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
    where it left off"), the free scanner has the same problem. This patch
    makes sure that the pfn range passed to isolate_freepages_block() is
    within the same block so that pfn_valid() checks are unnecessary.

    In answer to Henrik's wondering why others have not reported this:
    reproducing this requires a large enough hole with the right aligment to
    have compaction walk into a PFN range with no memmap. Size and
    alignment depends in the memory model - 4M for FLATMEM and 128M for
    SPARSEMEM on x86. It needs a "lucky" machine.

    Reported-by: Henrik Rydberg
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Oct, 2012

1 commit

  • Thierry reported that the "iron out" patch for isolate_freepages_block()
    had problems due to the strict check being too strict with "mm:
    compaction: Iron out isolate_freepages_block() and
    isolate_freepages_range() -fix1". It's possible that more pages than
    necessary are isolated but the check still fails and I missed that this
    fix was not picked up before RC1. This same problem has been identified
    in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

    Signed-off-by: Mel Gorman
    Tested-by: Tony Prisk
    Reported-by: Thierry Reding
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Oct, 2012

13 commits

  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is almost entirely based on Rik's previous patches and discussions
    with him about how this might be implemented.

    Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.
    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    This patch caches where the migration and free scanner should start from
    on subsequent compaction invocations using the pageblock-skip information.
    When compaction starts it begins from the cached restart points and will
    update the cached restart points until a page is isolated or a pageblock
    is skipped that would have been scanned by synchronous compaction.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Compaction's free scanner acquires the zone->lock when checking for
    PageBuddy pages and isolating them. It does this even if there are no
    PageBuddy pages in the range.

    This patch defers acquiring the zone lock for as long as possible. In the
    event there are no free pages in the pageblock then the lock will not be
    acquired at all which reduces contention on zone->lock.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Tested-by: Peter Ujfalusi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Richard Davies and Shaohua Li have both reported lock contention problems
    in compaction on the zone and LRU locks as well as significant amounts of
    time being spent in compaction. This series aims to reduce lock
    contention and scanning rates to reduce that CPU usage. Richard reported
    at https://lkml.org/lkml/2012/9/21/91 that this series made a big
    different to a problem he reported in August:

    http://marc.info/?l=kvm&m=134511507015614&w=2

    Patch 1 defers acquiring the zone->lru_lock as long as possible.

    Patch 2 defers acquiring the zone->lock as lock as possible.

    Patch 3 reverts Rik's "skip-free" patches as the core concept gets
    reimplemented later and the remaining patches are easier to
    understand if this is reverted first.

    Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
    pageblocks should be skipped by the migrate and free scanners.
    This drastically reduces the amount of scanning compaction has
    to do.

    Patch 5 reimplements something similar to Rik's idea except it uses the
    pageblock-skip information to decide where the scanners should
    restart from and does not need to wrap around.

    I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

    akpm-20120920 3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
    lesslock Patches 1-6
    revert Patches 1-7
    cachefail Patches 1-8
    skipuseless Patches 1-9

    Stress high-order allocation tests looked ok. Success rates are more or
    less the same with the full series applied but there is an expectation
    that there is less opportunity to race with other allocation requests if
    there is less scanning. The time to complete the tests did not vary that
    much and are uninteresting as were the vmstat statistics so I will not
    present them here.

    Using ftrace I recorded how much scanning was done by compaction and got this

    3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6
    akpm-20120920 lockless revert-v2r2 cachefail skipuseless

    Total free scanned 360753976 515414028 565479007 17103281 18916589
    Total free isolated 2852429 3597369 4048601 670493 727840
    Total free efficiency 0.0079% 0.0070% 0.0072% 0.0392% 0.0385%
    Total migrate scanned 247728664 822729112 1004645830 17946827 14118903
    Total migrate isolated 2555324 3245937 3437501 616359 658616
    Total migrate efficiency 0.0103% 0.0039% 0.0034% 0.0343% 0.0466%

    The efficiency is worthless because of the nature of the test and the
    number of failures. The really interesting point as far as this patch
    series is concerned is the number of pages scanned. Note that reverting
    Rik's patches massively increases the number of pages scanned indicating
    that those patches really did make a difference to CPU usage.

    However, caching what pageblocks should be skipped has a much higher
    impact. With patches 1-8 applied, free page and migrate page scanning are
    both reduced by 95% in comparison to the akpm kernel. If the basic
    concept of Rik's patches are implemened on top then scanning then the free
    scanner barely changed but migrate scanning was further reduced. That
    said, tests on 3.6-rc5 indicated that the last patch had greater impact
    than what was measured here so it is a bit variable.

    One way or the other, this series has a large impact on the amount of
    scanning compaction does when there is a storm of THP allocations.

    This patch:

    Compaction's migrate scanner acquires the zone->lru_lock when scanning a
    range of pages looking for LRU pages to acquire. It does this even if
    there are no LRU pages in the range. If multiple processes are compacting
    then this can cause severe locking contention. To make matters worse
    commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration") releases the lru_lock every
    SWAP_CLUSTER_MAX pages that are scanned.

    This patch makes two changes to how the migrate scanner acquires the LRU
    lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
    if the lock is contended. This reduces the number of times it
    unnecessarily disables and re-enables IRQs. The second is that it defers
    acquiring the LRU lock for as long as possible. If there are no LRU pages
    or the only LRU pages are transhuge then the LRU lock will not be acquired
    at all which reduces contention on zone->lru_lock.

    [minchan@kernel.org: augment comment]
    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Parameters were added without documentation, tut tut.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit c67fe3752abe ("mm: compaction: Abort async compaction if locks
    are contended or taking too long") addressed a lock contention problem
    in compaction by introducing compact_checklock_irqsave() that effecively
    aborting async compaction in the event of compaction.

    To preserve existing behaviour it also moved a fatal_signal_pending()
    check into compact_checklock_irqsave() but that is very misleading. It
    "hides" the check within a locking function but has nothing to do with
    locking as such. It just happens to work in a desirable fashion.

    This patch moves the fatal_signal_pending() check to
    isolate_migratepages_range() where it belongs. Arguably the same check
    should also happen when isolating pages for freeing but it's overkill.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • isolate_migratepages_range() might isolate no pages if for example when
    zone->lru_lock is contended and running asynchronous compaction. In this
    case, we should abort compaction, otherwise, compact_zone will run a
    useless loop and make zone->lru_lock is even contended.

    An additional check is added to ensure that cc.migratepages and
    cc.freepages get properly drained whan compaction is aborted.

    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
    [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Shaohua Li
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
    (from Minchan Kim).

    * During watermark check decrease available free pages number by
    free CMA pages number if necessary (unmovable allocations cannot
    use pages from CMA areas).

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • While compaction is migrating pages to free up large contiguous blocks
    for allocation it races with other allocation requests that may steal
    these blocks or break them up. This patch alters direct compaction to
    capture a suitable free page as soon as it becomes available to reduce
    this race. It uses similar logic to split_free_page() to ensure that
    watermarks are still obeyed.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allocation success rates have been far lower since 3.4 due to commit
    fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled").
    This commit was introduced for good reasons and it was known in advance
    that the success rates would suffer but it was justified on the grounds
    that the high allocation success rates were achieved by aggressive
    reclaim. Success rates are expected to suffer even more in 3.6 due to
    commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") which testing has shown to severely reduce allocation success
    rates under load - to 0% in one case.

    This series aims to improve the allocation success rates without
    regressing the benefits of commit fe2c2a106663. The series is based on
    latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
    away.

    Patch 1 updates a stale comment seeing as I was in the general area.

    Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
    of recent failures.

    Patch 3 captures suitable high-order pages freed by compaction to reduce
    races with parallel allocation requests.

    Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
    start off where it left] to enable compaction again

    Patch 5 identifies when compacion is taking too long due to contention
    and aborts.

    STRESS-HIGHALLOC
    3.6-rc1-akpm full-series
    Pass 1 36.00 ( 0.00%) 51.00 (15.00%)
    Pass 2 42.00 ( 0.00%) 63.00 (21.00%)
    while Rested 86.00 ( 0.00%) 86.00 ( 0.00%)

    From

    http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html

    I know that the allocation success rates in 3.3.6 was 78% in comparison
    to 36% in in the current akpm tree. With the full series applied, the
    success rates are up to around 51% with some variability in the results.
    This is not as high a success rate but it does not reclaim excessively
    which is a key point.

    MMTests Statistics: vmstat
    Page Ins 3050912 3078892
    Page Outs 8033528 8039096
    Swap Ins 0 0
    Swap Outs 0 0

    Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
    there were 71881 pages swapped out.

    Direct pages scanned 70942 122976
    Kswapd pages scanned 1366300 1520122
    Kswapd pages reclaimed 1366214 1484629
    Direct pages reclaimed 70936 105716
    Kswapd efficiency 99% 97%
    Kswapd velocity 1072.550 1182.615
    Direct efficiency 99% 85%
    Direct velocity 55.690 95.672

    The kswapd velocity changes very little as expected. kswapd velocity is
    around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
    allocation success rates it was 8140 pages/second. Direct velocity is
    higher as a result of patch 2 of the series but this is expected and is
    acceptable. The direct reclaim and kswapd velocities change very little.

    If these get accepted for merging then there is a difficulty in how they
    should be handled. 7db8889a ("mm: have order > 0 compaction start off
    where it left") is broken but it is already in 3.6-rc1 and needs to be
    fixed. However, if just patch 4 from this series is applied then Jim
    Schutt's workload is known to break again as his workload also requires
    patch 5. While it would be preferred to have all these patches in 3.6 to
    improve compaction in general, it would at least be acceptable if just
    patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
    compaction completely. On the face of it, that would force
    __GFP_NO_KSWAPD patches to be merged at the same time but I can do a
    version of this series with __GFP_NO_KSWAPD change reverted and then
    rebase it on top of this series. That might be best overall because I
    note that the __GFP_NO_KSWAPD patch should have removed
    deferred_compaction from page_alloc.c but it didn't but fixing that causes
    collisions with this series.

    This patch:

    The comment about order applied when the check was order >
    PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
    use compaction for all allocation orders"). Fixing the comment while I'm
    in the general area.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Aug, 2012

3 commits

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") introduced a caching mechanism to reduce the amount work the free
    page scanner does in compaction. However, it has a problem. Consider
    two process simultaneously scanning free pages

    C
    Process A M S F
    |---------------------------------------|
    Process B M FS

    C is zone->compact_cached_free_pfn
    S is cc->start_pfree_pfn
    M is cc->migrate_pfn
    F is cc->free_pfn

    In this diagram, Process A has just reached its migrate scanner, wrapped
    around and updated compact_cached_free_pfn accordingly.

    Simultaneously, Process B finishes isolating in a block and updates
    compact_cached_free_pfn again to the location of its free scanner.

    Process A moves to "end_of_zone - one_pageblock" and runs this check

    if (cc->order > 0 && (!cc->wrapped ||
    zone->compact_cached_free_pfn >
    cc->start_free_pfn))
    pfn = min(pfn, zone->compact_cached_free_pfn);

    compact_cached_free_pfn is above where it started so the free scanner
    skips almost the entire space it should have scanned. When there are
    multiple processes compacting it can end in a situation where the entire
    zone is not being scanned at all. Further, it is possible for two
    processes to ping-pong update to compact_cached_free_pfn which is just
    random.

    Overall, the end result wrecks allocation success rates.

    There is not an obvious way around this problem without introducing new
    locking and state so this patch takes a different approach.

    First, it gets rid of the skip logic because it's not clear that it
    matters if two free scanners happen to be in the same block but with
    racing updates it's too easy for it to skip over blocks it should not.

    Second, it updates compact_cached_free_pfn in a more limited set of
    circumstances.

    If a scanner has wrapped, it updates compact_cached_free_pfn to the end
    of the zone. When a wrapped scanner isolates a page, it updates
    compact_cached_free_pfn to point to the highest pageblock it
    can isolate pages from.

    If a scanner has not wrapped when it has finished isolated pages it
    checks if compact_cached_free_pfn is pointing to the end of the
    zone. If so, the value is updated to point to the highest
    pageblock that pages were isolated from. This value will not
    be updated again until a free page scanner wraps and resets
    compact_cached_free_pfn.

    This is not optimal and it can still race but the compact_cached_free_pfn
    will be pointing to or very near a pageblock with free pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aff622495c9a ("vmscan: only defer compaction for failed order and
    higher") fixed bad deferring policy but made mistake about checking
    compact_order_failed in __compact_pgdat(). So it can't update
    compact_order_failed with the new order. This ends up preventing
    correct operation of policy deferral. This patch fixes it.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

01 Aug, 2012

1 commit

  • Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.

    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    The obvious solution is to have isolate_freepages remember where it left
    off last time, and continue at that point the next time it gets invoked
    for an order > 0 compaction. This could cause compaction to fail if
    cc->free_pfn and cc->migrate_pfn are close together initially, in that
    case we restart from the end of the zone and try once more.

    Forced full (order == -1) compactions are left alone.

    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
    Signed-off-by: Rik van Riel
    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Cc: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

12 Jul, 2012

1 commit

  • If page migration cannot charge the temporary page to the memcg,
    migrate_pages() will return -ENOMEM. This isn't considered in memory
    compaction however, and the loop continues to iterate over all
    pageblocks trying to isolate and migrate pages. If a small number of
    very large memcgs happen to be oom, however, these attempts will mostly
    be futile leading to an enormous amout of cpu consumption due to the
    page migration failures.

    This patch will short circuit and fail memory compaction if
    migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
    some migrations were successful so that the page allocator will retry.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Jun, 2012

1 commit

  • This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199.

    That commit seems to be the cause of the mm compation list corruption
    issues that Dave Jones reported. The locking (or rather, absense
    there-of) is dubious, as is the use of the 'page' variable once it has
    been found to be outside the pageblock range.

    So revert it for now, we can re-visit this for 3.6. If we even need to:
    as Minchan Kim says, "The patch wasn't a bug fix and even test workload
    was very theoretical".

    Reported-and-tested-by: Dave Jones
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2012

3 commits

  • Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
    del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
    its target functions.

    This cleanup eliminates a swathe of cruft in memcontrol.c, including
    mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
    mem_cgroup_lru_move_lists() - which never actually touched the lists.

    In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
    a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
    lru_size stats.

    Whilst these are simplifications in their own right, the goal is to bring
    the evaluation of lruvec next to the spin_locking of the lrus, in
    preparation for a future patch.

    Signed-off-by: Hugh Dickins
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
    completely remove anon/file and active/inactive lru type filters from
    __isolate_lru_page(), because isolation for 0-order reclaim always
    isolates pages from right lru list. And pages-isolation for lumpy
    shrink_inactive_list() or memory-compaction anyway allowed to isolate
    pages from all evictable lru lists.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
    pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
    allocation takes ownership of the block may take too long. The type of
    the pageblock remains unchanged so the pageblock cannot be used as a
    migration target during compaction.

    Fix it by:

    * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
    COMPACT_SYNC) and then converting sync field in struct compact_control
    to use it.

    * Adding nr_pageblocks_skipped field to struct compact_control and
    tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
    If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
    try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
    suitable page for allocation. In this case then check how if there were
    enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
    COMPACT_ASYNC_UNMOVABLE mode.

    * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
    COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
    finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all
    pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
    sets change the whole pageblock type to MIGRATE_MOVABLE.

    My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
    131072 standard 4KiB pages in 'Normal' zone) is to:

    - allocate 120000 pages for kernel's usage
    - free every second page (60000 pages) of memory just allocated
    - allocate and use 60000 pages from user space
    - free remaining 60000 pages of kernel memory
    (now we have fragmented memory occupied mostly by user space pages)
    - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

    The results:
    - with compaction disabled I get 11 successful allocations
    - with compaction enabled - 14 successful allocations
    - with this patch I'm able to get all 100 successful allocations

    NOTE: If we can make kswapd aware of order-0 request during compaction, we
    can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
    (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the
    following thread:

    http://marc.info/?l=linux-mm&m=133552069417068&w=2

    [minchan@kernel.org: minor cleanups]
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Marek Szyprowski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

21 May, 2012

1 commit

  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz