08 Apr, 2014

6 commits

  • The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.

    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is just for clean-up to reduce code size and improve readability.
    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted. It isn't needed to
    call it on every page. Current code do well if not suitable, but, don't
    do well when suitable.

    1) It re-checks isolation_suitable() on each page of a pageblock that was
    already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
    was not entered through the next_pageblock: label, because
    last_pageblock_nr is not otherwise updated.

    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.

    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.

    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page. This may results in below situation while isolating
    migratepage.

    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
    the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.

    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock. This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • suitable_migration_target() checks that pageblock is suitable for
    migration target. In isolate_freepages_block(), it is called on every
    page and this is inefficient. So make it called once per pageblock.

    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order. So calling it once
    within pageblock range has no problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Purpose of compaction is to get a high order page. Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target. It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.

    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code. There is no functional changes from this clean-up.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

04 Apr, 2014

3 commits

  • Mark function as static in compaction.c because it is not used outside
    this file.

    This eliminates the following warning from mm/compaction.c:

    mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     
  • Page migration will fail for memory that is pinned in memory with, for
    example, get_user_pages(). In this case, it is unnecessary to take
    zone->lru_lock or isolating the page and passing it to page migration
    which will ultimately fail.

    This is a racy check, the page can still change from under us, but in
    that case we'll just fail later when attempting to move the page.

    This avoids very expensive memory compaction when faulting transparent
    hugepages after pinning a lot of memory with a Mellanox driver.

    On a 128GB machine and pinning ~120GB of memory, before this patch we
    see the enormous disparity in the number of page migration failures
    because of the pinning (from /proc/vmstat):

    compact_pages_moved 8450
    compact_pagemigrate_failed 15614415

    0.05% of pages isolated are successfully migrated and explicitly
    triggering memory compaction takes 102 seconds. After the patch:

    compact_pages_moved 9197
    compact_pagemigrate_failed 7

    99.9% of pages isolated are now successfully migrated in this
    configuration and memory compaction takes less than one second.

    Signed-off-by: David Rientjes
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The cached pageblock hint should be ignored when triggering compaction
    through /proc/sys/vm/compact_memory so all eligible memory is isolated.
    Manually invoking compaction is known to be expensive, there's no need
    to skip pageblocks based on heuristics (mainly for debugging).

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

11 Mar, 2014

1 commit

  • We received several reports of bad page state when freeing CMA pages
    previously allocated with alloc_contig_range:

    BUG: Bad page state in process Binder_A pfn:63202
    page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf
    page flags: 0x40080068(uptodate|lru|active|swapbacked)

    Based on the page state, it looks like the page was still in use. The
    page flags do not make sense for the use case though. Further debugging
    showed that despite alloc_contig_range returning success, at least one
    page in the range still remained in the buddy allocator.

    There is an issue with isolate_freepages_block. In strict mode (which
    CMA uses), if any pages in the range cannot be isolated,
    isolate_freepages_block should return failure 0. The current check
    keeps track of the total number of isolated pages and compares against
    the size of the range:

    if (strict && nr_strict_required > total_isolated)
    total_isolated = 0;

    After taking the zone lock, if one of the pages in the range is not in
    the buddy allocator, we continue through the loop and do not increment
    total_isolated. If in the last iteration of the loop we isolate more
    than one page (e.g. last page needed is a higher order page), the check
    for total_isolated may pass and we fail to detect that a page was
    skipped. The fix is to bail out if the loop immediately if we are in
    strict mode. There's no benfit to continuing anyway since we need all
    pages to be isolated. Additionally, drop the error checking based on
    nr_strict_required and just check the pfn ranges. This matches with
    what isolate_freepages_range does.

    Signed-off-by: Laura Abbott
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Acked-by: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

24 Jan, 2014

2 commits

  • Developers occasionally try and optimise PFN scanners by using
    page_order but miss that in general it requires zone->lock. This has
    happened twice for compaction.c and rejected both times. This patch
    clarifies the documentation of page_order and adds a note to
    compaction.c why page_order is not used.

    [akpm@linux-foundation.org: tweaks]
    [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
    Signed-off-by: Mel Gorman
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

6 commits

  • Compaction used to start its migrate and free page scaners at the zone's
    lowest and highest pfn, respectively. Later, caching was introduced to
    remember the scanners' progress across compaction attempts so that
    pageblocks are not re-scanned uselessly. Additionally, pageblocks where
    isolation failed are marked to be quickly skipped when encountered again
    in future compactions.

    Currently, both the reset of cached pfn's and clearing of the pageblock
    skip information for a zone is done in __reset_isolation_suitable().
    This function gets called when:

    - compaction is restarting after being deferred
    - compact_blockskip_flush flag is set in compact_finished() when the scanners
    meet (and not again cleared when direct compaction succeeds in allocation)
    and kswapd acts upon this flag before going to sleep

    This behavior is suboptimal for several reasons:

    - when direct sync compaction is called after async compaction fails (in the
    allocation slowpath), it will effectively do nothing, unless kswapd
    happens to process the compact_blockskip_flush flag meanwhile. This is racy
    and goes against the purpose of sync compaction to more thoroughly retry
    the compaction of a zone where async compaction has failed.
    The restart-after-deferring path cannot help here as deferring happens only
    after the sync compaction fails. It is also done only for the preferred
    zone, while the compaction might be done for a fallback zone.

    - the mechanism of marking pageblock to be skipped has little value since the
    cached pfn's are reset only together with the pageblock skip flags. This
    effectively limits pageblock skip usage to parallel compactions.

    This patch changes compact_finished() so that cached pfn's are reset
    immediately when the scanners meet. Clearing pageblock skip flags is
    unchanged, as well as the other situations where cached pfn's are reset.
    This allows the sync-after-async compaction to retry pageblocks not
    marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
    compactions now skips without marking them.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction temporarily marks pageblocks where it fails to isolate pages
    as to-be-skipped in further compactions, in order to improve efficiency.
    One of the reasons to fail isolating pages is that isolation is not
    attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

    The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
    async compaction become skipped due to the temporary mark also in future
    sync compaction. Moreover, this may follow quite soon during
    __alloc_page_slowpath, without much time for kswapd to clear the
    pageblock skip marks. This goes against the idea that sync compaction
    should try to scan these blocks more thoroughly than the async
    compaction.

    The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
    blocks are not marked to be skipped. Note this should not affect
    performance or locking impact of further async compactions, as skipping
    a block due to being !MIGRATE_MOVABLE is done soon after skipping a
    block marked to be skipped, both without locking.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction of a zone is finished when the migrate scanner (which begins
    at the zone's lowest pfn) meets the free page scanner (which begins at
    the zone's highest pfn). This is detected in compact_zone() and in the
    case of direct compaction, the compact_blockskip_flush flag is set so
    that kswapd later resets the cached scanner pfn's, and a new compaction
    may again start at the zone's borders.

    The meeting of the scanners can happen during either scanner's activity.
    However, it may currently fail to be detected when it occurs in the free
    page scanner, due to two problems. First, isolate_freepages() keeps
    free_pfn at the highest block where it isolated pages from, for the
    purposes of not missing the pages that are returned back to allocator
    when migration fails. Second, failing to isolate enough free pages due
    to scanners meeting results in -ENOMEM being returned by
    migrate_pages(), which makes compact_zone() bail out immediately without
    calling compact_finished() that would detect scanners meeting.

    This failure to detect scanners meeting might result in repeated
    attempts at compaction of a zone that keep starting from the cached
    pfn's close to the meeting point, and quickly failing through the
    -ENOMEM path, without the cached pfns being reset, over and over. This
    has been observed (through additional tracepoints) in the third phase of
    the mmtests stress-highalloc benchmark, where the allocator runs on an
    otherwise idle system. The problem was observed in the DMA32 zone,
    which was used as a fallback to the preferred Normal zone, but on the
    4GB system it was actually the largest zone. The problem is even
    amplified for such fallback zone - the deferred compaction logic, which
    could (after being fixed by a previous patch) reset the cached scanner
    pfn's, is only applied to the preferred zone and not for the fallbacks.

    The problem in the third phase of the benchmark was further amplified by
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
    resulted in a non-deterministic regression of the allocation success
    rate from ~85% to ~65%. This occurs in about half of benchmark runs,
    making bisection problematic. It is unlikely that the commit itself is
    buggy, but it should put more pressure on the DMA32 zone during phases 1
    and 2, which may leave it more fragmented in phase 3 and expose the bugs
    that this patch fixes.

    The fix is to make scanners meeting in isolate_freepage() stay that way,
    and to check in compact_zone() for scanners meeting when migrate_pages()
    returns -ENOMEM. The result is that compact_finished() also detects
    scanners meeting and sets the compact_blockskip_flush flag to make
    kswapd reset the scanner pfn's.

    The results in stress-highalloc benchmark show that the "regression" by
    commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
    allocation success rates are also significantly improved.

    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction caches pfn's for its migrate and free scanners to avoid
    scanning the whole zone each time. In compact_zone(), the cached values
    are read to set up initial values for the scanners. There are several
    situations when these cached pfn's are reset to the first and last pfn
    of the zone, respectively. One of these situations is when a compaction
    has been deferred for a zone and is now being restarted during a direct
    compaction, which is also done in compact_zone().

    However, compact_zone() currently reads the cached pfn's *before*
    resetting them. This means the reset doesn't affect the compaction that
    performs it, and with good chance also subsequent compactions, as
    update_pageblock_skip() is likely to be called and update the cached
    pfn's to those being processed. Another chance for a successful reset
    is when a direct compaction detects that migration and free scanners
    meet (which has its own problems addressed by another patch) and sets
    update_pageblock_skip flag which kswapd uses to do the reset because it
    goes to sleep.

    This is clearly a bug that results in non-deterministic behavior, so
    this patch moves the cached pfn reset to be performed *before* the
    values are read.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The broad goal of the series is to improve allocation success rates for
    huge pages through memory compaction, while trying not to increase the
    compaction overhead. The original objective was to reintroduce
    capturing of high-order pages freed by the compaction, before they are
    split by concurrent activity. However, several bugs and opportunities
    for simple improvements were found in the current implementation, mostly
    through extra tracepoints (which are however too ugly for now to be
    considered for sending).

    The patches mostly deal with two mechanisms that reduce compaction
    overhead, which is caching the progress of migrate and free scanners,
    and marking pageblocks where isolation failed to be skipped during
    further scans.

    Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
    compaction and potentially debug scanner pfn values.

    Patch 2 encapsulates the some functionality for handling deferred compactions
    for better maintainability, without a functional change
    type is not determined without being actually needed.

    Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
    they have been read to initialize a compaction run.

    Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
    and can lead to multiple compaction attempts quitting early without
    doing any work.

    Patch 5 improves the chances of sync compaction to process pageblocks that
    async compaction has skipped due to being !MIGRATE_MOVABLE.

    Patch 6 improves the chances of sync direct compaction to actually do anything
    when called after async compaction fails during allocation slowpath.

    The impact of patches were validated using mmtests's stress-highalloc
    benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
    with 4GB memory.

    Due to instability of the results (mostly related to the bugs fixed by
    patches 2 and 3), 10 iterations were performed, taking min,mean,max
    values for success rates and mean values for time and vmstat-based
    metrics.

    First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
    patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
    due to no functional changes in 1 and 2. Comments below.

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%)
    Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%)
    Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%)
    Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%)
    Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%)
    Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%)
    Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%)
    Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%)
    Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    User 6437.72 6459.76 5960.32 5974.55 6019.67
    System 1049.65 1049.09 1029.32 1031.47 1032.31
    Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Minor Faults 253952267 254581900 250030122 250507333 250157829
    Major Faults 420 407 506 530 530
    Swap Ins 4 9 9 6 6
    Swap Outs 398 375 345 346 333
    Direct pages scanned 197538 189017 298574 287019 299063
    Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089
    Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622
    Direct pages reclaimed 197227 188829 298380 286822 298835
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 953.382 970.449 952.243 934.569 922.286
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 104.058 101.832 153.961 143.200 148.205
    Percentage direct scans 9% 9% 13% 13% 13%
    Zone normal velocity 347.289 359.676 348.063 339.933 332.983
    Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 557.600 429.000 353.600 426.400 381.800
    Page writes file 159 53 7 79 48
    Page writes anon 398 375 345 346 333
    Page reclaim immediate 825 644 411 575 420
    Sector Reads 2781750 2769780 2878547 2939128 2910483
    Sector Writes 12080843 12083351 12012892 12002132 12010745
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1575654 1545344 1778406 1786700 1794073
    Direct inode steals 9657 10037 15795 14104 14645
    Kswapd inode steals 46857 46335 50543 50716 51796
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 97 91 81 71 77
    THP collapse alloc 456 506 546 544 565
    THP splits 6 5 5 4 4
    THP fault fallback 0 1 0 0 0
    THP collapse fail 14 14 12 13 12
    Compaction stalls 1006 980 1537 1536 1548
    Compaction success 303 284 562 559 578
    Compaction failures 702 696 974 976 969
    Page migrate success 1177325 1070077 3927538 3781870 3877057
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 2547248 2306457 8301218 8008500 8200674
    Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197
    Compaction free scanned 89199429 79189151 356529027 351943166 356326727
    Compaction cost 1566 1426 5312 5156 5294
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    Observations:

    - The "Success 3" line is allocation success rate with system idle
    (phases 1 and 2 are with background interference). I used to get stable
    values around 85% with vanilla 3.11. The lower min and mean values came
    with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
    zone allocator policy") As explained in comment for patch 3, I don't
    think the commit is wrong, but that it makes the effect of compaction
    bugs worse. From patch 3 onwards, the results are OK and match the 3.11
    results.

    - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
    I've seen with 3.11 (I didn't measure it that thoroughly then, but it
    was never above 40%).

    - Compaction cost and number of scanned pages is higher, especially due
    to patch 4. However, keep in mind that patches 3 and 4 fix existing
    bugs in the current design of compaction overhead mitigation, they do
    not change it. If overhead is found unacceptable, then it should be
    decreased differently (and consistently, not due to random conditions)
    than the current implementation does. In contrast, patches 5 and 6
    (which are not strictly bug fixes) do not increase the overhead (but
    also not success rates). This might be a limitation of the
    stress-highalloc benchmark as it's quite uniform.

    Another set of results is when configuring stress-highalloc t allocate
    with similar flags as THP uses:
    (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%)
    Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%)
    Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%)
    Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%)
    Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%)
    Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%)
    Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%)
    Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%)
    Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    User 6547.93 6475.85 6265.54 6289.46 6189.96
    System 1053.42 1047.28 1043.23 1042.73 1038.73
    Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Minor Faults 256805673 253106328 253222299 249830289 251184418
    Major Faults 395 375 423 434 448
    Swap Ins 12 10 10 12 9
    Swap Outs 530 537 487 455 415
    Direct pages scanned 71859 86046 153244 152764 190713
    Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520
    Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924
    Direct pages reclaimed 71766 85908 153167 152643 190600
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 38.897 49.127 80.747 79.983 96.468
    Percentage direct scans 3% 4% 7% 7% 9%
    Zone normal velocity 351.377 372.494 348.910 341.689 335.310
    Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 669.300 604.000 545.700 538.900 429.900
    Page writes file 138 66 58 83 14
    Page writes anon 530 537 487 455 415
    Page reclaim immediate 806 655 772 548 517
    Sector Reads 2711956 2703239 2811602 2818248 2839459
    Sector Writes 12163238 12018662 12038248 11954736 11994892
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1385088 1388364 1507968 1513292 1558656
    Direct inode steals 1739 2564 4622 5496 6007
    Kswapd inode steals 47461 46406 47804 48013 48466
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 110 82 84 69 70
    THP collapse alloc 445 482 467 462 539
    THP splits 6 5 4 5 3
    THP fault fallback 3 0 0 0 0
    THP collapse fail 15 14 14 14 13
    Compaction stalls 659 685 1033 1073 1111
    Compaction success 222 225 410 427 456
    Compaction failures 436 460 622 646 655
    Page migrate success 446594 439978 1085640 1095062 1131716
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 1029475 1013490 2453074 2482698 2565400
    Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204
    Compaction free scanned 27715272 28544654 80150615 82898631 85756132
    Compaction cost 552 555 1344 1379 1436
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    There are some differences from the previous results for THP-like allocations:

    - Here, the bad result for unpatched kernel in phase 3 is much more
    consistent to be between 65-70% and not related to the "regression" in
    3.12. Still there is the improvement from patch 4 onwards, which brings
    it on par with simple GFP_HIGHUSER_MOVABLE allocations.

    - Compaction costs have increased, but nowhere near as much as the
    non-THP case. Again, the patches should be worth the gained
    determininsm.

    - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
    This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
    pfn's and pageblock skip bits are not reset by kswapd that often (at
    least in phase 3 where no concurrent activity would wake up kswapd) and
    the patches thus help the sync-after-async compaction. It doesn't
    however show that the sync compaction would help so much with success
    rates, which can be again seen as a limitation of the benchmark
    scenario.

    This patch (of 6):

    Add two tracepoints for compaction begin and end of a zone. Using this it
    is possible to calculate how much time a workload is spending within
    compaction and potentially debug problems related to cached pfns for
    scanning. In combination with the direct reclaim and slab trace points it
    should be possible to estimate most allocation-related overhead for a
    workload.

    Signed-off-by: Mel Gorman
    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

1 commit

  • update_pageblock_skip() only fits to compaction which tries to isolate
    by pageblock unit. If isolate_migratepages_range() is called by CMA, it
    try to isolate regardless of pageblock unit and it don't reference
    get_pageblock_skip() by ignore_skip_hint. We should also respect it on
    update_pageblock_skip() to prevent from setting the wrong information.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

13 Nov, 2013

1 commit


01 Oct, 2013

1 commit

  • We've been getting warnings about an excessive amount of time spent
    allocating pages for migration during memory compaction without
    scheduling. isolate_freepages_block() already periodically checks for
    contended locks or the need to schedule, but isolate_freepages() never
    does.

    When a zone is massively long and no suitable targets can be found, this
    iteration can be quite expensive without ever doing cond_resched().

    Check periodically for the need to reschedule while the compaction free
    scanner iterates.

    Signed-off-by: David Rientjes
    Reviewed-by: Rik van Riel
    Reviewed-by: Wanpeng Li
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Sep, 2013

1 commit

  • If kswapd was reclaiming for a high order and resets it to 0 due to
    fragmentation it will still call compact_pgdat. For the most part, this
    will fail a compaction_suitable() test and not compact but it is
    unnecessarily sloppy. It could be fixed in the caller but fix it in the
    API instead.

    [dhillf@gmail.com: pointed out that it was a potential problem]
    Signed-off-by: Mel Gorman
    Cc: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 Feb, 2013

5 commits

  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • These functions always return 0. Formalise this.

    Cc: Jason Liu
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Compaction uses the ALIGN macro incorrectly with the migrate scanner by
    adding pageblock_nr_pages to a PFN. It happened to work when initially
    implemented as the starting PFN was also aligned but with caching
    restarts and isolating in smaller chunks this is no longer always true.

    The impact is that the migrate scanner scans outside its current
    pageblock. As pfn_valid() is still checked properly it does not cause
    any failure and the impact of the bug is that in some cases it will scan
    more than necessary when it crosses a page boundary but by no more than
    COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
    it's still wrong so this patch addresses the problem.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Jan, 2013

2 commits

  • Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
    waiting for POLLIN on a local TCP socket. It was easier to trigger if
    there was disk IO and dirty pages at the same time and he bisected it to
    commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available").

    The intention of that patch was to improve high-order allocations under
    memory pressure after changes made to reclaim in 3.6 drastically hurt
    THP allocations but the approach was flawed. For Eric, the problem was
    that page->pfmemalloc was not being cleared for captured pages leading
    to a poor interaction with swap-over-NFS support causing the packets to
    be dropped. However, I identified a few more problems with the patch
    including the fact that it can increase contention on zone->lock in some
    cases which could result in async direct compaction being aborted early.

    In retrospect the capture patch took the wrong approach. What it should
    have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
    was allocating for THP and avoided races that way. While the patch was
    showing to improve allocation success rates at the time, the benefit is
    marginal given the relative complexity and it should be revisited from
    scratch in the context of the other reclaim-related changes that have
    taken place since the patch was first written and tested. This patch
    partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
    suitable high-order page immediately when it is made available").

    Reported-and-tested-by: Eric Wong
    Tested-by: Eric Dumazet
    Cc:
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • when run the folloing command under shell, it will return error

    sh/$ echo 1 > /proc/sys/vm/compact_memory
    sh/$ sh: write error: Bad address

    After strace, I found the following log:

    ...
    write(1, "1\n", 2) = 3
    write(1, "", 4294967295) = -1 EFAULT (Bad address)
    write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
    ) = 31

    This tells system return 3(COMPACT_COMPLETE) after write data to
    compact_memory.

    The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
    from sysctl_compaction_handler after compaction_nodes finished.

    Signed-off-by: Jason Liu
    Suggested-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Liu
     

21 Dec, 2012

1 commit

  • isolate_freepages_block() and isolate_migratepages_range() are used for
    CMA as well as compaction so it breaks build for CONFIG_CMA &&
    !CONFIG_COMPACTION.

    This patch fixes it.

    [akpm@linux-foundation.org: add "do { } while (0)", per Mel]
    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • compact_capture_page() is only used if compaction is enabled so it should
    be moved into the corresponding #ifdef.

    Signed-off-by: Thierry Reding
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     

12 Dec, 2012

2 commits

  • The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
    hacks around putback_lru_pages() in order to allow ballooned pages to be
    re-inserted on balloon page list as if a ballooned page was like a LRU page.

    As ballooned pages are not legitimate LRU pages, this patch introduces
    putback_movable_pages() to properly cope with cases where the isolated
    pageset contains ballooned pages and LRU pages, thus fixing the mentioned
    inelegant hack around putback_lru_pages().

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a guest,
    thus imposing performance penalties associated with the reduced number of
    transparent huge pages that could be used by the guest workload.

    This patch introduces the helper functions as well as the necessary changes
    to teach compaction and migration bits how to cope with pages which are
    part of a guest memory balloon, in order to make them movable by memory
    compaction procedures.

    Signed-off-by: Rafael Aquini
    Acked-by: Mel Gorman
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

11 Dec, 2012

3 commits

  • Compaction already has tracepoints to count scanned and isolated pages
    but it requires that ftrace be enabled and if that information has to be
    written to disk then it can be disruptive. This patch adds vmstat counters
    for compaction called compact_migrate_scanned, compact_free_scanned and
    compact_isolated.

    With these counters, it is possible to define a basic cost model for
    compaction. This approximates of how much work compaction is doing and can
    be compared that with an oprofile showing TLB misses and see if the cost of
    compaction is being offset by THP for example. Minimally a compaction patch
    can be evaluated in terms of whether it increases or decreases cost. The
    basic cost model looks like this

    Fundamental unit u: a word sizeof(void *)

    Ca = cost of struct page access = sizeof(struct page) / u

    Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
    Cmf = Cost migrate failure = Ca * 2
    Ci = Cost page isolation = (Ca + Wi)
    where Wi is a constant that should reflect the approximate
    cost of the locking operation.

    Csm = Cost migrate scanning = Ca
    Csf = Cost free scanning = Ca

    Overall cost = (Csm * compact_migrate_scanned) +
    (Csf * compact_free_scanned) +
    (Ci * compact_isolated) +
    (Cmc * pgmigrate_success) +
    (Cmf * pgmigrate_failed)

    Where the values are read from /proc/vmstat.

    This is very basic and ignores certain costs such as the allocation cost
    to do a migrate page copy but any improvement to the model would still
    use the same vmstat counters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

07 Dec, 2012

1 commit

  • Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
    new MAX_ORDER_NR_PAGES block during isolation for migration") added a
    check for pfn_valid() when isolating pages for migration as the scanner
    does not necessarily start pageblock-aligned.

    Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
    where it left off"), the free scanner has the same problem. This patch
    makes sure that the pfn range passed to isolate_freepages_block() is
    within the same block so that pfn_valid() checks are unnecessary.

    In answer to Henrik's wondering why others have not reported this:
    reproducing this requires a large enough hole with the right aligment to
    have compaction walk into a PFN range with no memmap. Size and
    alignment depends in the memory model - 4M for FLATMEM and 128M for
    SPARSEMEM on x86. It needs a "lucky" machine.

    Reported-by: Henrik Rydberg
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Oct, 2012

1 commit

  • Thierry reported that the "iron out" patch for isolate_freepages_block()
    had problems due to the strict check being too strict with "mm:
    compaction: Iron out isolate_freepages_block() and
    isolate_freepages_range() -fix1". It's possible that more pages than
    necessary are isolated but the check still fails and I missed that this
    fix was not picked up before RC1. This same problem has been identified
    in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

    Signed-off-by: Mel Gorman
    Tested-by: Tony Prisk
    Reported-by: Thierry Reding
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Oct, 2012

1 commit

  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim