Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Apr, 2014

6 commits

da1c67a76 mm, compaction: determine isolation mode only once ... Browse Code »
7

The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.

This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-08 07:35:55 +0800
b6c750163 mm/compaction: clean-up code on success of ballon isolation ... Browse Code »
7

It is just for clean-up to reduce code size and improve readability.
There is no functional change.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-08 07:35:51 +0800
c122b2087 mm/compaction: check pageblock suitability once per pageblock ... Browse Code »
7

isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted. It isn't needed to
call it on every page. Current code do well if not suitable, but, don't
do well when suitable.

1) It re-checks isolation_suitable() on each page of a pageblock that was
already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
was not entered through the next_pageblock: label, because
last_pageblock_nr is not otherwise updated.

This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.

Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.

[vbabka@suse.cz: rephrase commit description]
Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-08 07:35:51 +0800
be1aa03b9 mm/compaction: change the timing to check to drop the spinlock ... Browse Code »
7

It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page. This may results in below situation while isolating
migratepage.

1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
the spinlock.
3. Then, to complete isolating, retry to aquire the lock.

I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock. This has no harm 0x0 pfn, because, at
this time, locked variable would be false.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-08 07:35:51 +0800
01ead5340 mm/compaction: do not call suitable_migration_target() on every page ... Browse Code »
7

suitable_migration_target() checks that pageblock is suitable for
migration target. In isolate_freepages_block(), it is called on every
page and this is inefficient. So make it called once per pageblock.

suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order. So calling it once
within pageblock range has no problem.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-08 07:35:51 +0800
7d348b9ea mm/compaction: disallow high-order page for migration target ... Browse Code »
7

Purpose of compaction is to get a high order page. Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target. It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.

Additionally, clean-up logic in suitable_migration_target() to simplify
the code. There is no functional changes from this clean-up.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-08 07:35:51 +0800

04 Apr, 2014

3 commits

74e77fb9a mm/compaction.c: mark function as static ... Browse Code »

Mark function as static in compaction.c because it is not used outside
this file.

This eliminates the following warning from mm/compaction.c:

mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800
119d6d59d mm, compaction: avoid isolating pinned pages ... Browse Code »
7

Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages(). In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.

This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.

This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.

On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):

compact_pages_moved 8450
compact_pagemigrate_failed 15614415

0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds. After the patch:

compact_pages_moved 9197
compact_pagemigrate_failed 7

99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.

Signed-off-by: David Rientjes
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Rik van Riel
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:21:01 +0800
91ca91864 mm, compaction: ignore pageblock skip when manually invoking compaction ... Browse Code »
7

The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:20:58 +0800

11 Mar, 2014

1 commit

2af120bc0 mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block ... Browse Code »

We received several reports of bad page state when freeing CMA pages
previously allocated with alloc_contig_range:

BUG: Bad page state in process Binder_A pfn:63202
page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf
page flags: 0x40080068(uptodate|lru|active|swapbacked)

Based on the page state, it looks like the page was still in use. The
page flags do not make sense for the use case though. Further debugging
showed that despite alloc_contig_range returning success, at least one
page in the range still remained in the buddy allocator.

There is an issue with isolate_freepages_block. In strict mode (which
CMA uses), if any pages in the range cannot be isolated,
isolate_freepages_block should return failure 0. The current check
keeps track of the total number of isolated pages and compares against
the size of the range:

if (strict && nr_strict_required > total_isolated)
total_isolated = 0;

After taking the zone lock, if one of the pages in the range is not in
the buddy allocator, we continue through the loop and do not increment
total_isolated. If in the last iteration of the loop we isolate more
than one page (e.g. last page needed is a higher order page), the check
for total_isolated may pass and we fail to detect that a page was
skipped. The fix is to bail out if the loop immediately if we are in
strict mode. There's no benfit to continuing anyway since we need all
pages to be isolated. Additionally, drop the error checking based on
nr_strict_required and just check the pfn ranges. This matches with
what isolate_freepages_range does.

Signed-off-by: Laura Abbott
Acked-by: Minchan Kim
Cc: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Acked-by: Bartlomiej Zolnierkiewicz
Acked-by: Michal Nazarewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-03-11 08:26:20 +0800

24 Jan, 2014

2 commits

6c14466cc mm: improve documentation of page_order ... Browse Code »

Developers occasionally try and optimise PFN scanners by using
page_order but miss that in general it requires zone->lock. This has
happened twice for compaction.c and rejected both times. This patch
clarifies the documentation of page_order and adds a note to
compaction.c why page_order is not used.

[akpm@linux-foundation.org: tweaks]
[lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
Signed-off-by: Mel Gorman
Acked-by: Rafael Aquini
Acked-by: Minchan Kim
Cc: Laura Abbott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-01-24 08:36:53 +0800
309381fea mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE ... Browse Code »

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-01-24 08:36:50 +0800

22 Jan, 2014

6 commits

55b7c4c99 mm: compaction: reset scanner positions immediately when they meet ... Browse Code »
2

Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively. Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly. Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.

Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable().
This function gets called when:

- compaction is restarting after being deferred
- compact_blockskip_flush flag is set in compact_finished() when the scanners
meet (and not again cleared when direct compaction succeeds in allocation)
and kswapd acts upon this flag before going to sleep

This behavior is suboptimal for several reasons:

- when direct sync compaction is called after async compaction fails (in the
allocation slowpath), it will effectively do nothing, unless kswapd
happens to process the compact_blockskip_flush flag meanwhile. This is racy
and goes against the purpose of sync compaction to more thoroughly retry
the compaction of a zone where async compaction has failed.
The restart-after-deferring path cannot help here as deferring happens only
after the sync compaction fails. It is also done only for the preferred
zone, while the compaction might be done for a fallback zone.

- the mechanism of marking pageblock to be skipped has little value since the
cached pfn's are reset only together with the pageblock skip flags. This
effectively limits pageblock skip usage to parallel compactions.

This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet. Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not
marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
compactions now skips without marking them.

Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-01-22 08:19:49 +0800
50b5b094e mm: compaction: do not mark unmovable pageblocks as skipped in async compaction ... Browse Code »
2

Compaction temporarily marks pageblocks where it fails to isolate pages
as to-be-skipped in further compactions, in order to improve efficiency.
One of the reasons to fail isolating pages is that isolation is not
attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction. Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the
pageblock skip marks. This goes against the idea that sync compaction
should try to scan these blocks more thoroughly than the async
compaction.

The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped. Note this should not affect
performance or locking impact of further async compactions, as skipping
a block due to being !MIGRATE_MOVABLE is done soon after skipping a
block marked to be skipped, both without locking.

Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-01-22 08:19:48 +0800
7ed695e06 mm: compaction: detect when scanners meet in isolate_freepages ... Browse Code »
1

Compaction of a zone is finished when the migrate scanner (which begins
at the zone's lowest pfn) meets the free page scanner (which begins at
the zone's highest pfn). This is detected in compact_zone() and in the
case of direct compaction, the compact_blockskip_flush flag is set so
that kswapd later resets the cached scanner pfn's, and a new compaction
may again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems. First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator
when migration fails. Second, failing to isolate enough free pages due
to scanners meeting results in -ENOMEM being returned by
migrate_pages(), which makes compact_zone() bail out immediately without
calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated
attempts at compaction of a zone that keep starting from the cached
pfn's close to the meeting point, and quickly failing through the
-ENOMEM path, without the cached pfns being reset, over and over. This
has been observed (through additional tracepoints) in the third phase of
the mmtests stress-highalloc benchmark, where the allocator runs on an
otherwise idle system. The problem was observed in the DMA32 zone,
which was used as a fallback to the preferred Normal zone, but on the
4GB system it was actually the largest zone. The problem is even
amplified for such fallback zone - the deferred compaction logic, which
could (after being fixed by a previous patch) reset the cached scanner
pfn's, is only applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success
rate from ~85% to ~65%. This occurs in about half of benchmark runs,
making bisection problematic. It is unlikely that the commit itself is
buggy, but it should put more pressure on the DMA32 zone during phases 1
and 2, which may leave it more fragmented in phase 3 and expose the bugs
that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM. The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make
kswapd reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
allocation success rates are also significantly improved.

Signed-off-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-01-22 08:19:48 +0800
d3132e4b8 mm: compaction: reset cached scanner pfn's before reading them ... Browse Code »
1

Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time. In compact_zone(), the cached values
are read to set up initial values for the scanners. There are several
situations when these cached pfn's are reset to the first and last pfn
of the zone, respectively. One of these situations is when a compaction
has been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before*
resetting them. This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached
pfn's to those being processed. Another chance for a successful reset
is when a direct compaction detects that migration and free scanners
meet (which has its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so
this patch moves the cached pfn reset to be performed *before* the
values are read.

Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-01-22 08:19:48 +0800
de6c60a6c mm: compaction: encapsulate defer reset logic ... Browse Code »
2

Currently there are several functions to manipulate the deferred
compaction state variables. The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.

Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional
change.

Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-01-22 08:19:48 +0800
0eb927c0a mm: compaction: trace compaction begin and end ... Browse Code »
2

The broad goal of the series is to improve allocation success rates for
huge pages through memory compaction, while trying not to increase the
compaction overhead. The original objective was to reintroduce
capturing of high-order pages freed by the compaction, before they are
split by concurrent activity. However, several bugs and opportunities
for simple improvements were found in the current implementation, mostly
through extra tracepoints (which are however too ugly for now to be
considered for sending).

The patches mostly deal with two mechanisms that reduce compaction
overhead, which is caching the progress of migrate and free scanners,
and marking pageblocks where isolation failed to be skipped during
further scans.

Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
compaction and potentially debug scanner pfn values.

Patch 2 encapsulates the some functionality for handling deferred compactions
for better maintainability, without a functional change
type is not determined without being actually needed.

Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
they have been read to initialize a compaction run.

Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
and can lead to multiple compaction attempts quitting early without
doing any work.

Patch 5 improves the chances of sync compaction to process pageblocks that
async compaction has skipped due to being !MIGRATE_MOVABLE.

Patch 6 improves the chances of sync direct compaction to actually do anything
when called after async compaction fails during allocation slowpath.

The impact of patches were validated using mmtests's stress-highalloc
benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
with 4GB memory.

Due to instability of the results (mostly related to the bugs fixed by
patches 2 and 3), 10 iterations were performed, taking min,mean,max
values for success rates and mean values for time and vmstat-based
metrics.

First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
due to no functional changes in 1 and 2. Comments below.

stress-highalloc
3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%)
Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%)
Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%)
Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%)
Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%)
Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%)
Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%)
Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%)
Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%)

3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
User 6437.72 6459.76 5960.32 5974.55 6019.67
System 1049.65 1049.09 1029.32 1031.47 1032.31
Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15

Minor Faults
Major Faults
Swap Ins
Swap Outs
Direct pages scanned
Kswapd pages scanned
Kswapd pages reclaimed
Direct pages reclaimed
Kswapd efficiency
Kswapd velocity
Direct efficiency
Direct velocity
Percentage direct scans
Zone normal velocity
Zone dma32 velocity
Zone dma velocity
Page writes by reclaim
Page writes file
Page writes anon
Page reclaim immediate
Sector Reads
Sector Writes
Page rescued immediate
Slabs scanned
Direct inode steals
Kswapd inode steals
Kswapd skipped wait
THP fault alloc
THP collapse alloc
THP splits
THP fault fallback
THP collapse fail
Compaction stalls
Compaction success
Compaction failures
Page migrate success
Page migrate failure
Compaction pages isolated
Compaction migrate scanned
Compaction free scanned
Compaction cost
NUMA PTE updates
NUMA hint faults
NUMA hint local faults
NUMA hint local percent
NUMA pages migrated
AutoNUMA cost 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp 253952267 254581900 250030122 250507333 250157829 420 407 506 530 530 4 9 9 6 6 398 375 345 346 333 197538 189017 298574 287019 299063 1809843 1801308 1846674 1873184 1861089 1806972 1798684 1844219 1870509 1858622 197227 188829 298380 286822 298835 99% 99% 99% 99% 99% 953.382 970.449 952.243 934.569 922.286 99% 99% 99% 99% 99% 104.058 101.832 153.961 143.200 148.205 9% 9% 13% 13% 13% 347.289 359.676 348.063 339.933 332.983 710.151 712.605 758.140 737.835 737.507 0.000 0.000 0.000 0.000 0.000 557.600 429.000 353.600 426.400 381.800 159 53 7 79 48 398 375 345 346 333 825 644 411 575 420 2781750 2769780 2878547 2939128 2910483 12080843 12083351 12012892 12002132 12010745 0 0 0 0 0 1575654 1545344 1778406 1786700 1794073 9657 10037 15795 14104 14645 46857 46335 50543 50716 51796 0 0 0 0 0 97 91 81 71 77 456 506 546 544 565 6 5 5 4 4 0 1 0 0 0 14 14 12 13 12 1006 980 1537 1536 1548 303 284 562 559 578 702 696 974 976 969 1177325 1070077 3927538 3781870 3877057 0 0 0 0 0 2547248 2306457 8301218 8008500 8200674 42290478 38832618 153961130 154143900 159141197 89199429 79189151 356529027 351943166 356326727 1566 1426 5312 5156 5294 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 100 100 100 100 0 0 0 0 0 0 0 0 0 0

Observations:

- The "Success 3" line is allocation success rate with system idle
(phases 1 and 2 are with background interference). I used to get stable
values around 85% with vanilla 3.11. The lower min and mean values came
with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
zone allocator policy") As explained in comment for patch 3, I don't
think the commit is wrong, but that it makes the effect of compaction
bugs worse. From patch 3 onwards, the results are OK and match the 3.11
results.

- Patch 4 also clearly helps phases 1 and 2, and exceeds any results
I've seen with 3.11 (I didn't measure it that thoroughly then, but it
was never above 40%).

- Compaction cost and number of scanned pages is higher, especially due
to patch 4. However, keep in mind that patches 3 and 4 fix existing
bugs in the current design of compaction overhead mitigation, they do
not change it. If overhead is found unacceptable, then it should be
decreased differently (and consistently, not due to random conditions)
than the current implementation does. In contrast, patches 5 and 6
(which are not strictly bug fixes) do not increase the overhead (but
also not success rates). This might be a limitation of the
stress-highalloc benchmark as it's quite uniform.

Another set of results is when configuring stress-highalloc t allocate
with similar flags as THP uses:
(GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

stress-highalloc
3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%)
Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%)
Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%)
Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%)
Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%)
Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%)
Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%)
Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%)
Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%)

3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
User 6547.93 6475.85 6265.54 6289.46 6189.96
System 1053.42 1047.28 1043.23 1042.73 1038.73
Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38

3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
Minor Faults 256805673 253106328 253222299 249830289 251184418
Major Faults 395 375 423 434 448
Swap Ins 12 10 10 12 9
Swap Outs 530 537 487 455 415
Direct pages scanned 71859 86046 153244 152764 190713
Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520
Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924
Direct pages reclaimed 71766 85908 153167 152643 190600
Kswapd efficiency 99% 99% 99% 99% 99%
Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218
Direct efficiency 99% 99% 99% 99% 99%
Direct velocity 38.897 49.127 80.747 79.983 96.468
Percentage direct scans 3% 4% 7% 7% 9%
Zone normal velocity 351.377 372.494 348.910 341.689 335.310
Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377
Zone dma velocity 0.000 0.000 0.000 0.000 0.000
Page writes by reclaim 669.300 604.000 545.700 538.900 429.900
Page writes file 138 66 58 83 14
Page writes anon 530 537 487 455 415
Page reclaim immediate 806 655 772 548 517
Sector Reads 2711956 2703239 2811602 2818248 2839459
Sector Writes 12163238 12018662 12038248 11954736 11994892
Page rescued immediate 0 0 0 0 0
Slabs scanned 1385088 1388364 1507968 1513292 1558656
Direct inode steals 1739 2564 4622 5496 6007
Kswapd inode steals 47461 46406 47804 48013 48466
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 110 82 84 69 70
THP collapse alloc 445 482 467 462 539
THP splits 6 5 4 5 3
THP fault fallback 3 0 0 0 0
THP collapse fail 15 14 14 14 13
Compaction stalls 659 685 1033 1073 1111
Compaction success 222 225 410 427 456
Compaction failures 436 460 622 646 655
Page migrate success 446594 439978 1085640 1095062 1131716
Page migrate failure 0 0 0 0 0
Compaction pages isolated 1029475 1013490 2453074 2482698 2565400
Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204
Compaction free scanned 27715272 28544654 80150615 82898631 85756132
Compaction cost 552 555 1344 1379 1436
NUMA PTE updates 0 0 0 0 0
NUMA hint faults 0 0 0 0 0
NUMA hint local faults 0 0 0 0 0
NUMA hint local percent 100 100 100 100 100
NUMA pages migrated 0 0 0 0 0
AutoNUMA cost 0 0 0 0 0

There are some differences from the previous results for THP-like allocations:

- Here, the bad result for unpatched kernel in phase 3 is much more
consistent to be between 65-70% and not related to the "regression" in
3.12. Still there is the improvement from patch 4 onwards, which brings
it on par with simple GFP_HIGHUSER_MOVABLE allocations.

- Compaction costs have increased, but nowhere near as much as the
non-THP case. Again, the patches should be worth the gained
determininsm.

- Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
pfn's and pageblock skip bits are not reset by kswapd that often (at
least in phase 3 where no concurrent activity would wake up kswapd) and
the patches thus help the sync-after-async compaction. It doesn't
however show that the sync compaction would help so much with success
rates, which can be again seen as a limitation of the benchmark
scenario.

This patch (of 6):

Add two tracepoints for compaction begin and end of a zone. Using this it
is possible to calculate how much time a workload is spending within
compaction and potentially debug problems related to cached pfns for
scanning. In combination with the direct reclaim and slab trace points it
should be possible to estimate most allocation-related overhead for a
workload.

Signed-off-by: Mel Gorman
Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-01-22 08:19:48 +0800

19 Dec, 2013

1 commit

6815bf3f2 mm/compaction: respect ignore_skip_hint in update_pageblock_skip ... Browse Code »

update_pageblock_skip() only fits to compaction which tries to isolate
by pageblock unit. If isolate_migratepages_range() is called by CMA, it
try to isolate regardless of pageblock unit and it don't reference
get_pageblock_skip() by ignore_skip_hint. We should also respect it on
update_pageblock_skip() to prevent from setting the wrong information.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Reviewed-by: Naoya Horiguchi
Reviewed-by: Wanpeng Li
Cc: Christoph Lameter
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Zhang Yanfei
Cc: [3.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-12-19 11:04:52 +0800

13 Nov, 2013

1 commit

9e4be4708 mm/compaction.c: update comment about zone lock in isolate_freepages_block ... Browse Code »

Since commit f40d1e42bb98 ("mm: compaction: acquire the zone->lock as
late as possible"), isolate_freepages_block() takes the zone->lock
itself. The function description however still states that the
zone->lock must be held.

This patch removes this outdated statement.

Signed-off-by: Jerome Marchand
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:03 +0800

01 Oct, 2013

1 commit

f6ea3adb7 mm/compaction.c: periodically schedule when freeing pages ... Browse Code »

We've been getting warnings about an excessive amount of time spent
allocating pages for migration during memory compaction without
scheduling. isolate_freepages_block() already periodically checks for
contended locks or the need to schedule, but isolate_freepages() never
does.

When a zone is massively long and no suitable targets can be found, this
iteration can be quite expensive without ever doing cond_resched().

Check periodically for the need to reschedule while the compaction free
scanner iterates.

Signed-off-by: David Rientjes
Reviewed-by: Rik van Riel
Reviewed-by: Wanpeng Li
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-10-01 05:31:01 +0800

12 Sep, 2013

1 commit

3a7200af3 mm: compaction: do not compact pgdat for order-0 ... Browse Code »

If kswapd was reclaiming for a high order and resets it to 0 due to
fragmentation it will still call compact_pgdat. For the most part, this
will fail a compaction_suitable() test and not compact but it is
unnecessarily sloppy. It could be fixed in the caller but fix it in the
API instead.

[dhillf@gmail.com: pointed out that it was a potential problem]
Signed-off-by: Mel Gorman
Cc: Hillf Danton
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-09-12 06:57:55 +0800

24 Feb, 2013

5 commits

108bcc96e mm: add & use zone_end_pfn() and zone_spans_pfn() ... Browse Code »

Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.

This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.

Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.

Signed-off-by: Cody P Schafer
Cc: David Hansen
Cc: Catalin Marinas
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-02-24 09:50:20 +0800
9c620e2bc mm: remove offlining arg to migrate_pages ... Browse Code »

No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: Petr Holasek
Cc: Andrea Arcangeli
Cc: Izik Eidus
Cc: Gerald Schaefer
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-02-24 09:50:19 +0800
194159fbc mm: remove MIGRATE_ISOLATE check in hotpath ... Browse Code »

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-02-24 09:50:15 +0800
7103f16db mm: compaction: make __compact_pgdat() and compact_pgdat() return void ... Browse Code »

These functions always return 0. Formalise this.

Cc: Jason Liu
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-02-24 09:50:10 +0800
a9aacbccf mm: compaction: do not accidentally skip pageblocks in the migrate scanner ... Browse Code »

Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN. It happened to work when initially
implemented as the starting PFN was also aligned but with caching
restarts and isolating in smaller chunks this is no longer always true.

The impact is that the migrate scanner scans outside its current
pageblock. As pfn_valid() is still checked properly it does not cause
any failure and the impact of the bug is that in some cases it will scan
more than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:10 +0800

12 Jan, 2013

2 commits

8fb74b9fb mm: compaction: partially revert capture of suitable high-order page ... Browse Code »
13

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong
Tested-by: Eric Dumazet
Cc:
Signed-off-by: Mel Gorman
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-01-12 06:54:56 +0800
7964c06d6 mm: compaction: fix echo 1 > compact_memory return error issue ... Browse Code »

when run the folloing command under shell, it will return error

sh/$ echo 1 > /proc/sys/vm/compact_memory
sh/$ sh: write error: Bad address

After strace, I found the following log:

...
write(1, "1\n", 2) = 3
write(1, "", 4294967295) = -1 EFAULT (Bad address)
write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu
Suggested-by: David Rientjes
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Liu
2013-01-12 06:54:54 +0800

21 Dec, 2012

1 commit

010fc29a4 compaction: fix build error in CMA && !COMPACTION ... Browse Code »

isolate_freepages_block() and isolate_migratepages_range() are used for
CMA as well as compaction so it breaks build for CONFIG_CMA &&
!CONFIG_COMPACTION.

This patch fixes it.

[akpm@linux-foundation.org: add "do { } while (0)", per Mel]
Signed-off-by: Minchan Kim
Cc: Mel Gorman
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-12-21 09:40:18 +0800

17 Dec, 2012

1 commit

3d59eebc5 Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma ... Browse Code »

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.

In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.

The most recent set of comparisons available from different people are

mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397

The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.

My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.

These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...

Linus Torvalds
2012-12-17 07:18:08 +0800

13 Dec, 2012

1 commit

c8bf2d8ba mm: compaction: Fix compiler warning ... Browse Code »

compact_capture_page() is only used if compaction is enabled so it should
be moved into the corresponding #ifdef.

Signed-off-by: Thierry Reding
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thierry Reding
2012-12-13 09:38:32 +0800

12 Dec, 2012

2 commits

5733c7d11 mm: introduce putback_movable_pages() ... Browse Code »

The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andi Kleen
Cc: Konrad Rzeszutek Wilk
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2012-12-12 09:22:27 +0800
bf6bddf19 mm: introduce compaction and migration for ballooned pages ... Browse Code »

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini
Acked-by: Mel Gorman
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Konrad Rzeszutek Wilk
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2012-12-12 09:22:27 +0800

11 Dec, 2012

3 commits

397487db6 mm: compaction: Add scanned and isolated counters for compaction ... Browse Code »

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u: a word sizeof(void *)

Ca = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure = Ca * 2
Ci = Cost page isolation = (Ca + Wi)
where Wi is a constant that should reflect the approximate
cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free scanning = Ca

Overall cost = (Csm * compact_migrate_scanned) +
(Csf * compact_free_scanned) +
(Ci * compact_isolated) +
(Cmc * pgmigrate_success) +
(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel

Mel Gorman
2012-12-11 22:28:35 +0800
7b2a2d4a1 mm: migrate: Add a tracepoint for migrate_pages ... Browse Code »

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel

Mel Gorman
2012-12-11 22:28:35 +0800
5647bc293 mm: compaction: Move migration fail/success stats to migrate.c ... Browse Code »

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel

Mel Gorman
2012-12-11 22:28:35 +0800

07 Dec, 2012

1 commit

60177d31d mm: compaction: validate pfn range passed to isolate_freepages_block ... Browse Code »

Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.

Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem. This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.

In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap. Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86. It needs a "lucky" machine.

Reported-by: Henrik Rydberg
Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2012-12-07 03:17:33 +0800

20 Oct, 2012

1 commit

0db63d7e2 mm: compaction: correct the nr_strict va isolated check for CMA ... Browse Code »

Thierry reported that the "iron out" patch for isolate_freepages_block()
had problems due to the strict check being too strict with "mm:
compaction: Iron out isolate_freepages_block() and
isolate_freepages_range() -fix1". It's possible that more pages than
necessary are isolated but the check still fails and I missed that this
fix was not picked up before RC1. This same problem has been identified
in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

Signed-off-by: Mel Gorman
Tested-by: Tony Prisk
Reported-by: Thierry Reding
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-20 05:07:47 +0800

09 Oct, 2012

1 commit

e46a28790 CMA: migrate mlocked pages ... Browse Code »

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:00 +0800