26 Sep, 2014
40 commits
-
commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.
Seems to be called with preemption enabled. Therefore it must use
mod_zone_page_state instead.Signed-off-by: Christoph Lameter
Reported-by: Grygorii Strashko
Tested-by: Grygorii Strashko
Cc: Tejun Heo
Cc: Santosh Shilimkar
Cc: Ingo Molnar
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.
The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that. Rename it to
reflect its actual meaning.Signed-off-by: Vladimir Davydov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.
When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware. That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good. Fix this.Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.
In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors(). That
smells fishy. Looking further, this is basically what happens:blkdev_aio_read()
generic_file_aio_read()
filemap_write_and_wait_range()
if (!mapping->nr_pages)
filemap_check_errors()and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation. The
patch below tests each of these bits before clearing them, avoiding this
issue. In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.Signed-off-by: Jens Axboe
Acked-by: Jeff Moyer
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.
Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.
Currently max_sane_readahead() returns zero on the cpu whose NUMA node
has no local memory which leads to readahead failure. Fix this
readahead failure by returning minimum of (requested pages, 512). Users
running applications on a memory-less cpu which needs readahead such as
streaming application see considerable boost in the performance.Result:
fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
the normal NUMA cases w/ patch.Kernel Avg Stddev
base 7.4975 3.92%
patched 7.4174 3.26%[Andrew: making return value PAGE_SIZE independent]
Suggested-by: Linus Torvalds
Signed-off-by: Raghavendra K T
Acked-by: Jan Kara
Cc: Wu Fengguang
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 91ca9186484809c57303b33778d841cc28f696ed upstream.
The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.
The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.Signed-off-by: David Rientjes
Cc: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.
It is just for clean-up to reduce code size and improve readability.
There is no functional change.Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.
isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted. It isn't needed to
call it on every page. Current code do well if not suitable, but, don't
do well when suitable.1) It re-checks isolation_suitable() on each page of a pageblock that was
already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
was not entered through the next_pageblock: label, because
last_pageblock_nr is not otherwise updated.This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.[vbabka@suse.cz: rephrase commit description]
Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.
It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page. This may results in below situation while isolating
migratepage.1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
the spinlock.
3. Then, to complete isolating, retry to aquire the lock.I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock. This has no harm 0x0 pfn, because, at
this time, locked variable would be false.Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.
suitable_migration_target() checks that pageblock is suitable for
migration target. In isolate_freepages_block(), it is called on every
page and this is inefficient. So make it called once per pageblock.suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order. So calling it once
within pageblock range has no problem.Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.
Purpose of compaction is to get a high order page. Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target. It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.Additionally, clean-up logic in suitable_migration_target() to simplify
the code. There is no functional changes from this clean-up.Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.
Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages(). In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):compact_pages_moved 8450
compact_pagemigrate_failed 156144150.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds. After the patch:compact_pages_moved 9197
compact_pagemigrate_failed 799.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.Signed-off-by: David Rientjes
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Rik van Riel
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
9TB memory machine since onlining memory sections is too slow. And we
found out setup_zone_migrate_reserve spent >90% of the time.The problem is, setup_zone_migrate_reserve scans all pageblocks
unconditionally, but it is only necessary if the number of reserved
block was reduced (i.e. memory hot remove).Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
that the number of reserved pageblocks is almost always unchanged.This patch adds zone->nr_migrate_reserve_block to maintain the number of
MIGRATE_RESERVE pageblocks and it reduces the overhead of
setup_zone_migrate_reserve dramatically. The following table shows time
of onlining a memory section.Amount of memory | 128GB | 192GB | 256GB|
---------------------------------------------
linux-3.12 | 23.9 | 31.4 | 44.5 |
This patch | 8.3 | 8.3 | 8.6 |
Mel's proposal patch | 10.9 | 19.2 | 31.3 |
---------------------------------------------
(millisecond)128GB : 4 nodes and each node has 32GB of memory
192GB : 6 nodes and each node has 32GB of memory
256GB : 8 nodes and each node has 32GB of memory(*1) Mel proposed his idea by the following threads.
https://lkml.org/lkml/2013/10/30/272[akpm@linux-foundation.org: tweak comment]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Yasuaki Ishimatsu
Reported-by: Yasuaki Ishimatsu
Tested-by: Yasuaki Ishimatsu
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit ec97097bca147d5718a5d2c024d1ec740b10096d upstream.
If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.Signed-off-by: Vladimir Davydov
Reported-by: Dave Chinner
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 0b1fb40a3b1291f2f12f13f644ac95cf756a00e6 upstream.
When reclaiming kmem, we currently don't scan slabs that have less than
batch_size objects (see shrink_slab_node()):while (total_scan >= batch_size) {
shrinkctl->nr_to_scan = batch_size;
shrinker->scan_objects(shrinker, shrinkctl);
total_scan -= batch_size;
}If there are only a few shrinkers available, such a behavior won't cause
any problems, because the batch_size is usually small, but if we have a
lot of slab shrinkers, which is perfectly possible since FS shrinkers
are now per-superblock, we can end up with hundreds of megabytes of
practically unreclaimable kmem objects. For instance, mounting a
thousand of ext2 FS images with a hundred of files in each and iterating
over all the files using du(1) will result in about 200 Mb of FS caches
that cannot be dropped even with the aid of the vm.drop_caches sysctl!This problem was initially pointed out by Glauber Costa [*]. Glauber
proposed to fix it by making the shrink_slab() always take at least one
pass, to put it simply, turning the scan loop above to a do{}while()
loop. However, this proposal was rejected, because it could result in
more aggressive and frequent slab shrinking even under low memory
pressure when total_scan is naturally very small.This patch is a slightly modified version of Glauber's approach.
Similarly to Glauber's patch, it makes shrink_slab() scan less than
batch_size objects, but only if the total number of objects we want to
scan (total_scan) is greater than the total number of objects available
(max_pass). Since total_scan is biased as half max_pass if the current
delta change is small:if (delta < max_pass / 4)
total_scan = min(total_scan, max_pass / 2);this is only possible if we are scanning at high prio. That said, this
patch shouldn't change the vmscan behaviour if the memory pressure is
low, but if we are tight on memory, we will do our best by trying to
reclaim all available objects, which sounds reasonable.[*] http://www.spinics.net/lists/cgroups/msg06913.html
Signed-off-by: Vladimir Davydov
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Dave Chinner
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.
This is a patch to improve swap readahead algorithm. It's from Hugh and
I slightly changed it.Hugh's original changelog:
swapin readahead does a blind readahead, whether or not the swapin is
sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.This patch adds very simplistic random read detection. Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly. There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 73921 76210 75611 76904 78191 121542
Seq Shmem 73601 73176 73855 72947 74543 118322
Rand Anon 895392 831243 871569 845197 846496 841680
Rand Shmem 1058375 1053486 827935 764955 764376 756489SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 24634 24198 24673 25107 21614 70018
Seq Shmem 24959 24932 25052 25703 22030 69678
Rand Anon 43014 26146 28075 25989 26935 25901
Rand Shmem 45349 45215 28249 24268 24138 24332These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.Shaohua Li:
Test shows Vanilla is slightly better in sequential workload than Hugh's
patch. I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit. And in such case, continuing doing readahead is good
actually.I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially. So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'. The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads. These are expected behavior. while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).Original patches by: Shaohua Li and Konstantin Khlebnikov.
[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins
Signed-off-by: Shaohua Li
Signed-off-by: Fengguang Wu
Cc: Rik van Riel
Cc: Wu Fengguang
Cc: Minchan Kim
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 55b7c4c99f6a448f72179297fe6432544f220063 upstream.
Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively. Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly. Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable().
This function gets called when:- compaction is restarting after being deferred
- compact_blockskip_flush flag is set in compact_finished() when the scanners
meet (and not again cleared when direct compaction succeeds in allocation)
and kswapd acts upon this flag before going to sleepThis behavior is suboptimal for several reasons:
- when direct sync compaction is called after async compaction fails (in the
allocation slowpath), it will effectively do nothing, unless kswapd
happens to process the compact_blockskip_flush flag meanwhile. This is racy
and goes against the purpose of sync compaction to more thoroughly retry
the compaction of a zone where async compaction has failed.
The restart-after-deferring path cannot help here as deferring happens only
after the sync compaction fails. It is also done only for the preferred
zone, while the compaction might be done for a fallback zone.- the mechanism of marking pageblock to be skipped has little value since the
cached pfn's are reset only together with the pageblock skip flags. This
effectively limits pageblock skip usage to parallel compactions.This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet. Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not
marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
compactions now skips without marking them.Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0 upstream.
Compaction temporarily marks pageblocks where it fails to isolate pages
as to-be-skipped in further compactions, in order to improve efficiency.
One of the reasons to fail isolating pages is that isolation is not
attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction. Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the
pageblock skip marks. This goes against the idea that sync compaction
should try to scan these blocks more thoroughly than the async
compaction.The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped. Note this should not affect
performance or locking impact of further async compactions, as skipping
a block due to being !MIGRATE_MOVABLE is done soon after skipping a
block marked to be skipped, both without locking.Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.
Currently there are several functions to manipulate the deferred
compaction state variables. The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional
change.Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f upstream.
The broad goal of the series is to improve allocation success rates for
huge pages through memory compaction, while trying not to increase the
compaction overhead. The original objective was to reintroduce
capturing of high-order pages freed by the compaction, before they are
split by concurrent activity. However, several bugs and opportunities
for simple improvements were found in the current implementation, mostly
through extra tracepoints (which are however too ugly for now to be
considered for sending).The patches mostly deal with two mechanisms that reduce compaction
overhead, which is caching the progress of migrate and free scanners,
and marking pageblocks where isolation failed to be skipped during
further scans.Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
compaction and potentially debug scanner pfn values.Patch 2 encapsulates the some functionality for handling deferred compactions
for better maintainability, without a functional change
type is not determined without being actually needed.Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
they have been read to initialize a compaction run.Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
and can lead to multiple compaction attempts quitting early without
doing any work.Patch 5 improves the chances of sync compaction to process pageblocks that
async compaction has skipped due to being !MIGRATE_MOVABLE.Patch 6 improves the chances of sync direct compaction to actually do anything
when called after async compaction fails during allocation slowpath.The impact of patches were validated using mmtests's stress-highalloc
benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
with 4GB memory.Due to instability of the results (mostly related to the bugs fixed by
patches 2 and 3), 10 iterations were performed, taking min,mean,max
values for success rates and mean values for time and vmstat-based
metrics.First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
due to no functional changes in 1 and 2. Comments below.stress-highalloc
3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%)
Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%)
Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%)
Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%)
Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%)
Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%)
Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%)
Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%)
Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%)3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
User 6437.72 6459.76 5960.32 5974.55 6019.67
System 1049.65 1049.09 1029.32 1031.47 1032.31
Elapsed 1856.77 1874.48 1949.97 1994.22 1983.153.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
Minor Faults 253952267 254581900 250030122 250507333 250157829
Major Faults 420 407 506 530 530
Swap Ins 4 9 9 6 6
Swap Outs 398 375 345 346 333
Direct pages scanned 197538 189017 298574 287019 299063
Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089
Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622
Direct pages reclaimed 197227 188829 298380 286822 298835
Kswapd efficiency 99% 99% 99% 99% 99%
Kswapd velocity 953.382 970.449 952.243 934.569 922.286
Direct efficiency 99% 99% 99% 99% 99%
Direct velocity 104.058 101.832 153.961 143.200 148.205
Percentage direct scans 9% 9% 13% 13% 13%
Zone normal velocity 347.289 359.676 348.063 339.933 332.983
Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507
Zone dma velocity 0.000 0.000 0.000 0.000 0.000
Page writes by reclaim 557.600 429.000 353.600 426.400 381.800
Page writes file 159 53 7 79 48
Page writes anon 398 375 345 346 333
Page reclaim immediate 825 644 411 575 420
Sector Reads 2781750 2769780 2878547 2939128 2910483
Sector Writes 12080843 12083351 12012892 12002132 12010745
Page rescued immediate 0 0 0 0 0
Slabs scanned 1575654 1545344 1778406 1786700 1794073
Direct inode steals 9657 10037 15795 14104 14645
Kswapd inode steals 46857 46335 50543 50716 51796
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 97 91 81 71 77
THP collapse alloc 456 506 546 544 565
THP splits 6 5 5 4 4
THP fault fallback 0 1 0 0 0
THP collapse fail 14 14 12 13 12
Compaction stalls 1006 980 1537 1536 1548
Compaction success 303 284 562 559 578
Compaction failures 702 696 974 976 969
Page migrate success 1177325 1070077 3927538 3781870 3877057
Page migrate failure 0 0 0 0 0
Compaction pages isolated 2547248 2306457 8301218 8008500 8200674
Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197
Compaction free scanned 89199429 79189151 356529027 351943166 356326727
Compaction cost 1566 1426 5312 5156 5294
NUMA PTE updates 0 0 0 0 0
NUMA hint faults 0 0 0 0 0
NUMA hint local faults 0 0 0 0 0
NUMA hint local percent 100 100 100 100 100
NUMA pages migrated 0 0 0 0 0
AutoNUMA cost 0 0 0 0 0Observations:
- The "Success 3" line is allocation success rate with system idle
(phases 1 and 2 are with background interference). I used to get stable
values around 85% with vanilla 3.11. The lower min and mean values came
with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
zone allocator policy") As explained in comment for patch 3, I don't
think the commit is wrong, but that it makes the effect of compaction
bugs worse. From patch 3 onwards, the results are OK and match the 3.11
results.- Patch 4 also clearly helps phases 1 and 2, and exceeds any results
I've seen with 3.11 (I didn't measure it that thoroughly then, but it
was never above 40%).- Compaction cost and number of scanned pages is higher, especially due
to patch 4. However, keep in mind that patches 3 and 4 fix existing
bugs in the current design of compaction overhead mitigation, they do
not change it. If overhead is found unacceptable, then it should be
decreased differently (and consistently, not due to random conditions)
than the current implementation does. In contrast, patches 5 and 6
(which are not strictly bug fixes) do not increase the overhead (but
also not success rates). This might be a limitation of the
stress-highalloc benchmark as it's quite uniform.Another set of results is when configuring stress-highalloc t allocate
with similar flags as THP uses:
(GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)stress-highalloc
3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%)
Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%)
Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%)
Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%)
Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%)
Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%)
Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%)
Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%)
Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%)3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
User 6547.93 6475.85 6265.54 6289.46 6189.96
System 1053.42 1047.28 1043.23 1042.73 1038.73
Elapsed 1835.43 1821.96 1908.67 1912.74 1956.383.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
2-thp 3-thp 4-thp 5-thp 6-thp
Minor Faults 256805673 253106328 253222299 249830289 251184418
Major Faults 395 375 423 434 448
Swap Ins 12 10 10 12 9
Swap Outs 530 537 487 455 415
Direct pages scanned 71859 86046 153244 152764 190713
Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520
Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924
Direct pages reclaimed 71766 85908 153167 152643 190600
Kswapd efficiency 99% 99% 99% 99% 99%
Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218
Direct efficiency 99% 99% 99% 99% 99%
Direct velocity 38.897 49.127 80.747 79.983 96.468
Percentage direct scans 3% 4% 7% 7% 9%
Zone normal velocity 351.377 372.494 348.910 341.689 335.310
Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377
Zone dma velocity 0.000 0.000 0.000 0.000 0.000
Page writes by reclaim 669.300 604.000 545.700 538.900 429.900
Page writes file 138 66 58 83 14
Page writes anon 530 537 487 455 415
Page reclaim immediate 806 655 772 548 517
Sector Reads 2711956 2703239 2811602 2818248 2839459
Sector Writes 12163238 12018662 12038248 11954736 11994892
Page rescued immediate 0 0 0 0 0
Slabs scanned 1385088 1388364 1507968 1513292 1558656
Direct inode steals 1739 2564 4622 5496 6007
Kswapd inode steals 47461 46406 47804 48013 48466
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 110 82 84 69 70
THP collapse alloc 445 482 467 462 539
THP splits 6 5 4 5 3
THP fault fallback 3 0 0 0 0
THP collapse fail 15 14 14 14 13
Compaction stalls 659 685 1033 1073 1111
Compaction success 222 225 410 427 456
Compaction failures 436 460 622 646 655
Page migrate success 446594 439978 1085640 1095062 1131716
Page migrate failure 0 0 0 0 0
Compaction pages isolated 1029475 1013490 2453074 2482698 2565400
Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204
Compaction free scanned 27715272 28544654 80150615 82898631 85756132
Compaction cost 552 555 1344 1379 1436
NUMA PTE updates 0 0 0 0 0
NUMA hint faults 0 0 0 0 0
NUMA hint local faults 0 0 0 0 0
NUMA hint local percent 100 100 100 100 100
NUMA pages migrated 0 0 0 0 0
AutoNUMA cost 0 0 0 0 0There are some differences from the previous results for THP-like allocations:
- Here, the bad result for unpatched kernel in phase 3 is much more
consistent to be between 65-70% and not related to the "regression" in
3.12. Still there is the improvement from patch 4 onwards, which brings
it on par with simple GFP_HIGHUSER_MOVABLE allocations.- Compaction costs have increased, but nowhere near as much as the
non-THP case. Again, the patches should be worth the gained
determininsm.- Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
pfn's and pageblock skip bits are not reset by kswapd that often (at
least in phase 3 where no concurrent activity would wake up kswapd) and
the patches thus help the sync-after-async compaction. It doesn't
however show that the sync compaction would help so much with success
rates, which can be again seen as a limitation of the benchmark
scenario.This patch (of 6):
Add two tracepoints for compaction begin and end of a zone. Using this it
is possible to calculate how much time a workload is spending within
compaction and potentially debug problems related to cached pfns for
scanning. In combination with the direct reclaim and slab trace points it
should be possible to estimate most allocation-related overhead for a
workload.Signed-off-by: Mel Gorman
Signed-off-by: Vlastimil Babka
Cc: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream.
When choosing between doing an address space or ranged flush,
the x86 implementation of flush_tlb_mm_range takes into account
whether there are any large pages in the range. A per-page
flush typically requires fewer entries than would covered by a
single large page and the check is redundant.There is one potential exception. THP migration flushes single
THP entries and it conceivably would benefit from flushing a
single entry instead of the mm. However, this flush is after a
THP allocation, copy and page table update potentially with any
other threads serialised behind it. In comparison to that, the
flush is noise. It makes more sense to optimise balancing to
require fewer flushes than to optimise the flush itself.This patch deletes the redundant huge page check.
Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Alex Shi
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream.
NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
the comparison with total_vm is done before taking
tlb_flushall_shift into account. Clean it up.Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Alex Shi
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Hugh Dickins
Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.
Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.The counters are undeniably useful but how often do we really
need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Hugh Dickins
Cc: Alex Shi
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 0cbef29a782162a3896487901eca4550bfa397ef upstream.
When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty. That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e. requested size is smaller than half of page block). The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.Mel's original anti fragmentation does the right thing. But commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.This patch restores sane and old behavior. It also removes an incorrect
comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream.
In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit af248a0c67457e5c6d2bcf288f07b4b2ed064f1f upstream.
The kernel's readahead algorithm sometimes interprets random read
accesses as sequential and triggers unnecessary data prefecthing from
storage device (impacting random read average latency).In order to identify sequential cache read misses, the readahead
algorithm intends to check whether offset - previous offset == 1
(trivial sequential reads) or offset - previous offset == 0 (sequential
reads not aligned on page boundary):if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
condition is true and access is wrongly interpeted as sequential. An
unnecessary data prefetching is triggered, impacting the average random
read latency.Storing the previous offset value in a "pgoff_t" variable (unsigned
long) fixes the sequential read detection logic.Signed-off-by: Damien Ramonda
Reviewed-by: Fengguang Wu
Acked-by: Pierre Tardy
Acked-by: David Cohen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.
Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full. The first patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.Replace the manually ordered swap_list_head with a plist, swap_active_head.
Add a new plist, swap_avail_head. The original swap_active_head plist
contains all active swap_info_structs, as before, while the new
swap_avail_head plist contains only swap_info_structs that are active and
available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
the swap_avail_head list.Mel Gorman suggested using plists since they internally handle ordering
the list entries based on priority, which is exactly what swap was doing
manually. All the ordering code is now removed, and swap_info_struct
entries and simply added to their corresponding plist and automatically
ordered correctly.Using a new plist for available swap_info_structs simplifies and
optimizes get_swap_page(), which no longer has to iterate over full
swap_info_structs. Using a new spinlock for swap_avail_head plist
allows each swap_info_struct to add or remove themselves from the
plist when they become full or not-full; previously they could not
do so because the swap_info_struct->lock is held when they change
from fullnot-full, and the swap_lock protecting the main
swap_active_head must be ordered before any swap_info_struct->lock.Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.
Add plist_requeue(), which moves the specified plist_node after all other
same-priority plist_nodes in the list. This is essentially an optimized
plist_del() followed by plist_add().This is needed by swap, which (with the next patch in this set) uses a
plist of available swap devices. When a swap device (either a swap
partition or swap file) are added to the system with swapon(), the device
is added to a plist, ordered by the swap device's priority. When swap
needs to allocate a page from one of the swap devices, it takes the page
from the first swap device on the plist, which is the highest priority
swap device. The swap device is left in the plist until all its pages are
used, and then removed from the plist when it becomes full.However, as described in man 2 swapon, swap must allocate pages from swap
devices with the same priority in round-robin order; to do this, on each
swap page allocation, swap uses a page from the first swap device in the
plist, and then calls plist_requeue() to move that swap device entry to
after any other same-priority swap devices. The next swap page allocation
will again use a page from the first swap device in the plist and requeue
it, and so on, resulting in round-robin usage of equal-priority swap
devices.Also add plist_test_requeue() test function, for use by plist_test() to
test plist_requeue() function.Signed-off-by: Dan Streetman
Cc: Steven Rostedt
Cc: Peter Zijlstra
Acked-by: Mel Gorman
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.
Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
define and initialize a struct plist_head.Add plist_for_each_continue() and plist_for_each_entry_continue(),
equivalent to list_for_each_continue() and list_for_each_entry_continue(),
to iterate over a plist continuing after the current position.Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
and ->next, implemented by list_prev_entry() and list_next_entry(), to
access the prev/next struct plist_node entry. These are needed because
unlike struct list_head, direct access of the prev/next struct plist_node
isn't possible; the list must be navigated via the contained struct
list_head. e.g. instead of accessing the prev by list_prev_entry(node,
node_list) it can be accessed by plist_prev(node).Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Paul Gortmaker
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit adfab836f4908deb049a5128082719e689eed964 upstream.
The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:- it stores the entries in priority order
- there is a pointer to the highest priority entry
- there is a pointer to the highest priority not-full entry
- there is a highest_priority_index variable set outside the swap_lock
- swap entries of equal priority should be used equallythis complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce unnecessary
list iteration (i.e. Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions. These
are used by the next two patches.The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically. The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head. A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.This patch (of 4):
Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads. Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.
We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes. So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.Signed-off-by: Michal Hocko
Acked-by: David Rientjes
Acked-by: Nishanth Aravamudan
Tested-by: Nishanth Aravamudan
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 457c1b27ed56ec472d202731b12417bff023594a upstream.
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....In KVM guests on Power, in a guest not backed by hugepages, we see the
following:AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kBHPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan
Reviewed-by: Aneesh Kumar K.V
Acked-by: Mel Gorman
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 887843961c4b4681ee993c36d4997bf4b4aa8253 upstream.
Fix some "Bad rss-counter state" reports on exit, arising from the
interaction between page migration and remap_file_pages(): zap_pte()
must count a migration entry when zapping it.And yes, it is possible (though very unusual) to find an anon page or
swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
get_user_pages(write, force) case which COWs even in a shared mapping.Signed-off-by: Hugh Dickins
Tested-by: Sasha Levin sasha.levin@oracle.com>
Tested-by: Dave Jones davej@redhat.com>
Cc: Cyrill Gorcunov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream.
If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
proc_dointvec() to proc_dointvec_minmax() in the
min_free_kbytes_sysctl_handler() can prevent this to happen.mhocko said:
: You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
: your machine unusable but I agree that proc_dointvec_minmax is more
: suitable here as we already have:
:
: .proc_handler = min_free_kbytes_sysctl_handler,
: .extra1 = &zero,
:
: It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
: to ctl_name and strategy from the generic sysctl table") has removed
: sysctl_intvec strategy and so extra1 is ignored.Signed-off-by: Han Pingtian
Acked-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 73293c2f900d0adbb6a415b312cd57976d5ae242 upstream.
We checked pfmemalloc by slab unit, not page unit. You can see this
in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
pfmemalloc.And, therefore we should check pfmemalloc in page flag of first page,
but current implementation don't do that. virt_to_head_page(obj) just
return 'struct page' of that object, not one of first page, since the SLAB
don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
we first get a slab and try to get it via virt_to_head_page(slab->s_mem).Acked-by: Andi Kleen
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 9f1b868a13ac36bd207a571f5ea1193d823ab18d upstream.
Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
hugepage which is allocated from the node of the first scanned normal
page, but this policy is too rough and may end with unexpected result to
upper users.The problem is the original page-balancing among all nodes will be
broken after hugepaged started. Thinking about the case if the first
scanned normal page is allocated from node A, most of other scanned
normal pages are allocated from node B or C.. But hugepaged will always
allocate hugepage from node A which will cause extra memory pressure on
node A which is not the situation before khugepaged started.This patch try to fix this problem by making khugepaged allocate
hugepage from the node which have max record of scaned normal pages hit,
so that the effect to original page-balancing can be minimized.The other problem is if normal scanned pages are equally allocated from
Node A,B and C, after khugepaged started Node A will still suffer extra
memory pressure.Andrew Davidoff reported a related issue several days ago. He wanted
his application interleaving among all nodes and "numactl
--interleave=all ./test" was used to run the testcase, but the result
wasn't not as expected.cat /proc/2814/numa_maps:
7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098The end result showed that most pages are from Node3 instead of
interleave among node0-3 which was unreasonable.This patch also fix this issue by allocating hugepage round robin from
all nodes have the same record, after this patch the result was as
expected:7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722
The simple testcase is like this:
int main() {
char *p;
int i;
int j;for (i=0; i < 200; i++) {
p = (char *)malloc(1048576);
printf("malloc done\n");if (p == 0) {
printf("Out of memory\n");
return 1;
}
for (j=0; j < 1048576; j++) {
p[j] = 'A';
}
printf("touched memory\n");sleep(1);
}
printf("enter sleep\n");
while(1) {
sleep(100);
}
}[akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
Reported-by: Andrew Davidoff
Tested-by: Andrew Davidoff
Signed-off-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Yasuaki Ishimatsu
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby -
commit 10dc4155c7714f508fe2e4667164925ea971fb25 upstream.
Move alloc_hugepage() to a better place, no need for a seperate #ifndef
CONFIG_NUMASigned-off-by: Bob Liu
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Andrew Davidoff
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby