Commit 579f82901f6f41256642936d7e632f3979ad76d4

Authored by Shaohua Li
Committed by Linus Torvalds
1 parent fb951eb5e1

swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm.  It's from Hugh and
I slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin is
sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.

This patch adds very simplistic random read detection.  Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly.  There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's
patch.  I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit.  And in such case, continuing doing readahead is good
actually.

I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially.  So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.

Here is my test result (unit second, 3 runs average):
	Vanilla		Hugh		New
Seq	356		370		360
Random	4525		2447		2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'.  The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads.  These are expected behavior.  while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 2 changed files with 62 additions and 5 deletions Side-by-side Diff

include/linux/page-flags.h
... ... @@ -228,9 +228,9 @@
228 228 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
229 229 PAGEFLAG(MappedToDisk, mappedtodisk)
230 230  
231   -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
  231 +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
232 232 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
233   -PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
  233 +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
234 234  
235 235 #ifdef CONFIG_HIGHMEM
236 236 /*
... ... @@ -63,6 +63,8 @@
63 63 return ret;
64 64 }
65 65  
  66 +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
  67 +
66 68 void show_swap_cache_info(void)
67 69 {
68 70 printk("%lu pages in swap cache\n", total_swapcache_pages());
69 71  
... ... @@ -286,8 +288,11 @@
286 288  
287 289 page = find_get_page(swap_address_space(entry), entry.val);
288 290  
289   - if (page)
  291 + if (page) {
290 292 INC_CACHE_INFO(find_success);
  293 + if (TestClearPageReadahead(page))
  294 + atomic_inc(&swapin_readahead_hits);
  295 + }
291 296  
292 297 INC_CACHE_INFO(find_total);
293 298 return page;
... ... @@ -389,6 +394,50 @@
389 394 return found_page;
390 395 }
391 396  
  397 +static unsigned long swapin_nr_pages(unsigned long offset)
  398 +{
  399 + static unsigned long prev_offset;
  400 + unsigned int pages, max_pages, last_ra;
  401 + static atomic_t last_readahead_pages;
  402 +
  403 + max_pages = 1 << ACCESS_ONCE(page_cluster);
  404 + if (max_pages <= 1)
  405 + return 1;
  406 +
  407 + /*
  408 + * This heuristic has been found to work well on both sequential and
  409 + * random loads, swapping to hard disk or to SSD: please don't ask
  410 + * what the "+ 2" means, it just happens to work well, that's all.
  411 + */
  412 + pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
  413 + if (pages == 2) {
  414 + /*
  415 + * We can have no readahead hits to judge by: but must not get
  416 + * stuck here forever, so check for an adjacent offset instead
  417 + * (and don't even bother to check whether swap type is same).
  418 + */
  419 + if (offset != prev_offset + 1 && offset != prev_offset - 1)
  420 + pages = 1;
  421 + prev_offset = offset;
  422 + } else {
  423 + unsigned int roundup = 4;
  424 + while (roundup < pages)
  425 + roundup <<= 1;
  426 + pages = roundup;
  427 + }
  428 +
  429 + if (pages > max_pages)
  430 + pages = max_pages;
  431 +
  432 + /* Don't shrink readahead too fast */
  433 + last_ra = atomic_read(&last_readahead_pages) / 2;
  434 + if (pages < last_ra)
  435 + pages = last_ra;
  436 + atomic_set(&last_readahead_pages, pages);
  437 +
  438 + return pages;
  439 +}
  440 +
392 441 /**
393 442 * swapin_readahead - swap in pages in hope we need them soon
394 443 * @entry: swap entry of this memory
395 444  
396 445  
... ... @@ -412,11 +461,16 @@
412 461 struct vm_area_struct *vma, unsigned long addr)
413 462 {
414 463 struct page *page;
415   - unsigned long offset = swp_offset(entry);
  464 + unsigned long entry_offset = swp_offset(entry);
  465 + unsigned long offset = entry_offset;
416 466 unsigned long start_offset, end_offset;
417   - unsigned long mask = (1UL << page_cluster) - 1;
  467 + unsigned long mask;
418 468 struct blk_plug plug;
419 469  
  470 + mask = swapin_nr_pages(offset) - 1;
  471 + if (!mask)
  472 + goto skip;
  473 +
420 474 /* Read a page_cluster sized and aligned cluster around offset. */
421 475 start_offset = offset & ~mask;
422 476 end_offset = offset | mask;
423 477  
... ... @@ -430,11 +484,14 @@
430 484 gfp_mask, vma, addr);
431 485 if (!page)
432 486 continue;
  487 + if (offset != entry_offset)
  488 + SetPageReadahead(page);
433 489 page_cache_release(page);
434 490 }
435 491 blk_finish_plug(&plug);
436 492  
437 493 lru_add_drain(); /* Push any new pages onto the LRU now */
  494 +skip:
438 495 return read_swap_cache_async(entry, gfp_mask, vma, addr);
439 496 }