swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li · Linus Torvalds · Eric Lee · Eric Lee
1 parent fb951eb5e1
Showing 2 changed files with 62 additions and 5 deletions Side-by-side Diff
include/linux/page-flags.h
mm/swap_state.c
@@ -228,9 +228,9 @@
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)
  
-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
  
 #ifdef CONFIG_HIGHMEM
 /*
@@ -63,6 +63,8 @@
 	return ret;
 }
  
+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages());
  
@@ -286,8 +288,11 @@
  
 	page = find_get_page(swap_address_space(entry), entry.val);
  
-	if (page)
+	if (page) {
 		INC_CACHE_INFO(find_success);
+		if (TestClearPageReadahead(page))
+			atomic_inc(&swapin_readahead_hits);
+	}
  
 	INC_CACHE_INFO(find_total);
 	return page;
@@ -389,6 +394,50 @@
 	return found_page;
 }
  
+static unsigned long swapin_nr_pages(unsigned long offset)
+{
+	static unsigned long prev_offset;
+	unsigned int pages, max_pages, last_ra;
+	static atomic_t last_readahead_pages;
+
+	max_pages = 1 << ACCESS_ONCE(page_cluster);
+	if (max_pages <= 1)
+		return 1;
+
+	/*
+	 * This heuristic has been found to work well on both sequential and
+	 * random loads, swapping to hard disk or to SSD: please don't ask
+	 * what the "+ 2" means, it just happens to work well, that's all.
+	 */
+	pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+	if (pages == 2) {
+		/*
+		 * We can have no readahead hits to judge by: but must not get
+		 * stuck here forever, so check for an adjacent offset instead
+		 * (and don't even bother to check whether swap type is same).
+		 */
+		if (offset != prev_offset + 1 && offset != prev_offset - 1)
+			pages = 1;
+		prev_offset = offset;
+	} else {
+		unsigned int roundup = 4;
+		while (roundup < pages)
+			roundup <<= 1;
+		pages = roundup;
+	}
+
+	if (pages > max_pages)
+		pages = max_pages;
+
+	/* Don't shrink readahead too fast */
+	last_ra = atomic_read(&last_readahead_pages) / 2;
+	if (pages < last_ra)
+		pages = last_ra;
+	atomic_set(&last_readahead_pages, pages);
+
+	return pages;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  
  
@@ -412,11 +461,16 @@
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
-	unsigned long offset = swp_offset(entry);
+	unsigned long entry_offset = swp_offset(entry);
+	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
-	unsigned long mask = (1UL << page_cluster) - 1;
+	unsigned long mask;
 	struct blk_plug plug;
  
+	mask = swapin_nr_pages(offset) - 1;
+	if (!mask)
+		goto skip;
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
  
@@ -430,11 +484,14 @@
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
+		if (offset != entry_offset)
+			SetPageReadahead(page);
 		page_cache_release(page);
 	}
 	blk_finish_plug(&plug);
  
 	lru_add_drain();	/* Push any new pages onto the LRU now */
+skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
...	...	@@ -228,9 +228,9 @@
228	228	TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
229	229	PAGEFLAG(MappedToDisk, mappedtodisk)
230	230
231		-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
	231	+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
232	232	PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
233		-PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
	233	+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
234	234
235	235	#ifdef CONFIG_HIGHMEM
236	236	/*
...	...	@@ -63,6 +63,8 @@
63	63	return ret;
64	64	}
65	65
	66	+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
	67	+
66	68	void show_swap_cache_info(void)
67	69	{
68	70	printk("%lu pages in swap cache\n", total_swapcache_pages());
69	71
...	...	@@ -286,8 +288,11 @@
286	288
287	289	page = find_get_page(swap_address_space(entry), entry.val);
288	290
289		- if (page)
	291	+ if (page) {
290	292	INC_CACHE_INFO(find_success);
	293	+ if (TestClearPageReadahead(page))
	294	+ atomic_inc(&swapin_readahead_hits);
	295	+ }
291	296
292	297	INC_CACHE_INFO(find_total);
293	298	return page;
...	...	@@ -389,6 +394,50 @@
389	394	return found_page;
390	395	}
391	396
	397	+static unsigned long swapin_nr_pages(unsigned long offset)
	398	+{
	399	+ static unsigned long prev_offset;
	400	+ unsigned int pages, max_pages, last_ra;
	401	+ static atomic_t last_readahead_pages;
	402	+
	403	+ max_pages = 1 << ACCESS_ONCE(page_cluster);
	404	+ if (max_pages <= 1)
	405	+ return 1;
	406	+
	407	+ /*
	408	+ * This heuristic has been found to work well on both sequential and
	409	+ * random loads, swapping to hard disk or to SSD: please don't ask
	410	+ * what the "+ 2" means, it just happens to work well, that's all.
	411	+ */
	412	+ pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
	413	+ if (pages == 2) {
	414	+ /*
	415	+ * We can have no readahead hits to judge by: but must not get
	416	+ * stuck here forever, so check for an adjacent offset instead
	417	+ * (and don't even bother to check whether swap type is same).
	418	+ */
	419	+ if (offset != prev_offset + 1 && offset != prev_offset - 1)
	420	+ pages = 1;
	421	+ prev_offset = offset;
	422	+ } else {
	423	+ unsigned int roundup = 4;
	424	+ while (roundup < pages)
	425	+ roundup <<= 1;
	426	+ pages = roundup;
	427	+ }
	428	+
	429	+ if (pages > max_pages)
	430	+ pages = max_pages;
	431	+
	432	+ /* Don't shrink readahead too fast */
	433	+ last_ra = atomic_read(&last_readahead_pages) / 2;
	434	+ if (pages < last_ra)
	435	+ pages = last_ra;
	436	+ atomic_set(&last_readahead_pages, pages);
	437	+
	438	+ return pages;
	439	+}
	440	+
392	441	/**
393	442	* swapin_readahead - swap in pages in hope we need them soon
394	443	* @entry: swap entry of this memory
395	444
396	445
...	...	@@ -412,11 +461,16 @@
412	461	struct vm_area_struct *vma, unsigned long addr)
413	462	{
414	463	struct page *page;
415		- unsigned long offset = swp_offset(entry);
	464	+ unsigned long entry_offset = swp_offset(entry);
	465	+ unsigned long offset = entry_offset;
416	466	unsigned long start_offset, end_offset;
417		- unsigned long mask = (1UL << page_cluster) - 1;
	467	+ unsigned long mask;
418	468	struct blk_plug plug;
419	469
	470	+ mask = swapin_nr_pages(offset) - 1;
	471	+ if (!mask)
	472	+ goto skip;
	473	+
420	474	/* Read a page_cluster sized and aligned cluster around offset. */
421	475	start_offset = offset & ~mask;
422	476	end_offset = offset \| mask;
423	477
...	...	@@ -430,11 +484,14 @@
430	484	gfp_mask, vma, addr);
431	485	if (!page)
432	486	continue;
	487	+ if (offset != entry_offset)
	488	+ SetPageReadahead(page);
433	489	page_cache_release(page);
434	490	}
435	491	blk_finish_plug(&plug);
436	492
437	493	lru_add_drain(); /* Push any new pages onto the LRU now */
	494	+skip:
438	495	return read_swap_cache_async(entry, gfp_mask, vma, addr);
439	496	}