ksm: make KSM page migration possible

KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ksm: make KSM page migration possible
KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins · Linus Torvalds
1 parent cbf86cfe04
Showing 2 changed files with 77 additions and 22 deletions Side-by-side Diff
mm/ksm.c
mm/migrate.c
@@ -499,6 +499,7 @@
  * In which case we can trust the content of the page, and it
  * returns the gotten page; but if the page has now been zapped,
  * remove the stale node from the stable tree and return NULL.
+ * But beware, the stable node's page might be being migrated.
  *
  * You would expect the stable_node to hold a reference to the ksm page.
  * But if it increments the page's count, swapping out has to wait for
  
  
  
  
  
  
  
  
@@ -509,44 +510,77 @@
  * pointing back to this stable node.  This relies on freeing a PageAnon
  * page to reset its page->mapping to NULL, and relies on no other use of
  * a page to put something that might look like our key in page->mapping.
- *
- * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
- * but this is different - made simpler by ksm_thread_mutex being held, but
- * interesting for assuming that no other use of the struct page could ever
- * put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping).
- *
- * Note: it is possible that get_ksm_page() will return NULL one moment,
- * then page the next, if the page is in between page_freeze_refs() and
- * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
  * is on its way to being freed; but it is an anomaly to bear in mind.
  */
 static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
 {
 	struct page *page;
 	void *expected_mapping;
+	unsigned long kpfn;
  
-	page = pfn_to_page(stable_node->kpfn);
 	expected_mapping = (void *)stable_node +
 				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
-	if (page->mapping != expected_mapping)
+again:
+	kpfn = ACCESS_ONCE(stable_node->kpfn);
+	page = pfn_to_page(kpfn);
+
+	/*
+	 * page is computed from kpfn, so on most architectures reading
+	 * page->mapping is naturally ordered after reading node->kpfn,
+	 * but on Alpha we need to be more careful.
+	 */
+	smp_read_barrier_depends();
+	if (ACCESS_ONCE(page->mapping) != expected_mapping)
 		goto stale;
-	if (!get_page_unless_zero(page))
-		goto stale;
-	if (page->mapping != expected_mapping) {
+
+	/*
+	 * We cannot do anything with the page while its refcount is 0.
+	 * Usually 0 means free, or tail of a higher-order page: in which
+	 * case this node is no longer referenced, and should be freed;
+	 * however, it might mean that the page is under page_freeze_refs().
+	 * The __remove_mapping() case is easy, again the node is now stale;
+	 * but if page is swapcache in migrate_page_move_mapping(), it might
+	 * still be our page, in which case it's essential to keep the node.
+	 */
+	while (!get_page_unless_zero(page)) {
+		/*
+		 * Another check for page->mapping != expected_mapping would
+		 * work here too.  We have chosen the !PageSwapCache test to
+		 * optimize the common case, when the page is or is about to
+		 * be freed: PageSwapCache is cleared (under spin_lock_irq)
+		 * in the freeze_refs section of __remove_mapping(); but Anon
+		 * page->mapping reset to NULL later, in free_pages_prepare().
+		 */
+		if (!PageSwapCache(page))
+			goto stale;
+		cpu_relax();
+	}
+
+	if (ACCESS_ONCE(page->mapping) != expected_mapping) {
 		put_page(page);
 		goto stale;
 	}
+
 	if (locked) {
 		lock_page(page);
-		if (page->mapping != expected_mapping) {
+		if (ACCESS_ONCE(page->mapping) != expected_mapping) {
 			unlock_page(page);
 			put_page(page);
 			goto stale;
 		}
 	}
 	return page;
+
 stale:
+	/*
+	 * We come here from above when page->mapping or !PageSwapCache
+	 * suggests that the node is stale; but it might be under migration.
+	 * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
+	 * before checking whether node->kpfn has been changed.
+	 */
+	smp_rmb();
+	if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
+		goto again;
 	remove_node_from_stable_tree(stable_node);
 	return NULL;
 }
  
  
  
  
@@ -1103,15 +1137,25 @@
 			return NULL;
  
 		ret = memcmp_pages(page, tree_page);
+		put_page(tree_page);
  
-		if (ret < 0) {
-			put_page(tree_page);
+		if (ret < 0)
 			node = node->rb_left;
-		} else if (ret > 0) {
-			put_page(tree_page);
+		else if (ret > 0)
 			node = node->rb_right;
-		} else
+		else {
+			/*
+			 * Lock and unlock the stable_node's page (which
+			 * might already have been migrated) so that page
+			 * migration is sure to notice its raised count.
+			 * It would be more elegant to return stable_node
+			 * than kpage, but that involves more changes.
+			 */
+			tree_page = get_ksm_page(stable_node, true);
+			if (tree_page)
+				unlock_page(tree_page);
 			return tree_page;
+		}
 	}
  
 	return NULL;
@@ -1903,6 +1947,14 @@
 	if (stable_node) {
 		VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
 		stable_node->kpfn = page_to_pfn(newpage);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+		set_page_stable_node(oldpage, NULL);
 	}
 }
 #endif /* CONFIG_MIGRATION */
@@ -464,7 +464,10 @@
  
 	mlock_migrate_page(newpage, page);
 	ksm_migrate_page(newpage, page);
-
+	/*
+	 * Please do not reorder this without considering how mm/ksm.c's
+	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
+	 */
 	ClearPageSwapCache(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
...	...	@@ -499,6 +499,7 @@
499	499	* In which case we can trust the content of the page, and it
500	500	* returns the gotten page; but if the page has now been zapped,
501	501	* remove the stale node from the stable tree and return NULL.
	502	+ * But beware, the stable node's page might be being migrated.
502	503	*
503	504	* You would expect the stable_node to hold a reference to the ksm page.
504	505	* But if it increments the page's count, swapping out has to wait for
505	506
506	507
507	508
508	509
509	510
510	511
511	512
512	513
...	...	@@ -509,44 +510,77 @@
509	510	* pointing back to this stable node. This relies on freeing a PageAnon
510	511	* page to reset its page->mapping to NULL, and relies on no other use of
511	512	* a page to put something that might look like our key in page->mapping.
512		- *
513		- * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
514		- * but this is different - made simpler by ksm_thread_mutex being held, but
515		- * interesting for assuming that no other use of the struct page could ever
516		- * put our expected_mapping into page->mapping (or a field of the union which
517		- * coincides with page->mapping).
518		- *
519		- * Note: it is possible that get_ksm_page() will return NULL one moment,
520		- * then page the next, if the page is in between page_freeze_refs() and
521		- * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
522	513	* is on its way to being freed; but it is an anomaly to bear in mind.
523	514	*/
524	515	static struct page get_ksm_page(struct stable_node stable_node, bool locked)
525	516	{
526	517	struct page *page;
527	518	void *expected_mapping;
	519	+ unsigned long kpfn;
528	520
529		- page = pfn_to_page(stable_node->kpfn);
530	521	expected_mapping = (void *)stable_node +
531	522	(PAGE_MAPPING_ANON \| PAGE_MAPPING_KSM);
532		- if (page->mapping != expected_mapping)
	523	+again:
	524	+ kpfn = ACCESS_ONCE(stable_node->kpfn);
	525	+ page = pfn_to_page(kpfn);
	526	+
	527	+ /*
	528	+ * page is computed from kpfn, so on most architectures reading
	529	+ * page->mapping is naturally ordered after reading node->kpfn,
	530	+ * but on Alpha we need to be more careful.
	531	+ */
	532	+ smp_read_barrier_depends();
	533	+ if (ACCESS_ONCE(page->mapping) != expected_mapping)
533	534	goto stale;
534		- if (!get_page_unless_zero(page))
535		- goto stale;
536		- if (page->mapping != expected_mapping) {
	535	+
	536	+ /*
	537	+ * We cannot do anything with the page while its refcount is 0.
	538	+ * Usually 0 means free, or tail of a higher-order page: in which
	539	+ * case this node is no longer referenced, and should be freed;
	540	+ * however, it might mean that the page is under page_freeze_refs().
	541	+ * The __remove_mapping() case is easy, again the node is now stale;
	542	+ * but if page is swapcache in migrate_page_move_mapping(), it might
	543	+ * still be our page, in which case it's essential to keep the node.
	544	+ */
	545	+ while (!get_page_unless_zero(page)) {
	546	+ /*
	547	+ * Another check for page->mapping != expected_mapping would
	548	+ * work here too. We have chosen the !PageSwapCache test to
	549	+ * optimize the common case, when the page is or is about to
	550	+ * be freed: PageSwapCache is cleared (under spin_lock_irq)
	551	+ * in the freeze_refs section of __remove_mapping(); but Anon
	552	+ * page->mapping reset to NULL later, in free_pages_prepare().
	553	+ */
	554	+ if (!PageSwapCache(page))
	555	+ goto stale;
	556	+ cpu_relax();
	557	+ }
	558	+
	559	+ if (ACCESS_ONCE(page->mapping) != expected_mapping) {
537	560	put_page(page);
538	561	goto stale;
539	562	}
	563	+
540	564	if (locked) {
541	565	lock_page(page);
542		- if (page->mapping != expected_mapping) {
	566	+ if (ACCESS_ONCE(page->mapping) != expected_mapping) {
543	567	unlock_page(page);
544	568	put_page(page);
545	569	goto stale;
546	570	}
547	571	}
548	572	return page;
	573	+
549	574	stale:
	575	+ /*
	576	+ * We come here from above when page->mapping or !PageSwapCache
	577	+ * suggests that the node is stale; but it might be under migration.
	578	+ * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
	579	+ * before checking whether node->kpfn has been changed.
	580	+ */
	581	+ smp_rmb();
	582	+ if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
	583	+ goto again;
550	584	remove_node_from_stable_tree(stable_node);
551	585	return NULL;
552	586	}
553	587
554	588
555	589
556	590
...	...	@@ -1103,15 +1137,25 @@
1103	1137	return NULL;
1104	1138
1105	1139	ret = memcmp_pages(page, tree_page);
	1140	+ put_page(tree_page);
1106	1141
1107		- if (ret < 0) {
1108		- put_page(tree_page);
	1142	+ if (ret < 0)
1109	1143	node = node->rb_left;
1110		- } else if (ret > 0) {
1111		- put_page(tree_page);
	1144	+ else if (ret > 0)
1112	1145	node = node->rb_right;
1113		- } else
	1146	+ else {
	1147	+ /*
	1148	+ * Lock and unlock the stable_node's page (which
	1149	+ * might already have been migrated) so that page
	1150	+ * migration is sure to notice its raised count.
	1151	+ * It would be more elegant to return stable_node
	1152	+ * than kpage, but that involves more changes.
	1153	+ */
	1154	+ tree_page = get_ksm_page(stable_node, true);
	1155	+ if (tree_page)
	1156	+ unlock_page(tree_page);
1114	1157	return tree_page;
	1158	+ }
1115	1159	}
1116	1160
1117	1161	return NULL;
...	...	@@ -1903,6 +1947,14 @@
1903	1947	if (stable_node) {
1904	1948	VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
1905	1949	stable_node->kpfn = page_to_pfn(newpage);
	1950	+ /*
	1951	+ * newpage->mapping was set in advance; now we need smp_wmb()
	1952	+ * to make sure that the new stable_node->kpfn is visible
	1953	+ * to get_ksm_page() before it can see that oldpage->mapping
	1954	+ * has gone stale (or that PageSwapCache has been cleared).
	1955	+ */
	1956	+ smp_wmb();
	1957	+ set_page_stable_node(oldpage, NULL);
1906	1958	}
1907	1959	}
1908	1960	#endif /* CONFIG_MIGRATION */
...	...	@@ -464,7 +464,10 @@
464	464
465	465	mlock_migrate_page(newpage, page);
466	466	ksm_migrate_page(newpage, page);
467		-
	467	+ /*
	468	+ * Please do not reorder this without considering how mm/ksm.c's
	469	+ * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
	470	+ */
468	471	ClearPageSwapCache(page);
469	472	ClearPagePrivate(page);
470	473	set_page_private(page, 0);