Commit c8d6553b9580188a1324486173d79c0f8642e870

Authored by Hugh Dickins
Committed by Linus Torvalds
1 parent cbf86cfe04

ksm: make KSM page migration possible

KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.

But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree?  And what if
that collides with an existing stable node?

So far there's no provision for that, and this patch does not attempt to
deal with it either.  But how will I test a solution, when I don't know
how to hotremove memory?  The best answer is to enable KSM page migration
in all cases now, and test more common cases.  With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.

Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).

It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.

Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.

The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here.  That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.

(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)

You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Showing 2 changed files with 77 additions and 22 deletions Side-by-side Diff

... ... @@ -499,6 +499,7 @@
499 499 * In which case we can trust the content of the page, and it
500 500 * returns the gotten page; but if the page has now been zapped,
501 501 * remove the stale node from the stable tree and return NULL.
  502 + * But beware, the stable node's page might be being migrated.
502 503 *
503 504 * You would expect the stable_node to hold a reference to the ksm page.
504 505 * But if it increments the page's count, swapping out has to wait for
505 506  
506 507  
507 508  
508 509  
509 510  
510 511  
511 512  
512 513  
... ... @@ -509,44 +510,77 @@
509 510 * pointing back to this stable node. This relies on freeing a PageAnon
510 511 * page to reset its page->mapping to NULL, and relies on no other use of
511 512 * a page to put something that might look like our key in page->mapping.
512   - *
513   - * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
514   - * but this is different - made simpler by ksm_thread_mutex being held, but
515   - * interesting for assuming that no other use of the struct page could ever
516   - * put our expected_mapping into page->mapping (or a field of the union which
517   - * coincides with page->mapping).
518   - *
519   - * Note: it is possible that get_ksm_page() will return NULL one moment,
520   - * then page the next, if the page is in between page_freeze_refs() and
521   - * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
522 513 * is on its way to being freed; but it is an anomaly to bear in mind.
523 514 */
524 515 static struct page *get_ksm_page(struct stable_node *stable_node, bool locked)
525 516 {
526 517 struct page *page;
527 518 void *expected_mapping;
  519 + unsigned long kpfn;
528 520  
529   - page = pfn_to_page(stable_node->kpfn);
530 521 expected_mapping = (void *)stable_node +
531 522 (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
532   - if (page->mapping != expected_mapping)
  523 +again:
  524 + kpfn = ACCESS_ONCE(stable_node->kpfn);
  525 + page = pfn_to_page(kpfn);
  526 +
  527 + /*
  528 + * page is computed from kpfn, so on most architectures reading
  529 + * page->mapping is naturally ordered after reading node->kpfn,
  530 + * but on Alpha we need to be more careful.
  531 + */
  532 + smp_read_barrier_depends();
  533 + if (ACCESS_ONCE(page->mapping) != expected_mapping)
533 534 goto stale;
534   - if (!get_page_unless_zero(page))
535   - goto stale;
536   - if (page->mapping != expected_mapping) {
  535 +
  536 + /*
  537 + * We cannot do anything with the page while its refcount is 0.
  538 + * Usually 0 means free, or tail of a higher-order page: in which
  539 + * case this node is no longer referenced, and should be freed;
  540 + * however, it might mean that the page is under page_freeze_refs().
  541 + * The __remove_mapping() case is easy, again the node is now stale;
  542 + * but if page is swapcache in migrate_page_move_mapping(), it might
  543 + * still be our page, in which case it's essential to keep the node.
  544 + */
  545 + while (!get_page_unless_zero(page)) {
  546 + /*
  547 + * Another check for page->mapping != expected_mapping would
  548 + * work here too. We have chosen the !PageSwapCache test to
  549 + * optimize the common case, when the page is or is about to
  550 + * be freed: PageSwapCache is cleared (under spin_lock_irq)
  551 + * in the freeze_refs section of __remove_mapping(); but Anon
  552 + * page->mapping reset to NULL later, in free_pages_prepare().
  553 + */
  554 + if (!PageSwapCache(page))
  555 + goto stale;
  556 + cpu_relax();
  557 + }
  558 +
  559 + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
537 560 put_page(page);
538 561 goto stale;
539 562 }
  563 +
540 564 if (locked) {
541 565 lock_page(page);
542   - if (page->mapping != expected_mapping) {
  566 + if (ACCESS_ONCE(page->mapping) != expected_mapping) {
543 567 unlock_page(page);
544 568 put_page(page);
545 569 goto stale;
546 570 }
547 571 }
548 572 return page;
  573 +
549 574 stale:
  575 + /*
  576 + * We come here from above when page->mapping or !PageSwapCache
  577 + * suggests that the node is stale; but it might be under migration.
  578 + * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
  579 + * before checking whether node->kpfn has been changed.
  580 + */
  581 + smp_rmb();
  582 + if (ACCESS_ONCE(stable_node->kpfn) != kpfn)
  583 + goto again;
550 584 remove_node_from_stable_tree(stable_node);
551 585 return NULL;
552 586 }
553 587  
554 588  
555 589  
556 590  
... ... @@ -1103,15 +1137,25 @@
1103 1137 return NULL;
1104 1138  
1105 1139 ret = memcmp_pages(page, tree_page);
  1140 + put_page(tree_page);
1106 1141  
1107   - if (ret < 0) {
1108   - put_page(tree_page);
  1142 + if (ret < 0)
1109 1143 node = node->rb_left;
1110   - } else if (ret > 0) {
1111   - put_page(tree_page);
  1144 + else if (ret > 0)
1112 1145 node = node->rb_right;
1113   - } else
  1146 + else {
  1147 + /*
  1148 + * Lock and unlock the stable_node's page (which
  1149 + * might already have been migrated) so that page
  1150 + * migration is sure to notice its raised count.
  1151 + * It would be more elegant to return stable_node
  1152 + * than kpage, but that involves more changes.
  1153 + */
  1154 + tree_page = get_ksm_page(stable_node, true);
  1155 + if (tree_page)
  1156 + unlock_page(tree_page);
1114 1157 return tree_page;
  1158 + }
1115 1159 }
1116 1160  
1117 1161 return NULL;
... ... @@ -1903,6 +1947,14 @@
1903 1947 if (stable_node) {
1904 1948 VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
1905 1949 stable_node->kpfn = page_to_pfn(newpage);
  1950 + /*
  1951 + * newpage->mapping was set in advance; now we need smp_wmb()
  1952 + * to make sure that the new stable_node->kpfn is visible
  1953 + * to get_ksm_page() before it can see that oldpage->mapping
  1954 + * has gone stale (or that PageSwapCache has been cleared).
  1955 + */
  1956 + smp_wmb();
  1957 + set_page_stable_node(oldpage, NULL);
1906 1958 }
1907 1959 }
1908 1960 #endif /* CONFIG_MIGRATION */
... ... @@ -464,7 +464,10 @@
464 464  
465 465 mlock_migrate_page(newpage, page);
466 466 ksm_migrate_page(newpage, page);
467   -
  467 + /*
  468 + * Please do not reorder this without considering how mm/ksm.c's
  469 + * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
  470 + */
468 471 ClearPageSwapCache(page);
469 472 ClearPagePrivate(page);
470 473 set_page_private(page, 0);