mm: thp: tail page refcounting fix

Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: thp: tail page refcounting fix
Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli · Linus Torvalds
1 parent 994c0e9925
Showing 8 changed files with 171 additions and 84 deletions Side-by-side Diff
arch/powerpc/mm/gup.c
arch/x86/mm/gup.c
include/linux/mm.h
include/linux/mm_types.h
mm/huge_memory.c
mm/internal.h
mm/memory.c
mm/swap.c
@@ -22,8 +22,9 @@
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
  
 /*
@@ -114,8 +114,9 @@
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
  
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
@@ -356,36 +356,39 @@
 	return page;
 }
  
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
  
+extern bool __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
  
 static inline struct page *virt_to_head_page(const void *x)
@@ -801,21 +804,6 @@
 	if (unlikely(PageSwapCache(page)))
 		return page_private(page);
 	return page->index;
-}
-
-/*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
 }
  
 /*
@@ -62,10 +62,23 @@
 			struct {
  
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
  
 					struct {
 						unsigned inuse:16;
@@ -990,7 +990,7 @@
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
  
 out:
 	return page;
@@ -1202,6 +1202,7 @@
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
  
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1210,11 +1211,27 @@
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
  
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
  
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1232,10 +1249,7 @@
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
  
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
  
 		/*
@@ -1252,7 +1266,6 @@
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
  
 		BUG_ON(page_tail->mapping);
@@ -1269,6 +1282,8 @@
  
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
  
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
@@ -37,6 +37,52 @@
 	atomic_dec(&page->_count);
 }
  
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+/*
+ * This is meant to be called as the FOLL_GET operation of
+ * follow_page() and it must be called while holding the proper PT
+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
+ */
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
  
 /*
@@ -1503,7 +1503,7 @@
 	}
  
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -78,39 +78,22 @@
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
  
  
@@ -121,16 +104,17 @@
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -159,6 +143,45 @@
 		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
+
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+bool __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	bool got = false;
+	struct page *page_head = compound_trans_head(page);
+
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = true;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
  
 /**
  * put_pages_list() - release a list of pages
...	...	@@ -22,8 +22,9 @@
22	22	* __split_huge_page_refcount() cannot run
23	23	* from under us.
24	24	*/
25		- VM_BUG_ON(atomic_read(&page->_count) < 0);
26		- atomic_inc(&page->_count);
	25	+ VM_BUG_ON(page_mapcount(page) < 0);
	26	+ VM_BUG_ON(atomic_read(&page->_count) != 0);
	27	+ atomic_inc(&page->_mapcount);
27	28	}
28	29
29	30	/*
...	...	@@ -114,8 +114,9 @@
114	114	* __split_huge_page_refcount() cannot run
115	115	* from under us.
116	116	*/
117		- VM_BUG_ON(atomic_read(&page->_count) < 0);
118		- atomic_inc(&page->_count);
	117	+ VM_BUG_ON(page_mapcount(page) < 0);
	118	+ VM_BUG_ON(atomic_read(&page->_count) != 0);
	119	+ atomic_inc(&page->_mapcount);
119	120	}
120	121
121	122	static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
...	...	@@ -356,36 +356,39 @@
356	356	return page;
357	357	}
358	358
	359	+/*
	360	+ * The atomic page->_mapcount, starts from -1: so that transitions
	361	+ * both from it and to it can be tracked, using atomic_inc_and_test
	362	+ * and atomic_add_negative(-1).
	363	+ */
	364	+static inline void reset_page_mapcount(struct page *page)
	365	+{
	366	+ atomic_set(&(page)->_mapcount, -1);
	367	+}
	368	+
	369	+static inline int page_mapcount(struct page *page)
	370	+{
	371	+ return atomic_read(&(page)->_mapcount) + 1;
	372	+}
	373	+
359	374	static inline int page_count(struct page *page)
360	375	{
361	376	return atomic_read(&compound_head(page)->_count);
362	377	}
363	378
	379	+extern bool __get_page_tail(struct page *page);
	380	+
364	381	static inline void get_page(struct page *page)
365	382	{
	383	+ if (unlikely(PageTail(page)))
	384	+ if (likely(__get_page_tail(page)))
	385	+ return;
366	386	/*
367	387	* Getting a normal page or the head of a compound page
368		- * requires to already have an elevated page->_count. Only if
369		- * we're getting a tail page, the elevated page->_count is
370		- * required only in the head page, so for tail pages the
371		- * bugcheck only verifies that the page->_count isn't
372		- * negative.
	388	+ * requires to already have an elevated page->_count.
373	389	*/
374		- VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
	390	+ VM_BUG_ON(atomic_read(&page->_count) <= 0);
375	391	atomic_inc(&page->_count);
376		- /*
377		- * Getting a tail page will elevate both the head and tail
378		- * page->_count(s).
379		- */
380		- if (unlikely(PageTail(page))) {
381		- /*
382		- * This is safe only because
383		- * __split_huge_page_refcount can't run under
384		- * get_page().
385		- */
386		- VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
387		- atomic_inc(&page->first_page->_count);
388		- }
389	392	}
390	393
391	394	static inline struct page virt_to_head_page(const void x)
...	...	@@ -801,21 +804,6 @@
801	804	if (unlikely(PageSwapCache(page)))
802	805	return page_private(page);
803	806	return page->index;
804		-}
805		-
806		-/*
807		- * The atomic page->_mapcount, like _count, starts from -1:
808		- * so that transitions both from it and to it can be tracked,
809		- * using atomic_inc_and_test and atomic_add_negative(-1).
810		- */
811		-static inline void reset_page_mapcount(struct page *page)
812		-{
813		- atomic_set(&(page)->_mapcount, -1);
814		-}
815		-
816		-static inline int page_mapcount(struct page *page)
817		-{
818		- return atomic_read(&(page)->_mapcount) + 1;
819	807	}
820	808
821	809	/*
...	...	@@ -62,10 +62,23 @@
62	62	struct {
63	63
64	64	union {
65		- atomic_t _mapcount; /* Count of ptes mapped in mms,
66		- * to show when page is mapped
67		- * & limit reverse map searches.
68		- */
	65	+ /*
	66	+ * Count of ptes mapped in
	67	+ * mms, to show when page is
	68	+ * mapped & limit reverse map
	69	+ * searches.
	70	+ *
	71	+ * Used also for tail pages
	72	+ * refcounting instead of
	73	+ * _count. Tail pages cannot
	74	+ * be mapped and keeping the
	75	+ * tail page _count zero at
	76	+ * all times guarantees
	77	+ * get_page_unless_zero() will
	78	+ * never succeed on tail
	79	+ * pages.
	80	+ */
	81	+ atomic_t _mapcount;
69	82
70	83	struct {
71	84	unsigned inuse:16;
...	...	@@ -990,7 +990,7 @@
990	990	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
991	991	VM_BUG_ON(!PageCompound(page));
992	992	if (flags & FOLL_GET)
993		- get_page(page);
	993	+ get_page_foll(page);
994	994
995	995	out:
996	996	return page;
...	...	@@ -1202,6 +1202,7 @@
1202	1202	unsigned long head_index = page->index;
1203	1203	struct zone *zone = page_zone(page);
1204	1204	int zonestat;
	1205	+ int tail_count = 0;
1205	1206
1206	1207	/* prevent PageLRU to go away from under us, and freeze lru stats */
1207	1208	spin_lock_irq(&zone->lru_lock);
...	...	@@ -1210,11 +1211,27 @@
1210	1211	for (i = 1; i < HPAGE_PMD_NR; i++) {
1211	1212	struct page *page_tail = page + i;
1212	1213
1213		- /* tail_page->_count cannot change */
1214		- atomic_sub(atomic_read(&page_tail->_count), &page->_count);
1215		- BUG_ON(page_count(page) <= 0);
1216		- atomic_add(page_mapcount(page) + 1, &page_tail->_count);
1217		- BUG_ON(atomic_read(&page_tail->_count) <= 0);
	1214	+ /* tail_page->_mapcount cannot change */
	1215	+ BUG_ON(page_mapcount(page_tail) < 0);
	1216	+ tail_count += page_mapcount(page_tail);
	1217	+ /* check for overflow */
	1218	+ BUG_ON(tail_count < 0);
	1219	+ BUG_ON(atomic_read(&page_tail->_count) != 0);
	1220	+ /*
	1221	+ * tail_page->_count is zero and not changing from
	1222	+ * under us. But get_page_unless_zero() may be running
	1223	+ * from under us on the tail_page. If we used
	1224	+ * atomic_set() below instead of atomic_add(), we
	1225	+ * would then run atomic_set() concurrently with
	1226	+ * get_page_unless_zero(), and atomic_set() is
	1227	+ * implemented in C not using locked ops. spin_unlock
	1228	+ * on x86 sometime uses locked ops because of PPro
	1229	+ * errata 66, 92, so unless somebody can guarantee
	1230	+ * atomic_set() here would be safe on all archs (and
	1231	+ * not only on x86), it's safer to use atomic_add().
	1232	+ */
	1233	+ atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
	1234	+ &page_tail->_count);
1218	1235
1219	1236	/* after clearing PageTail the gup refcount can be released */
1220	1237	smp_mb();
...	...	@@ -1232,10 +1249,7 @@
1232	1249	(1L << PG_uptodate)));
1233	1250	page_tail->flags \|= (1L << PG_dirty);
1234	1251
1235		- /*
1236		- * 1) clear PageTail before overwriting first_page
1237		- * 2) clear PageTail before clearing PageHead for VM_BUG_ON
1238		- */
	1252	+ /* clear PageTail before overwriting first_page */
1239	1253	smp_wmb();
1240	1254
1241	1255	/*
...	...	@@ -1252,7 +1266,6 @@
1252	1266	* status is achieved setting a reserved bit in the
1253	1267	* pmd, not by clearing the present bit.
1254	1268	*/
1255		- BUG_ON(page_mapcount(page_tail));
1256	1269	page_tail->_mapcount = page->_mapcount;
1257	1270
1258	1271	BUG_ON(page_tail->mapping);
...	...	@@ -1269,6 +1282,8 @@
1269	1282
1270	1283	lru_add_page_tail(zone, page, page_tail);
1271	1284	}
	1285	+ atomic_sub(tail_count, &page->_count);
	1286	+ BUG_ON(atomic_read(&page->_count) <= 0);
1272	1287
1273	1288	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
1274	1289	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
...	...	@@ -37,6 +37,52 @@
37	37	atomic_dec(&page->_count);
38	38	}
39	39
	40	+static inline void __get_page_tail_foll(struct page *page,
	41	+ bool get_page_head)
	42	+{
	43	+ /*
	44	+ * If we're getting a tail page, the elevated page->_count is
	45	+ * required only in the head page and we will elevate the head
	46	+ * page->_count and tail page->_mapcount.
	47	+ *
	48	+ * We elevate page_tail->_mapcount for tail pages to force
	49	+ * page_tail->_count to be zero at all times to avoid getting
	50	+ * false positives from get_page_unless_zero() with
	51	+ * speculative page access (like in
	52	+ * page_cache_get_speculative()) on tail pages.
	53	+ */
	54	+ VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
	55	+ VM_BUG_ON(atomic_read(&page->_count) != 0);
	56	+ VM_BUG_ON(page_mapcount(page) < 0);
	57	+ if (get_page_head)
	58	+ atomic_inc(&page->first_page->_count);
	59	+ atomic_inc(&page->_mapcount);
	60	+}
	61	+
	62	+/*
	63	+ * This is meant to be called as the FOLL_GET operation of
	64	+ * follow_page() and it must be called while holding the proper PT
	65	+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
	66	+ */
	67	+static inline void get_page_foll(struct page *page)
	68	+{
	69	+ if (unlikely(PageTail(page)))
	70	+ /*
	71	+ * This is safe only because
	72	+ * __split_huge_page_refcount() can't run under
	73	+ * get_page_foll() because we hold the proper PT lock.
	74	+ */
	75	+ __get_page_tail_foll(page, true);
	76	+ else {
	77	+ /*
	78	+ * Getting a normal page or the head of a compound page
	79	+ * requires to already have an elevated page->_count.
	80	+ */
	81	+ VM_BUG_ON(atomic_read(&page->_count) <= 0);
	82	+ atomic_inc(&page->_count);
	83	+ }
	84	+}
	85	+
40	86	extern unsigned long highest_memmap_pfn;
41	87
42	88	/*
...	...	@@ -1503,7 +1503,7 @@
1503	1503	}
1504	1504
1505	1505	if (flags & FOLL_GET)
1506		- get_page(page);
	1506	+ get_page_foll(page);
1507	1507	if (flags & FOLL_TOUCH) {
1508	1508	if ((flags & FOLL_WRITE) &&
1509	1509	!pte_dirty(pte) && !PageDirty(page))
...	...	@@ -78,39 +78,22 @@
78	78	{
79	79	if (unlikely(PageTail(page))) {
80	80	/* __split_huge_page_refcount can run under us */
81		- struct page *page_head = page->first_page;
82		- smp_rmb();
83		- /*
84		- * If PageTail is still set after smp_rmb() we can be sure
85		- * that the page->first_page we read wasn't a dangling pointer.
86		- * See __split_huge_page_refcount() smp_wmb().
87		- */
88		- if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
	81	+ struct page *page_head = compound_trans_head(page);
	82	+
	83	+ if (likely(page != page_head &&
	84	+ get_page_unless_zero(page_head))) {
89	85	unsigned long flags;
90	86	/*
91		- * Verify that our page_head wasn't converted
92		- * to a a regular page before we got a
93		- * reference on it.
	87	+ * page_head wasn't a dangling pointer but it
	88	+ * may not be a head page anymore by the time
	89	+ * we obtain the lock. That is ok as long as it
	90	+ * can't be freed from under us.
94	91	*/
95		- if (unlikely(!PageHead(page_head))) {
96		- /* PageHead is cleared after PageTail */
97		- smp_rmb();
98		- VM_BUG_ON(PageTail(page));
99		- goto out_put_head;
100		- }
101		- /*
102		- * Only run compound_lock on a valid PageHead,
103		- * after having it pinned with
104		- * get_page_unless_zero() above.
105		- */
106		- smp_mb();
107		- /* page_head wasn't a dangling pointer */
108	92	flags = compound_lock_irqsave(page_head);
109	93	if (unlikely(!PageTail(page))) {
110	94	/* __split_huge_page_refcount run before us */
111	95	compound_unlock_irqrestore(page_head, flags);
112	96	VM_BUG_ON(PageHead(page_head));
113		- out_put_head:
114	97	if (put_page_testzero(page_head))
115	98	__put_single_page(page_head);
116	99	out_put_single:
117	100
118	101
...	...	@@ -121,16 +104,17 @@
121	104	VM_BUG_ON(page_head != page->first_page);
122	105	/*
123	106	* We can release the refcount taken by
124		- * get_page_unless_zero now that
125		- * split_huge_page_refcount is blocked on the
126		- * compound_lock.
	107	+ * get_page_unless_zero() now that
	108	+ * __split_huge_page_refcount() is blocked on
	109	+ * the compound_lock.
127	110	*/
128	111	if (put_page_testzero(page_head))
129	112	VM_BUG_ON(1);
130	113	/* __split_huge_page_refcount will wait now */
131		- VM_BUG_ON(atomic_read(&page->_count) <= 0);
132		- atomic_dec(&page->_count);
	114	+ VM_BUG_ON(page_mapcount(page) <= 0);
	115	+ atomic_dec(&page->_mapcount);
133	116	VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
	117	+ VM_BUG_ON(atomic_read(&page->_count) != 0);
134	118	compound_unlock_irqrestore(page_head, flags);
135	119	if (put_page_testzero(page_head)) {
136	120	if (PageHead(page_head))
...	...	@@ -159,6 +143,45 @@
159	143	__put_single_page(page);
160	144	}
161	145	EXPORT_SYMBOL(put_page);
	146	+
	147	+/*
	148	+ * This function is exported but must not be called by anything other
	149	+ * than get_page(). It implements the slow path of get_page().
	150	+ */
	151	+bool __get_page_tail(struct page *page)
	152	+{
	153	+ /*
	154	+ * This takes care of get_page() if run on a tail page
	155	+ * returned by one of the get_user_pages/follow_page variants.
	156	+ * get_user_pages/follow_page itself doesn't need the compound
	157	+ * lock because it runs __get_page_tail_foll() under the
	158	+ * proper PT lock that already serializes against
	159	+ * split_huge_page().
	160	+ */
	161	+ unsigned long flags;
	162	+ bool got = false;
	163	+ struct page *page_head = compound_trans_head(page);
	164	+
	165	+ if (likely(page != page_head && get_page_unless_zero(page_head))) {
	166	+ /*
	167	+ * page_head wasn't a dangling pointer but it
	168	+ * may not be a head page anymore by the time
	169	+ * we obtain the lock. That is ok as long as it
	170	+ * can't be freed from under us.
	171	+ */
	172	+ flags = compound_lock_irqsave(page_head);
	173	+ /* here __split_huge_page_refcount won't run anymore */
	174	+ if (likely(PageTail(page))) {
	175	+ __get_page_tail_foll(page, false);
	176	+ got = true;
	177	+ }
	178	+ compound_unlock_irqrestore(page_head, flags);
	179	+ if (unlikely(!got))
	180	+ put_page(page_head);
	181	+ }
	182	+ return got;
	183	+}
	184	+EXPORT_SYMBOL(__get_page_tail);
162	185
163	186	/**
164	187	* put_pages_list() - release a list of pages