cpuset: mm: reduce large amounts of memory barrier related damage v3

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman · Linus Torvalds
1 parent e845e19936
Showing 12 changed files with 133 additions and 110 deletions Side-by-side Diff
include/linux/cpuset.h
include/linux/init_task.h
include/linux/sched.h
kernel/cpuset.c
kernel/fork.c
mm/filemap.c
mm/hugetlb.c
mm/mempolicy.c
mm/page_alloc.c
mm/slab.c
mm/slub.c
mm/vmscan.c
@@ -89,42 +89,33 @@
 extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  
 /*
- * reading current mems_allowed and mempolicy in the fastpath must protected
- * by get_mems_allowed()
+ * get_mems_allowed is required when making decisions involving mems_allowed
+ * such as during page allocation. mems_allowed can be updated in parallel
+ * and depending on the new value an operation can fail potentially causing
+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
+ * prevents these artificial failures.
  */
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
-	current->mems_allowed_change_disable++;
-
-	/*
-	 * ensure that reading mems_allowed and mempolicy happens after the
-	 * update of ->mems_allowed_change_disable.
-	 *
-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
-	 * and knows the read-side task is reading mems_allowed or mempolicy,
-	 * so it will clear old bits lazily.
-	 */
-	smp_mb();
+	return read_seqcount_begin(&current->mems_allowed_seq);
 }
  
-static inline void put_mems_allowed(void)
+/*
+ * If this returns false, the operation that took place after get_mems_allowed
+ * may have failed. It is up to the caller to retry the operation if
+ * appropriate.
+ */
+static inline bool put_mems_allowed(unsigned int seq)
 {
-	/*
-	 * ensure that reading mems_allowed and mempolicy before reducing
-	 * mems_allowed_change_disable.
-	 *
-	 * the write-side task will know that the read-side task is still
-	 * reading mems_allowed or mempolicy, don't clears old bits in the
-	 * nodemask.
-	 */
-	smp_mb();
-	--ACCESS_ONCE(current->mems_allowed_change_disable);
+	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
  
 static inline void set_mems_allowed(nodemask_t nodemask)
 {
 	task_lock(current);
+	write_seqcount_begin(&current->mems_allowed_seq);
 	current->mems_allowed = nodemask;
+	write_seqcount_end(&current->mems_allowed_seq);
 	task_unlock(current);
 }
  
  
  
  
@@ -234,12 +225,14 @@
 {
 }
  
-static inline void get_mems_allowed(void)
+static inline unsigned int get_mems_allowed(void)
 {
+	return 0;
 }
  
-static inline void put_mems_allowed(void)
+static inline bool put_mems_allowed(unsigned int seq)
 {
+	return true;
 }
  
 #endif /* !CONFIG_CPUSETS */
@@ -29,6 +29,13 @@
 #define INIT_GROUP_RWSEM(sig)
 #endif
  
+#ifdef CONFIG_CPUSETS
+#define INIT_CPUSET_SEQ							\
+	.mems_allowed_seq = SEQCNT_ZERO,
+#else
+#define INIT_CPUSET_SEQ
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -192,6 +199,7 @@
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_CPUSET_SEQ							\
 }
  
  
@@ -1514,7 +1514,7 @@
 #endif
 #ifdef CONFIG_CPUSETS
 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
-	int mems_allowed_change_disable;
+	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
 	int cpuset_mem_spread_rotor;
 	int cpuset_slab_spread_rotor;
 #endif
@@ -964,7 +964,6 @@
 {
 	bool need_loop;
  
-repeat:
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
  
  
@@ -983,45 +982,19 @@
 	 */
 	need_loop = task_has_mempolicy(tsk) ||
 			!nodes_intersects(*newmems, tsk->mems_allowed);
+
+	if (need_loop)
+		write_seqcount_begin(&tsk->mems_allowed_seq);
+
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
  
-	/*
-	 * ensure checking ->mems_allowed_change_disable after setting all new
-	 * allowed nodes.
-	 *
-	 * the read-side task can see an nodemask with new allowed nodes and
-	 * old allowed nodes. and if it allocates page when cpuset clears newly
-	 * disallowed ones continuous, it can see the new allowed bits.
-	 *
-	 * And if setting all new allowed nodes is after the checking, setting
-	 * all new allowed nodes and clearing newly disallowed ones will be done
-	 * continuous, and the read-side task may find no node to alloc page.
-	 */
-	smp_mb();
-
-	/*
-	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.
-	 */
-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
-		task_unlock(tsk);
-		if (!task_curr(tsk))
-			yield();
-		goto repeat;
-	}
-
-	/*
-	 * ensure checking ->mems_allowed_change_disable before clearing all new
-	 * disallowed nodes.
-	 *
-	 * if clearing newly disallowed bits before the checking, the read-side
-	 * task may find no node to alloc page.
-	 */
-	smp_mb();
-
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
 	tsk->mems_allowed = *newmems;
+
+	if (need_loop)
+		write_seqcount_end(&tsk->mems_allowed_seq);
+
 	task_unlock(tsk);
 }
  
@@ -1237,6 +1237,7 @@
 #ifdef CONFIG_CPUSETS
 	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
 	p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
+	seqcount_init(&p->mems_allowed_seq);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
 	p->irq_events = 0;
@@ -499,10 +499,13 @@
 	struct page *page;
  
 	if (cpuset_do_page_mem_spread()) {
-		get_mems_allowed();
-		n = cpuset_mem_spread_node();
-		page = alloc_pages_exact_node(n, gfp, 0);
-		put_mems_allowed();
+		unsigned int cpuset_mems_cookie;
+		do {
+			cpuset_mems_cookie = get_mems_allowed();
+			n = cpuset_mem_spread_node();
+			page = alloc_pages_exact_node(n, gfp, 0);
+		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
+
 		return page;
 	}
 	return alloc_pages(gfp, 0);
@@ -454,14 +454,16 @@
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	unsigned int cpuset_mems_cookie;
  
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol, &nodemask);
 	/*
  
  
@@ -488,10 +490,15 @@
 			}
 		}
 	}
-err:
+
 	mpol_cond_put(mpol);
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
+
+err:
+	mpol_cond_put(mpol);
+	return NULL;
 }
  
 static void update_and_free_page(struct hstate *h, struct page *page)
@@ -1850,18 +1850,24 @@
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct mempolicy *pol;
 	struct zonelist *zl;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
  
-	get_mems_allowed();
+retry_cpuset:
+	pol = get_vma_policy(current, vma, addr);
+	cpuset_mems_cookie = get_mems_allowed();
+
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
  
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
+
 		return page;
 	}
 	zl = policy_zonelist(gfp, pol, node);
@@ -1872,7 +1878,8 @@
 		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
-		put_mems_allowed();
+		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+			goto retry_cpuset;
 		return page;
 	}
 	/*
@@ -1880,7 +1887,8 @@
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
-	put_mems_allowed();
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
 	return page;
 }
  
  
@@ -1907,11 +1915,14 @@
 {
 	struct mempolicy *pol = current->mempolicy;
 	struct page *page;
+	unsigned int cpuset_mems_cookie;
  
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
  
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1922,7 +1933,10 @@
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
@@ -2380,8 +2380,9 @@
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
-	struct page *page;
+	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
+	unsigned int cpuset_mems_cookie;
  
 	gfp_mask &= gfp_allowed_mask;
  
  
@@ -2400,15 +2401,15 @@
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
  
-	get_mems_allowed();
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+
 	/* The preferred zone is used for statistics later */
 	first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
+	if (!preferred_zone)
+		goto out;
  
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
  
@@ -2418,9 +2419,19 @@
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-	put_mems_allowed();
  
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+
+out:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
+		goto retry_cpuset;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
  
@@ -2634,13 +2645,15 @@
 bool skip_free_areas_node(unsigned int flags, int nid)
 {
 	bool ret = false;
+	unsigned int cpuset_mems_cookie;
  
 	if (!(flags & SHOW_MEM_FILTER_NODES))
 		goto out;
  
-	get_mems_allowed();
-	ret = !node_isset(nid, cpuset_current_mems_allowed);
-	put_mems_allowed();
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		ret = !node_isset(nid, cpuset_current_mems_allowed);
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 out:
 	return ret;
 }
@@ -3284,12 +3284,10 @@
 	if (in_interrupt() || (flags & __GFP_THISNODE))
 		return NULL;
 	nid_alloc = nid_here = numa_mem_id();
-	get_mems_allowed();
 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
 		nid_alloc = cpuset_slab_spread_node();
 	else if (current->mempolicy)
 		nid_alloc = slab_node(current->mempolicy);
-	put_mems_allowed();
 	if (nid_alloc != nid_here)
 		return ____cache_alloc_node(cachep, flags, nid_alloc);
 	return NULL;
  
  
@@ -3312,14 +3310,17 @@
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
+	unsigned int cpuset_mems_cookie;
  
 	if (flags & __GFP_THISNODE)
 		return NULL;
  
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
  
+retry_cpuset:
+	cpuset_mems_cookie = get_mems_allowed();
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+
 retry:
 	/*
 	 * Look through allowed nodes for objects available
@@ -3372,7 +3373,9 @@
 			}
 		}
 	}
-	put_mems_allowed();
+
+	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
+		goto retry_cpuset;
 	return obj;
 }
  
@@ -1581,6 +1581,7 @@
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
+	unsigned int cpuset_mems_cookie;
  
 	/*
 	 * The defrag ratio allows a configuration of the tradeoffs between
  
  
  
@@ -1604,23 +1605,32 @@
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
  
-	get_mems_allowed();
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-		struct kmem_cache_node *n;
+	do {
+		cpuset_mems_cookie = get_mems_allowed();
+		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+			struct kmem_cache_node *n;
  
-		n = get_node(s, zone_to_nid(zone));
+			n = get_node(s, zone_to_nid(zone));
  
-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
-			object = get_partial_node(s, n, c);
-			if (object) {
-				put_mems_allowed();
-				return object;
+			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+					n->nr_partial > s->min_partial) {
+				object = get_partial_node(s, n, c);
+				if (object) {
+					/*
+					 * Return the object even if
+					 * put_mems_allowed indicated that
+					 * the cpuset mems_allowed was
+					 * updated in parallel. It's a
+					 * harmless race between the alloc
+					 * and the cpuset update.
+					 */
+					put_mems_allowed(cpuset_mems_cookie);
+					return object;
+				}
 			}
 		}
-	}
-	put_mems_allowed();
+	} while (!put_mems_allowed(cpuset_mems_cookie));
 #endif
 	return NULL;
 }
@@ -2343,7 +2343,6 @@
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
  
-	get_mems_allowed();
 	delayacct_freepages_start();
  
 	if (global_reclaim(sc))
@@ -2407,7 +2406,6 @@
  
 out:
 	delayacct_freepages_end();
-	put_mems_allowed();
  
 	if (sc->nr_reclaimed)
 		return sc->nr_reclaimed;
...	...	@@ -89,42 +89,33 @@
89	89	extern void cpuset_print_task_mems_allowed(struct task_struct *p);
90	90
91	91	/*
92		- * reading current mems_allowed and mempolicy in the fastpath must protected
93		- * by get_mems_allowed()
	92	+ * get_mems_allowed is required when making decisions involving mems_allowed
	93	+ * such as during page allocation. mems_allowed can be updated in parallel
	94	+ * and depending on the new value an operation can fail potentially causing
	95	+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
	96	+ * prevents these artificial failures.
94	97	*/
95		-static inline void get_mems_allowed(void)
	98	+static inline unsigned int get_mems_allowed(void)
96	99	{
97		- current->mems_allowed_change_disable++;
98		-
99		- /*
100		- * ensure that reading mems_allowed and mempolicy happens after the
101		- * update of ->mems_allowed_change_disable.
102		- *
103		- * the write-side task finds ->mems_allowed_change_disable is not 0,
104		- * and knows the read-side task is reading mems_allowed or mempolicy,
105		- * so it will clear old bits lazily.
106		- */
107		- smp_mb();
	100	+ return read_seqcount_begin(&current->mems_allowed_seq);
108	101	}
109	102
110		-static inline void put_mems_allowed(void)
	103	+/*
	104	+ * If this returns false, the operation that took place after get_mems_allowed
	105	+ * may have failed. It is up to the caller to retry the operation if
	106	+ * appropriate.
	107	+ */
	108	+static inline bool put_mems_allowed(unsigned int seq)
111	109	{
112		- /*
113		- * ensure that reading mems_allowed and mempolicy before reducing
114		- * mems_allowed_change_disable.
115		- *
116		- * the write-side task will know that the read-side task is still
117		- * reading mems_allowed or mempolicy, don't clears old bits in the
118		- * nodemask.
119		- */
120		- smp_mb();
121		- --ACCESS_ONCE(current->mems_allowed_change_disable);
	110	+ return !read_seqcount_retry(&current->mems_allowed_seq, seq);
122	111	}
123	112
124	113	static inline void set_mems_allowed(nodemask_t nodemask)
125	114	{
126	115	task_lock(current);
	116	+ write_seqcount_begin(&current->mems_allowed_seq);
127	117	current->mems_allowed = nodemask;
	118	+ write_seqcount_end(&current->mems_allowed_seq);
128	119	task_unlock(current);
129	120	}
130	121
131	122
132	123
133	124
...	...	@@ -234,12 +225,14 @@
234	225	{
235	226	}
236	227
237		-static inline void get_mems_allowed(void)
	228	+static inline unsigned int get_mems_allowed(void)
238	229	{
	230	+ return 0;
239	231	}
240	232
241		-static inline void put_mems_allowed(void)
	233	+static inline bool put_mems_allowed(unsigned int seq)
242	234	{
	235	+ return true;
243	236	}
244	237
245	238	#endif /* !CONFIG_CPUSETS */
...	...	@@ -29,6 +29,13 @@
29	29	#define INIT_GROUP_RWSEM(sig)
30	30	#endif
31	31
	32	+#ifdef CONFIG_CPUSETS
	33	+#define INIT_CPUSET_SEQ \
	34	+ .mems_allowed_seq = SEQCNT_ZERO,
	35	+#else
	36	+#define INIT_CPUSET_SEQ
	37	+#endif
	38	+
32	39	#define INIT_SIGNALS(sig) { \
33	40	.nr_threads = 1, \
34	41	.wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
...	...	@@ -192,6 +199,7 @@
192	199	INIT_FTRACE_GRAPH \
193	200	INIT_TRACE_RECURSION \
194	201	INIT_TASK_RCU_PREEMPT(tsk) \
	202	+ INIT_CPUSET_SEQ \
195	203	}
196	204
197	205
...	...	@@ -1514,7 +1514,7 @@
1514	1514	#endif
1515	1515	#ifdef CONFIG_CPUSETS
1516	1516	nodemask_t mems_allowed; /* Protected by alloc_lock */
1517		- int mems_allowed_change_disable;
	1517	+ seqcount_t mems_allowed_seq; /* Seqence no to catch updates */
1518	1518	int cpuset_mem_spread_rotor;
1519	1519	int cpuset_slab_spread_rotor;
1520	1520	#endif
...	...	@@ -964,7 +964,6 @@
964	964	{
965	965	bool need_loop;
966	966
967		-repeat:
968	967	/*
969	968	* Allow tasks that have access to memory reserves because they have
970	969	* been OOM killed to get memory anywhere.
971	970
972	971
...	...	@@ -983,45 +982,19 @@
983	982	*/
984	983	need_loop = task_has_mempolicy(tsk) \|\|
985	984	!nodes_intersects(*newmems, tsk->mems_allowed);
	985	+
	986	+ if (need_loop)
	987	+ write_seqcount_begin(&tsk->mems_allowed_seq);
	988	+
986	989	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
987	990	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
988	991
989		- /*
990		- * ensure checking ->mems_allowed_change_disable after setting all new
991		- * allowed nodes.
992		- *
993		- * the read-side task can see an nodemask with new allowed nodes and
994		- * old allowed nodes. and if it allocates page when cpuset clears newly
995		- * disallowed ones continuous, it can see the new allowed bits.
996		- *
997		- * And if setting all new allowed nodes is after the checking, setting
998		- * all new allowed nodes and clearing newly disallowed ones will be done
999		- * continuous, and the read-side task may find no node to alloc page.
1000		- */
1001		- smp_mb();
1002		-
1003		- /*
1004		- * Allocation of memory is very fast, we needn't sleep when waiting
1005		- * for the read-side.
1006		- */
1007		- while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
1008		- task_unlock(tsk);
1009		- if (!task_curr(tsk))
1010		- yield();
1011		- goto repeat;
1012		- }
1013		-
1014		- /*
1015		- * ensure checking ->mems_allowed_change_disable before clearing all new
1016		- * disallowed nodes.
1017		- *
1018		- * if clearing newly disallowed bits before the checking, the read-side
1019		- * task may find no node to alloc page.
1020		- */
1021		- smp_mb();
1022		-
1023	992	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
1024	993	tsk->mems_allowed = *newmems;
	994	+
	995	+ if (need_loop)
	996	+ write_seqcount_end(&tsk->mems_allowed_seq);
	997	+
1025	998	task_unlock(tsk);
1026	999	}
1027	1000
...	...	@@ -1237,6 +1237,7 @@
1237	1237	#ifdef CONFIG_CPUSETS
1238	1238	p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
1239	1239	p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
	1240	+ seqcount_init(&p->mems_allowed_seq);
1240	1241	#endif
1241	1242	#ifdef CONFIG_TRACE_IRQFLAGS
1242	1243	p->irq_events = 0;
...	...	@@ -499,10 +499,13 @@
499	499	struct page *page;
500	500
501	501	if (cpuset_do_page_mem_spread()) {
502		- get_mems_allowed();
503		- n = cpuset_mem_spread_node();
504		- page = alloc_pages_exact_node(n, gfp, 0);
505		- put_mems_allowed();
	502	+ unsigned int cpuset_mems_cookie;
	503	+ do {
	504	+ cpuset_mems_cookie = get_mems_allowed();
	505	+ n = cpuset_mem_spread_node();
	506	+ page = alloc_pages_exact_node(n, gfp, 0);
	507	+ } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
	508	+
506	509	return page;
507	510	}
508	511	return alloc_pages(gfp, 0);
...	...	@@ -454,14 +454,16 @@
454	454	struct vm_area_struct *vma,
455	455	unsigned long address, int avoid_reserve)
456	456	{
457		- struct page *page = NULL;
	457	+ struct page *page;
458	458	struct mempolicy *mpol;
459	459	nodemask_t *nodemask;
460	460	struct zonelist *zonelist;
461	461	struct zone *zone;
462	462	struct zoneref *z;
	463	+ unsigned int cpuset_mems_cookie;
463	464
464		- get_mems_allowed();
	465	+retry_cpuset:
	466	+ cpuset_mems_cookie = get_mems_allowed();
465	467	zonelist = huge_zonelist(vma, address,
466	468	htlb_alloc_mask, &mpol, &nodemask);
467	469	/*
468	470
469	471
...	...	@@ -488,10 +490,15 @@
488	490	}
489	491	}
490	492	}
491		-err:
	493	+
492	494	mpol_cond_put(mpol);
493		- put_mems_allowed();
	495	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	496	+ goto retry_cpuset;
494	497	return page;
	498	+
	499	+err:
	500	+ mpol_cond_put(mpol);
	501	+ return NULL;
495	502	}
496	503
497	504	static void update_and_free_page(struct hstate h, struct page page)
...	...	@@ -1850,18 +1850,24 @@
1850	1850	alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
1851	1851	unsigned long addr, int node)
1852	1852	{
1853		- struct mempolicy *pol = get_vma_policy(current, vma, addr);
	1853	+ struct mempolicy *pol;
1854	1854	struct zonelist *zl;
1855	1855	struct page *page;
	1856	+ unsigned int cpuset_mems_cookie;
1856	1857
1857		- get_mems_allowed();
	1858	+retry_cpuset:
	1859	+ pol = get_vma_policy(current, vma, addr);
	1860	+ cpuset_mems_cookie = get_mems_allowed();
	1861	+
1858	1862	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
1859	1863	unsigned nid;
1860	1864
1861	1865	nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
1862	1866	mpol_cond_put(pol);
1863	1867	page = alloc_page_interleave(gfp, order, nid);
1864		- put_mems_allowed();
	1868	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	1869	+ goto retry_cpuset;
	1870	+
1865	1871	return page;
1866	1872	}
1867	1873	zl = policy_zonelist(gfp, pol, node);
...	...	@@ -1872,7 +1878,8 @@
1872	1878	struct page *page = __alloc_pages_nodemask(gfp, order,
1873	1879	zl, policy_nodemask(gfp, pol));
1874	1880	__mpol_put(pol);
1875		- put_mems_allowed();
	1881	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	1882	+ goto retry_cpuset;
1876	1883	return page;
1877	1884	}
1878	1885	/*
...	...	@@ -1880,7 +1887,8 @@
1880	1887	*/
1881	1888	page = __alloc_pages_nodemask(gfp, order, zl,
1882	1889	policy_nodemask(gfp, pol));
1883		- put_mems_allowed();
	1890	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	1891	+ goto retry_cpuset;
1884	1892	return page;
1885	1893	}
1886	1894
1887	1895
...	...	@@ -1907,11 +1915,14 @@
1907	1915	{
1908	1916	struct mempolicy *pol = current->mempolicy;
1909	1917	struct page *page;
	1918	+ unsigned int cpuset_mems_cookie;
1910	1919
1911	1920	if (!pol \|\| in_interrupt() \|\| (gfp & __GFP_THISNODE))
1912	1921	pol = &default_policy;
1913	1922
1914		- get_mems_allowed();
	1923	+retry_cpuset:
	1924	+ cpuset_mems_cookie = get_mems_allowed();
	1925	+
1915	1926	/*
1916	1927	* No reference counting needed for current->mempolicy
1917	1928	* nor system default_policy
...	...	@@ -1922,7 +1933,10 @@
1922	1933	page = __alloc_pages_nodemask(gfp, order,
1923	1934	policy_zonelist(gfp, pol, numa_node_id()),
1924	1935	policy_nodemask(gfp, pol));
1925		- put_mems_allowed();
	1936	+
	1937	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	1938	+ goto retry_cpuset;
	1939	+
1926	1940	return page;
1927	1941	}
1928	1942	EXPORT_SYMBOL(alloc_pages_current);
...	...	@@ -2380,8 +2380,9 @@
2380	2380	{
2381	2381	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
2382	2382	struct zone *preferred_zone;
2383		- struct page *page;
	2383	+ struct page *page = NULL;
2384	2384	int migratetype = allocflags_to_migratetype(gfp_mask);
	2385	+ unsigned int cpuset_mems_cookie;
2385	2386
2386	2387	gfp_mask &= gfp_allowed_mask;
2387	2388
2388	2389
...	...	@@ -2400,15 +2401,15 @@
2400	2401	if (unlikely(!zonelist->_zonerefs->zone))
2401	2402	return NULL;
2402	2403
2403		- get_mems_allowed();
	2404	+retry_cpuset:
	2405	+ cpuset_mems_cookie = get_mems_allowed();
	2406	+
2404	2407	/* The preferred zone is used for statistics later */
2405	2408	first_zones_zonelist(zonelist, high_zoneidx,
2406	2409	nodemask ? : &cpuset_current_mems_allowed,
2407	2410	&preferred_zone);
2408		- if (!preferred_zone) {
2409		- put_mems_allowed();
2410		- return NULL;
2411		- }
	2411	+ if (!preferred_zone)
	2412	+ goto out;
2412	2413
2413	2414	/* First allocation attempt */
2414	2415	page = get_page_from_freelist(gfp_mask\|__GFP_HARDWALL, nodemask, order,
2415	2416
...	...	@@ -2418,9 +2419,19 @@
2418	2419	page = __alloc_pages_slowpath(gfp_mask, order,
2419	2420	zonelist, high_zoneidx, nodemask,
2420	2421	preferred_zone, migratetype);
2421		- put_mems_allowed();
2422	2422
2423	2423	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
	2424	+
	2425	+out:
	2426	+ /*
	2427	+ * When updating a task's mems_allowed, it is possible to race with
	2428	+ * parallel threads in such a way that an allocation can fail while
	2429	+ * the mask is being updated. If a page allocation is about to fail,
	2430	+ * check if the cpuset changed during allocation and if so, retry.
	2431	+ */
	2432	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
	2433	+ goto retry_cpuset;
	2434	+
2424	2435	return page;
2425	2436	}
2426	2437	EXPORT_SYMBOL(__alloc_pages_nodemask);
2427	2438
...	...	@@ -2634,13 +2645,15 @@
2634	2645	bool skip_free_areas_node(unsigned int flags, int nid)
2635	2646	{
2636	2647	bool ret = false;
	2648	+ unsigned int cpuset_mems_cookie;
2637	2649
2638	2650	if (!(flags & SHOW_MEM_FILTER_NODES))
2639	2651	goto out;
2640	2652
2641		- get_mems_allowed();
2642		- ret = !node_isset(nid, cpuset_current_mems_allowed);
2643		- put_mems_allowed();
	2653	+ do {
	2654	+ cpuset_mems_cookie = get_mems_allowed();
	2655	+ ret = !node_isset(nid, cpuset_current_mems_allowed);
	2656	+ } while (!put_mems_allowed(cpuset_mems_cookie));
2644	2657	out:
2645	2658	return ret;
2646	2659	}
...	...	@@ -3284,12 +3284,10 @@
3284	3284	if (in_interrupt() \|\| (flags & __GFP_THISNODE))
3285	3285	return NULL;
3286	3286	nid_alloc = nid_here = numa_mem_id();
3287		- get_mems_allowed();
3288	3287	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
3289	3288	nid_alloc = cpuset_slab_spread_node();
3290	3289	else if (current->mempolicy)
3291	3290	nid_alloc = slab_node(current->mempolicy);
3292		- put_mems_allowed();
3293	3291	if (nid_alloc != nid_here)
3294	3292	return ____cache_alloc_node(cachep, flags, nid_alloc);
3295	3293	return NULL;
3296	3294
3297	3295
...	...	@@ -3312,14 +3310,17 @@
3312	3310	enum zone_type high_zoneidx = gfp_zone(flags);
3313	3311	void *obj = NULL;
3314	3312	int nid;
	3313	+ unsigned int cpuset_mems_cookie;
3315	3314
3316	3315	if (flags & __GFP_THISNODE)
3317	3316	return NULL;
3318	3317
3319		- get_mems_allowed();
3320		- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
3321	3318	local_flags = flags & (GFP_CONSTRAINT_MASK\|GFP_RECLAIM_MASK);
3322	3319
	3320	+retry_cpuset:
	3321	+ cpuset_mems_cookie = get_mems_allowed();
	3322	+ zonelist = node_zonelist(slab_node(current->mempolicy), flags);
	3323	+
3323	3324	retry:
3324	3325	/*
3325	3326	* Look through allowed nodes for objects available
...	...	@@ -3372,7 +3373,9 @@
3372	3373	}
3373	3374	}
3374	3375	}
3375		- put_mems_allowed();
	3376	+
	3377	+ if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
	3378	+ goto retry_cpuset;
3376	3379	return obj;
3377	3380	}
3378	3381