sched, cpuset: rework sched domains and CPU hotplug handling (v4)

This is an updated version of my previous cpuset patch on top of the latest mainline git. The patch fixes CPU hotplug handling issues in the current cpusets code. Namely circular locking in rebuild_sched_domains() and unsafe access to the cpu_online_map in the cpuset cpu hotplug handler. This version includes changes suggested by Paul Jackson (naming, comments, style, etc). I also got rid of the separate workqueue thread because it is now safe to call get_online_cpus() from workqueue callbacks. Here are some more details: rebuild_sched_domains() is the only way to rebuild sched domains correctly based on the current cpuset settings. What this means is that we need to be able to call it from different contexts, like cpu hotplug for example. Also latest scheduler code in -tip now calls rebuild_sched_domains() directly from functions like arch_reinit_sched_domains(). In order to support that properly we need to rework cpuset locking rules to avoid circular dependencies, which is what this patch does. New lock nesting rules are explained in the comments. We can now safely call rebuild_sched_domains() from virtually any context. The only requirement is that it needs to be called under get_online_cpus(). This allows cpu hotplug handlers and the scheduler to call rebuild_sched_domains() directly. The rest of the cpuset code now offloads sched domains rebuilds to a workqueue (async_rebuild_sched_domains()). This version of the patch addresses comments from the previous review. I fixed all miss-formated comments and trailing spaces. I also factored out the code that builds domain masks and split up CPU and memory hotplug handling. This was needed to simplify locking, to avoid unsafe access to the cpu_online_map from mem hotplug handler, and in general to make things cleaner. The patch passes moderate testing (building kernel with -j 16, creating & removing domains and bringing cpus off/online at the same time) on the quad-core2 based machine. It passes lockdep checks, even with preemptable RCU enabled. This time I also tested in with suspend/resume path and everything is working as expected. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: menage@google.com Cc: a.p.zijlstra@chello.nl Cc: vegard.nossum@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of the latest mainline git. The patch fixes CPU hotplug handling issues in the current cpusets code. Namely circular locking in rebuild_sched_domains() and unsafe access to the cpu_online_map in the cpuset cpu hotplug handler. This version includes changes suggested by Paul Jackson (naming, comments, style, etc). I also got rid of the separate workqueue thread because it is now safe to call get_online_cpus() from workqueue callbacks. Here are some more details: rebuild_sched_domains() is the only way to rebuild sched domains correctly based on the current cpuset settings. What this means is that we need to be able to call it from different contexts, like cpu hotplug for example. Also latest scheduler code in -tip now calls rebuild_sched_domains() directly from functions like arch_reinit_sched_domains(). In order to support that properly we need to rework cpuset locking rules to avoid circular dependencies, which is what this patch does. New lock nesting rules are explained in the comments. We can now safely call rebuild_sched_domains() from virtually any context. The only requirement is that it needs to be called under get_online_cpus(). This allows cpu hotplug handlers and the scheduler to call rebuild_sched_domains() directly. The rest of the cpuset code now offloads sched domains rebuilds to a workqueue (async_rebuild_sched_domains()). This version of the patch addresses comments from the previous review. I fixed all miss-formated comments and trailing spaces. I also factored out the code that builds domain masks and split up CPU and memory hotplug handling. This was needed to simplify locking, to avoid unsafe access to the cpu_online_map from mem hotplug handler, and in general to make things cleaner. The patch passes moderate testing (building kernel with -j 16, creating & removing domains and bringing cpus off/online at the same time) on the quad-core2 based machine. It passes lockdep checks, even with preemptable RCU enabled. This time I also tested in with suspend/resume path and everything is working as expected. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: menage@google.com Cc: a.p.zijlstra@chello.nl Cc: vegard.nossum@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Krasnyansky · Ingo Molnar
1 parent b635acec48
Showing 1 changed file with 182 additions and 130 deletions Side-by-side Diff
kernel/cpuset.c
@@ -14,6 +14,8 @@
  *  2003-10-22 Updates by Stephen Hemminger.
  *  2004 May-July Rework by Paul Jackson.
  *  2006 Rework by Paul Menage to use generic cgroups
+ *  2008 Rework of the scheduler domains and CPU hotplug handling
+ *       by Max Krasnyansky
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of the Linux
  
@@ -236,9 +238,11 @@
  
 static DEFINE_MUTEX(callback_mutex);
  
-/* This is ugly, but preserves the userspace API for existing cpuset
+/*
+ * This is ugly, but preserves the userspace API for existing cpuset
  * users. If someone tries to mount the "cpuset" filesystem, we
- * silently switch it to mount "cgroup" instead */
+ * silently switch it to mount "cgroup" instead
+ */
 static int cpuset_get_sb(struct file_system_type *fs_type,
 			 int flags, const char *unused_dev_name,
 			 void *data, struct vfsmount *mnt)
  
@@ -473,10 +477,9 @@
 }
  
 /*
- * Helper routine for rebuild_sched_domains().
+ * Helper routine for generate_sched_domains().
  * Do cpusets a, b have overlapping cpus_allowed masks?
  */
-
 static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
 {
 	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
  
  
@@ -518,27 +521,16 @@
 }
  
 /*
- * rebuild_sched_domains()
+ * generate_sched_domains()
  *
- * This routine will be called to rebuild the scheduler's dynamic
- * sched domains:
- * - if the flag 'sched_load_balance' of any cpuset with non-empty
- *   'cpus' changes,
- * - or if the 'cpus' allowed changes in any cpuset which has that
- *   flag enabled,
- * - or if the 'sched_relax_domain_level' of any cpuset which has
- *   that flag enabled and with non-empty 'cpus' changes,
- * - or if any cpuset with non-empty 'cpus' is removed,
- * - or if a cpu gets offlined.
+ * This function builds a partial partition of the systems CPUs
+ * A 'partial partition' is a set of non-overlapping subsets whose
+ * union is a subset of that set.
+ * The output of this function needs to be passed to kernel/sched.c
+ * partition_sched_domains() routine, which will rebuild the scheduler's
+ * load balancing domains (sched domains) as specified by that partial
+ * partition.
  *
- * This routine builds a partial partition of the systems CPUs
- * (the set of non-overlappping cpumask_t's in the array 'part'
- * below), and passes that partial partition to the kernel/sched.c
- * partition_sched_domains() routine, which will rebuild the
- * schedulers load balancing domains (sched domains) as specified
- * by that partial partition.  A 'partial partition' is a set of
- * non-overlapping subsets whose union is a subset of that set.
- *
  * See "What is sched_load_balance" in Documentation/cpusets.txt
  * for a background explanation of this.
  *
@@ -547,13 +539,7 @@
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Call with cgroup_mutex held.  May take callback_mutex during
- * call due to the kfifo_alloc() and kmalloc() calls.  May nest
- * a call to the get_online_cpus()/put_online_cpus() pair.
- * Must not be called holding callback_mutex, because we must not
- * call get_online_cpus() while holding callback_mutex.  Elsewhere
- * the kernel nests callback_mutex inside get_online_cpus() calls.
- * So the reverse nesting would risk an ABBA deadlock.
+ * Must be called with cgroup_lock held.
  *
  * The three key local variables below are:
  *    q  - a linked-list queue of cpuset pointers, used to implement a
  
@@ -588,10 +574,10 @@
  *	element of the partition (one sched domain) to be passed to
  *	partition_sched_domains().
  */
-
-void rebuild_sched_domains(void)
+static int generate_sched_domains(cpumask_t **domains,
+			struct sched_domain_attr **attributes)
 {
-	LIST_HEAD(q);		/* queue of cpusets to be scanned*/
+	LIST_HEAD(q);		/* queue of cpusets to be scanned */
 	struct cpuset *cp;	/* scans q */
 	struct cpuset **csa;	/* array of all cpuset ptrs */
 	int csn;		/* how many cpuset ptrs in csa so far */
  
  
  
  
@@ -601,23 +587,26 @@
 	int ndoms;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] cpumask_t slot */
  
-	csa = NULL;
+	ndoms = 0;
 	doms = NULL;
 	dattr = NULL;
+	csa = NULL;
  
 	/* Special case for the 99% of systems with one, full, sched domain */
 	if (is_sched_load_balance(&top_cpuset)) {
-		ndoms = 1;
 		doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
 		if (!doms)
-			goto rebuild;
+			goto done;
+
 		dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
 		if (dattr) {
 			*dattr = SD_ATTR_INIT;
 			update_domain_attr_tree(dattr, &top_cpuset);
 		}
 		*doms = top_cpuset.cpus_allowed;
-		goto rebuild;
+
+		ndoms = 1;
+		goto done;
 	}
  
 	csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
  
  
  
  
  
  
  
  
  
  
  
  
  
  
@@ -680,63 +669,143 @@
 		}
 	}
  
-	/* Convert <csn, csa> to <ndoms, doms> */
+	/*
+	 * Now we know how many domains to create.
+	 * Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
+	 */
 	doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
-	if (!doms)
-		goto rebuild;
+	if (!doms) {
+		ndoms = 0;
+		goto done;
+	}
+
+	/*
+	 * The rest of the code, including the scheduler, can deal with
+	 * dattr==NULL case. No need to abort if alloc fails.
+	 */
 	dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);
  
 	for (nslot = 0, i = 0; i < csn; i++) {
 		struct cpuset *a = csa[i];
+		cpumask_t *dp;
 		int apn = a->pn;
  
-		if (apn >= 0) {
-			cpumask_t *dp = doms + nslot;
+		if (apn < 0) {
+			/* Skip completed partitions */
+			continue;
+		}
  
-			if (nslot == ndoms) {
-				static int warnings = 10;
-				if (warnings) {
-					printk(KERN_WARNING
-					 "rebuild_sched_domains confused:"
-					  " nslot %d, ndoms %d, csn %d, i %d,"
-					  " apn %d\n",
-					  nslot, ndoms, csn, i, apn);
-					warnings--;
-				}
-				continue;
+		dp = doms + nslot;
+
+		if (nslot == ndoms) {
+			static int warnings = 10;
+			if (warnings) {
+				printk(KERN_WARNING
+				 "rebuild_sched_domains confused:"
+				  " nslot %d, ndoms %d, csn %d, i %d,"
+				  " apn %d\n",
+				  nslot, ndoms, csn, i, apn);
+				warnings--;
 			}
+			continue;
+		}
  
-			cpus_clear(*dp);
-			if (dattr)
-				*(dattr + nslot) = SD_ATTR_INIT;
-			for (j = i; j < csn; j++) {
-				struct cpuset *b = csa[j];
+		cpus_clear(*dp);
+		if (dattr)
+			*(dattr + nslot) = SD_ATTR_INIT;
+		for (j = i; j < csn; j++) {
+			struct cpuset *b = csa[j];
  
-				if (apn == b->pn) {
-					cpus_or(*dp, *dp, b->cpus_allowed);
-					b->pn = -1;
-					if (dattr)
-						update_domain_attr_tree(dattr
-								   + nslot, b);
-				}
+			if (apn == b->pn) {
+				cpus_or(*dp, *dp, b->cpus_allowed);
+				if (dattr)
+					update_domain_attr_tree(dattr + nslot, b);
+
+				/* Done with this partition */
+				b->pn = -1;
 			}
-			nslot++;
 		}
+		nslot++;
 	}
 	BUG_ON(nslot != ndoms);
  
-rebuild:
-	/* Have scheduler rebuild sched domains */
+done:
+	kfree(csa);
+
+	*domains    = doms;
+	*attributes = dattr;
+	return ndoms;
+}
+
+/*
+ * Rebuild scheduler domains.
+ *
+ * Call with neither cgroup_mutex held nor within get_online_cpus().
+ * Takes both cgroup_mutex and get_online_cpus().
+ *
+ * Cannot be directly called from cpuset code handling changes
+ * to the cpuset pseudo-filesystem, because it cannot be called
+ * from code that already holds cgroup_mutex.
+ */
+static void do_rebuild_sched_domains(struct work_struct *unused)
+{
+	struct sched_domain_attr *attr;
+	cpumask_t *doms;
+	int ndoms;
+
 	get_online_cpus();
-	partition_sched_domains(ndoms, doms, dattr);
+
+	/* Generate domain masks and attrs */
+	cgroup_lock();
+	ndoms = generate_sched_domains(&doms, &attr);
+	cgroup_unlock();
+
+	/* Have scheduler rebuild the domains */
+	partition_sched_domains(ndoms, doms, attr);
+
 	put_online_cpus();
+}
  
-done:
-	kfree(csa);
-	/* Don't kfree(doms) -- partition_sched_domains() does that. */
-	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
+static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
+
+/*
+ * Rebuild scheduler domains, asynchronously via workqueue.
+ *
+ * If the flag 'sched_load_balance' of any cpuset with non-empty
+ * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
+ * which has that flag enabled, or if any cpuset with a non-empty
+ * 'cpus' is removed, then call this routine to rebuild the
+ * scheduler's dynamic sched domains.
+ *
+ * The rebuild_sched_domains() and partition_sched_domains()
+ * routines must nest cgroup_lock() inside get_online_cpus(),
+ * but such cpuset changes as these must nest that locking the
+ * other way, holding cgroup_lock() for much of the code.
+ *
+ * So in order to avoid an ABBA deadlock, the cpuset code handling
+ * these user changes delegates the actual sched domain rebuilding
+ * to a separate workqueue thread, which ends up processing the
+ * above do_rebuild_sched_domains() function.
+ */
+static void async_rebuild_sched_domains(void)
+{
+	schedule_work(&rebuild_sched_domains_work);
 }
  
+/*
+ * Accomplishes the same scheduler domain rebuild as the above
+ * async_rebuild_sched_domains(), however it directly calls the
+ * rebuild routine synchronously rather than calling it via an
+ * asynchronous work thread.
+ *
+ * This can only be called from code that is not holding
+ * cgroup_mutex (not nested in a cgroup_lock() call.)
+ */
+void rebuild_sched_domains(void)
+{
+	do_rebuild_sched_domains(NULL);
+}
+
 /**
  * cpuset_test_cpumask - test a task's cpus_allowed versus its cpuset's
  * @tsk: task to test
@@ -863,7 +932,7 @@
 		return retval;
  
 	if (is_load_balanced)
-		rebuild_sched_domains();
+		async_rebuild_sched_domains();
 	return 0;
 }
  
@@ -1090,7 +1159,7 @@
 	if (val != cs->relax_domain_level) {
 		cs->relax_domain_level = val;
 		if (!cpus_empty(cs->cpus_allowed) && is_sched_load_balance(cs))
-			rebuild_sched_domains();
+			async_rebuild_sched_domains();
 	}
  
 	return 0;
@@ -1131,7 +1200,7 @@
 	mutex_unlock(&callback_mutex);
  
 	if (cpus_nonempty && balance_flag_changed)
-		rebuild_sched_domains();
+		async_rebuild_sched_domains();
  
 	return 0;
 }
@@ -1492,6 +1561,9 @@
 	default:
 		BUG();
 	}
+
+	/* Unreachable but makes gcc happy */
+	return 0;
 }
  
 static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft)
@@ -1504,6 +1576,9 @@
 	default:
 		BUG();
 	}
+
+	/* Unrechable but makes gcc happy */
+	return 0;
 }
  
  
  
@@ -1692,15 +1767,9 @@
 }
  
 /*
- * Locking note on the strange update_flag() call below:
- *
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
- * will call rebuild_sched_domains().  The get_online_cpus()
- * call in rebuild_sched_domains() must not be made while holding
- * callback_mutex.  Elsewhere the kernel nests callback_mutex inside
- * get_online_cpus() calls.  So the reverse nesting would risk an
- * ABBA deadlock.
+ * will call async_rebuild_sched_domains().
  */
  
 static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1719,7 +1788,7 @@
 struct cgroup_subsys cpuset_subsys = {
 	.name = "cpuset",
 	.create = cpuset_create,
-	.destroy  = cpuset_destroy,
+	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
@@ -1811,7 +1880,7 @@
 }
  
 /*
- * If common_cpu_mem_hotplug_unplug(), below, unplugs any CPUs
+ * If CPU and/or memory hotplug handlers, below, unplug any CPUs
  * or memory nodes, we need to walk over the cpuset hierarchy,
  * removing that CPU or node from all cpusets.  If this removes the
  * last CPU or node from a cpuset, then move the tasks in the empty
@@ -1903,35 +1972,6 @@
 }
  
 /*
- * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
- * cpu_online_map and node_states[N_HIGH_MEMORY].  Force the top cpuset to
- * track what's online after any CPU or memory node hotplug or unplug event.
- *
- * Since there are two callers of this routine, one for CPU hotplug
- * events and one for memory node hotplug events, we could have coded
- * two separate routines here.  We code it as a single common routine
- * in order to minimize text size.
- */
-
-static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
-{
-	cgroup_lock();
-
-	top_cpuset.cpus_allowed = cpu_online_map;
-	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
-	scan_for_empty_cpusets(&top_cpuset);
-
-	/*
-	 * Scheduler destroys domains on hotplug events.
-	 * Rebuild them based on the current settings.
-	 */
-	if (rebuild_sd)
-		rebuild_sched_domains();
-
-	cgroup_unlock();
-}
-
-/*
  * The top_cpuset tracks what CPUs and Memory Nodes are online,
  * period.  This is necessary in order to make cpusets transparent
  * (of no affect) on systems that are actively using CPU hotplug
  
  
  
  
  
  
  
  
  
@@ -1939,40 +1979,52 @@
  *
  * This routine ensures that top_cpuset.cpus_allowed tracks
  * cpu_online_map on each CPU hotplug (cpuhp) event.
+ *
+ * Called within get_online_cpus().  Needs to call cgroup_lock()
+ * before calling generate_sched_domains().
  */
-
-static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
+static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
 				unsigned long phase, void *unused_cpu)
 {
+	struct sched_domain_attr *attr;
+	cpumask_t *doms;
+	int ndoms;
+
 	switch (phase) {
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DOWN_FAILED:
-	case CPU_DOWN_FAILED_FROZEN:
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		common_cpu_mem_hotplug_unplug(1);
 		break;
+
 	default:
 		return NOTIFY_DONE;
 	}
  
+	cgroup_lock();
+	top_cpuset.cpus_allowed = cpu_online_map;
+	scan_for_empty_cpusets(&top_cpuset);
+	ndoms = generate_sched_domains(&doms, &attr);
+	cgroup_unlock();
+
+	/* Have scheduler rebuild the domains */
+	partition_sched_domains(ndoms, doms, attr);
+
 	return NOTIFY_OK;
 }
  
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
- * Call this routine anytime after you change
- * node_states[N_HIGH_MEMORY].
- * See also the previous routine cpuset_handle_cpuhp().
+ * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
+ * See also the previous routine cpuset_track_online_cpus().
  */
-
 void cpuset_track_online_nodes(void)
 {
-	common_cpu_mem_hotplug_unplug(0);
+	cgroup_lock();
+	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+	scan_for_empty_cpusets(&top_cpuset);
+	cgroup_unlock();
 }
 #endif
  
@@ -1987,7 +2039,7 @@
 	top_cpuset.cpus_allowed = cpu_online_map;
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
  
-	hotcpu_notifier(cpuset_handle_cpuhp, 0);
+	hotcpu_notifier(cpuset_track_online_cpus, 0);
 }
  
 /**
...	...	@@ -14,6 +14,8 @@
14	14	* 2003-10-22 Updates by Stephen Hemminger.
15	15	* 2004 May-July Rework by Paul Jackson.
16	16	* 2006 Rework by Paul Menage to use generic cgroups
	17	+ * 2008 Rework of the scheduler domains and CPU hotplug handling
	18	+ * by Max Krasnyansky
17	19	*
18	20	* This file is subject to the terms and conditions of the GNU General Public
19	21	* License. See the file COPYING in the main directory of the Linux
20	22
...	...	@@ -236,9 +238,11 @@
236	238
237	239	static DEFINE_MUTEX(callback_mutex);
238	240
239		-/* This is ugly, but preserves the userspace API for existing cpuset
	241	+/*
	242	+ * This is ugly, but preserves the userspace API for existing cpuset
240	243	* users. If someone tries to mount the "cpuset" filesystem, we
241		- * silently switch it to mount "cgroup" instead */
	244	+ * silently switch it to mount "cgroup" instead
	245	+ */
242	246	static int cpuset_get_sb(struct file_system_type *fs_type,
243	247	int flags, const char *unused_dev_name,
244	248	void data, struct vfsmount mnt)
245	249
...	...	@@ -473,10 +477,9 @@
473	477	}
474	478
475	479	/*
476		- * Helper routine for rebuild_sched_domains().
	480	+ * Helper routine for generate_sched_domains().
477	481	* Do cpusets a, b have overlapping cpus_allowed masks?
478	482	*/
479		-
480	483	static int cpusets_overlap(struct cpuset a, struct cpuset b)
481	484	{
482	485	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
483	486
484	487
...	...	@@ -518,27 +521,16 @@
518	521	}
519	522
520	523	/*
521		- * rebuild_sched_domains()
	524	+ * generate_sched_domains()
522	525	*
523		- * This routine will be called to rebuild the scheduler's dynamic
524		- * sched domains:
525		- * - if the flag 'sched_load_balance' of any cpuset with non-empty
526		- * 'cpus' changes,
527		- * - or if the 'cpus' allowed changes in any cpuset which has that
528		- * flag enabled,
529		- * - or if the 'sched_relax_domain_level' of any cpuset which has
530		- * that flag enabled and with non-empty 'cpus' changes,
531		- * - or if any cpuset with non-empty 'cpus' is removed,
532		- * - or if a cpu gets offlined.
	526	+ * This function builds a partial partition of the systems CPUs
	527	+ * A 'partial partition' is a set of non-overlapping subsets whose
	528	+ * union is a subset of that set.
	529	+ * The output of this function needs to be passed to kernel/sched.c
	530	+ * partition_sched_domains() routine, which will rebuild the scheduler's
	531	+ * load balancing domains (sched domains) as specified by that partial
	532	+ * partition.
533	533	*
534		- * This routine builds a partial partition of the systems CPUs
535		- * (the set of non-overlappping cpumask_t's in the array 'part'
536		- * below), and passes that partial partition to the kernel/sched.c
537		- * partition_sched_domains() routine, which will rebuild the
538		- * schedulers load balancing domains (sched domains) as specified
539		- * by that partial partition. A 'partial partition' is a set of
540		- * non-overlapping subsets whose union is a subset of that set.
541		- *
542	534	* See "What is sched_load_balance" in Documentation/cpusets.txt
543	535	* for a background explanation of this.
544	536	*
...	...	@@ -547,13 +539,7 @@
547	539	* domains when operating in the severe memory shortage situations
548	540	* that could cause allocation failures below.
549	541	*
550		- * Call with cgroup_mutex held. May take callback_mutex during
551		- * call due to the kfifo_alloc() and kmalloc() calls. May nest
552		- * a call to the get_online_cpus()/put_online_cpus() pair.
553		- * Must not be called holding callback_mutex, because we must not
554		- * call get_online_cpus() while holding callback_mutex. Elsewhere
555		- * the kernel nests callback_mutex inside get_online_cpus() calls.
556		- * So the reverse nesting would risk an ABBA deadlock.
	542	+ * Must be called with cgroup_lock held.
557	543	*
558	544	* The three key local variables below are:
559	545	* q - a linked-list queue of cpuset pointers, used to implement a
560	546
...	...	@@ -588,10 +574,10 @@
588	574	* element of the partition (one sched domain) to be passed to
589	575	* partition_sched_domains().
590	576	*/
591		-
592		-void rebuild_sched_domains(void)
	577	+static int generate_sched_domains(cpumask_t **domains,
	578	+ struct sched_domain_attr **attributes)
593	579	{
594		- LIST_HEAD(q); /* queue of cpusets to be scanned*/
	580	+ LIST_HEAD(q); /* queue of cpusets to be scanned */
595	581	struct cpuset cp; / scans q */
596	582	struct cpuset *csa; / array of all cpuset ptrs */
597	583	int csn; /* how many cpuset ptrs in csa so far */
598	584
599	585
600	586
601	587
...	...	@@ -601,23 +587,26 @@
601	587	int ndoms; /* number of sched domains in result */
602	588	int nslot; /* next empty doms[] cpumask_t slot */
603	589
604		- csa = NULL;
	590	+ ndoms = 0;
605	591	doms = NULL;
606	592	dattr = NULL;
	593	+ csa = NULL;
607	594
608	595	/* Special case for the 99% of systems with one, full, sched domain */
609	596	if (is_sched_load_balance(&top_cpuset)) {
610		- ndoms = 1;
611	597	doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
612	598	if (!doms)
613		- goto rebuild;
	599	+ goto done;
	600	+
614	601	dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
615	602	if (dattr) {
616	603	*dattr = SD_ATTR_INIT;
617	604	update_domain_attr_tree(dattr, &top_cpuset);
618	605	}
619	606	*doms = top_cpuset.cpus_allowed;
620		- goto rebuild;
	607	+
	608	+ ndoms = 1;
	609	+ goto done;
621	610	}
622	611
623	612	csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
624	613
625	614
626	615
627	616
628	617
629	618
630	619
631	620
632	621
633	622
634	623
635	624
636	625
637	626
...	...	@@ -680,63 +669,143 @@
680	669	}
681	670	}
682	671
683		- /* Convert <csn, csa> to <ndoms, doms> */
	672	+ /*
	673	+ * Now we know how many domains to create.
	674	+ * Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
	675	+ */
684	676	doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
685		- if (!doms)
686		- goto rebuild;
	677	+ if (!doms) {
	678	+ ndoms = 0;
	679	+ goto done;
	680	+ }
	681	+
	682	+ /*
	683	+ * The rest of the code, including the scheduler, can deal with
	684	+ * dattr==NULL case. No need to abort if alloc fails.
	685	+ */
687	686	dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);
688	687
689	688	for (nslot = 0, i = 0; i < csn; i++) {
690	689	struct cpuset *a = csa[i];
	690	+ cpumask_t *dp;
691	691	int apn = a->pn;
692	692
693		- if (apn >= 0) {
694		- cpumask_t *dp = doms + nslot;
	693	+ if (apn < 0) {
	694	+ /* Skip completed partitions */
	695	+ continue;
	696	+ }
695	697
696		- if (nslot == ndoms) {
697		- static int warnings = 10;
698		- if (warnings) {
699		- printk(KERN_WARNING
700		- "rebuild_sched_domains confused:"
701		- " nslot %d, ndoms %d, csn %d, i %d,"
702		- " apn %d\n",
703		- nslot, ndoms, csn, i, apn);
704		- warnings--;
705		- }
706		- continue;
	698	+ dp = doms + nslot;
	699	+
	700	+ if (nslot == ndoms) {
	701	+ static int warnings = 10;
	702	+ if (warnings) {
	703	+ printk(KERN_WARNING
	704	+ "rebuild_sched_domains confused:"
	705	+ " nslot %d, ndoms %d, csn %d, i %d,"
	706	+ " apn %d\n",
	707	+ nslot, ndoms, csn, i, apn);
	708	+ warnings--;
707	709	}
	710	+ continue;
	711	+ }
708	712
709		- cpus_clear(*dp);
710		- if (dattr)
711		- *(dattr + nslot) = SD_ATTR_INIT;
712		- for (j = i; j < csn; j++) {
713		- struct cpuset *b = csa[j];
	713	+ cpus_clear(*dp);
	714	+ if (dattr)
	715	+ *(dattr + nslot) = SD_ATTR_INIT;
	716	+ for (j = i; j < csn; j++) {
	717	+ struct cpuset *b = csa[j];
714	718
715		- if (apn == b->pn) {
716		- cpus_or(dp, dp, b->cpus_allowed);
717		- b->pn = -1;
718		- if (dattr)
719		- update_domain_attr_tree(dattr
720		- + nslot, b);
721		- }
	719	+ if (apn == b->pn) {
	720	+ cpus_or(dp, dp, b->cpus_allowed);
	721	+ if (dattr)
	722	+ update_domain_attr_tree(dattr + nslot, b);
	723	+
	724	+ /* Done with this partition */
	725	+ b->pn = -1;
722	726	}
723		- nslot++;
724	727	}
	728	+ nslot++;
725	729	}
726	730	BUG_ON(nslot != ndoms);
727	731
728		-rebuild:
729		- /* Have scheduler rebuild sched domains */
	732	+done:
	733	+ kfree(csa);
	734	+
	735	+ *domains = doms;
	736	+ *attributes = dattr;
	737	+ return ndoms;
	738	+}
	739	+
	740	+/*
	741	+ * Rebuild scheduler domains.
	742	+ *
	743	+ * Call with neither cgroup_mutex held nor within get_online_cpus().
	744	+ * Takes both cgroup_mutex and get_online_cpus().
	745	+ *
	746	+ * Cannot be directly called from cpuset code handling changes
	747	+ * to the cpuset pseudo-filesystem, because it cannot be called
	748	+ * from code that already holds cgroup_mutex.
	749	+ */
	750	+static void do_rebuild_sched_domains(struct work_struct *unused)
	751	+{
	752	+ struct sched_domain_attr *attr;
	753	+ cpumask_t *doms;
	754	+ int ndoms;
	755	+
730	756	get_online_cpus();
731		- partition_sched_domains(ndoms, doms, dattr);
	757	+
	758	+ /* Generate domain masks and attrs */
	759	+ cgroup_lock();
	760	+ ndoms = generate_sched_domains(&doms, &attr);
	761	+ cgroup_unlock();
	762	+
	763	+ /* Have scheduler rebuild the domains */
	764	+ partition_sched_domains(ndoms, doms, attr);
	765	+
732	766	put_online_cpus();
	767	+}
733	768
734		-done:
735		- kfree(csa);
736		- /* Don't kfree(doms) -- partition_sched_domains() does that. */
737		- /* Don't kfree(dattr) -- partition_sched_domains() does that. */
	769	+static DECLARE_WORK(rebuild_sched_domains_work, do_rebuild_sched_domains);
	770	+
	771	+/*
	772	+ * Rebuild scheduler domains, asynchronously via workqueue.
	773	+ *
	774	+ * If the flag 'sched_load_balance' of any cpuset with non-empty
	775	+ * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
	776	+ * which has that flag enabled, or if any cpuset with a non-empty
	777	+ * 'cpus' is removed, then call this routine to rebuild the
	778	+ * scheduler's dynamic sched domains.
	779	+ *
	780	+ * The rebuild_sched_domains() and partition_sched_domains()
	781	+ * routines must nest cgroup_lock() inside get_online_cpus(),
	782	+ * but such cpuset changes as these must nest that locking the
	783	+ * other way, holding cgroup_lock() for much of the code.
	784	+ *
	785	+ * So in order to avoid an ABBA deadlock, the cpuset code handling
	786	+ * these user changes delegates the actual sched domain rebuilding
	787	+ * to a separate workqueue thread, which ends up processing the
	788	+ * above do_rebuild_sched_domains() function.
	789	+ */
	790	+static void async_rebuild_sched_domains(void)
	791	+{
	792	+ schedule_work(&rebuild_sched_domains_work);
738	793	}
739	794
	795	+/*
	796	+ * Accomplishes the same scheduler domain rebuild as the above
	797	+ * async_rebuild_sched_domains(), however it directly calls the
	798	+ * rebuild routine synchronously rather than calling it via an
	799	+ * asynchronous work thread.
	800	+ *
	801	+ * This can only be called from code that is not holding
	802	+ * cgroup_mutex (not nested in a cgroup_lock() call.)
	803	+ */
	804	+void rebuild_sched_domains(void)
	805	+{
	806	+ do_rebuild_sched_domains(NULL);
	807	+}
	808	+
740	809	/**
741	810	* cpuset_test_cpumask - test a task's cpus_allowed versus its cpuset's
742	811	* @tsk: task to test
...	...	@@ -863,7 +932,7 @@
863	932	return retval;
864	933
865	934	if (is_load_balanced)
866		- rebuild_sched_domains();
	935	+ async_rebuild_sched_domains();
867	936	return 0;
868	937	}
869	938
...	...	@@ -1090,7 +1159,7 @@
1090	1159	if (val != cs->relax_domain_level) {
1091	1160	cs->relax_domain_level = val;
1092	1161	if (!cpus_empty(cs->cpus_allowed) && is_sched_load_balance(cs))
1093		- rebuild_sched_domains();
	1162	+ async_rebuild_sched_domains();
1094	1163	}
1095	1164
1096	1165	return 0;
...	...	@@ -1131,7 +1200,7 @@
1131	1200	mutex_unlock(&callback_mutex);
1132	1201
1133	1202	if (cpus_nonempty && balance_flag_changed)
1134		- rebuild_sched_domains();
	1203	+ async_rebuild_sched_domains();
1135	1204
1136	1205	return 0;
1137	1206	}
...	...	@@ -1492,6 +1561,9 @@
1492	1561	default:
1493	1562	BUG();
1494	1563	}
	1564	+
	1565	+ /* Unreachable but makes gcc happy */
	1566	+ return 0;
1495	1567	}
1496	1568
1497	1569	static s64 cpuset_read_s64(struct cgroup cont, struct cftype cft)
...	...	@@ -1504,6 +1576,9 @@
1504	1576	default:
1505	1577	BUG();
1506	1578	}
	1579	+
	1580	+ /* Unrechable but makes gcc happy */
	1581	+ return 0;
1507	1582	}
1508	1583
1509	1584
1510	1585
...	...	@@ -1692,15 +1767,9 @@
1692	1767	}
1693	1768
1694	1769	/*
1695		- * Locking note on the strange update_flag() call below:
1696		- *
1697	1770	* If the cpuset being removed has its flag 'sched_load_balance'
1698	1771	* enabled, then simulate turning sched_load_balance off, which
1699		- * will call rebuild_sched_domains(). The get_online_cpus()
1700		- * call in rebuild_sched_domains() must not be made while holding
1701		- * callback_mutex. Elsewhere the kernel nests callback_mutex inside
1702		- * get_online_cpus() calls. So the reverse nesting would risk an
1703		- * ABBA deadlock.
	1772	+ * will call async_rebuild_sched_domains().
1704	1773	*/
1705	1774
1706	1775	static void cpuset_destroy(struct cgroup_subsys ss, struct cgroup cont)
...	...	@@ -1719,7 +1788,7 @@
1719	1788	struct cgroup_subsys cpuset_subsys = {
1720	1789	.name = "cpuset",
1721	1790	.create = cpuset_create,
1722		- .destroy = cpuset_destroy,
	1791	+ .destroy = cpuset_destroy,
1723	1792	.can_attach = cpuset_can_attach,
1724	1793	.attach = cpuset_attach,
1725	1794	.populate = cpuset_populate,
...	...	@@ -1811,7 +1880,7 @@
1811	1880	}
1812	1881
1813	1882	/*
1814		- * If common_cpu_mem_hotplug_unplug(), below, unplugs any CPUs
	1883	+ * If CPU and/or memory hotplug handlers, below, unplug any CPUs
1815	1884	* or memory nodes, we need to walk over the cpuset hierarchy,
1816	1885	* removing that CPU or node from all cpusets. If this removes the
1817	1886	* last CPU or node from a cpuset, then move the tasks in the empty
...	...	@@ -1903,35 +1972,6 @@
1903	1972	}
1904	1973
1905	1974	/*
1906		- * The cpus_allowed and mems_allowed nodemasks in the top_cpuset track
1907		- * cpu_online_map and node_states[N_HIGH_MEMORY]. Force the top cpuset to
1908		- * track what's online after any CPU or memory node hotplug or unplug event.
1909		- *
1910		- * Since there are two callers of this routine, one for CPU hotplug
1911		- * events and one for memory node hotplug events, we could have coded
1912		- * two separate routines here. We code it as a single common routine
1913		- * in order to minimize text size.
1914		- */
1915		-
1916		-static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
1917		-{
1918		- cgroup_lock();
1919		-
1920		- top_cpuset.cpus_allowed = cpu_online_map;
1921		- top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
1922		- scan_for_empty_cpusets(&top_cpuset);
1923		-
1924		- /*
1925		- * Scheduler destroys domains on hotplug events.
1926		- * Rebuild them based on the current settings.
1927		- */
1928		- if (rebuild_sd)
1929		- rebuild_sched_domains();
1930		-
1931		- cgroup_unlock();
1932		-}
1933		-
1934		-/*
1935	1975	* The top_cpuset tracks what CPUs and Memory Nodes are online,
1936	1976	* period. This is necessary in order to make cpusets transparent
1937	1977	* (of no affect) on systems that are actively using CPU hotplug
1938	1978
1939	1979
1940	1980
1941	1981
1942	1982
1943	1983
1944	1984
1945	1985
1946	1986
...	...	@@ -1939,40 +1979,52 @@
1939	1979	*
1940	1980	* This routine ensures that top_cpuset.cpus_allowed tracks
1941	1981	* cpu_online_map on each CPU hotplug (cpuhp) event.
	1982	+ *
	1983	+ * Called within get_online_cpus(). Needs to call cgroup_lock()
	1984	+ * before calling generate_sched_domains().
1942	1985	*/
1943		-
1944		-static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
	1986	+static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
1945	1987	unsigned long phase, void *unused_cpu)
1946	1988	{
	1989	+ struct sched_domain_attr *attr;
	1990	+ cpumask_t *doms;
	1991	+ int ndoms;
	1992	+
1947	1993	switch (phase) {
1948		- case CPU_UP_CANCELED:
1949		- case CPU_UP_CANCELED_FROZEN:
1950		- case CPU_DOWN_FAILED:
1951		- case CPU_DOWN_FAILED_FROZEN:
1952	1994	case CPU_ONLINE:
1953	1995	case CPU_ONLINE_FROZEN:
1954	1996	case CPU_DEAD:
1955	1997	case CPU_DEAD_FROZEN:
1956		- common_cpu_mem_hotplug_unplug(1);
1957	1998	break;
	1999	+
1958	2000	default:
1959	2001	return NOTIFY_DONE;
1960	2002	}
1961	2003
	2004	+ cgroup_lock();
	2005	+ top_cpuset.cpus_allowed = cpu_online_map;
	2006	+ scan_for_empty_cpusets(&top_cpuset);
	2007	+ ndoms = generate_sched_domains(&doms, &attr);
	2008	+ cgroup_unlock();
	2009	+
	2010	+ /* Have scheduler rebuild the domains */
	2011	+ partition_sched_domains(ndoms, doms, attr);
	2012	+
1962	2013	return NOTIFY_OK;
1963	2014	}
1964	2015
1965	2016	#ifdef CONFIG_MEMORY_HOTPLUG
1966	2017	/*
1967	2018	* Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
1968		- * Call this routine anytime after you change
1969		- * node_states[N_HIGH_MEMORY].
1970		- * See also the previous routine cpuset_handle_cpuhp().
	2019	+ * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
	2020	+ * See also the previous routine cpuset_track_online_cpus().
1971	2021	*/
1972		-
1973	2022	void cpuset_track_online_nodes(void)
1974	2023	{
1975		- common_cpu_mem_hotplug_unplug(0);
	2024	+ cgroup_lock();
	2025	+ top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
	2026	+ scan_for_empty_cpusets(&top_cpuset);
	2027	+ cgroup_unlock();
1976	2028	}
1977	2029	#endif
1978	2030
...	...	@@ -1987,7 +2039,7 @@
1987	2039	top_cpuset.cpus_allowed = cpu_online_map;
1988	2040	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
1989	2041
1990		- hotcpu_notifier(cpuset_handle_cpuhp, 0);
	2042	+ hotcpu_notifier(cpuset_track_online_cpus, 0);
1991	2043	}
1992	2044
1993	2045	/**