[PATCH] cpusets: automatic numa mempolicy rebinding

This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Paul Jackson · Linus Torvalds
1 parent fb5eeeee44
Showing 3 changed files with 74 additions and 0 deletions Side-by-side Diff
include/linux/mempolicy.h
kernel/cpuset.c
mm/mempolicy.c
@@ -154,6 +154,7 @@
  
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
+extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
 extern struct mempolicy default_policy;
  
 #else
@@ -223,6 +224,11 @@
 }
  
 static inline void numa_default_policy(void)
+{
+}
+
+static inline void numa_policy_rebind(const nodemask_t *old,
+					const nodemask_t *new)
 {
 }
  
@@ -32,6 +32,7 @@
 #include <linux/kernel.h>
 #include <linux/kmod.h>
 #include <linux/list.h>
+#include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mount.h>
@@ -600,6 +601,7 @@
  
 	if (current->cpuset_mems_generation != my_cpusets_mem_gen) {
 		struct cpuset *cs;
+		nodemask_t oldmem = current->mems_allowed;
  
 		down(&callback_sem);
 		task_lock(current);
@@ -608,6 +610,8 @@
 		current->cpuset_mems_generation = cs->mems_generation;
 		task_unlock(current);
 		up(&callback_sem);
+		if (!nodes_equal(oldmem, current->mems_allowed))
+			numa_policy_rebind(&oldmem, &current->mems_allowed);
 	}
 }
  
@@ -457,6 +457,7 @@
 	struct vm_area_struct *vma = NULL;
 	struct mempolicy *pol = current->mempolicy;
  
+	cpuset_update_current_mems_allowed();
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
 	if (flags & MPOL_F_ADDR) {
@@ -1205,5 +1206,68 @@
 void numa_default_policy(void)
 {
 	do_set_mempolicy(MPOL_DEFAULT, NULL);
+}
+
+/* Migrate a policy to a different set of nodes */
+static void rebind_policy(struct mempolicy *pol, const nodemask_t *old,
+							const nodemask_t *new)
+{
+	nodemask_t tmp;
+
+	if (!pol)
+		return;
+
+	switch (pol->policy) {
+	case MPOL_DEFAULT:
+		break;
+	case MPOL_INTERLEAVE:
+		nodes_remap(tmp, pol->v.nodes, *old, *new);
+		pol->v.nodes = tmp;
+		current->il_next = node_remap(current->il_next, *old, *new);
+		break;
+	case MPOL_PREFERRED:
+		pol->v.preferred_node = node_remap(pol->v.preferred_node,
+								*old, *new);
+		break;
+	case MPOL_BIND: {
+		nodemask_t nodes;
+		struct zone **z;
+		struct zonelist *zonelist;
+
+		nodes_clear(nodes);
+		for (z = pol->v.zonelist->zones; *z; z++)
+			node_set((*z)->zone_pgdat->node_id, nodes);
+		nodes_remap(tmp, nodes, *old, *new);
+		nodes = tmp;
+
+		zonelist = bind_zonelist(&nodes);
+
+		/* If no mem, then zonelist is NULL and we keep old zonelist.
+		 * If that old zonelist has no remaining mems_allowed nodes,
+		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
+		 */
+
+		if (zonelist) {
+			/* Good - got mem - substitute new zonelist */
+			kfree(pol->v.zonelist);
+			pol->v.zonelist = zonelist;
+		}
+		break;
+	}
+	default:
+		BUG();
+		break;
+	}
+}
+
+/*
+ * Someone moved this task to different nodes.  Fixup mempolicies.
+ *
+ * TODO - fixup current->mm->vma and shmfs/tmpfs/hugetlbfs policies as well,
+ * once we have a cpuset mechanism to mark which cpuset subtree is migrating.
+ */
+void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new)
+{
+	rebind_policy(current->mempolicy, old, new);
 }
...	...	@@ -154,6 +154,7 @@
154	154
155	155	extern void numa_default_policy(void);
156	156	extern void numa_policy_init(void);
	157	+extern void numa_policy_rebind(const nodemask_t old, const nodemask_t new);
157	158	extern struct mempolicy default_policy;
158	159
159	160	#else
...	...	@@ -223,6 +224,11 @@
223	224	}
224	225
225	226	static inline void numa_default_policy(void)
	227	+{
	228	+}
	229	+
	230	+static inline void numa_policy_rebind(const nodemask_t *old,
	231	+ const nodemask_t *new)
226	232	{
227	233	}
228	234
...	...	@@ -32,6 +32,7 @@
32	32	#include <linux/kernel.h>
33	33	#include <linux/kmod.h>
34	34	#include <linux/list.h>
	35	+#include <linux/mempolicy.h>
35	36	#include <linux/mm.h>
36	37	#include <linux/module.h>
37	38	#include <linux/mount.h>
...	...	@@ -600,6 +601,7 @@
600	601
601	602	if (current->cpuset_mems_generation != my_cpusets_mem_gen) {
602	603	struct cpuset *cs;
	604	+ nodemask_t oldmem = current->mems_allowed;
603	605
604	606	down(&callback_sem);
605	607	task_lock(current);
...	...	@@ -608,6 +610,8 @@
608	610	current->cpuset_mems_generation = cs->mems_generation;
609	611	task_unlock(current);
610	612	up(&callback_sem);
	613	+ if (!nodes_equal(oldmem, current->mems_allowed))
	614	+ numa_policy_rebind(&oldmem, &current->mems_allowed);
611	615	}
612	616	}
613	617
...	...	@@ -457,6 +457,7 @@
457	457	struct vm_area_struct *vma = NULL;
458	458	struct mempolicy *pol = current->mempolicy;
459	459
	460	+ cpuset_update_current_mems_allowed();
460	461	if (flags & ~(unsigned long)(MPOL_F_NODE\|MPOL_F_ADDR))
461	462	return -EINVAL;
462	463	if (flags & MPOL_F_ADDR) {
...	...	@@ -1205,5 +1206,68 @@
1205	1206	void numa_default_policy(void)
1206	1207	{
1207	1208	do_set_mempolicy(MPOL_DEFAULT, NULL);
	1209	+}
	1210	+
	1211	+/* Migrate a policy to a different set of nodes */
	1212	+static void rebind_policy(struct mempolicy pol, const nodemask_t old,
	1213	+ const nodemask_t *new)
	1214	+{
	1215	+ nodemask_t tmp;
	1216	+
	1217	+ if (!pol)
	1218	+ return;
	1219	+
	1220	+ switch (pol->policy) {
	1221	+ case MPOL_DEFAULT:
	1222	+ break;
	1223	+ case MPOL_INTERLEAVE:
	1224	+ nodes_remap(tmp, pol->v.nodes, old, new);
	1225	+ pol->v.nodes = tmp;
	1226	+ current->il_next = node_remap(current->il_next, old, new);
	1227	+ break;
	1228	+ case MPOL_PREFERRED:
	1229	+ pol->v.preferred_node = node_remap(pol->v.preferred_node,
	1230	+ old, new);
	1231	+ break;
	1232	+ case MPOL_BIND: {
	1233	+ nodemask_t nodes;
	1234	+ struct zone **z;
	1235	+ struct zonelist *zonelist;
	1236	+
	1237	+ nodes_clear(nodes);
	1238	+ for (z = pol->v.zonelist->zones; *z; z++)
	1239	+ node_set((*z)->zone_pgdat->node_id, nodes);
	1240	+ nodes_remap(tmp, nodes, old, new);
	1241	+ nodes = tmp;
	1242	+
	1243	+ zonelist = bind_zonelist(&nodes);
	1244	+
	1245	+ /* If no mem, then zonelist is NULL and we keep old zonelist.
	1246	+ * If that old zonelist has no remaining mems_allowed nodes,
	1247	+ * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
	1248	+ */
	1249	+
	1250	+ if (zonelist) {
	1251	+ /* Good - got mem - substitute new zonelist */
	1252	+ kfree(pol->v.zonelist);
	1253	+ pol->v.zonelist = zonelist;
	1254	+ }
	1255	+ break;
	1256	+ }
	1257	+ default:
	1258	+ BUG();
	1259	+ break;
	1260	+ }
	1261	+}
	1262	+
	1263	+/*
	1264	+ * Someone moved this task to different nodes. Fixup mempolicies.
	1265	+ *
	1266	+ * TODO - fixup current->mm->vma and shmfs/tmpfs/hugetlbfs policies as well,
	1267	+ * once we have a cpuset mechanism to mark which cpuset subtree is migrating.
	1268	+ */
	1269	+void numa_policy_rebind(const nodemask_t old, const nodemask_t new)
	1270	+{
	1271	+ rebind_policy(current->mempolicy, old, new);
1208	1272	}