sched/numa, mm: Use active_nodes nodemask to limit numa migrations

Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>

sched/numa, mm: Use active_nodes nodemask to limit numa migrations
Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel · Ingo Molnar
1 parent 20e07dea28
Showing 3 changed files with 71 additions and 28 deletions Side-by-side Diff
include/linux/sched.h
kernel/sched/fair.c
mm/mempolicy.c
@@ -1589,6 +1589,8 @@
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
+					int src_nid, int dst_cpu);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1603,6 +1605,11 @@
 }
 static inline void task_numa_free(struct task_struct *p)
 {
+}
+static inline bool should_numa_migrate_memory(struct task_struct *p,
+				struct page *page, int src_nid, int dst_cpu)
+{
+	return true;
 }
 #endif
  
@@ -954,6 +954,69 @@
 	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
  
+bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
+				int src_nid, int dst_cpu)
+{
+	struct numa_group *ng = p->numa_group;
+	int dst_nid = cpu_to_node(dst_cpu);
+	int last_cpupid, this_cpupid;
+
+	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
+
+	/*
+	 * Multi-stage node selection is used in conjunction with a periodic
+	 * migration fault to build a temporal task<->page relation. By using
+	 * a two-stage filter we remove short/unlikely relations.
+	 *
+	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
+	 * a task's usage of a particular page (n_p) per total usage of this
+	 * page (n_t) (in a given time-span) to a probability.
+	 *
+	 * Our periodic faults will sample this probability and getting the
+	 * same result twice in a row, given these samples are fully
+	 * independent, is then given by P(n)^2, provided our sample period
+	 * is sufficiently short compared to the usage pattern.
+	 *
+	 * This quadric squishes small probabilities, making it less likely we
+	 * act on an unlikely task<->page relation.
+	 */
+	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+	if (!cpupid_pid_unset(last_cpupid) &&
+				cpupid_to_nid(last_cpupid) != dst_nid)
+		return false;
+
+	/* Always allow migrate on private faults */
+	if (cpupid_match_pid(p, last_cpupid))
+		return true;
+
+	/* A shared fault, but p->numa_group has not been set up yet. */
+	if (!ng)
+		return true;
+
+	/*
+	 * Do not migrate if the destination is not a node that
+	 * is actively used by this numa group.
+	 */
+	if (!node_isset(dst_nid, ng->active_nodes))
+		return false;
+
+	/*
+	 * Source is a node that is not actively used by this
+	 * numa group, while the destination is. Migrate.
+	 */
+	if (!node_isset(src_nid, ng->active_nodes))
+		return true;
+
+	/*
+	 * Both source and destination are nodes in active
+	 * use by this numa group. Maximize memory bandwidth
+	 * by migrating from more heavily used groups, to less
+	 * heavily used ones, spreading the load around.
+	 * Use a 1/4 hysteresis to avoid spurious page movement.
+	 */
+	return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -2377,37 +2377,10 @@
  
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_cpupid;
-		int this_cpupid;
-
 		polnid = thisnid;
-		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
  
-		/*
-		 * Multi-stage node selection is used in conjunction
-		 * with a periodic migration fault to build a temporal
-		 * task<->page relation. By using a two-stage filter we
-		 * remove short/unlikely relations.
-		 *
-		 * Using P(p) ~ n_p / n_t as per frequentist
-		 * probability, we can equate a task's usage of a
-		 * particular page (n_p) per total usage of this
-		 * page (n_t) (in a given time-span) to a probability.
-		 *
-		 * Our periodic faults will sample this probability and
-		 * getting the same result twice in a row, given these
-		 * samples are fully independent, is then given by
-		 * P(n)^2, provided our sample period is sufficiently
-		 * short compared to the usage pattern.
-		 *
-		 * This quadric squishes small probabilities, making
-		 * it less likely we act on an unlikely task<->page
-		 * relation.
-		 */
-		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
-		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
+		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
 			goto out;
-		}
 	}
  
 	if (curnid != polnid)
...	...	@@ -1589,6 +1589,8 @@
1589	1589	extern pid_t task_numa_group_id(struct task_struct *p);
1590	1590	extern void set_numabalancing_state(bool enabled);
1591	1591	extern void task_numa_free(struct task_struct *p);
	1592	+extern bool should_numa_migrate_memory(struct task_struct p, struct page page,
	1593	+ int src_nid, int dst_cpu);
1592	1594	#else
1593	1595	static inline void task_numa_fault(int last_node, int node, int pages,
1594	1596	int flags)
...	...	@@ -1603,6 +1605,11 @@
1603	1605	}
1604	1606	static inline void task_numa_free(struct task_struct *p)
1605	1607	{
	1608	+}
	1609	+static inline bool should_numa_migrate_memory(struct task_struct *p,
	1610	+ struct page *page, int src_nid, int dst_cpu)
	1611	+{
	1612	+ return true;
1606	1613	}
1607	1614	#endif
1608	1615
...	...	@@ -954,6 +954,69 @@
954	954	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
955	955	}
956	956
	957	+bool should_numa_migrate_memory(struct task_struct p, struct page page,
	958	+ int src_nid, int dst_cpu)
	959	+{
	960	+ struct numa_group *ng = p->numa_group;
	961	+ int dst_nid = cpu_to_node(dst_cpu);
	962	+ int last_cpupid, this_cpupid;
	963	+
	964	+ this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
	965	+
	966	+ /*
	967	+ * Multi-stage node selection is used in conjunction with a periodic
	968	+ * migration fault to build a temporal task<->page relation. By using
	969	+ * a two-stage filter we remove short/unlikely relations.
	970	+ *
	971	+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
	972	+ * a task's usage of a particular page (n_p) per total usage of this
	973	+ * page (n_t) (in a given time-span) to a probability.
	974	+ *
	975	+ * Our periodic faults will sample this probability and getting the
	976	+ * same result twice in a row, given these samples are fully
	977	+ * independent, is then given by P(n)^2, provided our sample period
	978	+ * is sufficiently short compared to the usage pattern.
	979	+ *
	980	+ * This quadric squishes small probabilities, making it less likely we
	981	+ * act on an unlikely task<->page relation.
	982	+ */
	983	+ last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
	984	+ if (!cpupid_pid_unset(last_cpupid) &&
	985	+ cpupid_to_nid(last_cpupid) != dst_nid)
	986	+ return false;
	987	+
	988	+ /* Always allow migrate on private faults */
	989	+ if (cpupid_match_pid(p, last_cpupid))
	990	+ return true;
	991	+
	992	+ /* A shared fault, but p->numa_group has not been set up yet. */
	993	+ if (!ng)
	994	+ return true;
	995	+
	996	+ /*
	997	+ * Do not migrate if the destination is not a node that
	998	+ * is actively used by this numa group.
	999	+ */
	1000	+ if (!node_isset(dst_nid, ng->active_nodes))
	1001	+ return false;
	1002	+
	1003	+ /*
	1004	+ * Source is a node that is not actively used by this
	1005	+ * numa group, while the destination is. Migrate.
	1006	+ */
	1007	+ if (!node_isset(src_nid, ng->active_nodes))
	1008	+ return true;
	1009	+
	1010	+ /*
	1011	+ * Both source and destination are nodes in active
	1012	+ * use by this numa group. Maximize memory bandwidth
	1013	+ * by migrating from more heavily used groups, to less
	1014	+ * heavily used ones, spreading the load around.
	1015	+ * Use a 1/4 hysteresis to avoid spurious page movement.
	1016	+ */
	1017	+ return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
	1018	+}
	1019	+
957	1020	static unsigned long weighted_cpuload(const int cpu);
958	1021	static unsigned long source_load(int cpu, int type);
959	1022	static unsigned long target_load(int cpu, int type);
...	...	@@ -2377,37 +2377,10 @@
2377	2377
2378	2378	/* Migrate the page towards the node whose CPU is referencing it */
2379	2379	if (pol->flags & MPOL_F_MORON) {
2380		- int last_cpupid;
2381		- int this_cpupid;
2382		-
2383	2380	polnid = thisnid;
2384		- this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
2385	2381
2386		- /*
2387		- * Multi-stage node selection is used in conjunction
2388		- * with a periodic migration fault to build a temporal
2389		- * task<->page relation. By using a two-stage filter we
2390		- * remove short/unlikely relations.
2391		- *
2392		- * Using P(p) ~ n_p / n_t as per frequentist
2393		- * probability, we can equate a task's usage of a
2394		- * particular page (n_p) per total usage of this
2395		- * page (n_t) (in a given time-span) to a probability.
2396		- *
2397		- * Our periodic faults will sample this probability and
2398		- * getting the same result twice in a row, given these
2399		- * samples are fully independent, is then given by
2400		- * P(n)^2, provided our sample period is sufficiently
2401		- * short compared to the usage pattern.
2402		- *
2403		- * This quadric squishes small probabilities, making
2404		- * it less likely we act on an unlikely task<->page
2405		- * relation.
2406		- */
2407		- last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
2408		- if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
	2382	+ if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
2409	2383	goto out;
2410		- }
2411	2384	}
2412	2385
2413	2386	if (curnid != polnid)