Commit 10f39042711ba21773763f267b4943a2c66c8bef

Authored by Rik van Riel
Committed by Ingo Molnar
1 parent 20e07dea28

sched/numa, mm: Use active_nodes nodemask to limit numa migrations

Use the active_nodes nodemask to make smarter decisions on NUMA migrations.

In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:

  1) keep private memory local to each thread

  2) avoid excessive NUMA migration of pages

  3) distribute shared memory across the active nodes, to
     maximize memory bandwidth available to the workload

This patch accomplishes that by implementing the following policy for
NUMA migrations:

  1) always migrate on a private fault

  2) never migrate to a node that is not in the set of active nodes
     for the numa_group

  3) always migrate from a node outside of the set of active nodes,
     to a node that is in that set

  4) within the set of active nodes in the numa_group, only migrate
     from a node with more NUMA page faults, to a node with fewer
     NUMA page faults, with a 25% margin to avoid ping-ponging

This results in most pages of a workload ending up on the actively
used nodes, with reduced ping-ponging of pages between those nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Showing 3 changed files with 71 additions and 28 deletions Side-by-side Diff

include/linux/sched.h
... ... @@ -1589,6 +1589,8 @@
1589 1589 extern pid_t task_numa_group_id(struct task_struct *p);
1590 1590 extern void set_numabalancing_state(bool enabled);
1591 1591 extern void task_numa_free(struct task_struct *p);
  1592 +extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
  1593 + int src_nid, int dst_cpu);
1592 1594 #else
1593 1595 static inline void task_numa_fault(int last_node, int node, int pages,
1594 1596 int flags)
... ... @@ -1603,6 +1605,11 @@
1603 1605 }
1604 1606 static inline void task_numa_free(struct task_struct *p)
1605 1607 {
  1608 +}
  1609 +static inline bool should_numa_migrate_memory(struct task_struct *p,
  1610 + struct page *page, int src_nid, int dst_cpu)
  1611 +{
  1612 + return true;
1606 1613 }
1607 1614 #endif
1608 1615  
... ... @@ -954,6 +954,69 @@
954 954 return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
955 955 }
956 956  
  957 +bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
  958 + int src_nid, int dst_cpu)
  959 +{
  960 + struct numa_group *ng = p->numa_group;
  961 + int dst_nid = cpu_to_node(dst_cpu);
  962 + int last_cpupid, this_cpupid;
  963 +
  964 + this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
  965 +
  966 + /*
  967 + * Multi-stage node selection is used in conjunction with a periodic
  968 + * migration fault to build a temporal task<->page relation. By using
  969 + * a two-stage filter we remove short/unlikely relations.
  970 + *
  971 + * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
  972 + * a task's usage of a particular page (n_p) per total usage of this
  973 + * page (n_t) (in a given time-span) to a probability.
  974 + *
  975 + * Our periodic faults will sample this probability and getting the
  976 + * same result twice in a row, given these samples are fully
  977 + * independent, is then given by P(n)^2, provided our sample period
  978 + * is sufficiently short compared to the usage pattern.
  979 + *
  980 + * This quadric squishes small probabilities, making it less likely we
  981 + * act on an unlikely task<->page relation.
  982 + */
  983 + last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
  984 + if (!cpupid_pid_unset(last_cpupid) &&
  985 + cpupid_to_nid(last_cpupid) != dst_nid)
  986 + return false;
  987 +
  988 + /* Always allow migrate on private faults */
  989 + if (cpupid_match_pid(p, last_cpupid))
  990 + return true;
  991 +
  992 + /* A shared fault, but p->numa_group has not been set up yet. */
  993 + if (!ng)
  994 + return true;
  995 +
  996 + /*
  997 + * Do not migrate if the destination is not a node that
  998 + * is actively used by this numa group.
  999 + */
  1000 + if (!node_isset(dst_nid, ng->active_nodes))
  1001 + return false;
  1002 +
  1003 + /*
  1004 + * Source is a node that is not actively used by this
  1005 + * numa group, while the destination is. Migrate.
  1006 + */
  1007 + if (!node_isset(src_nid, ng->active_nodes))
  1008 + return true;
  1009 +
  1010 + /*
  1011 + * Both source and destination are nodes in active
  1012 + * use by this numa group. Maximize memory bandwidth
  1013 + * by migrating from more heavily used groups, to less
  1014 + * heavily used ones, spreading the load around.
  1015 + * Use a 1/4 hysteresis to avoid spurious page movement.
  1016 + */
  1017 + return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
  1018 +}
  1019 +
957 1020 static unsigned long weighted_cpuload(const int cpu);
958 1021 static unsigned long source_load(int cpu, int type);
959 1022 static unsigned long target_load(int cpu, int type);
... ... @@ -2377,37 +2377,10 @@
2377 2377  
2378 2378 /* Migrate the page towards the node whose CPU is referencing it */
2379 2379 if (pol->flags & MPOL_F_MORON) {
2380   - int last_cpupid;
2381   - int this_cpupid;
2382   -
2383 2380 polnid = thisnid;
2384   - this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
2385 2381  
2386   - /*
2387   - * Multi-stage node selection is used in conjunction
2388   - * with a periodic migration fault to build a temporal
2389   - * task<->page relation. By using a two-stage filter we
2390   - * remove short/unlikely relations.
2391   - *
2392   - * Using P(p) ~ n_p / n_t as per frequentist
2393   - * probability, we can equate a task's usage of a
2394   - * particular page (n_p) per total usage of this
2395   - * page (n_t) (in a given time-span) to a probability.
2396   - *
2397   - * Our periodic faults will sample this probability and
2398   - * getting the same result twice in a row, given these
2399   - * samples are fully independent, is then given by
2400   - * P(n)^2, provided our sample period is sufficiently
2401   - * short compared to the usage pattern.
2402   - *
2403   - * This quadric squishes small probabilities, making
2404   - * it less likely we act on an unlikely task<->page
2405   - * relation.
2406   - */
2407   - last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
2408   - if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
  2382 + if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
2409 2383 goto out;
2410   - }
2411 2384 }
2412 2385  
2413 2386 if (curnid != polnid)