Commit 8932a63d5edb02f714d50c26583152fe0a97a69c
1 parent
d8169d4c36
rcu: Reduce cache-miss initialization latencies for large systems
Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper limit of 16 on the leaf-level fanout for the rcu_node tree. This was needed to reduce lock contention that was induced by the synchronization of scheduling-clock interrupts, which was in turn needed to improve energy efficiency for moderate-sized lightly loaded servers. However, reducing the leaf-level fanout means that there are more leaf-level rcu_node structures in the tree, which in turn means that RCU's grace-period initialization incurs more cache misses. This is not a problem on moderate-sized servers with only a few tens of CPUs, but becomes a major source of real-time latency spikes on systems with many hundreds of CPUs. In addition, the workloads running on these large systems tend to be CPU-bound, which eliminates the energy-efficiency advantages of synchronizing scheduling-clock interrupts. Therefore, these systems need maximal values for the rcu_node leaf-level fanout. This commit addresses this problem by introducing a new kernel parameter named RCU_FANOUT_LEAF that directly controls the leaf-level fanout. This parameter defaults to 16 to handle the common case of a moderate sized lightly loaded servers, but may be set higher on larger systems. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Showing 3 changed files with 31 additions and 8 deletions Side-by-side Diff
init/Kconfig
... | ... | @@ -458,6 +458,33 @@ |
458 | 458 | Select a specific number if testing RCU itself. |
459 | 459 | Take the default if unsure. |
460 | 460 | |
461 | +config RCU_FANOUT_LEAF | |
462 | + int "Tree-based hierarchical RCU leaf-level fanout value" | |
463 | + range 2 RCU_FANOUT if 64BIT | |
464 | + range 2 RCU_FANOUT if !64BIT | |
465 | + depends on TREE_RCU || TREE_PREEMPT_RCU | |
466 | + default 16 | |
467 | + help | |
468 | + This option controls the leaf-level fanout of hierarchical | |
469 | + implementations of RCU, and allows trading off cache misses | |
470 | + against lock contention. Systems that synchronize their | |
471 | + scheduling-clock interrupts for energy-efficiency reasons will | |
472 | + want the default because the smaller leaf-level fanout keeps | |
473 | + lock contention levels acceptably low. Very large systems | |
474 | + (hundreds or thousands of CPUs) will instead want to set this | |
475 | + value to the maximum value possible in order to reduce the | |
476 | + number of cache misses incurred during RCU's grace-period | |
477 | + initialization. These systems tend to run CPU-bound, and thus | |
478 | + are not helped by synchronized interrupts, and thus tend to | |
479 | + skew them, which reduces lock contention enough that large | |
480 | + leaf-level fanouts work well. | |
481 | + | |
482 | + Select a specific number if testing RCU itself. | |
483 | + | |
484 | + Select the maximum permissible value for large systems. | |
485 | + | |
486 | + Take the default if unsure. | |
487 | + | |
461 | 488 | config RCU_FANOUT_EXACT |
462 | 489 | bool "Disable tree-based hierarchical RCU auto-balancing" |
463 | 490 | depends on TREE_RCU || TREE_PREEMPT_RCU |
kernel/rcutree.c
... | ... | @@ -2418,7 +2418,7 @@ |
2418 | 2418 | |
2419 | 2419 | for (i = NUM_RCU_LVLS - 1; i > 0; i--) |
2420 | 2420 | rsp->levelspread[i] = CONFIG_RCU_FANOUT; |
2421 | - rsp->levelspread[0] = RCU_FANOUT_LEAF; | |
2421 | + rsp->levelspread[0] = CONFIG_RCU_FANOUT_LEAF; | |
2422 | 2422 | } |
2423 | 2423 | #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ |
2424 | 2424 | static void __init rcu_init_levelspread(struct rcu_state *rsp) |
kernel/rcutree.h
... | ... | @@ -29,18 +29,14 @@ |
29 | 29 | #include <linux/seqlock.h> |
30 | 30 | |
31 | 31 | /* |
32 | - * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT. | |
32 | + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and | |
33 | + * CONFIG_RCU_FANOUT_LEAF. | |
33 | 34 | * In theory, it should be possible to add more levels straightforwardly. |
34 | 35 | * In practice, this did work well going from three levels to four. |
35 | 36 | * Of course, your mileage may vary. |
36 | 37 | */ |
37 | 38 | #define MAX_RCU_LVLS 4 |
38 | -#if CONFIG_RCU_FANOUT > 16 | |
39 | -#define RCU_FANOUT_LEAF 16 | |
40 | -#else /* #if CONFIG_RCU_FANOUT > 16 */ | |
41 | -#define RCU_FANOUT_LEAF (CONFIG_RCU_FANOUT) | |
42 | -#endif /* #else #if CONFIG_RCU_FANOUT > 16 */ | |
43 | -#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) | |
39 | +#define RCU_FANOUT_1 (CONFIG_RCU_FANOUT_LEAF) | |
44 | 40 | #define RCU_FANOUT_2 (RCU_FANOUT_1 * CONFIG_RCU_FANOUT) |
45 | 41 | #define RCU_FANOUT_3 (RCU_FANOUT_2 * CONFIG_RCU_FANOUT) |
46 | 42 | #define RCU_FANOUT_4 (RCU_FANOUT_3 * CONFIG_RCU_FANOUT) |