Commit f1d507beeab1d1d60a1c58eac7dc81522c6f4629
1 parent
d21670acab
Exists in
master
and in
7 other branches
rcu: improve the RCU CPU-stall warning documentation
The existing Documentation/RCU/stallwarn.txt has proven unhelpful, so rework it a bit. In particular, show how to interpret the stall-warning messages. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Showing 1 changed file with 71 additions and 23 deletions Side-by-side Diff
Documentation/RCU/stallwarn.txt
... | ... | @@ -3,57 +3,105 @@ |
3 | 3 | The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables |
4 | 4 | RCU's CPU stall detector, which detects conditions that unduly delay |
5 | 5 | RCU grace periods. The stall detector's idea of what constitutes |
6 | -"unduly delayed" is controlled by a pair of C preprocessor macros: | |
6 | +"unduly delayed" is controlled by a set of C preprocessor macros: | |
7 | 7 | |
8 | 8 | RCU_SECONDS_TILL_STALL_CHECK |
9 | 9 | |
10 | 10 | This macro defines the period of time that RCU will wait from |
11 | 11 | the beginning of a grace period until it issues an RCU CPU |
12 | - stall warning. It is normally ten seconds. | |
12 | + stall warning. This time period is normally ten seconds. | |
13 | 13 | |
14 | 14 | RCU_SECONDS_TILL_STALL_RECHECK |
15 | 15 | |
16 | 16 | This macro defines the period of time that RCU will wait after |
17 | - issuing a stall warning until it issues another stall warning. | |
18 | - It is normally set to thirty seconds. | |
17 | + issuing a stall warning until it issues another stall warning | |
18 | + for the same stall. This time period is normally set to thirty | |
19 | + seconds. | |
19 | 20 | |
20 | 21 | RCU_STALL_RAT_DELAY |
21 | 22 | |
22 | - The CPU stall detector tries to make the offending CPU rat on itself, | |
23 | - as this often gives better-quality stack traces. However, if | |
24 | - the offending CPU does not detect its own stall in the number | |
25 | - of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will | |
26 | - complain. This is normally set to two jiffies. | |
23 | + The CPU stall detector tries to make the offending CPU print its | |
24 | + own warnings, as this often gives better-quality stack traces. | |
25 | + However, if the offending CPU does not detect its own stall in | |
26 | + the number of jiffies specified by RCU_STALL_RAT_DELAY, then | |
27 | + some other CPU will complain. This delay is normally set to | |
28 | + two jiffies. | |
27 | 29 | |
28 | -The following problems can result in an RCU CPU stall warning: | |
30 | +When a CPU detects that it is stalling, it will print a message similar | |
31 | +to the following: | |
29 | 32 | |
33 | +INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies) | |
34 | + | |
35 | +This message indicates that CPU 5 detected that it was causing a stall, | |
36 | +and that the stall was affecting RCU-sched. This message will normally be | |
37 | +followed by a stack dump of the offending CPU. On TREE_RCU kernel builds, | |
38 | +RCU and RCU-sched are implemented by the same underlying mechanism, | |
39 | +while on TREE_PREEMPT_RCU kernel builds, RCU is instead implemented | |
40 | +by rcu_preempt_state. | |
41 | + | |
42 | +On the other hand, if the offending CPU fails to print out a stall-warning | |
43 | +message quickly enough, some other CPU will print a message similar to | |
44 | +the following: | |
45 | + | |
46 | +INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies) | |
47 | + | |
48 | +This message indicates that CPU 2 detected that CPUs 3 and 5 were both | |
49 | +causing stalls, and that the stall was affecting RCU-bh. This message | |
50 | +will normally be followed by stack dumps for each CPU. Please note that | |
51 | +TREE_PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, | |
52 | +and that the tasks will be indicated by PID, for example, "P3421". | |
53 | +It is even possible for a rcu_preempt_state stall to be caused by both | |
54 | +CPUs -and- tasks, in which case the offending CPUs and tasks will all | |
55 | +be called out in the list. | |
56 | + | |
57 | +Finally, if the grace period ends just as the stall warning starts | |
58 | +printing, there will be a spurious stall-warning message: | |
59 | + | |
60 | +INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies) | |
61 | + | |
62 | +This is rare, but does happen from time to time in real life. | |
63 | + | |
64 | +So your kernel printed an RCU CPU stall warning. The next question is | |
65 | +"What caused it?" The following problems can result in RCU CPU stall | |
66 | +warnings: | |
67 | + | |
30 | 68 | o A CPU looping in an RCU read-side critical section. |
31 | 69 | |
32 | -o A CPU looping with interrupts disabled. | |
70 | +o A CPU looping with interrupts disabled. This condition can | |
71 | + result in RCU-sched and RCU-bh stalls. | |
33 | 72 | |
34 | -o A CPU looping with preemption disabled. | |
73 | +o A CPU looping with preemption disabled. This condition can | |
74 | + result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh | |
75 | + stalls. | |
35 | 76 | |
77 | +o A CPU looping with bottom halves disabled. This condition can | |
78 | + result in RCU-sched and RCU-bh stalls. | |
79 | + | |
36 | 80 | o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel |
37 | 81 | without invoking schedule(). |
38 | 82 | |
39 | 83 | o A bug in the RCU implementation. |
40 | 84 | |
41 | 85 | o A hardware failure. This is quite unlikely, but has occurred |
42 | - at least once in a former life. A CPU failed in a running system, | |
86 | + at least once in real life. A CPU failed in a running system, | |
43 | 87 | becoming unresponsive, but not causing an immediate crash. |
44 | 88 | This resulted in a series of RCU CPU stall warnings, eventually |
45 | 89 | leading the realization that the CPU had failed. |
46 | 90 | |
47 | -The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning. | |
48 | -SRCU does not do so directly, but its calls to synchronize_sched() will | |
49 | -result in RCU-sched detecting any CPU stalls that might be occurring. | |
91 | +The RCU, RCU-sched, and RCU-bh implementations have CPU stall | |
92 | +warning. SRCU does not have its own CPU stall warnings, but its | |
93 | +calls to synchronize_sched() will result in RCU-sched detecting | |
94 | +RCU-sched-related CPU stalls. Please note that RCU only detects | |
95 | +CPU stalls when there is a grace period in progress. No grace period, | |
96 | +no CPU stall warnings. | |
50 | 97 | |
51 | -To diagnose the cause of the stall, inspect the stack traces. The offending | |
52 | -function will usually be near the top of the stack. If you have a series | |
53 | -of stall warnings from a single extended stall, comparing the stack traces | |
54 | -can often help determine where the stall is occurring, which will usually | |
55 | -be in the function nearest the top of the stack that stays the same from | |
56 | -trace to trace. | |
98 | +To diagnose the cause of the stall, inspect the stack traces. | |
99 | +The offending function will usually be near the top of the stack. | |
100 | +If you have a series of stall warnings from a single extended stall, | |
101 | +comparing the stack traces can often help determine where the stall | |
102 | +is occurring, which will usually be in the function nearest the top of | |
103 | +that portion of the stack which remains the same from trace to trace. | |
104 | +If you can reliably trigger the stall, ftrace can be quite helpful. | |
57 | 105 | |
58 | 106 | RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE. |