Commit f1d507beeab1d1d60a1c58eac7dc81522c6f4629

Authored by Paul E. McKenney
1 parent d21670acab

rcu: improve the RCU CPU-stall warning documentation

The existing Documentation/RCU/stallwarn.txt has proven unhelpful, so
rework it a bit.  In particular, show how to interpret the stall-warning
messages.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Showing 1 changed file with 71 additions and 23 deletions Side-by-side Diff

Documentation/RCU/stallwarn.txt
... ... @@ -3,57 +3,105 @@
3 3 The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables
4 4 RCU's CPU stall detector, which detects conditions that unduly delay
5 5 RCU grace periods. The stall detector's idea of what constitutes
6   -"unduly delayed" is controlled by a pair of C preprocessor macros:
  6 +"unduly delayed" is controlled by a set of C preprocessor macros:
7 7  
8 8 RCU_SECONDS_TILL_STALL_CHECK
9 9  
10 10 This macro defines the period of time that RCU will wait from
11 11 the beginning of a grace period until it issues an RCU CPU
12   - stall warning. It is normally ten seconds.
  12 + stall warning. This time period is normally ten seconds.
13 13  
14 14 RCU_SECONDS_TILL_STALL_RECHECK
15 15  
16 16 This macro defines the period of time that RCU will wait after
17   - issuing a stall warning until it issues another stall warning.
18   - It is normally set to thirty seconds.
  17 + issuing a stall warning until it issues another stall warning
  18 + for the same stall. This time period is normally set to thirty
  19 + seconds.
19 20  
20 21 RCU_STALL_RAT_DELAY
21 22  
22   - The CPU stall detector tries to make the offending CPU rat on itself,
23   - as this often gives better-quality stack traces. However, if
24   - the offending CPU does not detect its own stall in the number
25   - of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will
26   - complain. This is normally set to two jiffies.
  23 + The CPU stall detector tries to make the offending CPU print its
  24 + own warnings, as this often gives better-quality stack traces.
  25 + However, if the offending CPU does not detect its own stall in
  26 + the number of jiffies specified by RCU_STALL_RAT_DELAY, then
  27 + some other CPU will complain. This delay is normally set to
  28 + two jiffies.
27 29  
28   -The following problems can result in an RCU CPU stall warning:
  30 +When a CPU detects that it is stalling, it will print a message similar
  31 +to the following:
29 32  
  33 +INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)
  34 +
  35 +This message indicates that CPU 5 detected that it was causing a stall,
  36 +and that the stall was affecting RCU-sched. This message will normally be
  37 +followed by a stack dump of the offending CPU. On TREE_RCU kernel builds,
  38 +RCU and RCU-sched are implemented by the same underlying mechanism,
  39 +while on TREE_PREEMPT_RCU kernel builds, RCU is instead implemented
  40 +by rcu_preempt_state.
  41 +
  42 +On the other hand, if the offending CPU fails to print out a stall-warning
  43 +message quickly enough, some other CPU will print a message similar to
  44 +the following:
  45 +
  46 +INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies)
  47 +
  48 +This message indicates that CPU 2 detected that CPUs 3 and 5 were both
  49 +causing stalls, and that the stall was affecting RCU-bh. This message
  50 +will normally be followed by stack dumps for each CPU. Please note that
  51 +TREE_PREEMPT_RCU builds can be stalled by tasks as well as by CPUs,
  52 +and that the tasks will be indicated by PID, for example, "P3421".
  53 +It is even possible for a rcu_preempt_state stall to be caused by both
  54 +CPUs -and- tasks, in which case the offending CPUs and tasks will all
  55 +be called out in the list.
  56 +
  57 +Finally, if the grace period ends just as the stall warning starts
  58 +printing, there will be a spurious stall-warning message:
  59 +
  60 +INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies)
  61 +
  62 +This is rare, but does happen from time to time in real life.
  63 +
  64 +So your kernel printed an RCU CPU stall warning. The next question is
  65 +"What caused it?" The following problems can result in RCU CPU stall
  66 +warnings:
  67 +
30 68 o A CPU looping in an RCU read-side critical section.
31 69  
32   -o A CPU looping with interrupts disabled.
  70 +o A CPU looping with interrupts disabled. This condition can
  71 + result in RCU-sched and RCU-bh stalls.
33 72  
34   -o A CPU looping with preemption disabled.
  73 +o A CPU looping with preemption disabled. This condition can
  74 + result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
  75 + stalls.
35 76  
  77 +o A CPU looping with bottom halves disabled. This condition can
  78 + result in RCU-sched and RCU-bh stalls.
  79 +
36 80 o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
37 81 without invoking schedule().
38 82  
39 83 o A bug in the RCU implementation.
40 84  
41 85 o A hardware failure. This is quite unlikely, but has occurred
42   - at least once in a former life. A CPU failed in a running system,
  86 + at least once in real life. A CPU failed in a running system,
43 87 becoming unresponsive, but not causing an immediate crash.
44 88 This resulted in a series of RCU CPU stall warnings, eventually
45 89 leading the realization that the CPU had failed.
46 90  
47   -The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
48   -SRCU does not do so directly, but its calls to synchronize_sched() will
49   -result in RCU-sched detecting any CPU stalls that might be occurring.
  91 +The RCU, RCU-sched, and RCU-bh implementations have CPU stall
  92 +warning. SRCU does not have its own CPU stall warnings, but its
  93 +calls to synchronize_sched() will result in RCU-sched detecting
  94 +RCU-sched-related CPU stalls. Please note that RCU only detects
  95 +CPU stalls when there is a grace period in progress. No grace period,
  96 +no CPU stall warnings.
50 97  
51   -To diagnose the cause of the stall, inspect the stack traces. The offending
52   -function will usually be near the top of the stack. If you have a series
53   -of stall warnings from a single extended stall, comparing the stack traces
54   -can often help determine where the stall is occurring, which will usually
55   -be in the function nearest the top of the stack that stays the same from
56   -trace to trace.
  98 +To diagnose the cause of the stall, inspect the stack traces.
  99 +The offending function will usually be near the top of the stack.
  100 +If you have a series of stall warnings from a single extended stall,
  101 +comparing the stack traces can often help determine where the stall
  102 +is occurring, which will usually be in the function nearest the top of
  103 +that portion of the stack which remains the same from trace to trace.
  104 +If you can reliably trigger the stall, ftrace can be quite helpful.
57 105  
58 106 RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE.