Commit 23b5c8fa01b723c70a20d6e4ef4ff54c7656d6e1

Authored by Paul E. McKenney
1 parent 4305ce7894

rcu: Decrease memory-barrier usage based on semi-formal proof

(Note: this was reverted, and is now being re-applied in pieces, with
this being the fifth and final piece.  See below for the reason that
it is now felt to be safe to re-apply this.)

Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed.  This commit removes
them and provides a proof of correctness in their absence.  It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section.  This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period.  Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU.  (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore.  For the short term, better safe than sorry.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock.  The proof is as follows:

1.	A given CPU declares a quiescent state under the protection of
	its leaf rcu_node's lock.

2.	If there is more than one level of rcu_node hierarchy, the
	last CPU to declare a quiescent state will also acquire the
	->lock of the next rcu_node up in the hierarchy,  but only
	after releasing the lower level's lock.  The acquisition of this
	lock clearly cannot occur prior to the acquisition of the leaf
	node's lock.

3.	Step 2 repeats until we reach the root rcu_node structure.
	Please note again that only one lock is held at a time through
	this process.  The acquisition of the root rcu_node's ->lock
	must occur after the release of that of the leaf rcu_node.

4.	At this point, we set the ->completed field in the rcu_state
	structure in rcu_report_qs_rsp().  However, if the rcu_node
	hierarchy contains only one rcu_node, then in theory the code
	preceding the quiescent state could leak into the critical
	section.  We therefore precede the update of ->completed with a
	memory barrier.  All CPUs will therefore agree that any updates
	preceding any report of a quiescent state will have happened
	before the update of ->completed.

5.	Regardless of whether a new grace period is needed, rcu_start_gp()
	will propagate the new value of ->completed to all of the leaf
	rcu_node structures, under the protection of each rcu_node's ->lock.
	If a new grace period is needed immediately, this propagation
	will occur in the same critical section that ->completed was
	set in, but courtesy of the memory barrier in #4 above, is still
	seen to follow any pre-quiescent-state activity.

6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
	aware of the end of the old grace period and therefore makes
	any RCU callbacks that were waiting on that grace period eligible
	for invocation.

	If this CPU is the same one that detected the end of the grace
	period, and if there is but a single rcu_node in the hierarchy,
	we will still be in the single critical section.  In this case,
	the memory barrier in step #4 guarantees that all callbacks will
	be seen to execute after each CPU's quiescent state.

	On the other hand, if this is a different CPU, it will acquire
	the leaf rcu_node's ->lock, and will again be serialized after
	each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

This was reverted do to hangs found during testing by Yinghai Lu and
Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
located the underlying problem, and Frederic also provided the fix.

The underlying problem was that the HARDIRQ_ENTER() macro from
lib/locking-selftest.c invoked irq_enter(), which in turn invokes
rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
does not invoke rcu_irq_exit().  This situation resulted in calls
to rcu_irq_enter() that were not balanced by the required calls to
rcu_irq_exit().  Therefore, after these locking selftests completed,
RCU's dyntick-idle nesting count was a large number (for example,
72), which caused RCU to to conclude that the affected CPU was not in
dyntick-idle mode when in fact it was.

RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
in hangs.

In contrast, with Frederic's patch, which replaces the irq_enter()
in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
running the test is already marked as not being in dyntick-idle mode.
This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
then has no problem working out which CPUs are in dyntick-idle mode and
which are not.

The reason that the imbalance was not noticed before the barrier patch
was applied is that the old implementation of rcu_enter_nohz() ignored
the nesting depth.  This could still result in delays, but much shorter
ones.  Whenever there was a delay, RCU would IPI the CPU with the
unbalanced nesting level, which would eventually result in rcu_enter_nohz()
being called, which in turn would force RCU to see that the CPU was in
dyntick-idle mode.

The reason that very few people noticed the problem is that the mismatched
irq_enter() vs. __irq_exit() occured only when the kernel was built with
CONFIG_DEBUG_LOCKING_API_SELFTESTS.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Showing 5 changed files with 67 additions and 89 deletions Side-by-side Diff

Documentation/RCU/trace.txt
... ... @@ -99,18 +99,11 @@
99 99  
100 100 o "dt" is the current value of the dyntick counter that is incremented
101 101 when entering or leaving dynticks idle state, either by the
102   - scheduler or by irq. The number after the "/" is the interrupt
103   - nesting depth when in dyntick-idle state, or one greater than
104   - the interrupt-nesting depth otherwise.
105   -
106   - This field is displayed only for CONFIG_NO_HZ kernels.
107   -
108   -o "dn" is the current value of the dyntick counter that is incremented
109   - when entering or leaving dynticks idle state via NMI. If both
110   - the "dt" and "dn" values are even, then this CPU is in dynticks
111   - idle mode and may be ignored by RCU. If either of these two
112   - counters is odd, then RCU must be alert to the possibility of
113   - an RCU read-side critical section running on this CPU.
  102 + scheduler or by irq. This number is even if the CPU is in
  103 + dyntick idle mode and odd otherwise. The number after the first
  104 + "/" is the interrupt nesting depth when in dyntick-idle state,
  105 + or one greater than the interrupt-nesting depth otherwise.
  106 + The number after the second "/" is the NMI nesting depth.
114 107  
115 108 This field is displayed only for CONFIG_NO_HZ kernels.
116 109  
... ... @@ -162,7 +162,7 @@
162 162 #ifdef CONFIG_NO_HZ
163 163 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
164 164 .dynticks_nesting = 1,
165   - .dynticks = 1,
  165 + .dynticks = ATOMIC_INIT(1),
166 166 };
167 167 #endif /* #ifdef CONFIG_NO_HZ */
168 168  
169 169  
170 170  
... ... @@ -321,13 +321,25 @@
321 321 unsigned long flags;
322 322 struct rcu_dynticks *rdtp;
323 323  
324   - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
325 324 local_irq_save(flags);
326 325 rdtp = &__get_cpu_var(rcu_dynticks);
327   - if (--rdtp->dynticks_nesting == 0)
328   - rdtp->dynticks++;
329   - WARN_ON_ONCE(rdtp->dynticks & 0x1);
  326 + if (--rdtp->dynticks_nesting) {
  327 + local_irq_restore(flags);
  328 + return;
  329 + }
  330 + /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
  331 + smp_mb__before_atomic_inc(); /* See above. */
  332 + atomic_inc(&rdtp->dynticks);
  333 + smp_mb__after_atomic_inc(); /* Force ordering with next sojourn. */
  334 + WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
330 335 local_irq_restore(flags);
  336 +
  337 + /* If the interrupt queued a callback, get out of dyntick mode. */
  338 + if (in_irq() &&
  339 + (__get_cpu_var(rcu_sched_data).nxtlist ||
  340 + __get_cpu_var(rcu_bh_data).nxtlist ||
  341 + rcu_preempt_needs_cpu(smp_processor_id())))
  342 + set_need_resched();
331 343 }
332 344  
333 345 /*
334 346  
... ... @@ -343,11 +355,16 @@
343 355  
344 356 local_irq_save(flags);
345 357 rdtp = &__get_cpu_var(rcu_dynticks);
346   - rdtp->dynticks++;
347   - rdtp->dynticks_nesting++;
348   - WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
  358 + if (rdtp->dynticks_nesting++) {
  359 + local_irq_restore(flags);
  360 + return;
  361 + }
  362 + smp_mb__before_atomic_inc(); /* Force ordering w/previous sojourn. */
  363 + atomic_inc(&rdtp->dynticks);
  364 + /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
  365 + smp_mb__after_atomic_inc(); /* See above. */
  366 + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
349 367 local_irq_restore(flags);
350   - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
351 368 }
352 369  
353 370 /**
354 371  
... ... @@ -361,11 +378,15 @@
361 378 {
362 379 struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
363 380  
364   - if (rdtp->dynticks & 0x1)
  381 + if (rdtp->dynticks_nmi_nesting == 0 &&
  382 + (atomic_read(&rdtp->dynticks) & 0x1))
365 383 return;
366   - rdtp->dynticks_nmi++;
367   - WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
368   - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
  384 + rdtp->dynticks_nmi_nesting++;
  385 + smp_mb__before_atomic_inc(); /* Force delay from prior write. */
  386 + atomic_inc(&rdtp->dynticks);
  387 + /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
  388 + smp_mb__after_atomic_inc(); /* See above. */
  389 + WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
369 390 }
370 391  
371 392 /**
372 393  
... ... @@ -379,11 +400,14 @@
379 400 {
380 401 struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
381 402  
382   - if (rdtp->dynticks & 0x1)
  403 + if (rdtp->dynticks_nmi_nesting == 0 ||
  404 + --rdtp->dynticks_nmi_nesting != 0)
383 405 return;
384   - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
385   - rdtp->dynticks_nmi++;
386   - WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
  406 + /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
  407 + smp_mb__before_atomic_inc(); /* See above. */
  408 + atomic_inc(&rdtp->dynticks);
  409 + smp_mb__after_atomic_inc(); /* Force delay to next write. */
  410 + WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
387 411 }
388 412  
389 413 /**
... ... @@ -394,13 +418,7 @@
394 418 */
395 419 void rcu_irq_enter(void)
396 420 {
397   - struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
398   -
399   - if (rdtp->dynticks_nesting++)
400   - return;
401   - rdtp->dynticks++;
402   - WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
403   - smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
  421 + rcu_exit_nohz();
404 422 }
405 423  
406 424 /**
... ... @@ -412,19 +430,7 @@
412 430 */
413 431 void rcu_irq_exit(void)
414 432 {
415   - struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
416   -
417   - if (--rdtp->dynticks_nesting)
418   - return;
419   - smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
420   - rdtp->dynticks++;
421   - WARN_ON_ONCE(rdtp->dynticks & 0x1);
422   -
423   - /* If the interrupt queued a callback, get out of dyntick mode. */
424   - if (in_irq() &&
425   - (__this_cpu_read(rcu_sched_data.nxtlist) ||
426   - __this_cpu_read(rcu_bh_data.nxtlist)))
427   - set_need_resched();
  433 + rcu_enter_nohz();
428 434 }
429 435  
430 436 #ifdef CONFIG_SMP
... ... @@ -436,19 +442,8 @@
436 442 */
437 443 static int dyntick_save_progress_counter(struct rcu_data *rdp)
438 444 {
439   - int ret;
440   - int snap;
441   - int snap_nmi;
442   -
443   - snap = rdp->dynticks->dynticks;
444   - snap_nmi = rdp->dynticks->dynticks_nmi;
445   - smp_mb(); /* Order sampling of snap with end of grace period. */
446   - rdp->dynticks_snap = snap;
447   - rdp->dynticks_nmi_snap = snap_nmi;
448   - ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
449   - if (ret)
450   - rdp->dynticks_fqs++;
451   - return ret;
  445 + rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
  446 + return 0;
452 447 }
453 448  
454 449 /*
455 450  
... ... @@ -459,16 +454,11 @@
459 454 */
460 455 static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
461 456 {
462   - long curr;
463   - long curr_nmi;
464   - long snap;
465   - long snap_nmi;
  457 + unsigned long curr;
  458 + unsigned long snap;
466 459  
467   - curr = rdp->dynticks->dynticks;
468   - snap = rdp->dynticks_snap;
469   - curr_nmi = rdp->dynticks->dynticks_nmi;
470   - snap_nmi = rdp->dynticks_nmi_snap;
471   - smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
  460 + curr = (unsigned long)atomic_add_return(0, &rdp->dynticks->dynticks);
  461 + snap = (unsigned long)rdp->dynticks_snap;
472 462  
473 463 /*
474 464 * If the CPU passed through or entered a dynticks idle phase with
... ... @@ -478,8 +468,7 @@
478 468 * read-side critical section that started before the beginning
479 469 * of the current RCU grace period.
480 470 */
481   - if ((curr != snap || (curr & 0x1) == 0) &&
482   - (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
  471 + if ((curr & 0x1) == 0 || ULONG_CMP_GE(curr, snap + 2)) {
483 472 rdp->dynticks_fqs++;
484 473 return 1;
485 474 }
... ... @@ -84,11 +84,9 @@
84 84 * Dynticks per-CPU state.
85 85 */
86 86 struct rcu_dynticks {
87   - int dynticks_nesting; /* Track nesting level, sort of. */
88   - int dynticks; /* Even value for dynticks-idle, else odd. */
89   - int dynticks_nmi; /* Even value for either dynticks-idle or */
90   - /* not in nmi handler, else odd. So this */
91   - /* remains even for nmi from irq handler. */
  87 + int dynticks_nesting; /* Track irq/process nesting level. */
  88 + int dynticks_nmi_nesting; /* Track NMI nesting level. */
  89 + atomic_t dynticks; /* Even value for dynticks-idle, else odd. */
92 90 };
93 91  
94 92 /* RCU's kthread states for tracing. */
... ... @@ -284,7 +282,6 @@
284 282 /* 3) dynticks interface. */
285 283 struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
286 284 int dynticks_snap; /* Per-GP tracking for dynticks. */
287   - int dynticks_nmi_snap; /* Per-GP tracking for dynticks_nmi. */
288 285 #endif /* #ifdef CONFIG_NO_HZ */
289 286  
290 287 /* 4) reasons this CPU needed to be kicked by force_quiescent_state */
kernel/rcutree_plugin.h
... ... @@ -1520,7 +1520,6 @@
1520 1520 {
1521 1521 int c = 0;
1522 1522 int snap;
1523   - int snap_nmi;
1524 1523 int thatcpu;
1525 1524  
1526 1525 /* Check for being in the holdoff period. */
1527 1526  
... ... @@ -1531,10 +1530,10 @@
1531 1530 for_each_online_cpu(thatcpu) {
1532 1531 if (thatcpu == cpu)
1533 1532 continue;
1534   - snap = per_cpu(rcu_dynticks, thatcpu).dynticks;
1535   - snap_nmi = per_cpu(rcu_dynticks, thatcpu).dynticks_nmi;
  1533 + snap = atomic_add_return(0, &per_cpu(rcu_dynticks,
  1534 + thatcpu).dynticks);
1536 1535 smp_mb(); /* Order sampling of snap with end of grace period. */
1537   - if (((snap & 0x1) != 0) || ((snap_nmi & 0x1) != 0)) {
  1536 + if ((snap & 0x1) != 0) {
1538 1537 per_cpu(rcu_dyntick_drain, cpu) = 0;
1539 1538 per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
1540 1539 return rcu_needs_cpu_quick_check(cpu);
kernel/rcutree_trace.c
... ... @@ -69,10 +69,10 @@
69 69 rdp->passed_quiesc, rdp->passed_quiesc_completed,
70 70 rdp->qs_pending);
71 71 #ifdef CONFIG_NO_HZ
72   - seq_printf(m, " dt=%d/%d dn=%d df=%lu",
73   - rdp->dynticks->dynticks,
  72 + seq_printf(m, " dt=%d/%d/%d df=%lu",
  73 + atomic_read(&rdp->dynticks->dynticks),
74 74 rdp->dynticks->dynticks_nesting,
75   - rdp->dynticks->dynticks_nmi,
  75 + rdp->dynticks->dynticks_nmi_nesting,
76 76 rdp->dynticks_fqs);
77 77 #endif /* #ifdef CONFIG_NO_HZ */
78 78 seq_printf(m, " of=%lu ri=%lu", rdp->offline_fqs, rdp->resched_ipi);
79 79  
... ... @@ -141,9 +141,9 @@
141 141 rdp->qs_pending);
142 142 #ifdef CONFIG_NO_HZ
143 143 seq_printf(m, ",%d,%d,%d,%lu",
144   - rdp->dynticks->dynticks,
  144 + atomic_read(&rdp->dynticks->dynticks),
145 145 rdp->dynticks->dynticks_nesting,
146   - rdp->dynticks->dynticks_nmi,
  146 + rdp->dynticks->dynticks_nmi_nesting,
147 147 rdp->dynticks_fqs);
148 148 #endif /* #ifdef CONFIG_NO_HZ */
149 149 seq_printf(m, ",%lu,%lu", rdp->offline_fqs, rdp->resched_ipi);
... ... @@ -167,7 +167,7 @@
167 167 {
168 168 seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pqc\",\"pq\",");
169 169 #ifdef CONFIG_NO_HZ
170   - seq_puts(m, "\"dt\",\"dt nesting\",\"dn\",\"df\",");
  170 + seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
171 171 #endif /* #ifdef CONFIG_NO_HZ */
172 172 seq_puts(m, "\"of\",\"ri\",\"ql\",\"b\",\"ci\",\"co\",\"ca\"\n");
173 173 #ifdef CONFIG_TREE_PREEMPT_RCU