rcu: Decrease memory-barrier usage based on semi-formal proof

(Note: this was reverted, and is now being re-applied in pieces, with this being the fifth and final piece. See below for the reason that it is now felt to be safe to re-apply this.) Commit d09b62d fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. This was reverted do to hangs found during testing by Yinghai Lu and Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that located the underlying problem, and Frederic also provided the fix. The underlying problem was that the HARDIRQ_ENTER() macro from lib/locking-selftest.c invoked irq_enter(), which in turn invokes rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which does not invoke rcu_irq_exit(). This situation resulted in calls to rcu_irq_enter() that were not balanced by the required calls to rcu_irq_exit(). Therefore, after these locking selftests completed, RCU's dyntick-idle nesting count was a large number (for example, 72), which caused RCU to to conclude that the affected CPU was not in dyntick-idle mode when in fact it was. RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting in hangs. In contrast, with Frederic's patch, which replaces the irq_enter() in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU running the test is already marked as not being in dyntick-idle mode. This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU then has no problem working out which CPUs are in dyntick-idle mode and which are not. The reason that the imbalance was not noticed before the barrier patch was applied is that the old implementation of rcu_enter_nohz() ignored the nesting depth. This could still result in delays, but much shorter ones. Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting level, which would eventually result in rcu_enter_nohz() being called, which in turn would force RCU to see that the CPU was in dyntick-idle mode. The reason that very few people noticed the problem is that the mismatched irq_enter() vs. __irq_exit() occured only when the kernel was built with CONFIG_DEBUG_LOCKING_API_SELFTESTS. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>

rcu: Decrease memory-barrier usage based on semi-formal proof
(Note: this was reverted, and is now being re-applied in pieces, with this being the fifth and final piece. See below for the reason that it is now felt to be safe to re-apply this.) Commit d09b62d fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. This was reverted do to hangs found during testing by Yinghai Lu and Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that located the underlying problem, and Frederic also provided the fix. The underlying problem was that the HARDIRQ_ENTER() macro from lib/locking-selftest.c invoked irq_enter(), which in turn invokes rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which does not invoke rcu_irq_exit(). This situation resulted in calls to rcu_irq_enter() that were not balanced by the required calls to rcu_irq_exit(). Therefore, after these locking selftests completed, RCU's dyntick-idle nesting count was a large number (for example, 72), which caused RCU to to conclude that the affected CPU was not in dyntick-idle mode when in fact it was. RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting in hangs. In contrast, with Frederic's patch, which replaces the irq_enter() in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU running the test is already marked as not being in dyntick-idle mode. This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU then has no problem working out which CPUs are in dyntick-idle mode and which are not. The reason that the imbalance was not noticed before the barrier patch was applied is that the old implementation of rcu_enter_nohz() ignored the nesting depth. This could still result in delays, but much shorter ones. Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting level, which would eventually result in rcu_enter_nohz() being called, which in turn would force RCU to see that the CPU was in dyntick-idle mode. The reason that very few people noticed the problem is that the mismatched irq_enter() vs. __irq_exit() occured only when the kernel was built with CONFIG_DEBUG_LOCKING_API_SELFTESTS. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Paul E. McKenney
1 parent 4305ce7894
Showing 5 changed files with 67 additions and 89 deletions Side-by-side Diff
Documentation/RCU/trace.txt
kernel/rcutree.c
kernel/rcutree.h
kernel/rcutree_plugin.h
kernel/rcutree_trace.c
@@ -99,18 +99,11 @@
  
 o	"dt" is the current value of the dyntick counter that is incremented
 	when entering or leaving dynticks idle state, either by the
-	scheduler or by irq.  The number after the "/" is the interrupt
-	nesting depth when in dyntick-idle state, or one greater than
-	the interrupt-nesting depth otherwise.
-
-	This field is displayed only for CONFIG_NO_HZ kernels.
-
-o	"dn" is the current value of the dyntick counter that is incremented
-	when entering or leaving dynticks idle state via NMI.  If both
-	the "dt" and "dn" values are even, then this CPU is in dynticks
-	idle mode and may be ignored by RCU.  If either of these two
-	counters is odd, then RCU must be alert to the possibility of
-	an RCU read-side critical section running on this CPU.
+	scheduler or by irq.  This number is even if the CPU is in
+	dyntick idle mode and odd otherwise.  The number after the first
+	"/" is the interrupt nesting depth when in dyntick-idle state,
+	or one greater than the interrupt-nesting depth otherwise.
+	The number after the second "/" is the NMI nesting depth.
  
 	This field is displayed only for CONFIG_NO_HZ kernels.
  
@@ -162,7 +162,7 @@
 #ifdef CONFIG_NO_HZ
 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 	.dynticks_nesting = 1,
-	.dynticks = 1,
+	.dynticks = ATOMIC_INIT(1),
 };
 #endif /* #ifdef CONFIG_NO_HZ */
  
  
  
@@ -321,13 +321,25 @@
 	unsigned long flags;
 	struct rcu_dynticks *rdtp;
  
-	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
 	local_irq_save(flags);
 	rdtp = &__get_cpu_var(rcu_dynticks);
-	if (--rdtp->dynticks_nesting == 0)
-		rdtp->dynticks++;
-	WARN_ON_ONCE(rdtp->dynticks & 0x1);
+	if (--rdtp->dynticks_nesting) {
+		local_irq_restore(flags);
+		return;
+	}
+	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
+	smp_mb__before_atomic_inc();  /* See above. */
+	atomic_inc(&rdtp->dynticks);
+	smp_mb__after_atomic_inc();  /* Force ordering with next sojourn. */
+	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 	local_irq_restore(flags);
+
+	/* If the interrupt queued a callback, get out of dyntick mode. */
+	if (in_irq() &&
+	    (__get_cpu_var(rcu_sched_data).nxtlist ||
+	     __get_cpu_var(rcu_bh_data).nxtlist ||
+	     rcu_preempt_needs_cpu(smp_processor_id())))
+		set_need_resched();
 }
  
 /*
  
@@ -343,11 +355,16 @@
  
 	local_irq_save(flags);
 	rdtp = &__get_cpu_var(rcu_dynticks);
-	rdtp->dynticks++;
-	rdtp->dynticks_nesting++;
-	WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
+	if (rdtp->dynticks_nesting++) {
+		local_irq_restore(flags);
+		return;
+	}
+	smp_mb__before_atomic_inc();  /* Force ordering w/previous sojourn. */
+	atomic_inc(&rdtp->dynticks);
+	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
+	smp_mb__after_atomic_inc();  /* See above. */
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 	local_irq_restore(flags);
-	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
 }
  
 /**
  
@@ -361,11 +378,15 @@
 {
 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
  
-	if (rdtp->dynticks & 0x1)
+	if (rdtp->dynticks_nmi_nesting == 0 &&
+	    (atomic_read(&rdtp->dynticks) & 0x1))
 		return;
-	rdtp->dynticks_nmi++;
-	WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
-	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+	rdtp->dynticks_nmi_nesting++;
+	smp_mb__before_atomic_inc();  /* Force delay from prior write. */
+	atomic_inc(&rdtp->dynticks);
+	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
+	smp_mb__after_atomic_inc();  /* See above. */
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 }
  
 /**
  
@@ -379,11 +400,14 @@
 {
 	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
  
-	if (rdtp->dynticks & 0x1)
+	if (rdtp->dynticks_nmi_nesting == 0 ||
+	    --rdtp->dynticks_nmi_nesting != 0)
 		return;
-	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
-	rdtp->dynticks_nmi++;
-	WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
+	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
+	smp_mb__before_atomic_inc();  /* See above. */
+	atomic_inc(&rdtp->dynticks);
+	smp_mb__after_atomic_inc();  /* Force delay to next write. */
+	WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 }
  
 /**
@@ -394,13 +418,7 @@
  */
 void rcu_irq_enter(void)
 {
-	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
-	if (rdtp->dynticks_nesting++)
-		return;
-	rdtp->dynticks++;
-	WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
-	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+	rcu_exit_nohz();
 }
  
 /**
@@ -412,19 +430,7 @@
  */
 void rcu_irq_exit(void)
 {
-	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
-
-	if (--rdtp->dynticks_nesting)
-		return;
-	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
-	rdtp->dynticks++;
-	WARN_ON_ONCE(rdtp->dynticks & 0x1);
-
-	/* If the interrupt queued a callback, get out of dyntick mode. */
-	if (in_irq() &&
-	    (__this_cpu_read(rcu_sched_data.nxtlist) ||
-	     __this_cpu_read(rcu_bh_data.nxtlist)))
-		set_need_resched();
+	rcu_enter_nohz();
 }
  
 #ifdef CONFIG_SMP
@@ -436,19 +442,8 @@
  */
 static int dyntick_save_progress_counter(struct rcu_data *rdp)
 {
-	int ret;
-	int snap;
-	int snap_nmi;
-
-	snap = rdp->dynticks->dynticks;
-	snap_nmi = rdp->dynticks->dynticks_nmi;
-	smp_mb();	/* Order sampling of snap with end of grace period. */
-	rdp->dynticks_snap = snap;
-	rdp->dynticks_nmi_snap = snap_nmi;
-	ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
-	if (ret)
-		rdp->dynticks_fqs++;
-	return ret;
+	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+	return 0;
 }
  
 /*
  
@@ -459,16 +454,11 @@
  */
 static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 {
-	long curr;
-	long curr_nmi;
-	long snap;
-	long snap_nmi;
+	unsigned long curr;
+	unsigned long snap;
  
-	curr = rdp->dynticks->dynticks;
-	snap = rdp->dynticks_snap;
-	curr_nmi = rdp->dynticks->dynticks_nmi;
-	snap_nmi = rdp->dynticks_nmi_snap;
-	smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
+	curr = (unsigned long)atomic_add_return(0, &rdp->dynticks->dynticks);
+	snap = (unsigned long)rdp->dynticks_snap;
  
 	/*
 	 * If the CPU passed through or entered a dynticks idle phase with
@@ -478,8 +468,7 @@
 	 * read-side critical section that started before the beginning
 	 * of the current RCU grace period.
 	 */
-	if ((curr != snap || (curr & 0x1) == 0) &&
-	    (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
+	if ((curr & 0x1) == 0 || ULONG_CMP_GE(curr, snap + 2)) {
 		rdp->dynticks_fqs++;
 		return 1;
 	}
@@ -84,11 +84,9 @@
  * Dynticks per-CPU state.
  */
 struct rcu_dynticks {
-	int dynticks_nesting;	/* Track nesting level, sort of. */
-	int dynticks;		/* Even value for dynticks-idle, else odd. */
-	int dynticks_nmi;	/* Even value for either dynticks-idle or */
-				/*  not in nmi handler, else odd.  So this */
-				/*  remains even for nmi from irq handler. */
+	int dynticks_nesting;	/* Track irq/process nesting level. */
+	int dynticks_nmi_nesting; /* Track NMI nesting level. */
+	atomic_t dynticks;	/* Even value for dynticks-idle, else odd. */
 };
  
 /* RCU's kthread states for tracing. */
@@ -284,7 +282,6 @@
 	/* 3) dynticks interface. */
 	struct rcu_dynticks *dynticks;	/* Shared per-CPU dynticks state. */
 	int dynticks_snap;		/* Per-GP tracking for dynticks. */
-	int dynticks_nmi_snap;		/* Per-GP tracking for dynticks_nmi. */
 #endif /* #ifdef CONFIG_NO_HZ */
  
 	/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
@@ -1520,7 +1520,6 @@
 {
 	int c = 0;
 	int snap;
-	int snap_nmi;
 	int thatcpu;
  
 	/* Check for being in the holdoff period. */
  
@@ -1531,10 +1530,10 @@
 	for_each_online_cpu(thatcpu) {
 		if (thatcpu == cpu)
 			continue;
-		snap = per_cpu(rcu_dynticks, thatcpu).dynticks;
-		snap_nmi = per_cpu(rcu_dynticks, thatcpu).dynticks_nmi;
+		snap = atomic_add_return(0, &per_cpu(rcu_dynticks,
+						     thatcpu).dynticks);
 		smp_mb(); /* Order sampling of snap with end of grace period. */
-		if (((snap & 0x1) != 0) || ((snap_nmi & 0x1) != 0)) {
+		if ((snap & 0x1) != 0) {
 			per_cpu(rcu_dyntick_drain, cpu) = 0;
 			per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
 			return rcu_needs_cpu_quick_check(cpu);
@@ -69,10 +69,10 @@
 		   rdp->passed_quiesc, rdp->passed_quiesc_completed,
 		   rdp->qs_pending);
 #ifdef CONFIG_NO_HZ
-	seq_printf(m, " dt=%d/%d dn=%d df=%lu",
-		   rdp->dynticks->dynticks,
+	seq_printf(m, " dt=%d/%d/%d df=%lu",
+		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
-		   rdp->dynticks->dynticks_nmi,
+		   rdp->dynticks->dynticks_nmi_nesting,
 		   rdp->dynticks_fqs);
 #endif /* #ifdef CONFIG_NO_HZ */
 	seq_printf(m, " of=%lu ri=%lu", rdp->offline_fqs, rdp->resched_ipi);
  
@@ -141,9 +141,9 @@
 		   rdp->qs_pending);
 #ifdef CONFIG_NO_HZ
 	seq_printf(m, ",%d,%d,%d,%lu",
-		   rdp->dynticks->dynticks,
+		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
-		   rdp->dynticks->dynticks_nmi,
+		   rdp->dynticks->dynticks_nmi_nesting,
 		   rdp->dynticks_fqs);
 #endif /* #ifdef CONFIG_NO_HZ */
 	seq_printf(m, ",%lu,%lu", rdp->offline_fqs, rdp->resched_ipi);
@@ -167,7 +167,7 @@
 {
 	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pqc\",\"pq\",");
 #ifdef CONFIG_NO_HZ
-	seq_puts(m, "\"dt\",\"dt nesting\",\"dn\",\"df\",");
+	seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
 #endif /* #ifdef CONFIG_NO_HZ */
 	seq_puts(m, "\"of\",\"ri\",\"ql\",\"b\",\"ci\",\"co\",\"ca\"\n");
 #ifdef CONFIG_TREE_PREEMPT_RCU
...	...	@@ -99,18 +99,11 @@
99	99
100	100	o "dt" is the current value of the dyntick counter that is incremented
101	101	when entering or leaving dynticks idle state, either by the
102		- scheduler or by irq. The number after the "/" is the interrupt
103		- nesting depth when in dyntick-idle state, or one greater than
104		- the interrupt-nesting depth otherwise.
105		-
106		- This field is displayed only for CONFIG_NO_HZ kernels.
107		-
108		-o "dn" is the current value of the dyntick counter that is incremented
109		- when entering or leaving dynticks idle state via NMI. If both
110		- the "dt" and "dn" values are even, then this CPU is in dynticks
111		- idle mode and may be ignored by RCU. If either of these two
112		- counters is odd, then RCU must be alert to the possibility of
113		- an RCU read-side critical section running on this CPU.
	102	+ scheduler or by irq. This number is even if the CPU is in
	103	+ dyntick idle mode and odd otherwise. The number after the first
	104	+ "/" is the interrupt nesting depth when in dyntick-idle state,
	105	+ or one greater than the interrupt-nesting depth otherwise.
	106	+ The number after the second "/" is the NMI nesting depth.
114	107
115	108	This field is displayed only for CONFIG_NO_HZ kernels.
116	109
...	...	@@ -162,7 +162,7 @@
162	162	#ifdef CONFIG_NO_HZ
163	163	DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
164	164	.dynticks_nesting = 1,
165		- .dynticks = 1,
	165	+ .dynticks = ATOMIC_INIT(1),
166	166	};
167	167	#endif /* #ifdef CONFIG_NO_HZ */
168	168
169	169
170	170
...	...	@@ -321,13 +321,25 @@
321	321	unsigned long flags;
322	322	struct rcu_dynticks *rdtp;
323	323
324		- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
325	324	local_irq_save(flags);
326	325	rdtp = &__get_cpu_var(rcu_dynticks);
327		- if (--rdtp->dynticks_nesting == 0)
328		- rdtp->dynticks++;
329		- WARN_ON_ONCE(rdtp->dynticks & 0x1);
	326	+ if (--rdtp->dynticks_nesting) {
	327	+ local_irq_restore(flags);
	328	+ return;
	329	+ }
	330	+ /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
	331	+ smp_mb__before_atomic_inc(); /* See above. */
	332	+ atomic_inc(&rdtp->dynticks);
	333	+ smp_mb__after_atomic_inc(); /* Force ordering with next sojourn. */
	334	+ WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
330	335	local_irq_restore(flags);
	336	+
	337	+ /* If the interrupt queued a callback, get out of dyntick mode. */
	338	+ if (in_irq() &&
	339	+ (__get_cpu_var(rcu_sched_data).nxtlist \|\|
	340	+ __get_cpu_var(rcu_bh_data).nxtlist \|\|
	341	+ rcu_preempt_needs_cpu(smp_processor_id())))
	342	+ set_need_resched();
331	343	}
332	344
333	345	/*
334	346
...	...	@@ -343,11 +355,16 @@
343	355
344	356	local_irq_save(flags);
345	357	rdtp = &__get_cpu_var(rcu_dynticks);
346		- rdtp->dynticks++;
347		- rdtp->dynticks_nesting++;
348		- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
	358	+ if (rdtp->dynticks_nesting++) {
	359	+ local_irq_restore(flags);
	360	+ return;
	361	+ }
	362	+ smp_mb__before_atomic_inc(); /* Force ordering w/previous sojourn. */
	363	+ atomic_inc(&rdtp->dynticks);
	364	+ /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
	365	+ smp_mb__after_atomic_inc(); /* See above. */
	366	+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
349	367	local_irq_restore(flags);
350		- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
351	368	}
352	369
353	370	/**
354	371
...	...	@@ -361,11 +378,15 @@
361	378	{
362	379	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
363	380
364		- if (rdtp->dynticks & 0x1)
	381	+ if (rdtp->dynticks_nmi_nesting == 0 &&
	382	+ (atomic_read(&rdtp->dynticks) & 0x1))
365	383	return;
366		- rdtp->dynticks_nmi++;
367		- WARN_ON_ONCE(!(rdtp->dynticks_nmi & 0x1));
368		- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
	384	+ rdtp->dynticks_nmi_nesting++;
	385	+ smp_mb__before_atomic_inc(); /* Force delay from prior write. */
	386	+ atomic_inc(&rdtp->dynticks);
	387	+ /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
	388	+ smp_mb__after_atomic_inc(); /* See above. */
	389	+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
369	390	}
370	391
371	392	/**
372	393
...	...	@@ -379,11 +400,14 @@
379	400	{
380	401	struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
381	402
382		- if (rdtp->dynticks & 0x1)
	403	+ if (rdtp->dynticks_nmi_nesting == 0 \|\|
	404	+ --rdtp->dynticks_nmi_nesting != 0)
383	405	return;
384		- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
385		- rdtp->dynticks_nmi++;
386		- WARN_ON_ONCE(rdtp->dynticks_nmi & 0x1);
	406	+ /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
	407	+ smp_mb__before_atomic_inc(); /* See above. */
	408	+ atomic_inc(&rdtp->dynticks);
	409	+ smp_mb__after_atomic_inc(); /* Force delay to next write. */
	410	+ WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
387	411	}
388	412
389	413	/**
...	...	@@ -394,13 +418,7 @@
394	418	*/
395	419	void rcu_irq_enter(void)
396	420	{
397		- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
398		-
399		- if (rdtp->dynticks_nesting++)
400		- return;
401		- rdtp->dynticks++;
402		- WARN_ON_ONCE(!(rdtp->dynticks & 0x1));
403		- smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
	421	+ rcu_exit_nohz();
404	422	}
405	423
406	424	/**
...	...	@@ -412,19 +430,7 @@
412	430	*/
413	431	void rcu_irq_exit(void)
414	432	{
415		- struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
416		-
417		- if (--rdtp->dynticks_nesting)
418		- return;
419		- smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
420		- rdtp->dynticks++;
421		- WARN_ON_ONCE(rdtp->dynticks & 0x1);
422		-
423		- /* If the interrupt queued a callback, get out of dyntick mode. */
424		- if (in_irq() &&
425		- (__this_cpu_read(rcu_sched_data.nxtlist) \|\|
426		- __this_cpu_read(rcu_bh_data.nxtlist)))
427		- set_need_resched();
	433	+ rcu_enter_nohz();
428	434	}
429	435
430	436	#ifdef CONFIG_SMP
...	...	@@ -436,19 +442,8 @@
436	442	*/
437	443	static int dyntick_save_progress_counter(struct rcu_data *rdp)
438	444	{
439		- int ret;
440		- int snap;
441		- int snap_nmi;
442		-
443		- snap = rdp->dynticks->dynticks;
444		- snap_nmi = rdp->dynticks->dynticks_nmi;
445		- smp_mb(); /* Order sampling of snap with end of grace period. */
446		- rdp->dynticks_snap = snap;
447		- rdp->dynticks_nmi_snap = snap_nmi;
448		- ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0);
449		- if (ret)
450		- rdp->dynticks_fqs++;
451		- return ret;
	445	+ rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
	446	+ return 0;
452	447	}
453	448
454	449	/*
455	450
...	...	@@ -459,16 +454,11 @@
459	454	*/
460	455	static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
461	456	{
462		- long curr;
463		- long curr_nmi;
464		- long snap;
465		- long snap_nmi;
	457	+ unsigned long curr;
	458	+ unsigned long snap;
466	459
467		- curr = rdp->dynticks->dynticks;
468		- snap = rdp->dynticks_snap;
469		- curr_nmi = rdp->dynticks->dynticks_nmi;
470		- snap_nmi = rdp->dynticks_nmi_snap;
471		- smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
	460	+ curr = (unsigned long)atomic_add_return(0, &rdp->dynticks->dynticks);
	461	+ snap = (unsigned long)rdp->dynticks_snap;
472	462
473	463	/*
474	464	* If the CPU passed through or entered a dynticks idle phase with
...	...	@@ -478,8 +468,7 @@
478	468	* read-side critical section that started before the beginning
479	469	* of the current RCU grace period.
480	470	*/
481		- if ((curr != snap \|\| (curr & 0x1) == 0) &&
482		- (curr_nmi != snap_nmi \|\| (curr_nmi & 0x1) == 0)) {
	471	+ if ((curr & 0x1) == 0 \|\| ULONG_CMP_GE(curr, snap + 2)) {
483	472	rdp->dynticks_fqs++;
484	473	return 1;
485	474	}
...	...	@@ -84,11 +84,9 @@
84	84	* Dynticks per-CPU state.
85	85	*/
86	86	struct rcu_dynticks {
87		- int dynticks_nesting; /* Track nesting level, sort of. */
88		- int dynticks; /* Even value for dynticks-idle, else odd. */
89		- int dynticks_nmi; /* Even value for either dynticks-idle or */
90		- /* not in nmi handler, else odd. So this */
91		- /* remains even for nmi from irq handler. */
	87	+ int dynticks_nesting; /* Track irq/process nesting level. */
	88	+ int dynticks_nmi_nesting; /* Track NMI nesting level. */
	89	+ atomic_t dynticks; /* Even value for dynticks-idle, else odd. */
92	90	};
93	91
94	92	/* RCU's kthread states for tracing. */
...	...	@@ -284,7 +282,6 @@
284	282	/* 3) dynticks interface. */
285	283	struct rcu_dynticks dynticks; / Shared per-CPU dynticks state. */
286	284	int dynticks_snap; /* Per-GP tracking for dynticks. */
287		- int dynticks_nmi_snap; /* Per-GP tracking for dynticks_nmi. */
288	285	#endif /* #ifdef CONFIG_NO_HZ */
289	286
290	287	/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
...	...	@@ -1520,7 +1520,6 @@
1520	1520	{
1521	1521	int c = 0;
1522	1522	int snap;
1523		- int snap_nmi;
1524	1523	int thatcpu;
1525	1524
1526	1525	/* Check for being in the holdoff period. */
1527	1526
...	...	@@ -1531,10 +1530,10 @@
1531	1530	for_each_online_cpu(thatcpu) {
1532	1531	if (thatcpu == cpu)
1533	1532	continue;
1534		- snap = per_cpu(rcu_dynticks, thatcpu).dynticks;
1535		- snap_nmi = per_cpu(rcu_dynticks, thatcpu).dynticks_nmi;
	1533	+ snap = atomic_add_return(0, &per_cpu(rcu_dynticks,
	1534	+ thatcpu).dynticks);
1536	1535	smp_mb(); /* Order sampling of snap with end of grace period. */
1537		- if (((snap & 0x1) != 0) \|\| ((snap_nmi & 0x1) != 0)) {
	1536	+ if ((snap & 0x1) != 0) {
1538	1537	per_cpu(rcu_dyntick_drain, cpu) = 0;
1539	1538	per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
1540	1539	return rcu_needs_cpu_quick_check(cpu);
...	...	@@ -69,10 +69,10 @@
69	69	rdp->passed_quiesc, rdp->passed_quiesc_completed,
70	70	rdp->qs_pending);
71	71	#ifdef CONFIG_NO_HZ
72		- seq_printf(m, " dt=%d/%d dn=%d df=%lu",
73		- rdp->dynticks->dynticks,
	72	+ seq_printf(m, " dt=%d/%d/%d df=%lu",
	73	+ atomic_read(&rdp->dynticks->dynticks),
74	74	rdp->dynticks->dynticks_nesting,
75		- rdp->dynticks->dynticks_nmi,
	75	+ rdp->dynticks->dynticks_nmi_nesting,
76	76	rdp->dynticks_fqs);
77	77	#endif /* #ifdef CONFIG_NO_HZ */
78	78	seq_printf(m, " of=%lu ri=%lu", rdp->offline_fqs, rdp->resched_ipi);
79	79
...	...	@@ -141,9 +141,9 @@
141	141	rdp->qs_pending);
142	142	#ifdef CONFIG_NO_HZ
143	143	seq_printf(m, ",%d,%d,%d,%lu",
144		- rdp->dynticks->dynticks,
	144	+ atomic_read(&rdp->dynticks->dynticks),
145	145	rdp->dynticks->dynticks_nesting,
146		- rdp->dynticks->dynticks_nmi,
	146	+ rdp->dynticks->dynticks_nmi_nesting,
147	147	rdp->dynticks_fqs);
148	148	#endif /* #ifdef CONFIG_NO_HZ */
149	149	seq_printf(m, ",%lu,%lu", rdp->offline_fqs, rdp->resched_ipi);
...	...	@@ -167,7 +167,7 @@
167	167	{
168	168	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pqc\",\"pq\",");
169	169	#ifdef CONFIG_NO_HZ
170		- seq_puts(m, "\"dt\",\"dt nesting\",\"dn\",\"df\",");
	170	+ seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
171	171	#endif /* #ifdef CONFIG_NO_HZ */
172	172	seq_puts(m, "\"of\",\"ri\",\"ql\",\"b\",\"ci\",\"co\",\"ca\"\n");
173	173	#ifdef CONFIG_TREE_PREEMPT_RCU