27 May, 2011

6 commits

  • (Note: this was reverted, and is now being re-applied in pieces, with
    this being the fifth and final piece. See below for the reason that
    it is now felt to be safe to re-apply this.)

    Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
    invocations in rcu_process_callbacks() that are no longer needed, but
    sheer paranoia prevented them from being removed. This commit removes
    them and provides a proof of correctness in their absence. It also adds
    a memory barrier to rcu_report_qs_rsp() immediately before the update to
    rsp->completed in order to handle the theoretical possibility that the
    compiler or CPU might move massive quantities of code into a lock-based
    critical section. This also proves that the sheer paranoia was not
    entirely unjustified, at least from a theoretical point of view.

    In addition, the old dyntick-idle synchronization depended on the fact
    that grace periods were many milliseconds in duration, so that it could
    be assumed that no dyntick-idle CPU could reorder a memory reference
    across an entire grace period. Unfortunately for this design, the
    addition of expedited grace periods breaks this assumption, which has
    the unfortunate side-effect of requiring atomic operations in the
    functions that track dyntick-idle state for RCU. (There is some hope
    that the algorithms used in user-level RCU might be applied here, but
    some work is required to handle the NMIs that user-space applications
    can happily ignore. For the short term, better safe than sorry.)

    This proof assumes that neither compiler nor CPU will allow a lock
    acquisition and release to be reordered, as doing so can result in
    deadlock. The proof is as follows:

    1. A given CPU declares a quiescent state under the protection of
    its leaf rcu_node's lock.

    2. If there is more than one level of rcu_node hierarchy, the
    last CPU to declare a quiescent state will also acquire the
    ->lock of the next rcu_node up in the hierarchy, but only
    after releasing the lower level's lock. The acquisition of this
    lock clearly cannot occur prior to the acquisition of the leaf
    node's lock.

    3. Step 2 repeats until we reach the root rcu_node structure.
    Please note again that only one lock is held at a time through
    this process. The acquisition of the root rcu_node's ->lock
    must occur after the release of that of the leaf rcu_node.

    4. At this point, we set the ->completed field in the rcu_state
    structure in rcu_report_qs_rsp(). However, if the rcu_node
    hierarchy contains only one rcu_node, then in theory the code
    preceding the quiescent state could leak into the critical
    section. We therefore precede the update of ->completed with a
    memory barrier. All CPUs will therefore agree that any updates
    preceding any report of a quiescent state will have happened
    before the update of ->completed.

    5. Regardless of whether a new grace period is needed, rcu_start_gp()
    will propagate the new value of ->completed to all of the leaf
    rcu_node structures, under the protection of each rcu_node's ->lock.
    If a new grace period is needed immediately, this propagation
    will occur in the same critical section that ->completed was
    set in, but courtesy of the memory barrier in #4 above, is still
    seen to follow any pre-quiescent-state activity.

    6. When a given CPU invokes __rcu_process_gp_end(), it becomes
    aware of the end of the old grace period and therefore makes
    any RCU callbacks that were waiting on that grace period eligible
    for invocation.

    If this CPU is the same one that detected the end of the grace
    period, and if there is but a single rcu_node in the hierarchy,
    we will still be in the single critical section. In this case,
    the memory barrier in step #4 guarantees that all callbacks will
    be seen to execute after each CPU's quiescent state.

    On the other hand, if this is a different CPU, it will acquire
    the leaf rcu_node's ->lock, and will again be serialized after
    each CPU's quiescent state for the old grace period.

    On the strength of this proof, this commit therefore removes the memory
    barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
    The effect is to reduce the number of memory barriers by one and to
    reduce the frequency of execution from about once per scheduling tick
    per CPU to once per grace period.

    This was reverted do to hangs found during testing by Yinghai Lu and
    Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that
    located the underlying problem, and Frederic also provided the fix.

    The underlying problem was that the HARDIRQ_ENTER() macro from
    lib/locking-selftest.c invoked irq_enter(), which in turn invokes
    rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
    does not invoke rcu_irq_exit(). This situation resulted in calls
    to rcu_irq_enter() that were not balanced by the required calls to
    rcu_irq_exit(). Therefore, after these locking selftests completed,
    RCU's dyntick-idle nesting count was a large number (for example,
    72), which caused RCU to to conclude that the affected CPU was not in
    dyntick-idle mode when in fact it was.

    RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
    in hangs.

    In contrast, with Frederic's patch, which replaces the irq_enter()
    in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
    either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
    running the test is already marked as not being in dyntick-idle mode.
    This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
    then has no problem working out which CPUs are in dyntick-idle mode and
    which are not.

    The reason that the imbalance was not noticed before the barrier patch
    was applied is that the old implementation of rcu_enter_nohz() ignored
    the nesting depth. This could still result in delays, but much shorter
    ones. Whenever there was a delay, RCU would IPI the CPU with the
    unbalanced nesting level, which would eventually result in rcu_enter_nohz()
    being called, which in turn would force RCU to see that the CPU was in
    dyntick-idle mode.

    The reason that very few people noticed the problem is that the mismatched
    irq_enter() vs. __irq_exit() occured only when the kernel was built with
    CONFIG_DEBUG_LOCKING_API_SELFTESTS.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The old version of rcu_enter_nohz() forced RCU into nohz mode even if
    the nesting count was non-zero. This change causes rcu_enter_nohz()
    to hold off for non-zero nesting counts.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Condition the set_need_resched() in rcu_irq_exit() on in_irq(). This
    should be a no-op, because rcu_irq_exit() should only be called from irq.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Second step of partitioning of commit e59fb3120b.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Add the memory barriers added by e59fb3120b.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • HARDIRQ_ENTER() maps to irq_enter() which calls rcu_irq_enter().
    But HARDIRQ_EXIT() maps to __irq_exit() which doesn't call
    rcu_irq_exit().

    So for every locking selftest that simulates hardirq disabled,
    we create an imbalance in the rcu extended quiescent state
    internal state.

    As a result, after the first missing rcu_irq_exit(), subsequent
    irqs won't exit dyntick-idle mode after leaving the interrupt
    handler. This means that RCU won't see the affected CPU as being
    in an extended quiescent state, resulting in long grace-period
    delays (as in grace periods extending for hours).

    To fix this, just use __irq_enter() to simulate the hardirq
    context. This is sufficient for the locking selftests as we
    don't need to exit any extended quiescent state or perform
    any check that irqs normally do when they wake up from idle.

    As a side effect, this patch makes it possible to restore
    "rcu: Decrease memory-barrier usage based on semi-formal proof",
    which eventually helped finding this bug.

    Reported-and-tested-by: Yinghai Lu
    Signed-off-by: Frederic Weisbecker
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Stable
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

20 May, 2011

1 commit


08 May, 2011

33 commits