12 Dec, 2011

15 commits

  • Both TINY_RCU's and TREE_RCU's implementations of rcu_boost() access
    the ->boost_tasks and ->exp_tasks fields without preventing concurrent
    changes to these fields. This commit therefore applies ACCESS_ONCE in
    order to prevent compiler mischief.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This reverts commit 5342e269b2b58ee0b0b4168a94087faaa60d0567.

    The approach taken in this patch was deemed too abusive to mutexes,
    and thus too likely to result in maintenance problems in the future.
    Instead, we will disallow RCU read-side critical sections that partially
    overlap with interrupt-disbled code segments.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If there are other CPUs active at a given point in time, then there is a
    limit to what a given CPU can do to advance the current RCU grace period.
    Beyond this limit, attempting to force the RCU grace period forward will
    do nothing but consume energy burning CPU cycles.

    Therefore, this commit takes an adaptive approach to RCU_FAST_NO_HZ
    preparations for idle. It pushes the RCU core state machine for
    two cycles unconditionally, and then it will push from zero to three
    additional cycles, but only as long as the RCU core has work for this
    CPU to do immediately. The rcu_pending() function is used to check
    whether the RCU core has such work.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_do_batch() function that invokes callbacks for TREE_RCU and
    TREE_PREEMPT_RCU normally throttles callback invocation to avoid degrading
    scheduling latency. However, as long as the CPU would otherwise be idle,
    there is no downside to continuing to invoke any callbacks that have passed
    through their grace periods. In fact, processing such callbacks in a
    timely manner has the benefit of increasing the probability that the
    CPU can enter the power-saving dyntick-idle mode.

    Therefore, this commit allows callback invocation to continue beyond the
    preset limit as long as the scheduler does not have some other task to
    run and as long as context is that of the idle task or the relevant
    RCU kthread.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
    dyntick-idle state if they have RCU callbacks pending. Unfortunately,
    this has the side-effect of often preventing them from entering this
    state, especially if at least one other CPU is not in dyntick-idle state.
    However, the resulting per-tick wakeup is wasteful in many cases: if the
    CPU has already fully responded to the current RCU grace period, there
    will be nothing for it to do until this grace period ends, which will
    frequently take several jiffies.

    This commit therefore permits a CPU that has done everything that the
    current grace period has asked of it (rcu_pending() == 0) even if it
    still as RCU callbacks pending. However, such a CPU posts a timer to
    wake it up several jiffies later (6 jiffies, based on experience with
    grace-period lengths). This wakeup is required to handle situations
    that can result in all CPUs being in dyntick-idle mode, thus failing
    to ever complete the current grace period. If a CPU wakes up before
    the timer goes off, then it cancels that timer, thus avoiding spurious
    wakeups.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Re-enable interrupts across calls to quiescent-state functions and
    also across force_quiescent_state() to reduce latency.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • With the new implementation of RCU_FAST_NO_HZ, it was possible to hang
    RCU grace periods as follows:

    o CPU 0 attempts to go idle, cycles several times through the
    rcu_prepare_for_idle() loop, then goes dyntick-idle when
    RCU needs nothing more from it, while still having at least
    on RCU callback pending.

    o CPU 1 goes idle with no callbacks.

    Both CPUs can then stay in dyntick-idle mode indefinitely, preventing
    the RCU grace period from ever completing, possibly hanging the system.

    This commit therefore prevents CPUs that have RCU callbacks from entering
    dyntick-idle mode. This approach also eliminates the need for the
    end-of-grace-period IPIs used previously.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If a CPU enters dyntick-idle mode with callbacks pending, it will need
    an IPI at the end of the grace period. However, if it exits dyntick-idle
    mode before the grace period ends, it will be needlessly IPIed at the
    end of the grace period.

    Therefore, this commit clears the per-CPU rcu_awake_at_gp_end flag
    when a CPU determines that it does not need it. This in turn requires
    disabling interrupts across much of rcu_prepare_for_idle() in order to
    avoid having nested interrupts clearing this state out from under us.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The earlier version would attempt to push callbacks through five times
    before going into dyntick-idle mode if callbacks remained, but the CPU
    had done all that it needed to do for the current RCU grace periods.
    This is wasteful: In most cases, once the CPU has done all that it
    needs to for the current RCU grace periods, it will make no further
    progress on the callbacks no matter how many times it loops through
    the RCU core processing and the idle-entry code.

    This commit therefore goes to dyntick-idle mode whenever the current
    CPU has done all it can for the current grace period.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adds trace_rcu_prep_idle(), which is invoked from
    rcu_prepare_for_idle() and rcu_wake_cpu() to trace attempts on
    the part of RCU to force CPUs into dyntick-idle mode.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, RCU does not permit a CPU to enter dyntick-idle mode if that
    CPU has any RCU callbacks queued. This means that workloads for which
    each CPU wakes up and does some RCU updates every few ticks will never
    enter dyntick-idle mode. This can result in significant unnecessary power
    consumption, so this patch permits a given to enter dyntick-idle mode if
    it has callbacks, but only if that same CPU has completed all current
    work for the RCU core. We determine use rcu_pending() to determine
    whether a given CPU has completed all current work for the RCU core.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Empty void functions do not need "return", so this commit removes it
    from rcu_report_exp_rnp().

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney

    Thomas Gleixner
     
  • When setting up an expedited grace period, if there were no readers, the
    task will awaken itself. This commit removes this useless self-awakening.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney

    Thomas Gleixner
     
  • When synchronize_sched_expedited() takes its second and subsequent
    snapshots of sync_sched_expedited_started, it subtracts 1. This
    means that the concurrent caller of synchronize_sched_expedited()
    that incremented to that value sees our successful completion, it
    will not be able to take advantage of it. This restriction is
    pointless, given that our full expedited grace period would have
    happened after the other guy started, and thus should be able to
    serve as a proxy for the other guy successfully executing
    try_stop_cpus().

    This commit therefore removes the subtraction of 1.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Because rcu_read_unlock_special() samples rcu_preempted_readers_exp(rnp)
    after dropping rnp->lock, the following sequence of events is possible:

    1. Task A exits its RCU read-side critical section, and removes
    itself from the ->blkd_tasks list, releases rnp->lock, and is
    then preempted. Task B remains on the ->blkd_tasks list, and
    blocks the current expedited grace period.

    2. Task B exits from its RCU read-side critical section and removes
    itself from the ->blkd_tasks list. Because it is the last task
    blocking the current expedited grace period, it ends that
    expedited grace period.

    3. Task A resumes, and samples rcu_preempted_readers_exp(rnp) which
    of course indicates that nothing is blocking the nonexistent
    expedited grace period. Task A is again preempted.

    4. Some other CPU starts an expedited grace period. There are several
    tasks blocking this expedited grace period queued on the
    same rcu_node structure that Task A was using in step 1 above.

    5. Task A examines its state and incorrectly concludes that it was
    the last task blocking the expedited grace period on the current
    rcu_node structure. It therefore reports completion up the
    rcu_node tree.

    6. The expedited grace period can then incorrectly complete before
    the tasks blocked on this same rcu_node structure exit their
    RCU read-side critical sections. Arbitrarily bad things happen.

    This commit therefore takes a snapshot of rcu_preempted_readers_exp(rnp)
    prior to dropping the lock, so that only the last task thinks that it is
    the last task, thus avoiding the failure scenario laid out above.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

29 Sep, 2011

14 commits

  • The purpose of rcu_needs_cpu_flush() was to iterate on pushing the
    current grace period in order to help the current CPU enter dyntick-idle
    mode. However, this can result in failures if the CPU starts entering
    dyntick-idle mode, but then backs out. In this case, the call to
    rcu_pending() from rcu_needs_cpu_flush() might end up announcing a
    non-existing quiescent state.

    This commit therefore removes rcu_needs_cpu_flush() in favor of letting
    the dyntick-idle machinery at the end of the softirq handler push the
    loop along via its call to rcu_pending().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU boost threads start life at RCU_BOOST_PRIO, while others remain
    at RCU_KTHREAD_PRIO. While here, change thread names to match other
    kthreads, and adjust rcu_yield() to not override the priority set by
    the user. This last change sets the stage for runtime changes to
    priority in the -rt tree.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Paul E. McKenney

    Mike Galbraith
     
  • Create a separate lockdep class for the rt_mutex used for RCU priority
    boosting and enable use of rt_mutex_lock() with irqs disabled. This
    prevents RCU priority boosting from falling prey to deadlocks when
    someone begins an RCU read-side critical section in preemptible state,
    but releases it with an irq-disabled lock held.

    Unfortunately, the scheduler's runqueue and priority-inheritance locks
    still must either completely enclose or be completely enclosed by any
    overlapping RCU read-side critical section.

    This version removes a redundant local_irq_restore() noted by
    Yong Zhang.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • It is possible for an RCU CPU stall to end just as it is detected, in
    which case the current code will uselessly dump all CPU's stacks.
    This commit therefore checks for this condition and refrains from
    sending needless NMIs.

    And yes, the stall might also end just after we checked all CPUs and
    tasks, but in that case we would at least have given some clue as
    to which CPU/task was at fault.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special)
    introduced a new ->rcu_boosted field in the task structure. This is
    redundant because the existing ->rcu_boost_mutex will be non-NULL at
    any time that ->rcu_boosted is nonzero. Therefore, this commit removes
    ->rcu_boosted and tests ->rcu_boost_mutex instead.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • We only need to constrain the compiler if we are actually exiting
    the top-level RCU read-side critical section. This commit therefore
    moves the first barrier() cal in __rcu_read_unlock() to inside the
    "if" statement, thus avoiding needless register flushes for inner
    rcu_read_unlock() calls.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • There is often a delay between the time that a CPU passes through a
    quiescent state and the time that this quiescent state is reported to the
    RCU core. It is quite possible that the grace period ended before the
    quiescent state could be reported, for example, some other CPU might have
    deduced that this CPU passed through dyntick-idle mode. It is critically
    important that quiescent state be counted only against the grace period
    that was in effect at the time that the quiescent state was detected.

    Previously, this was handled by recording the number of the last grace
    period to complete when passing through a quiescent state. The RCU
    core then checks this number against the current value, and rejects
    the quiescent state if there is a mismatch. However, one additional
    possibility must be accounted for, namely that the quiescent state was
    recorded after the prior grace period completed but before the current
    grace period started. In this case, the RCU core must reject the
    quiescent state, but the recorded number will match. This is handled
    when the CPU becomes aware of a new grace period -- at that point,
    it invalidates any prior quiescent state.

    This works, but is a bit indirect. The new approach records the current
    grace period, and the RCU core checks to see (1) that this is still the
    current grace period and (2) that this grace period has not yet ended.
    This approach simplifies reasoning about correctness, and this commit
    changes over to this new approach.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Add trace events to record grace-period start and end, quiescent states,
    CPUs noticing grace-period start and end, grace-period initialization,
    call_rcu() invocation, tasks blocking in RCU read-side critical sections,
    tasks exiting those same critical sections, force_quiescent_state()
    detection of dyntick-idle and offline CPUs, CPUs entering and leaving
    dyntick-idle mode (except from NMIs), CPUs coming online and going
    offline, and CPUs being kicked for staying in dyntick-idle mode for too
    long (as in many weeks, even on 32-bit systems).

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    rcu: Add the rcu flavor to callback trace events

    The earlier trace events for registering RCU callbacks and for invoking
    them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
    This commit adds the RCU flavor to those trace events.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Add event-trace markers to TREE_RCU kthreads to allow including these
    kthread's CPU time in the utilization calculations.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • We now have kthreads only for flavors of RCU that support boosting,
    so update the now-misleading comments accordingly.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • In order to allow event tracing to distinguish between flavors of
    RCU, we need those names in the relevant RCU data structures. TINY_RCU
    has avoided them for memory-footprint reasons, so add them only if
    CONFIG_RCU_TRACE=y.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Pull the code that waits for an RCU grace period into a single function,
    which is then called by synchronize_rcu() and friends in the case of
    TREE_RCU and TREE_PREEMPT_RCU, and from rcu_barrier() and friends in
    the case of TINY_RCU and TINY_PREEMPT_RCU.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • There are a number of cases where the RCU can find additional work
    for the per-CPU kthread within the context of that per-CPU kthread.
    In such cases, the per-CPU kthread is already running, so attempting
    to wake itself up does nothing except waste CPU cycles. This commit
    therefore checks to see if it is in the per-CPU kthread context,
    omitting the wakeup in this case.

    Signed-off-by: Shaohua Li
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Shaohua Li
     
  • Commit a26ac2455ffc (move TREE_RCU from softirq to kthread) added
    per-CPU kthreads. However, kthread creation uses kthread_create(), which
    can put the kthread's stack and task struct on the wrong NUMA node.
    Therefore, use kthread_create_on_node() instead of kthread_create()
    so that the stacks and task structs are placed on the correct NUMA node.

    A similar change was carried out in commit 94dcf29a11b3 (kthread:
    use kthread_create_on_node()).

    Also change rcutorture's priority-boost-test kthread creation.

    Signed-off-by: Eric Dumazet
    CC: Tejun Heo
    CC: Rusty Russell
    CC: Andrew Morton
    CC: Andi Kleen
    CC: Ingo Molnar
    Signed-off-by: Paul E. McKenney

    Eric Dumazet
     

21 Jul, 2011

2 commits

  • The rcu_read_unlock_special() function relies on in_irq() to exclude
    scheduler activity from interrupt level. This fails because exit_irq()
    can invoke the scheduler after clearing the preempt_count() bits that
    in_irq() uses to determine that it is at interrupt level. This situation
    can result in failures as follows:

    $task IRQ SoftIRQ

    rcu_read_lock()

    /* do stuff */

    |= UNLOCK_BLOCKED

    rcu_read_unlock()
    --t->rcu_read_lock_nesting

    irq_enter();
    /* do stuff, don't use RCU */
    irq_exit();
    sub_preempt_count(IRQ_EXIT_OFFSET);
    invoke_softirq()

    ttwu();
    spin_lock_irq(&pi->lock)
    rcu_read_lock();
    /* do stuff */
    rcu_read_unlock();
    rcu_read_unlock_special()
    rcu_report_exp_rnp()
    ttwu()
    spin_lock_irq(&pi->lock) /* deadlock */

    rcu_read_unlock_special(t);

    Ed can simply trigger this 'easy' because invoke_softirq() immediately
    does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
    first, but even without that the above happens.

    Cure this by also excluding softirqs from the
    rcu_read_unlock_special() handler and ensuring the force_irqthreads
    ksoftirqd/# wakeup is done from full softirq context.

    [ Alternatively, delaying the ->rcu_read_lock_nesting decrement
    until after the special handling would make the thing more robust
    in the face of interrupts as well. And there is a separate patch
    for that. ]

    Cc: Thomas Gleixner
    Reported-and-tested-by: Ed Tomlinson
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     
  • The addition of RCU read-side critical sections within runqueue and
    priority-inheritance lock critical sections introduced some deadlock
    cycles, for example, involving interrupts from __rcu_read_unlock()
    where the interrupt handlers call wake_up(). This situation can cause
    the instance of __rcu_read_unlock() invoked from interrupt to do some
    of the processing that would otherwise have been carried out by the
    task-level instance of __rcu_read_unlock(). When the interrupt-level
    instance of __rcu_read_unlock() is called with a scheduler lock held
    from interrupt-entry/exit situations where in_irq() returns false,
    deadlock can result.

    This commit resolves these deadlocks by using negative values of
    the per-task ->rcu_read_lock_nesting counter to indicate that an
    instance of __rcu_read_unlock() is in flight, which in turn prevents
    instances from interrupt handlers from doing any special processing.
    This patch is inspired by Steven Rostedt's earlier patch that similarly
    made __rcu_read_unlock() guard against interrupt-mediated recursion
    (see https://lkml.org/lkml/2011/7/15/326), but this commit refines
    Steven's approach to avoid the need for preemption disabling on the
    __rcu_read_unlock() fastpath and to also avoid the need for manipulating
    a separate per-CPU variable.

    This patch avoids need for preempt_disable() by instead using negative
    values of the per-task ->rcu_read_lock_nesting counter. Note that nested
    rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
    never see ->rcu_read_lock_nesting go to zero, and will therefore never
    invoke rcu_read_unlock_special(), thus preventing them from seeing the
    RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
    This patch also adds a check for ->rcu_read_unlock_special being negative
    in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
    bit from being set should a scheduling-clock interrupt occur while
    __rcu_read_unlock() is exiting from an outermost RCU read-side critical
    section.

    Of course, __rcu_read_unlock() can be preempted during the time that
    ->rcu_read_lock_nesting is negative. This could result in the setting
    of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
    and would also result it this task being queued on the corresponding
    rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side
    critical section would enter rcu_read_unlock_special() to clean up --
    which could result in deadlock if that critical section happened to be in
    the scheduler where the runqueue or priority-inheritance locks were held.

    This situation is dealt with by making rcu_preempt_note_context_switch()
    check for negative ->rcu_read_lock_nesting, thus refraining from
    queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are
    already exiting from the outermost RCU read-side critical section (in
    other words, we really are no longer actually in that RCU read-side
    critical section). In addition, rcu_preempt_note_context_switch()
    invokes rcu_read_unlock_special() to carry out the cleanup in this case,
    which clears out the ->rcu_read_unlock_special bits and dequeues the task
    (if necessary), in turn avoiding needless delay of the current RCU grace
    period and needless RCU priority boosting.

    It is still illegal to call rcu_read_unlock() while holding a scheduler
    lock if the prior RCU read-side critical section has ever had either
    preemption or irqs enabled. However, the common use case is legal,
    namely where then entire RCU read-side critical section executes with
    irqs disabled, for example, when the scheduler lock is held across the
    entire lifetime of the RCU read-side critical section.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

20 Jul, 2011

3 commits

  • Given some common flag combinations, particularly -Os, gcc will inline
    rcu_read_unlock_special() despite its being in an unlikely() clause.
    Use noinline to prohibit this misoptimization.

    In addition, move the second barrier() in __rcu_read_unlock() so that
    it is not on the common-case code path. This will allow the compiler to
    generate better code for the common-case path through __rcu_read_unlock().

    Suggested-by: Linus Torvalds
    Signed-off-by: Paul E. McKenney
    Acked-by: Mathieu Desnoyers

    Paul E. McKenney
     
  • The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task
    write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's
    ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without
    correctly synchronizing all accesses to ->rcu_read_unlock_special.
    This could result in bits in ->rcu_read_unlock_special being spuriously
    set and cleared due to conflicting accesses, which in turn could result
    in deadlocks between the rcu_node structure's ->lock and the scheduler's
    rq and pi locks. These deadlocks would result from RCU incorrectly
    believing that the just-ended RCU read-side critical section had been
    preempted and/or boosted. If that RCU read-side critical section was
    executed with either rq or pi locks held, RCU's ensuing (incorrect)
    calls to the scheduler would cause the scheduler to attempt to once
    again acquire the rq and pi locks, resulting in deadlock. More complex
    deadlock cycles are also possible, involving multiple rq and pi locks
    as well as locks from multiple rcu_node structures.

    This commit fixes synchronization by creating ->rcu_boosted field in
    task_struct that is accessed and modified only when holding the ->lock
    in the rcu_node structure on which the task is queued (on that rcu_node
    structure's ->blkd_tasks list). This results in tasks accessing only
    their own current->rcu_read_unlock_special fields, making unsynchronized
    access once again legal, and keeping the rcu_read_unlock() fastpath free
    of atomic instructions and memory barriers.

    The reason that the rcu_read_unlock() fastpath does not need to access
    the new current->rcu_boosted field is that this new field cannot
    be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the
    current->rcu_read_unlock_special field. Therefore, rcu_read_unlock()
    need only test current->rcu_read_unlock_special: if that is zero, then
    current->rcu_boosted must also be zero.

    This bug does not affect TINY_PREEMPT_RCU because this implementation
    of RCU accesses current->rcu_read_unlock_special with irqs disabled,
    thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on.

    Maybe-reported-by: Dave Jones
    Maybe-reported-by: Sergey Senozhatsky
    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Steven Rostedt

    Paul E. McKenney
     
  • PREEMPT_RCU read-side critical sections blocking an expedited grace
    period invoke rcu_report_exp_rnp(). When the last such critical section
    has completed, rcu_report_exp_rnp() invokes the scheduler to wake up the
    task that invoked synchronize_rcu_expedited() -- needlessly holding the
    root rcu_node structure's lock while doing so, thus needlessly providing
    a way for RCU and the scheduler to deadlock.

    This commit therefore releases the root rcu_node structure's lock before
    calling wake_up().

    Reported-by: Ed Tomlinson
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

13 Jul, 2011

1 commit

  • Under some rare but real combinations of configuration parameters, RCU
    callbacks are posted during early boot that use kernel facilities that
    are not yet initialized. Therefore, when these callbacks are invoked,
    hard hangs and crashes ensue. This commit therefore prevents RCU
    callbacks from being invoked until after the scheduler is fully up and
    running, as in after multiple tasks have been spawned.

    It might well turn out that a better approach is to identify the specific
    RCU callbacks that are causing this problem, but that discussion will
    wait until such time as someone really needs an RCU callback to be invoked
    (as opposed to merely registered) during early boot.

    Reported-by: julie Sullivan
    Reported-by: RKK
    Signed-off-by: Paul E. McKenney
    Tested-by: Konrad Rzeszutek Wilk
    Tested-by: julie Sullivan
    Tested-by: RKK

    Paul E. McKenney
     

17 Jun, 2011

1 commit

  • The commit "use softirq instead of kthreads except when RCU_BOOST=y"
    just applied #ifdef in place. This commit is a cleanup that moves
    the newly #ifdef'ed code to the header file kernel/rcutree_plugin.h.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

16 Jun, 2011

1 commit


15 Jun, 2011

2 commits

  • Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread)
    introduced performance regression. In an AIM7 test, this commit degraded
    performance by about 40%.

    The commit runs rcu callbacks in a kthread instead of softirq. We observed
    high rate of context switch which is caused by this. Out test system has
    64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
    which is caused by RCU's per-CPU kthread. A trace showed that most of
    the time the RCU per-CPU kthread doesn't actually handle any callbacks,
    but instead just does a very small amount of work handling grace periods.
    This means that RCU's per-CPU kthreads are making the scheduler do quite
    a bit of work in order to allow a very small amount of RCU-related
    processing to be done.

    Alex Shi's analysis determined that this slowdown is due to lock
    contention within the scheduler. Unfortunately, as Peter Zijlstra points
    out, the scheduler's real-time semantics require global action, which
    means that this contention is inherent in real-time scheduling. (Yes,
    perhaps someone will come up with a workaround -- otherwise, -rt is not
    going to do well on large SMP systems -- but this patch will work around
    this issue in the meantime. And "the meantime" might well be forever.)

    This patch therefore re-introduces softirq processing to RCU, but only
    for core RCU work. RCU callbacks are still executed in kthread context,
    so that only a small amount of RCU work runs in softirq context in the
    common case. This should minimize ksoftirqd execution, allowing us to
    skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.

    Signed-off-by: Shaohua Li
    Tested-by: "Alex,Shi"
    Signed-off-by: Paul E. McKenney

    Shaohua Li
     
  • Make the functions creating the kthreads wake them up. Leverage the
    fact that the per-node and boost kthreads can run anywhere, thus
    dispensing with the need to wake them up once the incoming CPU has
    gone fully online.

    Signed-off-by: Paul E. McKenney
    Tested-by: Daniel J Blueman

    Paul E. McKenney
     

31 May, 2011

1 commit

  • Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
    state) fudges a sleeping task' state, resulting in the scheduler seeing
    a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
    task waking up. The result is unbalanced load calculation.

    The problem that patch tried to address is that the RCU threads could
    stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
    task detector due to on-demand wake-ups.

    Cure the problem differently by always giving the tasks at least one
    wake-up once the CPU is fully up and running, this will kick them out of
    the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
    wait state.

    [ The alternative would be teaching kthread_create() to start threads as
    INTERRUPTIBLE but that needs a tad more thought. ]

    Reported-by: Damien Wyart
    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra