06 May, 2011

34 commits

  • Provide rcu_virt_note_context_switch() for vitalization use to note
    quiescent state during guest entry.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Paul E. McKenney

    Gleb Natapov
     
  • Signed integer overflow is undefined by the C standard, so move
    calculations to unsigned.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • rcu_sched_qs() currently calls local_irq_save()/local_irq_restore() up
    to three times.

    Remove irq masking from rcu_qsctr_help() / invoke_rcu_kthread()
    and do it once in rcu_sched_qs() / rcu_bh_qs()

    This generates smaller code as well.

    text data bss dec hex filename
    2314 156 24 2494 9be kernel/rcutiny.old.o
    2250 156 24 2430 97e kernel/rcutiny.new.o

    Fix an outdated comment for rcu_qsctr_help()
    Move invoke_rcu_kthread() definition before its use.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Eric Dumazet
     
  • This commit marks a first step towards making call_rcu() have
    real-time behavior. If irqs are disabled, don't dive into the
    RCU core. Later on, this new early exit will wake up the
    per-CPU kthread, which first must be modified to handle the
    cases involving callback storms.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Although rcu_yield() dropped from real-time to normal priority, there
    is always the possibility that the competing tasks have been niced.
    So nice to 19 in rcu_yield() to help ensure that other tasks have a
    better chance of running.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Many rcu callbacks functions just call kfree() on the base structure.
    These functions are trivial, but their size adds up, and furthermore
    when they are used in a kernel module, that module must invoke the
    high-latency rcu_barrier() function at module-unload time.

    The kfree_rcu() function introduced by this commit addresses this issue.
    Rather than encoding a function address in the embedded rcu_head
    structure, kfree_rcu() instead encodes the offset of the rcu_head
    structure within the base structure. Because the functions are not
    allowed in the low-order 4096 bytes of kernel virtual memory, offsets
    up to 4095 bytes can be accommodated. If the offset is larger than
    4095 bytes, a compile-time error will be generated in __kfree_rcu().
    If this error is triggered, you can either fall back to use of call_rcu()
    or rearrange the structure to position the rcu_head structure into the
    first 4096 bytes.

    Note that the allowable offset might decrease in the future, for example,
    to allow something like kmem_cache_free_rcu().

    The new kfree_rcu() function can replace code as follows:

    call_rcu(&p->rcu, simple_kfree_callback);

    where "simple_kfree_callback()" might be defined as follows:

    void simple_kfree_callback(struct rcu_head *p)
    {
    struct foo *q = container_of(p, struct foo, rcu);

    kfree(q);
    }

    with the following:

    kfree_rcu(&p->rcu, rcu);

    Note that the "rcu" is the name of a field in the structure being
    freed. The reason for using this rather than passing in a pointer
    to the base structure is that the above approach allows better type
    checking.

    This commit is based on earlier work by Lai Jiangshan and Manfred Spraul:

    Lai's V1 patch: http://lkml.org/lkml/2008/9/18/1
    Manfred's patch: http://lkml.org/lkml/2009/1/2/115

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Manfred Spraul
    Signed-off-by: Paul E. McKenney
    Reviewed-by: David Howells
    Reviewed-by: Josh Triplett

    Lai Jiangshan
     
  • The "preemptible" spelling is preferable. May as well fix it.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Using __rcu_read_lock() in place of rcu_read_lock() leaves any debug
    state as it really should be, namely with the lock still held.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Lai Jiangshan
     
  • This applies a trick from TREE_RCU boosting to TINY_RCU, eliminating
    code and adding comments. The key point is that it is possible for
    the booster thread itself to work out whether there is a normal or
    expedited boost required based solely on local information. There
    is therefore no need for boost initiation to know or care what type
    of boosting is required. In addition, when boosting is complete for
    a given grace period, then by definition there cannot be any more
    boosting for that grace period. This allows eliminating yet more
    state and statistics.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The ->boosted_this_gp field is a holdover from an earlier design that
    was to carry out multiple boost operations in parallel. It is not required
    by the current design, which boosts one task at a time.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Extraneous semicolon, bad comment, and fold INIT_LIST_HEAD() into
    list_del() to get list_del_init().

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This removes a couple of lines from invoke_rcu_cpu_kthread(), improving
    readability.

    Reported-by: Christoph Lameter
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Avoid additional multiple-warning confusion in memory-corruption scenarios.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The CONFIG_DEBUG_OBJECTS_RCU_HEAD facility requires that on-stack RCU
    callbacks be flagged explicitly to debug-objects using the
    init_rcu_head_on_stack() and destroy_rcu_head_on_stack() functions.
    This commit applies those functions to the rcutorture code that tests
    RCU priority boosting.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Verify that rcu_head structures are aligned to a four-byte boundary.
    This check is enabled by CONFIG_DEBUG_OBJECTS_RCU_HEAD.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The prohibition of DEBUG_OBJECTS_RCU_HEAD from !PREEMPT was due to the
    fixup actions. So just produce a warning from !PREEMPT.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Mathieu Desnoyers
     
  • Increment a per-CPU counter on each pass through rcu_cpu_kthread()'s
    service loop, and add it to the rcudata trace output.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit adds the age in jiffies of the current grace period along
    with the duration in jiffies of the longest grace period since boot
    to the rcu/rcugp debugfs file. It also adds an additional "O" state
    to kthread tracing to differentiate between the kthread waiting due to
    having nothing to do on the one hand and waiting due to being on the
    wrong CPU on the other hand.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_initiate_boost_trace() function mis-attributed refusals to
    initiate RCU priority boosting that were in fact due to its not yet
    being time to boost. This patch fixes the faulty comparison.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit documents the new debugfs rcu/rcutorture and rcu/rcuboost
    trace files. The description has been updated as suggested by Josh
    Triplett.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • It is not possible to accurately correlate rcutorture output with that
    of debugfs. This patch therefore adds a debugfs file that prints out
    the rcutorture version number, permitting easy correlation.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Add tracing to help debugging situations when RCU's kthreads are not
    running but are supposed to be.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit adds an indication of the state of the callback queue using
    a string of four characters following the "ql=" integer queue length.
    The first character is "N" if there are callbacks that have been
    queued that are not yet ready to be handled by the next grace period, or
    "." otherwise. The second character is "R" if there are callbacks queued
    that are ready to be handled by the next grace period, or "." otherwise.
    The third character is "W" if there are callbacks waiting for the current
    grace period, or "." otherwise. Finally, the fourth character is "D"
    if there are callbacks that have been handled by a prior grace period
    and are waiting to be invoked, or ".".

    Note that callbacks that are in the process of being invoked are
    not shown. These callbacks would have been removed from the rcu_data
    structure's list by rcu_do_batch() prior to being executed. (These
    callbacks are also not reflected in the "ql=" total, FWIW.)

    Also, document the new callback-queue trace information.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The trace.txt file had obsolete output for the debugfs rcu/rcudata
    file, so update it.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Includes total number of tasks boosted, number boosted on behalf of each
    of normal and expedited grace periods, and statistics on attempts to
    initiate boosting that failed for various reasons.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror
    statistics are not actually incremented anymore, so eliminate them.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The scheduler does not appear to take kindly to having multiple
    real-time threads bound to a CPU that is going offline. So this
    commit is a temporary hack-around to avoid that happening.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If you are doing CPU hotplug operations, it is best not to have
    CPU-bound realtime tasks running CPU-bound on the outgoing CPU.
    So this commit makes per-CPU kthreads run at non-realtime priority
    during that time.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The scheduler has had some heartburn in the past when too many real-time
    kthreads were affinitied to the outgoing CPU. So, this commit lightens
    the load by forcing the per-rcu_node and the boost kthreads off of the
    outgoing CPU. Note that RCU's per-CPU kthread remains on the outgoing
    CPU until the bitter end, as it must in order to preserve correctness.

    Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(),
    given that this function can block.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Add priority boosting for TREE_PREEMPT_RCU, similar to that for
    TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST
    kernel parameter. The priority to which to boost preempted
    RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
    (defaulting to real-time priority 1) and the time to wait before
    boosting the readers who are blocking a given grace period is
    controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
    500 milliseconds).

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • If RCU priority boosting is to be meaningful, callback invocation must
    be boosted in addition to preempted RCU readers. Otherwise, in presence
    of CPU real-time threads, the grace period ends, but the callbacks don't
    get invoked. If the callbacks don't get invoked, the associated memory
    doesn't get freed, so the system is still subject to OOM.

    But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
    moves the callback invocations to a kthread, which can be boosted easily.

    Also add comments and properly synchronized all accesses to
    rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
    rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
    and ->exp_tasks tail pointers. This is in preparation for RCU priority
    boosting, which will add a third dimension to the combinatorial explosion
    in the ->blocked_tasks[] case, but simply a third pointer in the new
    ->blkd_tasks case.

    Also update documentation to reflect blocked_tasks[] merge

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
    invocations in rcu_process_callbacks() that are no longer needed, but
    sheer paranoia prevented them from being removed. This commit removes
    them and provides a proof of correctness in their absence. It also adds
    a memory barrier to rcu_report_qs_rsp() immediately before the update to
    rsp->completed in order to handle the theoretical possibility that the
    compiler or CPU might move massive quantities of code into a lock-based
    critical section. This also proves that the sheer paranoia was not
    entirely unjustified, at least from a theoretical point of view.

    In addition, the old dyntick-idle synchronization depended on the fact
    that grace periods were many milliseconds in duration, so that it could
    be assumed that no dyntick-idle CPU could reorder a memory reference
    across an entire grace period. Unfortunately for this design, the
    addition of expedited grace periods breaks this assumption, which has
    the unfortunate side-effect of requiring atomic operations in the
    functions that track dyntick-idle state for RCU. (There is some hope
    that the algorithms used in user-level RCU might be applied here, but
    some work is required to handle the NMIs that user-space applications
    can happily ignore. For the short term, better safe than sorry.)

    This proof assumes that neither compiler nor CPU will allow a lock
    acquisition and release to be reordered, as doing so can result in
    deadlock. The proof is as follows:

    1. A given CPU declares a quiescent state under the protection of
    its leaf rcu_node's lock.

    2. If there is more than one level of rcu_node hierarchy, the
    last CPU to declare a quiescent state will also acquire the
    ->lock of the next rcu_node up in the hierarchy, but only
    after releasing the lower level's lock. The acquisition of this
    lock clearly cannot occur prior to the acquisition of the leaf
    node's lock.

    3. Step 2 repeats until we reach the root rcu_node structure.
    Please note again that only one lock is held at a time through
    this process. The acquisition of the root rcu_node's ->lock
    must occur after the release of that of the leaf rcu_node.

    4. At this point, we set the ->completed field in the rcu_state
    structure in rcu_report_qs_rsp(). However, if the rcu_node
    hierarchy contains only one rcu_node, then in theory the code
    preceding the quiescent state could leak into the critical
    section. We therefore precede the update of ->completed with a
    memory barrier. All CPUs will therefore agree that any updates
    preceding any report of a quiescent state will have happened
    before the update of ->completed.

    5. Regardless of whether a new grace period is needed, rcu_start_gp()
    will propagate the new value of ->completed to all of the leaf
    rcu_node structures, under the protection of each rcu_node's ->lock.
    If a new grace period is needed immediately, this propagation
    will occur in the same critical section that ->completed was
    set in, but courtesy of the memory barrier in #4 above, is still
    seen to follow any pre-quiescent-state activity.

    6. When a given CPU invokes __rcu_process_gp_end(), it becomes
    aware of the end of the old grace period and therefore makes
    any RCU callbacks that were waiting on that grace period eligible
    for invocation.

    If this CPU is the same one that detected the end of the grace
    period, and if there is but a single rcu_node in the hierarchy,
    we will still be in the single critical section. In this case,
    the memory barrier in step #4 guarantees that all callbacks will
    be seen to execute after each CPU's quiescent state.

    On the other hand, if this is a different CPU, it will acquire
    the leaf rcu_node's ->lock, and will again be serialized after
    each CPU's quiescent state for the old grace period.

    On the strength of this proof, this commit therefore removes the memory
    barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
    The effect is to reduce the number of memory barriers by one and to
    reduce the frequency of execution from about once per scheduling tick
    per CPU to once per grace period.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The RCU CPU stall warnings can now be controlled using the
    rcu_cpu_stall_suppress boot-time parameter or via the same parameter
    from sysfs. There is therefore no longer any reason to have
    kernel config parameters for this feature. This commit therefore
    removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
    kernel config parameters. The RCU_CPU_STALL_TIMEOUT parameter remains
    to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
    remains to allow task-stall information to be suppressed if desired.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

04 May, 2011

6 commits