17 Aug, 2017

1 commit

  • …isc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD

    doc.2017.08.17a: Documentation updates.
    fixes.2017.08.17a: RCU fixes.
    hotplug.2017.07.25b: CPU-hotplug updates.
    misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts).
    spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait().
    srcu.2017.07.27c: SRCU updates.
    torture.2017.07.24c: Torture-test updates.

    Paul E. McKenney
     

25 Jul, 2017

2 commits


09 Jun, 2017

3 commits

  • This commit uses TREE RCU's rnp->lock wrappers to replace a few explicit
    memory barriers. This change also has the advantage of making SRCU's
    memory-ordering properties be implemented in roughly the same way as they
    are in Tree RCU.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The call_srcu() docbook entry is currently in include/linux/srcu.h,
    which causes needless processing for each include point. This commit
    therefore moves this entry to kernel/rcu/srcutree.c, which the compiler
    reads only once. In addition, the srcu_batches_completed() function is
    used only within RCU and its torture-test suites. This commit therefore
    also moves this function's declaration from include/linux/srcutiny.h,
    include/linux/srcutree.h, and include/linux/srcuclassic.h to
    kernel/rcu/rcu.h.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The include/linux/rcupdate.h file contains a number of definitions that
    are used only to communicate between rcutorture, rcuperf, and the RCU code
    itself. There is no point in having these definitions exposed globally
    throughout the kernel, so this commit moves them to kernel/rcu/rcu.h.
    This change has the added benefit of shrinking rcupdate.h.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

27 Apr, 2017

4 commits

  • On small systems, in the absence of readers, expedited SRCU grace
    periods can complete in less than a microsecond. This means that an
    eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
    loop and almost always expedite. This might actually be desirable in
    some situations, but in general it is a good way to needlessly burn
    CPU cycles. And in those situations where it is desirable, your friend
    is the function synchronize_srcu_expedited().

    For other situations, this commit adds a kernel parameter that specifies
    a holdoff between completing the last SRCU grace period and auto-expediting
    the next. If the next grace period starts before the holdoff expires,
    auto-expediting is disabled. The holdoff is 50 microseconds by default,
    and can be tuned to the desired number of nanoseconds. A value of zero
    disables auto-expediting.

    Signed-off-by: Paul E. McKenney
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • Commit f60d231a87c5 ("srcu: Crude control of expedited grace periods")
    introduced a per-srcu_struct atomic counter to track outstanding
    requests for grace periods. This works, but represents a memory-contention
    bottleneck. This commit therefore uses the srcu_node combining tree
    to remove this bottleneck.

    This commit adds new ->srcu_gp_seq_needed_exp fields to the
    srcu_data, srcu_node, and srcu_struct structures, which track the
    farthest-in-the-future grace period that must be expedited, which in
    turn requires that all nearer-term grace periods also be expedited.
    Requests for expediting start with the srcu_data structure, run up
    through the srcu_node tree, and end at the srcu_struct structure.
    Note that it may be necessary to expedite a grace period that just
    now started, and this is handled by a new srcu_funnel_exp_start()
    function, which is invoked when the grace period itself is already
    in its way, but when that grace period was not marked as expedited.

    A new srcu_get_delay() function returns zero if there is at least one
    expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
    This function is used to calculate delays: Normal grace periods
    are allowed to extend in order to cover more requests with a given
    grace-period computation, which decreases per-request overhead.

    Signed-off-by: Paul E. McKenney
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • In the past, SRCU was simple enough that there was little point in
    making the rcutorture writer stall messages print the SRCU grace-period
    number state. With the advent of Tree SRCU, this has changed. This
    commit therefore makes Classic, Tiny, and Tree SRCU report this state
    to rcutorture as needed.

    Signed-off-by: Paul E. McKenney
    Tested-by: Mike Galbraith

    Paul E. McKenney
     
  • The current Tree SRCU implementation schedules a workqueue for every
    srcu_data covered by a given leaf srcu_node structure having callbacks,
    even if only one of those srcu_data structures actually contains
    callbacks. This is clearly inefficient for workloads that don't feature
    callbacks everywhere all the time. This commit therefore adds an array
    of masks that are used by the leaf srcu_node structures to track exactly
    which srcu_data structures contain callbacks.

    Signed-off-by: Paul E. McKenney
    Tested-by: Mike Galbraith

    Paul E. McKenney
     

21 Apr, 2017

1 commit

  • Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
    however, there are workloads that could result in a high volume of
    concurrent invocations of call_srcu(), which with current SRCU would
    result in excessive lock contention on the srcu_struct structure's
    ->queue_lock, which protects SRCU's callback lists. This commit therefore
    moves SRCU to per-CPU callback lists, thus greatly reducing contention.

    Because a given SRCU instance no longer has a single centralized callback
    list, starting grace periods and invoking callbacks are both more complex
    than in the single-list Classic SRCU implementation. Starting grace
    periods and handling callbacks are now handled using an srcu_node tree
    that is in some ways similar to the rcu_node trees used by RCU-bh,
    RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
    controlled by exactly the same Kconfig options and boot parameters that
    control the shape of the rcu_node tree).

    In addition, the old per-CPU srcu_array structure is now named srcu_data
    and contains an rcu_segcblist structure named ->srcu_cblist for its
    callbacks (and a spinlock to protect this). The srcu_struct gets
    an srcu_gp_seq that is used to associate callback segments with the
    corresponding completion-time grace-period number. These completion-time
    grace-period numbers are propagated up the srcu_node tree so that the
    grace-period workqueue handler can determine whether additional grace
    periods are needed on the one hand and where to look for callbacks that
    are ready to be invoked.

    The srcu_barrier() function must now wait on all instances of the per-CPU
    ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock,
    srcu_barrier() can remotely add the needed callbacks. In theory,
    it could also remotely start grace periods, but in practice doing so
    is complex and racy. And interestingly enough, it is never necessary
    for srcu_barrier() to start a grace period because srcu_barrier() only
    enqueues a callback when a callback is already present--and it turns out
    that a grace period has to have already been started for this pre-existing
    callback. Furthermore, it is only the callback that srcu_barrier()
    needs to wait on, not any particular grace period. Therefore, a new
    rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
    callback into the same segment occupied by the last pre-existing callback
    in the list. The special case where all the pre-existing callbacks are
    on a different list (because they are in the process of being invoked)
    is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
    segment, relying on the done-callbacks check that takes place after all
    callbacks are inovked.

    Note that the readers use the same algorithm as before. Note that there
    is a separate srcu_idx that tells the readers what counter to increment.
    This unfortunately cannot be combined with srcu_gp_seq because they
    need to be incremented at different times.

    This commit introduces some ugly #ifdefs in rcutorture. These will go
    away when I feel good enough about Tree SRCU to ditch Classic SRCU.

    Some crude performance comparisons, courtesy of a quickly hacked rcuperf
    asynchronous-grace-period capability:

    Callback Queuing Overhead
    -------------------------
    # CPUS Classic SRCU Tree SRCU
    ------ ------------ ---------
    2 0.349 us 0.342 us
    16 31.66 us 0.4 us
    41 --------- 0.417 us

    The times are the 90th percentiles, a statistic that was chosen to reject
    the overheads of the occasional srcu_barrier() call needed to avoid OOMing
    the test machine. The rcuperf test hangs when running Classic SRCU at 41
    CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code
    and that statistics, this is a convincing demonstration of Tree SRCU's
    performance and scalability advantages.

    [1] https://lwn.net/Articles/309030/
    [2] https://patchwork.kernel.org/patch/5108281/

    Signed-off-by: Paul E. McKenney
    [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]

    Paul E. McKenney
     

19 Apr, 2017

1 commit