09 Jun, 2017

6 commits

  • The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013
    by commit 0edd1b1784cb ("nohz_full: Add full-system-idle state machine"),
    but has not been used. This commit therefore removes it.

    If it turns out to be needed later, this commit can always be reverted.

    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Acked-by: Linus Torvalds

    Paul E. McKenney
     
  • Anything that can be done with the RCU_KTHREAD_PRIO Kconfig option can
    also be done with the rcutree.kthread_prio kernel boot parameter.
    This commit therefore removes this Kconfig option.

    Reported-by: Linus Torvalds
    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Rik van Riel

    Paul E. McKenney
     
  • The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
    RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
    RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
    and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
    useful for torture testing, and there are the rcutree.gp_cleanup_delay,
    rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
    that rcutorture can use instead. The effect of these parameters is to
    artificially slow down grace period initialization and cleanup in order
    to make some types of race conditions happen more often.

    This commit therefore simplifies Tree RCU a bit by removing the Kconfig
    options and adding the corresponding kernel parameters to rcutorture's
    .boot files instead. However, this commit also leaves out the kernel
    parameters for TREE02, TREE04, and TREE07 in order to have about the
    same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
    and TREE06 are slowed, and the rest are not slowed.

    Reported-by: Linus Torvalds
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The "__call_rcu(): Leaked duplicate callback" error message from
    __call_rcu() has proven to be unhelpful. This commit therefore changes
    it to "__call_rcu(): Double-freed CB" and adds the value of the pointer
    passed in. The value of the pointer improves debuggability by allowing
    correlation with tracing output, for example, the rcu:rcu_callback trace
    event.

    Reported-by: Vegard Nossum
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The __rcu_is_watching() function is currently not used, aside from
    to implement the rcu_is_watching() function. This commit therefore
    eliminates __rcu_is_watching(), which has the beneficial side-effect
    of shrinking include/linux/rcupdate.h a bit.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The include/linux/rcupdate.h file is included by more than 200
    files, so shrinking it should provide some build-time benefits.
    This commit therefore moves several docbook comments from rcupdate.h to
    kernel/rcu/update.c, kernel/rcu/tree.c, and kernel/rcu/tree_plugin.h, thus
    reducing the number of times that the compiler has to scan these comments.
    This likely provides only a small benefit, but every little bit helps.

    This commit also fixes a malformed bulleted list noted by the 0day
    Test Robot.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

08 Jun, 2017

6 commits

  • Comments can be helpful, but assertions carry more force. This
    commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN()
    calls to enforce lock-held and interrupt-disabled preconditions.

    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit updates rcu_bootup_announce_oddness() to check additional
    Kconfig options and module/boot parameters.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adds WARN_ON_ONCE() calls that trigger if either
    rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled.
    In the immortal words of Peter Zijlstra: "these are much harder to ignore
    than comments".

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The synchronize_kernel() primitive was removed in favor of
    synchronize_sched() more than a decade ago, and it seems likely that
    rather few kernel hackers are familiar with it. Its continued presence
    is therefore providing more confusion than enlightenment. This commit
    therefore removes the reference from the synchronize_sched() header
    comment, and adds the corresponding information to the synchronize_rcu(0
    header comment.

    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Although preemptible RCU allows its read-side critical sections to be
    preempted, general blocking is forbidden. The reason for this is that
    excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
    voluntarily blocked task doesn't care how high you boost its priority.
    Because preemptible RCU is a global mechanism, one ill-behaved reader
    hurts everyone. Hence the prohibition against general blocking in
    RCU-preempt read-side critical sections. Preemption yes, blocking no.

    This commit enforces this prohibition.

    There is a special exception for the -rt patchset (which they kindly
    volunteered to implement): It is OK to block (as opposed to merely being
    preempted) within an RCU-preempt read-side critical section, but only if
    the blocking is subject to priority inheritance. This exception permits
    CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.

    Why doesn't this exception also apply to mainline's rt_mutex? Because
    of the possibility that someone does general blocking while holding
    an rt_mutex. Yes, the priority boosting will affect the rt_mutex,
    but it won't help with the task doing general blocking while holding
    that rt_mutex.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
    on each CPU with a non-empty callback list. This works, but means
    that rcu_barrier() forces grace periods that are not otherwise needed.
    The key point is that rcu_barrier() never needs to wait for a grace
    period, but instead only for all pre-existing callbacks to be invoked.
    This means that rcu_barrier()'s new callbacks should be placed in
    the callback-list segment containing the last pre-existing callback.

    This commit makes this change using the new rcu_segcblist_entrain()
    function.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

03 May, 2017

2 commits

  • Because the rcu_cblist_n_lazy_cbs() just samples the ->len_lazy counter,
    and because the rcu_cblist structure is quite straightforward, it makes
    sense to open-code rcu_cblist_n_lazy_cbs(p) as p->len_lazy, cutting out
    a level of indirection. This commit makes this change.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds

    Paul E. McKenney
     
  • Because the rcu_cblist_n_cbs() just samples the ->len counter, and
    because the rcu_cblist structure is quite straightforward, it makes
    sense to open-code rcu_cblist_n_cbs(p) as p->len, cutting out a level
    of indirection. This commit makes this change.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds

    Paul E. McKenney
     

02 May, 2017

1 commit

  • Because the rcu_cblist_empty() just samples the ->head pointer, and
    because the rcu_cblist structure is quite straightforward, it makes
    sense to open-code rcu_cblist_empty(p) as !p->head, cutting out a
    level of indirection. This commit makes this change.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds

    Paul E. McKenney
     

27 Apr, 2017

1 commit

  • In the past, SRCU was simple enough that there was little point in
    making the rcutorture writer stall messages print the SRCU grace-period
    number state. With the advent of Tree SRCU, this has changed. This
    commit therefore makes Classic, Tiny, and Tree SRCU report this state
    to rcutorture as needed.

    Signed-off-by: Paul E. McKenney
    Tested-by: Mike Galbraith

    Paul E. McKenney
     

21 Apr, 2017

3 commits

  • doc.2017.04.12a: Documentation updates
    fixes.2017.04.19a: Miscellaneous fixes
    srcu.2017.04.21a: Parallelize SRCU callback handling

    Paul E. McKenney
     
  • Currently, a call to schedule() acts as a Tasks RCU quiescent state
    only if a context switch actually takes place. However, just the
    call to schedule() guarantees that the calling task has moved off of
    whatever tracing trampoline that it might have been one previously.
    This commit therefore plumbs schedule()'s "preempt" parameter into
    rcu_note_context_switch(), which then records the Tasks RCU quiescent
    state, but only if this call to schedule() was -not- due to a preemption.

    To avoid adding overhead to the common-case context-switch path,
    this commit hides the rcu_note_context_switch() check under an existing
    non-common-case check.

    Suggested-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
    however, there are workloads that could result in a high volume of
    concurrent invocations of call_srcu(), which with current SRCU would
    result in excessive lock contention on the srcu_struct structure's
    ->queue_lock, which protects SRCU's callback lists. This commit therefore
    moves SRCU to per-CPU callback lists, thus greatly reducing contention.

    Because a given SRCU instance no longer has a single centralized callback
    list, starting grace periods and invoking callbacks are both more complex
    than in the single-list Classic SRCU implementation. Starting grace
    periods and handling callbacks are now handled using an srcu_node tree
    that is in some ways similar to the rcu_node trees used by RCU-bh,
    RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
    controlled by exactly the same Kconfig options and boot parameters that
    control the shape of the rcu_node tree).

    In addition, the old per-CPU srcu_array structure is now named srcu_data
    and contains an rcu_segcblist structure named ->srcu_cblist for its
    callbacks (and a spinlock to protect this). The srcu_struct gets
    an srcu_gp_seq that is used to associate callback segments with the
    corresponding completion-time grace-period number. These completion-time
    grace-period numbers are propagated up the srcu_node tree so that the
    grace-period workqueue handler can determine whether additional grace
    periods are needed on the one hand and where to look for callbacks that
    are ready to be invoked.

    The srcu_barrier() function must now wait on all instances of the per-CPU
    ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock,
    srcu_barrier() can remotely add the needed callbacks. In theory,
    it could also remotely start grace periods, but in practice doing so
    is complex and racy. And interestingly enough, it is never necessary
    for srcu_barrier() to start a grace period because srcu_barrier() only
    enqueues a callback when a callback is already present--and it turns out
    that a grace period has to have already been started for this pre-existing
    callback. Furthermore, it is only the callback that srcu_barrier()
    needs to wait on, not any particular grace period. Therefore, a new
    rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
    callback into the same segment occupied by the last pre-existing callback
    in the list. The special case where all the pre-existing callbacks are
    on a different list (because they are in the process of being invoked)
    is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
    segment, relying on the done-callbacks check that takes place after all
    callbacks are inovked.

    Note that the readers use the same algorithm as before. Note that there
    is a separate srcu_idx that tells the readers what counter to increment.
    This unfortunately cannot be combined with srcu_gp_seq because they
    need to be incremented at different times.

    This commit introduces some ugly #ifdefs in rcutorture. These will go
    away when I feel good enough about Tree SRCU to ditch Classic SRCU.

    Some crude performance comparisons, courtesy of a quickly hacked rcuperf
    asynchronous-grace-period capability:

    Callback Queuing Overhead
    -------------------------
    # CPUS Classic SRCU Tree SRCU
    ------ ------------ ---------
    2 0.349 us 0.342 us
    16 31.66 us 0.4 us
    41 --------- 0.417 us

    The times are the 90th percentiles, a statistic that was chosen to reject
    the overheads of the occasional srcu_barrier() call needed to avoid OOMing
    the test machine. The rcuperf test hangs when running Classic SRCU at 41
    CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code
    and that statistics, this is a convincing demonstration of Tree SRCU's
    performance and scalability advantages.

    [1] https://lwn.net/Articles/309030/
    [2] https://patchwork.kernel.org/patch/5108281/

    Signed-off-by: Paul E. McKenney
    [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]

    Paul E. McKenney
     

20 Apr, 2017

4 commits


19 Apr, 2017

12 commits

  • This commit makes the num_rcu_lvl[] array external so that SRCU can
    make use of it for initializing its upcoming srcu_node tree.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The levelcnt[] array is identical to num_rcu_lvl[], so this commit
    removes levelcnt[].

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit moves the rcu_init_levelspread() function from
    kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it. This is
    another step towards enabling SRCU to create its own combining tree.
    This commit is code-movement only, give or take knock-on adjustments.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(),
    and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h.
    This will allow SRCU to use these functions, which in turn will
    allow SRCU to move from a single global callback queue to a
    per-CPU callback queue.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This is primarily a code-movement commit in preparation for allowing
    SRCU to handle early-boot SRCU grace periods.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU has only one multi-tail callback list, which is implemented via
    the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
    rcu_data structure, and whose operations are open-code throughout the
    Tree RCU implementation. This has been more or less OK in the past,
    but upcoming callback-list optimizations in SRCU could really use
    a multi-tail callback list there as well.

    This commit therefore abstracts the multi-tail callback list handling
    into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
    The simple head-and-tail pointer callback list is also abstracted and
    applied everywhere except for the NOCB callback-offload lists. (Yes,
    the plan is to apply them there as well, but this commit is already
    bigger than would be good.)

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
    taking various actions to supply RCU with quiescent states, depending
    on the outcomes of the various checks. This is a bit much for scheduling
    fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
    the rcu_dynticks structure that acts as a global guard for these checks.
    Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
    check the ->rcu_urgent_qs field, find it false, and simply return.

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra

    Paul E. McKenney
     
  • The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
    that one of them still needs a quiescent state before doing an expensive
    atomic operation on the ->dynticks counter. However, this check reduces
    overhead only after a rare race condition, and increases complexity. This
    commit therefore removes the scan and the mechanism enabling the scan.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_qs_ctr variable is yet another isolated per-CPU variable,
    so this commit pulls it into the pre-existing rcu_dynticks per-CPU
    structure.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
    so this commit pulls it into the pre-existing rcu_dynticks per-CPU
    structure.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
    where "statement" is a local-variable declaration, as it can leave a
    misplaced ";" in the source code. This commit therefore converts these
    to "RCU_TRACE(statement;)", which avoids the misplaced ";".

    Reported-by: Josh Triplett
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, IPIs are used to force other CPUs to invalidate their TLBs
    in response to a kernel virtual-memory mapping change. This works, but
    degrades both battery lifetime (for idle CPUs) and real-time response
    (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
    the fact that CPUs executing in usermode are unaffected by stale kernel
    mappings. It would be better to cause a CPU executing in usermode to
    wait until it is entering kernel mode to do the flush, first to avoid
    interrupting usemode tasks and second to handle multiple flush requests
    with a single flush in the case of a long-running user task.

    This commit therefore reserves a bit at the bottom of the ->dynticks
    counter, which is checked upon exit from extended quiescent states.
    If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
    invoked, which, if not supplied, is an empty single-pass do-while loop.
    If this bottom bit is set on -entry- to an extended quiescent state,
    then a WARN_ON_ONCE() triggers.

    This bottom bit may be set using a new rcu_eqs_special_set() function,
    which returns true if the bit was set, or false if the CPU turned
    out to not be in an extended quiescent state. Please note that this
    function refuses to set the bit for a non-nohz_full CPU when that CPU
    is executing in usermode because usermode execution is tracked by RCU
    as a dyntick-idle extended quiescent state only for nohz_full CPUs.

    Reported-by: Andy Lutomirski
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

11 Apr, 2017

2 commits

  • Tracing uses rcu_irq_enter() as a way to make sure that RCU is watching when
    it needs to use rcu_read_lock() and friends. This is because tracing can
    happen as RCU is about to enter user space, or about to go idle, and RCU
    does not watch for RCU read side critical sections as it makes the
    transition.

    There is a small location within the RCU infrastructure that rcu_irq_enter()
    itself will not work. If tracing were to occur in that section it will break
    if it tries to use rcu_irq_enter().

    Originally, this happens with the stack_tracer, because it will call
    save_stack_trace when it encounters stack usage that is greater than any
    stack usage it had encountered previously. There was a case where that
    happened in the RCU section where rcu_irq_enter() did not work, and lockdep
    complained loudly about it. To fix it, stack tracing added a call to be
    disabled and RCU would disable stack tracing during the critical section
    that rcu_irq_enter() was inoperable. This solution worked, but there are
    other cases that use rcu_irq_enter() and it would be a good idea to let RCU
    give a way to let others know that rcu_irq_enter() will not work. For
    example, in trace events.

    Another helpful aspect of this change is that it also moves the per cpu
    variable called in the RCU critical section into a cache locale along with
    other RCU per cpu variables used in that same location.

    I'm keeping the stack_trace_disable() code, as that still could be used in
    the future by places that really need to disable it. And since it's only a
    static inline, it wont take up any kernel text if it is not used.

    Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home

    Acked-by: Paul E. McKenney
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • The tracing subsystem started using rcu_irq_entry() and rcu_irq_exit()
    (with my blessing) to allow the current _rcuidle alternative tracepoint
    name to be dispensed with while still maintaining good performance.
    Unfortunately, this causes RCU's dyntick-idle entry code's tracing to
    appear to RCU like an interrupt that occurs where RCU is not designed
    to handle interrupts.

    This commit fixes this problem by moving the zeroing of ->dynticks_nesting
    after the offending trace_rcu_dyntick() statement, which narrows the
    window of vulnerability to a pair of adjacent statements that are now
    marked with comments to that effect.

    Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home
    Link: http://lkml.kernel.org/r/20170405193928.GM1600@linux.vnet.ibm.com

    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Steven Rostedt (VMware)

    Paul E. McKenney
     

02 Mar, 2017

2 commits