23 Aug, 2016

1 commit

  • The current implementation of expedited grace periods has the user
    task drive the grace period. This works, but has downsides: (1) The
    user task must awaken tasks piggybacking on this grace period, which
    can result in latencies rivaling that of the grace period itself, and
    (2) User tasks can receive signals, which interfere with RCU CPU stall
    warnings.

    This commit therefore uses workqueues to drive the grace periods, so
    that the user task need not do the awakening. A subsequent commit
    will remove the now-unnecessary code allowing for signals.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

01 Apr, 2016

2 commits


08 Dec, 2015

1 commit


05 Dec, 2015

2 commits

  • The Kconfig currently controlling compilation of this code is:

    init/Kconfig:config TREE_RCU_TRACE
    init/Kconfig: def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU )

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that
    when reading the file there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular
    case, the init ordering remains unchanged with this commit. We could
    consider moving this to an earlier initcall if desired.

    We don't replace module.h with init.h since the file already has that.
    We also delete the moduleparam.h include that is left over from
    commit 64db4cfff99c04cd5f550357edcc8780f96b54a2 (""Tree RCU": scalable
    classic RCU implementation") since it is not needed here either.

    We morph some tags like MODULE_AUTHOR into the comments at the top of
    the file for documentation purposes.

    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Reviewed-by: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Paul E. McKenney

    Paul Gortmaker
     
  • Currently, the piggybacked-work checks carried out by sync_exp_work_done()
    atomically increment a small set of variables (the ->expedited_workdone0,
    ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
    fields in the rcu_state structure), which will form a memory-contention
    bottleneck given a sufficiently large number of CPUs concurrently invoking
    either synchronize_rcu_expedited() or synchronize_sched_expedited().

    This commit therefore moves these for fields to the per-CPU rcu_data
    structure, eliminating the memory contention. The show_rcuexp() function
    also changes to sum up each field in the rcu_data structures.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

24 Nov, 2015

1 commit

  • The rule is that all acquisitions of the rcu_node structure's ->lock
    must provide transitivity: The lock is not acquired that frequently,
    and sorting out exactly which required it and which did not would be
    a maintenance nightmare. This commit therefore supplies the needed
    transitivity to the remaining ->lock acquisitions.

    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

08 Oct, 2015

1 commit


07 Oct, 2015

1 commit

  • Commit commit 4cdfc175c25c89ee ("rcu: Move quiescent-state forcing
    into kthread") started the process of folding the old ->fqs_state into
    ->gp_state, but did not complete it. This situation does not cause
    any malfunction, but can result in extremely confusing trace output.
    This commit completes this task of eliminating ->fqs_state in favor
    of ->gp_state.

    The old ->fqs_state was also used to decide when to collect dyntick-idle
    snapshots. For this purpose, we add a boolean variable into the kthread,
    which is set on the first call to rcu_gp_fqs() for a given grace period
    and clear otherwise.

    Signed-off-by: Petr Mladek
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Petr Mladek
     

21 Sep, 2015

3 commits

  • This commit converts the rcu_data structure's ->cpu_no_qs field
    to a union. The bytewise side of this union allows individual access
    to indications as to whether this CPU needs to find a quiescent state
    for a normal (.norm) and/or expedited (.exp) grace period. The setwise
    side of the union allows testing whether or not a quiescent state is
    needed at all, for either type of grace period.

    For now, only .norm is used. A later commit will introduce the expedited
    usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit inverts the sense of the rcu_data structure's ->passed_quiesce
    field and renames it to ->cpu_no_qs. This will allow a later commit to
    use an "aggregate OR" operation to test expedited as well as normal grace
    periods without added overhead.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • An upcoming commit needs to invert the sense of the ->passed_quiesce
    rcu_data structure field, so this commit is taking this opportunity
    to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.

    So if !rdp->core_needs_qs, then this CPU need not concern itself with
    quiescent states, in particular, it need not acquire its leaf rcu_node
    structure's ->lock to check. Otherwise, it needs to report the next
    quiescent state.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

18 Jul, 2015

6 commits

  • In the common case, there will be only one expedited grace period in
    the system at a given time, in which case it is not helpful to use
    funnel locking. This commit therefore adds a fastpath that bypasses
    funnel locking when the root ->exp_funnel_mutex is not held.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The strictly rcu_node based funnel-locking scheme works well in many
    cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
    all that much concurrency. This commit therefore extends the funnel
    locking into the per-CPU rcu_data structure, providing concurrency equal
    to the number of CPUs.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
    replaces the open-coding with the shiny new rcu_seq operations.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Sequentially stopping the CPUs slows down expedited grace periods by
    at least a factor of two, based on rcutorture's grace-period-per-second
    rate. This is a conservative measure because rcutorture uses unusually
    long RCU read-side critical sections and because rcutorture periodically
    quiesces the system in order to test RCU's ability to ramp down to and
    up from the idle state. This commit therefore replaces the stop_one_cpu()
    with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
    when all CPUs have passed through the stopped state.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     
  • This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
    polling loop in favor of a funnel-locking scheme based on the rcu_node
    tree. The work-done check is done at each level of the tree, allowing
    high-contention situations to be resolved quickly with reasonable levels
    of mutex contention.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Now that synchronize_sched_expedited() have a mutex, it can use simpler
    work-already-done detection scheme. This commit simplifies this scheme
    by using something similar to the sequence-locking counter scheme.
    A counter is incremented before and after each grace period, so that
    the counter is odd in the midst of the grace period and even otherwise.
    So if the counter has advanced to the second even number that is
    greater than or equal to the snapshot, the required grace period has
    already happened.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

28 May, 2015

1 commit


13 Mar, 2015

1 commit

  • Races between CPU hotplug and grace periods can be difficult to resolve,
    so the ->onoff_mutex is used to exclude the two events. Unfortunately,
    this means that it is impossible for an outgoing CPU to perform the
    last bits of its offlining from its last pass through the idle loop,
    because sleeplocks cannot be acquired in that context.

    This commit avoids these problems by buffering online and offline events
    in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
    grace period starts, the events accumulated in this mask are applied to
    the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
    case of all CPUs corresponding to a given leaf rcu_node structure being
    offline while there are still elements in that structure's ->blkd_tasks
    list is handled using a new ->wait_blkd_tasks field. In this case,
    propagating the offline bits up the tree is deferred until the beginning
    of the grace period after all of the tasks have exited their RCU read-side
    critical sections and removed themselves from the list, at which point
    the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
    structure's CPUs comes back online before the list empties, then the
    ->wait_blkd_tasks flag is simply cleared.

    This of course means that RCU's notion of which CPUs are offline can be
    out of date. This is OK because RCU need only wait on CPUs that were
    online at the time that the grace period started. In addition, RCU's
    force-quiescent-state actions will handle the case where a CPU goes
    offline after the grace period starts.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

16 Jan, 2015

1 commit

  • Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
    in places where it would be useful for it to apply to the normal RCU
    flavors, rcu_preempt, rcu_sched, and rcu_bh. This is especially the
    case for workloads that aggressively overload the system, particularly
    those that generate large numbers of RCU updates on systems running
    NO_HZ_FULL CPUs. This commit therefore communicates quiescent states
    from cond_resched_rcu_qs() to the normal RCU flavors.

    Note that it is unfortunately necessary to leave the old ->passed_quiesce
    mechanism in place to allow quiescent states that apply to only one
    flavor to be recorded. (Yes, we could decrement ->rcu_qs_ctr_snap in
    that case, but that is not so good for debugging of RCU internals.)
    In addition, if one of the RCU flavor's grace period has stalled, this
    will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
    quiescent state visible from other CPUs.

    Reported-by: Sasha Levin
    Reported-by: Dave Jones
    Signed-off-by: Paul E. McKenney
    [ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
    was used in preemptible code. ]

    Paul E. McKenney
     

18 Feb, 2014

2 commits

  • All of the RCU source files have the usual GPL header, which contains a
    long-obsolete postal address for FSF. To avoid the need to track the
    FSF office's movements, this commit substitutes the URL where GPL may
    be found.

    Reported-by: Greg KH
    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The ->n_force_qs_lh field is accessed without the benefit of any
    synchronization, so this commit adds the needed ACCESS_ONCE() wrappers.
    Yes, increments to ->n_force_qs_lh can be lost, but contention should
    be low and the field is strictly statistical in nature, so this is not
    a problem.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

04 Dec, 2013

1 commit

  • Dave Jones got the following lockdep splat:

    > ======================================================
    > [ INFO: possible circular locking dependency detected ]
    > 3.12.0-rc3+ #92 Not tainted
    > -------------------------------------------------------
    > trinity-child2/15191 is trying to acquire lock:
    > (&rdp->nocb_wq){......}, at: [] __wake_up+0x23/0x50
    >
    > but task is already holding lock:
    > (&ctx->lock){-.-...}, at: [] perf_event_exit_task+0x109/0x230
    >
    > which lock already depends on the new lock.
    >
    >
    > the existing dependency chain (in reverse order) is:
    >
    > -> #3 (&ctx->lock){-.-...}:
    > [] lock_acquire+0x93/0x200
    > [] _raw_spin_lock+0x40/0x80
    > [] __perf_event_task_sched_out+0x2df/0x5e0
    > [] perf_event_task_sched_out+0x93/0xa0
    > [] __schedule+0x1d2/0xa20
    > [] preempt_schedule_irq+0x50/0xb0
    > [] retint_kernel+0x26/0x30
    > [] tty_flip_buffer_push+0x34/0x50
    > [] pty_write+0x54/0x60
    > [] n_tty_write+0x32d/0x4e0
    > [] tty_write+0x158/0x2d0
    > [] vfs_write+0xc0/0x1f0
    > [] SyS_write+0x4c/0xa0
    > [] tracesys+0xdd/0xe2
    >
    > -> #2 (&rq->lock){-.-.-.}:
    > [] lock_acquire+0x93/0x200
    > [] _raw_spin_lock+0x40/0x80
    > [] wake_up_new_task+0xc2/0x2e0
    > [] do_fork+0x126/0x460
    > [] kernel_thread+0x26/0x30
    > [] rest_init+0x23/0x140
    > [] start_kernel+0x3f6/0x403
    > [] x86_64_start_reservations+0x2a/0x2c
    > [] x86_64_start_kernel+0xf1/0xf4
    >
    > -> #1 (&p->pi_lock){-.-.-.}:
    > [] lock_acquire+0x93/0x200
    > [] _raw_spin_lock_irqsave+0x4b/0x90
    > [] try_to_wake_up+0x31/0x350
    > [] default_wake_function+0x12/0x20
    > [] autoremove_wake_function+0x18/0x40
    > [] __wake_up_common+0x58/0x90
    > [] __wake_up+0x39/0x50
    > [] __call_rcu_nocb_enqueue+0xa8/0xc0
    > [] __call_rcu+0x140/0x820
    > [] call_rcu+0x1d/0x20
    > [] cpu_attach_domain+0x287/0x360
    > [] build_sched_domains+0xe5e/0x10a0
    > [] sched_init_smp+0x3b7/0x47a
    > [] kernel_init_freeable+0xf6/0x202
    > [] kernel_init+0xe/0x190
    > [] ret_from_fork+0x7c/0xb0
    >
    > -> #0 (&rdp->nocb_wq){......}:
    > [] __lock_acquire+0x191a/0x1be0
    > [] lock_acquire+0x93/0x200
    > [] _raw_spin_lock_irqsave+0x4b/0x90
    > [] __wake_up+0x23/0x50
    > [] __call_rcu_nocb_enqueue+0xa8/0xc0
    > [] __call_rcu+0x140/0x820
    > [] kfree_call_rcu+0x20/0x30
    > [] put_ctx+0x4f/0x70
    > [] perf_event_exit_task+0x12e/0x230
    > [] do_exit+0x30d/0xcc0
    > [] do_group_exit+0x4c/0xc0
    > [] SyS_exit_group+0x14/0x20
    > [] tracesys+0xdd/0xe2
    >
    > other info that might help us debug this:
    >
    > Chain exists of:
    > &rdp->nocb_wq --> &rq->lock --> &ctx->lock
    >
    > Possible unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(&ctx->lock);
    > lock(&rq->lock);
    > lock(&ctx->lock);
    > lock(&rdp->nocb_wq);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by trinity-child2/15191:
    > #0: (&ctx->lock){-.-...}, at: [] perf_event_exit_task+0x109/0x230
    >
    > stack backtrace:
    > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
    > ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
    > ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
    > ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
    > Call Trace:
    > [] dump_stack+0x4e/0x82
    > [] print_circular_bug+0x200/0x20f
    > [] __lock_acquire+0x191a/0x1be0
    > [] ? get_lock_stats+0x19/0x60
    > [] ? native_sched_clock+0x24/0x80
    > [] lock_acquire+0x93/0x200
    > [] ? __wake_up+0x23/0x50
    > [] _raw_spin_lock_irqsave+0x4b/0x90
    > [] ? __wake_up+0x23/0x50
    > [] __wake_up+0x23/0x50
    > [] __call_rcu_nocb_enqueue+0xa8/0xc0
    > [] __call_rcu+0x140/0x820
    > [] ? local_clock+0x3f/0x50
    > [] kfree_call_rcu+0x20/0x30
    > [] put_ctx+0x4f/0x70
    > [] perf_event_exit_task+0x12e/0x230
    > [] do_exit+0x30d/0xcc0
    > [] ? trace_hardirqs_on_caller+0x115/0x1e0
    > [] ? trace_hardirqs_on+0xd/0x10
    > [] do_group_exit+0x4c/0xc0
    > [] SyS_exit_group+0x14/0x20
    > [] tracesys+0xdd/0xe2

    The underlying problem is that perf is invoking call_rcu() with the
    scheduler locks held, but in NOCB mode, call_rcu() will with high
    probability invoke the scheduler -- which just might want to use its
    locks. The reason that call_rcu() needs to invoke the scheduler is
    to wake up the corresponding rcuo callback-offload kthread, which
    does the job of starting up a grace period and invoking the callbacks
    afterwards.

    One solution (championed on a related problem by Lai Jiangshan) is to
    simply defer the wakeup to some point where scheduler locks are no longer
    held. Since we don't want to unnecessarily incur the cost of such
    deferral, the task before us is threefold:

    1. Determine when it is likely that a relevant scheduler lock is held.

    2. Defer the wakeup in such cases.

    3. Ensure that all deferred wakeups eventually happen, preferably
    sooner rather than later.

    We use irqs_disabled_flags() as a proxy for relevant scheduler locks
    being held. This works because the relevant locks are always acquired
    with interrupts disabled. We may defer more often than needed, but that
    is at least safe.

    The wakeup deferral is tracked via a new field in the per-CPU and
    per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

    This flag is checked by the RCU core processing. The __rcu_pending()
    function now checks this flag, which causes rcu_check_callbacks()
    to initiate RCU core processing at each scheduling-clock interrupt
    where this flag is set. Of course this is not sufficient because
    scheduling-clock interrupts are often turned off (the things we used to
    be able to count on!). So the flags are also checked on entry to any
    state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
    state and NO_HZ_FULL user-mode-execution state.

    This approach should allow call_rcu() to be invoked regardless of what
    locks you might be holding, the key word being "should".

    Reported-by: Dave Jones
    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra

    Paul E. McKenney
     

16 Oct, 2013

1 commit