23 Aug, 2016

1 commit

  • The current implementation of expedited grace periods has the user
    task drive the grace period. This works, but has downsides: (1) The
    user task must awaken tasks piggybacking on this grace period, which
    can result in latencies rivaling that of the grace period itself, and
    (2) User tasks can receive signals, which interfere with RCU CPU stall
    warnings.

    This commit therefore uses workqueues to drive the grace periods, so
    that the user task need not do the awakening. A subsequent commit
    will remove the now-unnecessary code allowing for signals.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

16 Jun, 2016

1 commit

  • In many cases in the RCU tree code, we iterate over the set of cpus for
    a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
    per-cpu data for each cpu in this range. However, if the set of possible
    cpus is sparse, some cpus described in this range are not possible, and
    thus no per-cpu region will have been allocated (or initialised) for
    them by the generic percpu code.

    Erroneous accesses to a per-cpu area for these !possible cpus may fault
    or may hit other data depending on the addressed generated when the
    erroneous per cpu offset is applied. In practice, both cases have been
    observed on arm64 hardware (the former being silent, but detectable with
    additional patches).

    To avoid issues resulting from this, we must iterate over the set of
    *possible* cpus for a given leaf node. This patch add a new helper,
    for_each_leaf_node_possible_cpu, to enable this. As iteration is often
    intertwined with rcu_node local bitmask manipulation, a new
    leaf_node_cpu_bit helper is added to make this simpler and more
    consistent. The RCU tree code is made to use both of these where
    appropriate.

    Without this patch, running reboot at a shell can result in an oops
    like:

    [ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
    [ 3369.083881] pgd = ffffffc3ecdda000
    [ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
    [ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
    [ 3369.101781] Modules linked in:
    [ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
    [ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
    [ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
    [ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
    [ 3369.139479] pc : [] lr : [] pstate: 200001c5
    [ 3369.146860] sp : ffffffc3eb9435a0
    [ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
    [ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
    [ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
    [ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
    [ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
    [ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
    [ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
    [ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
    [ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
    [ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
    [ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
    [ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
    [ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
    [ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
    [ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
    ...
    [ 3369.972257] [] sync_rcu_exp_select_cpus+0x188/0x510
    [ 3369.978685] [] synchronize_rcu_expedited+0x64/0xa8
    [ 3369.985026] [] synchronize_net+0x24/0x30
    [ 3369.990499] [] dev_deactivate_many+0x28c/0x298
    [ 3369.996493] [] __dev_close_many+0x60/0xd0
    [ 3370.002052] [] __dev_close+0x28/0x40
    [ 3370.007178] [] __dev_change_flags+0x8c/0x158
    [ 3370.012999] [] dev_change_flags+0x20/0x60
    [ 3370.018558] [] do_setlink+0x288/0x918
    [ 3370.023771] [] rtnl_newlink+0x398/0x6a8
    [ 3370.029158] [] rtnetlink_rcv_msg+0xe4/0x220
    [ 3370.034891] [] netlink_rcv_skb+0xc4/0xf8
    [ 3370.040364] [] rtnetlink_rcv+0x2c/0x40
    [ 3370.045663] [] netlink_unicast+0x160/0x238
    [ 3370.051309] [] netlink_sendmsg+0x2f0/0x358
    [ 3370.056956] [] sock_sendmsg+0x18/0x30
    [ 3370.062168] [] ___sys_sendmsg+0x26c/0x280
    [ 3370.067728] [] __sys_sendmsg+0x44/0x88
    [ 3370.073027] [] SyS_sendmsg+0x10/0x20
    [ 3370.078153] [] el0_svc_naked+0x24/0x28

    Signed-off-by: Mark Rutland
    Reported-by: Dennis Chen
    Cc: Catalin Marinas
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Mathieu Desnoyers
    Cc: Steve Capper
    Cc: Steven Rostedt
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Paul E. McKenney

    Mark Rutland
     

22 Apr, 2016

1 commit


01 Apr, 2016

5 commits

  • Recent kernels can fail to awaken the grace-period kthread for
    quiescent-state forcing. This commit is a crude hack that does
    a wakeup if a scheduling-clock interrupt sees that it has been
    too long since force-quiescent-state (FQS) processing.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current expedited grace-period implementation makes subsequent grace
    periods wait on wakeups for the prior grace period. This does not fit
    the dictionary definition of "expedited", so this commit allows these two
    phases to overlap. Doing this requires four waitqueues rather than two
    because tasks can now be waiting on the previous, current, and next grace
    periods. The fourth waitqueue makes the bit masking work out nicely.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current mutex-based funnel-locking approach used by expedited grace
    periods is subject to severe unfairness. The problem arises when a
    few tasks, making a path from leaves to root, all wake up before other
    tasks do. A new task can then follow this path all the way to the root,
    which needlessly delays tasks whose grace period is done, but who do
    not happen to acquire the lock quickly enough.

    This commit avoids this problem by maintaining per-rcu_node wait queues,
    along with a per-rcu_node counter that tracks the latest grace period
    sought by an earlier task to visit this node. If that grace period
    would satisfy the current task, instead of proceeding up the tree,
    it waits on the current rcu_node structure using a pair of wait queues
    provided for that purpose. This decouples awakening of old tasks from
    the arrival of new tasks.

    If the wakeups prove to be a bottleneck, additional kthreads can be
    brought to bear for that purpose.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Just a name change to save a few lines and a bit of typing.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
    turns out to be a pessimization at high load because it forces a tree
    full of tasks to wait for an expedited grace period that they probably
    do not need. This commit therefore removes this optimization.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

15 Mar, 2016

1 commit


25 Feb, 2016

2 commits

  • As of commit dae6e64d2bcfd ("rcu: Introduce proper blocking to no-CBs kthreads
    GP waits") the RCU subsystem started making use of wait queues.

    Here we convert all additions of RCU wait queues to use simple wait queues,
    since they don't need the extra overhead of the full wait queue features.

    Originally this was done for RT kernels[1], since we would get things like...

    BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
    in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
    Pid: 8, comm: rcu_preempt Not tainted
    Call Trace:
    [] __might_sleep+0xd0/0xf0
    [] rt_spin_lock+0x24/0x50
    [] __wake_up+0x36/0x70
    [] rcu_gp_kthread+0x4d2/0x680
    [] ? __init_waitqueue_head+0x50/0x50
    [] ? rcu_gp_fqs+0x80/0x80
    [] kthread+0xdb/0xe0
    [] ? finish_task_switch+0x52/0x100
    [] kernel_thread_helper+0x4/0x10
    [] ? __init_kthread_worker+0x60/0x60
    [] ? gs_change+0xb/0xb

    ...and hence simple wait queues were deployed on RT out of necessity
    (as simple wait uses a raw lock), but mainline might as well take
    advantage of the more streamline support as well.

    [1] This is a carry forward of work from v3.10-rt; the original conversion
    was by Thomas on an earlier -rt version, and Sebastian extended it to
    additional post-3.10 added RCU waiters; here I've added a commit log and
    unified the RCU changes into one, and uprev'd it to match mainline RCU.

    Signed-off-by: Daniel Wagner
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-rt-users@vger.kernel.org
    Cc: Boqun Feng
    Cc: Marcelo Tosatti
    Cc: Steven Rostedt
    Cc: Paul Gortmaker
    Cc: Paolo Bonzini
    Cc: "Paul E. McKenney"
    Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
    Signed-off-by: Thomas Gleixner

    Paul Gortmaker
     
  • rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
    this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
    not enable the IRQs. lockdep is happy.

    By switching over using swait this is not true anymore. swake_up_all()
    enables the IRQs while processing the waiters. __do_softirq() can now
    run and will eventually call rcu_process_callbacks() which wants to
    grap nrp->lock.

    Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
    switch over to swait.

    If we would hold the rnp->lock and use swait, lockdep reports
    following:

    =================================
    [ INFO: inconsistent lock state ]
    4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
    ---------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (rcu_node_1){+.?...}, at: [] rcu_gp_kthread+0xb97/0xeb0
    {IN-SOFTIRQ-W} state was registered at:
    [] __lock_acquire+0xd5f/0x21e0
    [] lock_acquire+0xdf/0x2b0
    [] _raw_spin_lock_irqsave+0x59/0xa0
    [] rcu_process_callbacks+0x141/0x3c0
    [] __do_softirq+0x14d/0x670
    [] irq_exit+0x104/0x110
    [] smp_apic_timer_interrupt+0x46/0x60
    [] apic_timer_interrupt+0x70/0x80
    [] rq_attach_root+0xa6/0x100
    [] cpu_attach_domain+0x16d/0x650
    [] build_sched_domains+0x942/0xb00
    [] sched_init_smp+0x509/0x5c1
    [] kernel_init_freeable+0x172/0x28f
    [] kernel_init+0xe/0xe0
    [] ret_from_fork+0x3f/0x70
    irq event stamp: 76
    hardirqs last enabled at (75): [] _raw_spin_unlock_irq+0x30/0x60
    hardirqs last disabled at (76): [] _raw_spin_lock_irq+0x1f/0x90
    softirqs last enabled at (0): [] copy_process.part.26+0x602/0x1cf0
    softirqs last disabled at (0): [< (null)>] (null)
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(rcu_node_1);

    lock(rcu_node_1);
    *** DEADLOCK ***
    1 lock held by rcu_preempt/8:
    #0: (rcu_node_1){+.?...}, at: [] rcu_gp_kthread+0xb97/0xeb0
    stack backtrace:
    CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
    Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
    0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
    0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
    0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
    Call Trace:
    [] dump_stack+0x4f/0x7b
    [] print_usage_bug+0x1db/0x1e0
    [] ? save_stack_trace+0x2f/0x50
    [] mark_lock+0x66d/0x6e0
    [] ? check_usage_forwards+0x150/0x150
    [] mark_held_locks+0x78/0xa0
    [] ? _raw_spin_unlock_irq+0x30/0x60
    [] trace_hardirqs_on_caller+0x168/0x220
    [] trace_hardirqs_on+0xd/0x10
    [] _raw_spin_unlock_irq+0x30/0x60
    [] swake_up_all+0xb7/0xe0
    [] rcu_gp_kthread+0xab1/0xeb0
    [] ? trace_hardirqs_on_caller+0xff/0x220
    [] ? _raw_spin_unlock_irq+0x41/0x60
    [] ? rcu_barrier+0x20/0x20
    [] kthread+0x104/0x120
    [] ? _raw_spin_unlock_irq+0x30/0x60
    [] ? kthread_create_on_node+0x260/0x260
    [] ret_from_fork+0x3f/0x70
    [] ? kthread_create_on_node+0x260/0x260

    Signed-off-by: Daniel Wagner
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-rt-users@vger.kernel.org
    Cc: Boqun Feng
    Cc: Marcelo Tosatti
    Cc: Steven Rostedt
    Cc: Paul Gortmaker
    Cc: Paolo Bonzini
    Cc: "Paul E. McKenney"
    Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
    Signed-off-by: Thomas Gleixner

    Daniel Wagner
     

24 Feb, 2016

1 commit

  • In patch:

    "rcu: Add transitivity to remaining rcu_node ->lock acquisitions"

    All locking operations on rcu_node::lock are replaced with the wrappers
    because of the need of transitivity, which indicates we should never
    write code using LOCK primitives alone(i.e. without a proper barrier
    following) on rcu_node::lock outside those wrappers. We could detect
    this kind of misuses on rcu_node::lock in the future by adding __private
    modifier on rcu_node::lock.

    To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
    spinlock unlocks with these wrappers not only privatizes rcu_node::lock
    but also makes it easier to figure out critical sections of rcu_node.

    This patch adds __private modifier to rcu_node::lock and makes every
    access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
    added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
    those wrappers.

    Signed-off-by: Boqun Feng
    Signed-off-by: Paul E. McKenney

    Boqun Feng
     

08 Dec, 2015

1 commit


06 Dec, 2015

1 commit

  • Currently, ->gp_state is printed as an integer, which slows debugging.
    This commit therefore prints a symbolic name in addition to the integer.

    Signed-off-by: Paul E. McKenney
    [ paulmck: Updated to fix relational operator called out by Dan Carpenter. ]
    [ paulmck: More "const", as suggested by Josh Triplett. ]
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

05 Dec, 2015

2 commits

  • Currently, the piggybacked-work checks carried out by sync_exp_work_done()
    atomically increment a small set of variables (the ->expedited_workdone0,
    ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
    fields in the rcu_state structure), which will form a memory-contention
    bottleneck given a sufficiently large number of CPUs concurrently invoking
    either synchronize_rcu_expedited() or synchronize_sched_expedited().

    This commit therefore moves these for fields to the per-CPU rcu_data
    structure, eliminating the memory contention. The show_rcuexp() function
    also changes to sum up each field in the rcu_data structures.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Analogy with the ->qsmaskinitnext field might lead one to believe that
    ->expmaskinitnext tracks online CPUs. This belief is incorrect: Any CPU
    that has ever been online will have its bit set in the ->expmaskinitnext
    field. This commit therefore adds a comment to make this clear, at
    least to people who read comments.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

24 Nov, 2015

1 commit

  • Providing RCU's memory-ordering guarantees requires that the rcu_node
    tree's locking provide transitive memory ordering, which the Linux kernel's
    spinlocks currently do not provide unless smp_mb__after_unlock_lock()
    is used. Having a separate smp_mb__after_unlock_lock() after each and
    every lock acquisition is error-prone, hard to read, and a bit annoying,
    so this commit provides wrapper functions that pull in the
    smp_mb__after_unlock_lock() invocations.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     

08 Oct, 2015

4 commits

  • exp.2015.10.07a: Reduce OS jitter of RCU-sched expedited grace periods.
    fixes.2015.10.06a: Miscellaneous fixes.

    Paul E. McKenney
     
  • This commit makes the RCU CPU stall warning message print online/offline
    indications immediately after the CPU number. A "O" indicates global
    offline, a "." global online, and a "o" indicates RCU believes that the
    CPU is offline for the current grace period and "." otherwise, and an
    "N" indicates that RCU believes that the CPU will be offline for the
    next grace period, and "." otherwise, all right after the CPU number.
    So for CPU 10, you would normally see "10-...:" indicating that everything
    believes that the CPU is online.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This reverts commit af859beaaba4 (rcu: Silence lockdep false positive
    for expedited grace periods). Because synchronize_rcu_expedited()
    no longer invokes synchronize_sched_expedited(), ->exp_funnel_mutex
    acquisition is no longer nested, so the false positive no longer happens.
    This commit therefore removes the extra lockdep data structures, as they
    are no longer needed.

    Paul E. McKenney
     
  • This commit switches synchronize_sched_expedited() from stop_one_cpu_nowait()
    to smp_call_function_single(), thus moving from an IPI and a pair of
    context switches to an IPI and a single pass through the scheduler.
    Of course, if the scheduler actually does decide to switch to a different
    task, there will still be a pair of context switches, but there would
    likely have been a pair of context switches anyway, just a bit later.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 Oct, 2015

4 commits

  • This commit corrects the comment for the values of the ->gp_state field,
    which previously incorrectly said that these were for the ->gp_flags
    field.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Commit commit 4cdfc175c25c89ee ("rcu: Move quiescent-state forcing
    into kthread") started the process of folding the old ->fqs_state into
    ->gp_state, but did not complete it. This situation does not cause
    any malfunction, but can result in extremely confusing trace output.
    This commit completes this task of eliminating ->fqs_state in favor
    of ->gp_state.

    The old ->fqs_state was also used to decide when to collect dyntick-idle
    snapshots. For this purpose, we add a boolean variable into the kthread,
    which is set on the first call to rcu_gp_fqs() for a given grace period
    and clear otherwise.

    Signed-off-by: Petr Mladek
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Petr Mladek
     
  • We have had the call_rcu_func_t typedef for a quite awhile, but we still
    use explicit function pointer types in some places. These types can
    confuse cscope and can be hard to read. This patch therefore replaces
    these types with the call_rcu_func_t typedef.

    Signed-off-by: Boqun Feng
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Boqun Feng
     
  • As we now have rcu_callback_t typedefs as the type of rcu callbacks, we
    should use it in call_rcu*() and friends as the type of parameters. This
    could save us a few lines of code and make it clear which function
    requires an rcu callbacks rather than other callbacks as its argument.

    Besides, this can also help cscope to generate a better database for
    code reading.

    Signed-off-by: Boqun Feng
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Boqun Feng
     

21 Sep, 2015

5 commits

  • This commit converts the rcu_data structure's ->cpu_no_qs field
    to a union. The bytewise side of this union allows individual access
    to indications as to whether this CPU needs to find a quiescent state
    for a normal (.norm) and/or expedited (.exp) grace period. The setwise
    side of the union allows testing whether or not a quiescent state is
    needed at all, for either type of grace period.

    For now, only .norm is used. A later commit will introduce the expedited
    usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit inverts the sense of the rcu_data structure's ->passed_quiesce
    field and renames it to ->cpu_no_qs. This will allow a later commit to
    use an "aggregate OR" operation to test expedited as well as normal grace
    periods without added overhead.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • An upcoming commit needs to invert the sense of the ->passed_quiesce
    rcu_data structure field, so this commit is taking this opportunity
    to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.

    So if !rdp->core_needs_qs, then this CPU need not concern itself with
    quiescent states, in particular, it need not acquire its leaf rcu_node
    structure's ->lock to check. Otherwise, it needs to report the next
    quiescent state.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, synchronize_sched_expedited() uses a single global counter
    to track the number of remaining context switches that the current
    expedited grace period must wait on. This is problematic on large
    systems, where the resulting memory contention can be pathological.
    This commit therefore makes synchronize_sched_expedited() instead use
    the combining tree in the same manner as synchronize_rcu_expedited(),
    keeping memory contention down to a dull roar.

    This commit creates a temporary function sync_sched_exp_select_cpus()
    that is very similar to sync_rcu_exp_select_cpus(). A later commit
    will consolidate these two functions, which becomes possible when
    synchronize_sched_expedited() switches from stop_one_cpu_nowait() to
    smp_call_function_single().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit replaces sync_rcu_preempt_exp_init1(() and
    sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug()
    and sync_exp_reset_tree(), which will also be used by
    synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which
    contains code specific to synchronize_rcu_expedited().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

04 Aug, 2015

2 commits

  • RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
    likely the only thing that ever will use it, so this commit makes this
    macro private to RCU.

    Signed-off-by: Paul E. McKenney
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: "linux-arch@vger.kernel.org"

    Paul E. McKenney
     
  • In a CONFIG_PREEMPT=y kernel, synchronize_rcu_expedited()
    acquires the ->exp_funnel_mutex in rcu_preempt_state, then invokes
    synchronize_sched_expedited, which acquires the ->exp_funnel_mutex in
    rcu_sched_state. There can be no deadlock because rcu_preempt_state
    ->exp_funnel_mutex acquisition always precedes that of rcu_sched_state.
    But lockdep does not know that, so it gives false-positive splats.

    This commit therefore associates a separate lock_class_key structure
    with the rcu_sched_state structure's ->exp_funnel_mutex, allowing
    lockdep to see the lock ordering, avoiding the false positives.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

18 Jul, 2015

8 commits

  • In the common case, there will be only one expedited grace period in
    the system at a given time, in which case it is not helpful to use
    funnel locking. This commit therefore adds a fastpath that bypasses
    funnel locking when the root ->exp_funnel_mutex is not held.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The grace-period kthread sleeps waiting to do a force-quiescent-state
    scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
    However, this is confusing because the kthread has not done the
    force-quiescent-state, but is instead just starting to do it. This commit
    therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
    things a bit easier on reviewers.

    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Although synchronize_sched_expedited() historically has no RCU CPU stall
    warnings, the availability of the rcupdate.rcu_expedited boot parameter
    invalidates the old assumption that synchronize_sched()'s stall warnings
    would suffice. This commit therefore adds RCU CPU stall warnings to
    synchronize_sched_expedited().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The strictly rcu_node based funnel-locking scheme works well in many
    cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
    all that much concurrency. This commit therefore extends the funnel
    locking into the per-CPU rcu_data structure, providing concurrency equal
    to the number of CPUs.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
    replaces the open-coding with the shiny new rcu_seq operations.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Sequentially stopping the CPUs slows down expedited grace periods by
    at least a factor of two, based on rcutorture's grace-period-per-second
    rate. This is a conservative measure because rcutorture uses unusually
    long RCU read-side critical sections and because rcutorture periodically
    quiesces the system in order to test RCU's ability to ramp down to and
    up from the idle state. This commit therefore replaces the stop_one_cpu()
    with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
    when all CPUs have passed through the stopped state.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     
  • This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
    polling loop in favor of a funnel-locking scheme based on the rcu_node
    tree. The work-done check is done at each level of the tree, allowing
    high-contention situations to be resolved quickly with reasonable levels
    of mutex contention.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Now that synchronize_sched_expedited() have a mutex, it can use simpler
    work-already-done detection scheme. This commit simplifies this scheme
    by using something similar to the sequence-locking counter scheme.
    A counter is incremented before and after each grace period, so that
    the counter is odd in the midst of the grace period and even otherwise.
    So if the counter has advanced to the second even number that is
    greater than or equal to the snapshot, the required grace period has
    already happened.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney