11 Jun, 2013

1 commit

  • This commit fixes a lockdep-detected deadlock by moving a wake_up()
    call out from a rnp->lock critical section. Please see below for
    the long version of this story.

    On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:

    > [12572.705832] ======================================================
    > [12572.750317] [ INFO: possible circular locking dependency detected ]
    > [12572.796978] 3.10.0-rc3+ #39 Not tainted
    > [12572.833381] -------------------------------------------------------
    > [12572.862233] trinity-child17/31341 is trying to acquire lock:
    > [12572.870390] (rcu_node_0){..-.-.}, at: [] rcu_read_unlock_special+0x9f/0x4c0
    > [12572.878859]
    > but task is already holding lock:
    > [12572.894894] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12572.903381]
    > which lock already depends on the new lock.
    >
    > [12572.927541]
    > the existing dependency chain (in reverse order) is:
    > [12572.943736]
    > -> #4 (&ctx->lock){-.-...}:
    > [12572.960032] [] lock_acquire+0x91/0x1f0
    > [12572.968337] [] _raw_spin_lock+0x40/0x80
    > [12572.976633] [] __perf_event_task_sched_out+0x2e7/0x5e0
    > [12572.984969] [] perf_event_task_sched_out+0x93/0xa0
    > [12572.993326] [] __schedule+0x2cf/0x9c0
    > [12573.001652] [] schedule_user+0x2e/0x70
    > [12573.009998] [] retint_careful+0x12/0x2e
    > [12573.018321]
    > -> #3 (&rq->lock){-.-.-.}:
    > [12573.034628] [] lock_acquire+0x91/0x1f0
    > [12573.042930] [] _raw_spin_lock+0x40/0x80
    > [12573.051248] [] wake_up_new_task+0xb7/0x260
    > [12573.059579] [] do_fork+0x105/0x470
    > [12573.067880] [] kernel_thread+0x26/0x30
    > [12573.076202] [] rest_init+0x23/0x140
    > [12573.084508] [] start_kernel+0x3f1/0x3fe
    > [12573.092852] [] x86_64_start_reservations+0x2a/0x2c
    > [12573.101233] [] x86_64_start_kernel+0xcc/0xcf
    > [12573.109528]
    > -> #2 (&p->pi_lock){-.-.-.}:
    > [12573.125675] [] lock_acquire+0x91/0x1f0
    > [12573.133829] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.141964] [] try_to_wake_up+0x31/0x320
    > [12573.150065] [] default_wake_function+0x12/0x20
    > [12573.158151] [] autoremove_wake_function+0x18/0x40
    > [12573.166195] [] __wake_up_common+0x58/0x90
    > [12573.174215] [] __wake_up+0x39/0x50
    > [12573.182146] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.190119] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.198023] [] rcu_nocb_kthread+0x114/0x930
    > [12573.205860] [] kthread+0xed/0x100
    > [12573.213656] [] ret_from_fork+0x7c/0xb0
    > [12573.221379]
    > -> #1 (&rsp->gp_wq){..-.-.}:
    > [12573.236329] [] lock_acquire+0x91/0x1f0
    > [12573.243783] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.251178] [] __wake_up+0x23/0x50
    > [12573.258505] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.265891] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.273248] [] rcu_nocb_kthread+0x114/0x930
    > [12573.280564] [] kthread+0xed/0x100
    > [12573.287807] [] ret_from_fork+0x7c/0xb0

    Notice the above call chain.

    rcu_start_future_gp() is called with the rnp->lock held. Then it calls
    rcu_start_gp_advance, which does a wakeup.

    You can't do wakeups while holding the rnp->lock, as that would mean
    that you could not do a rcu_read_unlock() while holding the rq lock, or
    any lock that was taken while holding the rq lock. This is because...
    (See below).

    > [12573.295067]
    > -> #0 (rcu_node_0){..-.-.}:
    > [12573.309293] [] __lock_acquire+0x1786/0x1af0
    > [12573.316568] [] lock_acquire+0x91/0x1f0
    > [12573.323825] [] _raw_spin_lock+0x40/0x80
    > [12573.331081] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.338377] [] __rcu_read_unlock+0x96/0xa0
    > [12573.345648] [] perf_lock_task_context+0x143/0x2d0
    > [12573.352942] [] find_get_context+0x4e/0x1f0
    > [12573.360211] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.367514] [] SyS_perf_event_open+0x9/0x10
    > [12573.374816] [] tracesys+0xdd/0xe2

    Notice the above trace.

    perf took its own ctx->lock, which can be taken while holding the rq
    lock. While holding this lock, it did a rcu_read_unlock(). The
    perf_lock_task_context() basically looks like:

    rcu_read_lock();
    raw_spin_lock(ctx->lock);
    rcu_read_unlock();

    Now, what looks to have happened, is that we scheduled after taking that
    first rcu_read_lock() but before taking the spin lock. When we scheduled
    back in and took the ctx->lock, the following rcu_read_unlock()
    triggered the "special" code.

    The rcu_read_unlock_special() takes the rnp->lock, which gives us a
    possible deadlock scenario.

    CPU0 CPU1 CPU2
    ---- ---- ----

    rcu_nocb_kthread()
    lock(rq->lock);
    lock(ctx->lock);
    lock(rnp->lock);

    wake_up();

    lock(rq->lock);

    rcu_read_unlock();

    rcu_read_unlock_special();

    lock(rnp->lock);
    lock(ctx->lock);

    **** DEADLOCK ****

    > [12573.382068]
    > other info that might help us debug this:
    >
    > [12573.403229] Chain exists of:
    > rcu_node_0 --> &rq->lock --> &ctx->lock
    >
    > [12573.424471] Possible unsafe locking scenario:
    >
    > [12573.438499] CPU0 CPU1
    > [12573.445599] ---- ----
    > [12573.452691] lock(&ctx->lock);
    > [12573.459799] lock(&rq->lock);
    > [12573.467010] lock(&ctx->lock);
    > [12573.474192] lock(rcu_node_0);
    > [12573.481262]
    > *** DEADLOCK ***
    >
    > [12573.501931] 1 lock held by trinity-child17/31341:
    > [12573.508990] #0: (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12573.516475]
    > stack backtrace:
    > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
    > [12573.545357] ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
    > [12573.552868] ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
    > [12573.560353] 0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
    > [12573.567856] Call Trace:
    > [12573.575011] [] dump_stack+0x19/0x1b
    > [12573.582284] [] print_circular_bug+0x200/0x20f
    > [12573.589637] [] __lock_acquire+0x1786/0x1af0
    > [12573.596982] [] ? sched_clock_cpu+0xb5/0x100
    > [12573.604344] [] lock_acquire+0x91/0x1f0
    > [12573.611652] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.619030] [] _raw_spin_lock+0x40/0x80
    > [12573.626331] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.633671] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.640992] [] ? perf_lock_task_context+0x7d/0x2d0
    > [12573.648330] [] ? put_lock_stats.isra.29+0xe/0x40
    > [12573.655662] [] ? delay_tsc+0x90/0xe0
    > [12573.662964] [] __rcu_read_unlock+0x96/0xa0
    > [12573.670276] [] perf_lock_task_context+0x143/0x2d0
    > [12573.677622] [] ? __perf_event_enable+0x370/0x370
    > [12573.684981] [] find_get_context+0x4e/0x1f0
    > [12573.692358] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.699753] [] ? get_parent_ip+0xd/0x50
    > [12573.707135] [] ? trace_hardirqs_on_caller+0xfd/0x1c0
    > [12573.714599] [] SyS_perf_event_open+0x9/0x10
    > [12573.721996] [] tracesys+0xdd/0xe2

    This commit delays the wakeup via irq_work(), which is what
    perf and ftrace use to perform wakeups in critical sections.

    Reported-by: Dave Jones
    Signed-off-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Steven Rostedt
     

02 May, 2013

1 commit


19 Apr, 2013

1 commit

  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

16 Apr, 2013

1 commit

  • Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
    not necessarily turn the scheduler-clock tick back on. This state of
    affairs could result in RCU waiting on an adaptive-ticks CPU running
    for an extended period in kernel mode. Such a CPU will never run the
    RCU state machine, and could therefore indefinitely extend the RCU state
    machine, sooner or later resulting in an OOM condition.

    This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
    causes RCU's force-quiescent-state processing to check for this condition
    and to send an IPI to CPUs that remain in that state for too long.
    "Too long" currently means about three jiffies by default, which is
    quite some time for a CPU to remain in the kernel without blocking.
    The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
    sysfs variables may be used to tune "too long" if needed.

    Reported-by: Frederic Weisbecker
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Paul E. McKenney
     

26 Mar, 2013

6 commits

  • doc.2013.03.12a: Documentation changes.

    fixes.2013.03.13a: Miscellaneous fixes.

    idlenocb.2013.03.26b: Remove restrictions on no-CBs CPUs, make
    RCU_FAST_NO_HZ take advantage of numbered callbacks, add
    callback acceleration based on numbered callbacks.

    Paul E. McKenney
     
  • CPUs going idle will need to record the need for a future grace
    period, but won't actually need to block waiting on it. This commit
    therefore splits rcu_start_future_gp(), which does the recording, from
    rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
    recording, after which rcu_nocb_wait_gp() does the waiting.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • CPUs going idle need to be able to indicate their need for future grace
    periods. A mechanism for doing this already exists for no-callbacks
    CPUs, so the idea is to re-use that mechanism. This commit therefore
    moves the ->n_nocb_gp_requests field of the rcu_node structure out from
    under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Because RCU callbacks are now associated with the number of the grace
    period that they must wait for, CPUs can now take advance callbacks
    corresponding to grace periods that ended while a given CPU was in
    dyntick-idle mode. This eliminates the need to try forcing the RCU
    state machine while entering idle, thus reducing the CPU intensiveness
    of RCU_FAST_NO_HZ, which should increase its energy efficiency.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
    the CPU number, for example, "rcuo". This is problematic given that
    there are either two or three RCU flavors, each of which gets a per-CPU
    kthread with exactly the same name. This commit therefore introduces
    a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
    'p' for RCU-preempt, and 's' for RCU-sched. This abbreviation is used
    to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
    "rcuob/0", "rcuop/0", and "rcuos/0".

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Tested-by: Dietmar Eggemann

    Paul E. McKenney
     
  • Currently, the no-CBs kthreads do repeated timed waits for grace periods
    to elapse. This is crude and energy inefficient, so this commit allows
    no-CBs kthreads to specify exactly which grace period they are waiting
    for and also allows them to block for the entire duration until the
    desired grace period completes.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

14 Mar, 2013

1 commit

  • If RCU's softirq handler is prevented from executing, an RCU CPU stall
    warning can result. Ways to prevent RCU's softirq handler from executing
    include: (1) CPU spinning with interrupts disabled, (2) infinite loop
    in some softirq handler, and (3) in -rt kernels, an infinite loop in a
    set of real-time threads running at priorities higher than that of RCU's
    softirq handler.

    Because this situation can be difficult to track down, this commit causes
    the count of RCU softirq handler invocations to be printed with RCU
    CPU stall warnings. This information does require some interpretation,
    as now documented in Documentation/RCU/stallwarn.txt.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Tested-by: Paul Gortmaker

    Paul E. McKenney
     

13 Mar, 2013

2 commits

  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
    at least one no-CBs CPU must remain online at any given time. These
    restrictions are problematic in some situations, such as cases where
    all CPUs must run a real-time workload that needs to be insulated from
    OS jitter and latencies due to RCU callback invocation. This commit
    therefore provides no-CBs CPUs a (very crude and energy-inefficient)
    way to start and to wait for grace periods independently of the normal
    RCU callback mechanisms. This approach allows any or all of the CPUs to
    be designated as no-CBs CPUs, and allows any proper subset of the CPUs
    (whether no-CBs CPUs or not) to be offlined.

    This commit also provides a fix for a locking bug spotted by Xie
    ChanglongX .

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

29 Jan, 2013

2 commits

  • …' and 'tiny.2013.01.29b' into HEAD

    doctorture.2013.01.11a: Changes to rcutorture and to RCU documentation.

    fixes.2013.01.26a: Miscellaneous fixes.

    tagcb.2013.01.24a: Tag RCU callbacks with grace-period number to
    simplify callback advancement.

    tiny.2013.01.29b: Enhancements to uniprocessor handling in tiny RCU.

    Paul E. McKenney
     
  • Tiny RCU has historically omitted RCU CPU stall warnings in order to
    reduce memory requirements, however, lack of these warnings caused
    Thomas Gleixner some debugging pain recently. Therefore, this commit
    adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y. This keeps
    the memory footprint small, while still enabling CPU stall warnings
    in kernels built to enable them.

    Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
    config variable to simplify #if expressions.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

27 Jan, 2013

1 commit


09 Jan, 2013

1 commit

  • Currently, callbacks are advanced each time the corresponding CPU
    notices a change in its leaf rcu_node structure's ->completed value
    (this value counts grace-period completions). This approach has worked
    quite well, but with the advent of RCU_FAST_NO_HZ, we cannot count on
    a given CPU seeing all the grace-period completions. When a CPU misses
    a grace-period completion that occurs while it is in dyntick-idle mode,
    this will delay invocation of its callbacks.

    In addition, acceleration of callbacks (when RCU realizes that a given
    callback need only wait until the end of the next grace period, rather
    than having to wait for a partial grace period followed by a full
    grace period) must be carried out extremely carefully. Insufficient
    acceleration will result in unnecessarily long grace-period latencies,
    while excessive acceleration will result in premature callback invocation.
    Changes that involve this tradeoff are therefore among the most
    nerve-wracking changes to RCU.

    This commit therefore explicitly tags groups of callbacks with the
    number of the grace period that they are waiting for. This means that
    callback-advancement and callback-acceleration functions are idempotent,
    so that excessive acceleration will merely waste a few CPU cycles. This
    also allows a CPU to take full advantage of any grace periods that have
    elapsed while it has been in dyntick-idle mode. It should also enable
    simulataneous simplifications to and optimizations of RCU_FAST_NO_HZ.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

17 Nov, 2012

2 commits

  • Currently, callback invocations from callback-free CPUs are accounted to
    the CPU that registered the callback, but using the same field that is
    used for normal callbacks. This makes it impossible to determine from
    debugfs output whether callbacks are in fact being diverted. This commit
    therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
    in which diverted callback invocations are counted. RCU's debugfs tracing
    still displays normal callback invocations using ci=, but displayed
    diverted callbacks with nci=.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU callback execution can add significant OS jitter and also can
    degrade both scheduling latency and, in asymmetric multiprocessors,
    energy efficiency. This commit therefore adds the ability for selected
    CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
    to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
    these kthreads will do polling, removing the need for the offloaded
    CPUs to do wakeups. At least one CPU must be doing normal callback
    processing: currently CPU 0 cannot be selected as a no-CBs CPU.
    In addition, attempts to offline the last normal-CBs CPU will fail.

    This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
    this commit includes fixes to problems located by Fengguang Wu's
    kbuild test robot.

    [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

09 Nov, 2012

3 commits

  • This commit adds the counters to rcu_state and updates them in
    synchronize_rcu_expedited() to provide the data needed for debugfs
    tracing.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Tracing (debugfs) of expedited RCU primitives is required, which in turn
    requires that the relevant data be located where the tracing code can find
    it, not in its current static global variables in kernel/rcutree.c.
    This commit therefore moves sync_sched_expedited_started and
    sync_sched_expedited_done to the rcu_state structure, as fields
    ->expedited_start and ->expedited_done, respectively.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The ->onofflock field in the rcu_state structure at one time synchronized
    CPU-hotplug operations for RCU. However, its scope has decreased over time
    so that it now only protects the lists of orphaned RCU callbacks. This
    commit therefore renames it to ->orphan_lock to reflect its current use.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

09 Oct, 2012

1 commit

  • Kirill noted the following deadlock cycle on shutdown involving padata:

    > With commit 755609a9087fa983f567dc5452b2fa7b089b591f I've got deadlock on
    > poweroff.
    >
    > It guess it happens because of race for cpu_hotplug.lock:
    >
    > CPU A CPU B
    > disable_nonboot_cpus()
    > _cpu_down()
    > cpu_hotplug_begin()
    > mutex_lock(&cpu_hotplug.lock);
    > __cpu_notify()
    > padata_cpu_callback()
    > __padata_remove_cpu()
    > padata_replace()
    > synchronize_rcu()
    > rcu_gp_kthread()
    > get_online_cpus();
    > mutex_lock(&cpu_hotplug.lock);

    It would of course be good to eliminate grace-period delays from
    CPU-hotplug notifiers, but that is a separate issue. Deadlock is
    not an appropriate diagnostic for excessive CPU-hotplug latency.

    Fortunately, grace-period initialization does not actually need to
    exclude all of the CPU-hotplug operation, but rather only RCU's own
    CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers. This commit therefore
    introduces a new per-rcu_state onoff_mutex that provides the required
    concurrency control in place of the get_online_cpus() that was previously
    in rcu_gp_init().

    Reported-by: "Kirill A. Shutemov"
    Signed-off-by: Paul E. McKenney
    Tested-by: Kirill A. Shutemov

    Paul E. McKenney
     

26 Sep, 2012

3 commits

  • By default we don't want to enter into RCU extended quiescent
    state while in userspace because doing this produces some overhead
    (eg: use of syscall slowpath). Set it off by default and ready to
    run when some feature like adaptive tickless need it.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • Allow calls to rcu_user_enter() even if we are already
    in userspace (as seen by RCU) and allow calls to rcu_user_exit()
    even if we are already in the kernel.

    This makes the APIs more flexible to be called from architectures.
    Exception entries for example won't need to know if they come from
    userspace before calling rcu_user_exit().

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     
  • The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
    were due to adjacent insertions and deletions, which were resolved
    by simply accepting the changes on both branches.

    Paul E. McKenney
     

25 Sep, 2012

1 commit

  • …', 'hotplug.2012.09.23a' and 'idlechop.2012.09.23a' into HEAD

    bigrt.2012.09.23a contains additional commits to reduce scheduling latency
    from RCU on huge systems (many hundrends or thousands of CPUs).

    doctorture.2012.09.23a contains documentation changes and rcutorture fixes.

    fixes.2012.09.23a contains miscellaneous fixes.

    hotplug.2012.09.23a contains CPU-hotplug-related changes.

    idle.2012.09.23a fixes architectures for which RCU no longer considered
    the idle loop to be a quiescent state due to earlier
    adaptive-dynticks changes. Affected architectures are alpha,
    cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa,
    and ia64.

    Paul E. McKenney
     

23 Sep, 2012

8 commits

  • Currently, _rcu_barrier() relies on preempt_disable() to prevent
    any CPU from going offline, which in turn depends on CPU hotplug's
    use of __stop_machine().

    This patch therefore makes _rcu_barrier() use get_online_cpus() to
    block CPU-hotplug operations. This has the added benefit of removing
    the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug
    operations are excluded, there can be no callbacks to adopt. This
    commit simplifies the code accordingly.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The current quiescent-state detection algorithm is needlessly
    complex. It records the grace-period number corresponding to
    the quiescent state at the time of the quiescent state, which
    works, but it seems better to simply erase any record of previous
    quiescent states at the time that the CPU notices the new grace
    period. This has the further advantage of removing another piece
    of RCU for which lockless reasoning is required.

    Therefore, this commit makes this change.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Large systems running RCU_FAST_NO_HZ kernels see extreme memory
    contention on the rcu_state structure's ->fqslock field. This
    can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
    or at boot time (via the nohz kernel boot parameter), but large
    systems will no doubt become sensitive to energy consumption.
    This commit therefore uses a combining-tree approach to spread the
    memory contention across new cache lines in the leaf rcu_node structures.
    This can be thought of as a tournament lock that has only a try-lock
    acquisition primitive.

    The effect on small systems is minimal, because such systems have
    an rcu_node "tree" consisting of a single node. In addition, this
    functionality is not used on fastpaths.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Moving quiescent-state forcing into a kthread dispenses with the need
    for the ->n_rp_need_fqs field, so this commit removes it.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • As the first step towards allowing quiescent-state forcing to be
    preemptible, this commit moves RCU quiescent-state forcing into the
    same kthread that is now used to initialize and clean up after grace
    periods. This is yet another step towards keeping scheduling
    latency down to a dull roar.

    Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
    and to remove the now-unused rcu_state structure fields as suggested by
    Peter Zijlstra.

    Reported-by: Mike Galbraith
    Reported-by: Dimitri Sivanich
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The fields in the rcu_state structure that are protected by the
    root rcu_node structure's ->lock can share a cache line with the
    fields protected by ->onofflock. This can result in excessive
    memory contention on large systems, so this commit applies
    ____cacheline_internodealigned_in_smp to the ->onofflock field in
    order to segregate them.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Paul E. McKenney
    Tested-by: Dimitri Sivanich
    Reviewed-by: Josh Triplett

    Dimitri Sivanich
     
  • In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
    large number of lazy callbacks, which as the name implies will be slow
    to be invoked. This can be a problem on small-memory systems, where the
    default 6-second sleep for CPUs having only lazy RCU callbacks could well
    be fatal. This commit therefore installs an OOM hander that ensures that
    every CPU with lazy callbacks has at least one non-lazy callback, in turn
    ensuring timely advancement for these callbacks.

    Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.

    Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
    thus reducing the number of IPIs, as suggested by Steven Rostedt. Also
    to make the for_each_online_cpu() loop be preemptible. (Later, it might
    be good to use smp_call_function(), as suggested by Peter Zijlstra.)

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Tested-by: Sasha Levin
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • As the first step towards allowing grace-period initialization to be
    preemptible, this commit moves the RCU grace-period initialization
    into its own kthread. This is needed to keep large-system scheduling
    latency at reasonable levels.

    Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
    by Peter Zijlstra in review comments.

    Reported-by: Mike Galbraith
    Reported-by: Dimitri Sivanich
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

13 Aug, 2012

2 commits

  • Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU
    kthread code to use the new smp_hotplug_thread facility.

    [ tglx: Adapted it to use callbacks and to the simplified rcu yield ]

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Srivatsa S. Bhat
    Cc: Rusty Russell
    Cc: Namhyung Kim
    Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de
    Signed-off-by: Thomas Gleixner

    Paul E. McKenney
     
  • The rcu_yield() code is amazing. It's there to avoid starvation of the
    system when lots of (boosting) work is to be done.

    Now looking at the code it's functionality is:

    Make the thread SCHED_OTHER and very nice, i.e. get it out of the way
    Arm a timer with 2 ticks
    schedule()

    Now if the system goes idle the rcu task returns, regains SCHED_FIFO
    and plugs on. If the systems stays busy the timer fires and wakes a
    per node kthread which in turn makes the per cpu thread SCHED_FIFO and
    brings it back on the cpu. For the boosting thread the "make it FIFO"
    bit is missing and it just runs some magic boost checks. Now this is a
    lot of code with extra threads and complexity.

    It's way simpler to let the tasks when they detect overload schedule
    away for 2 ticks and defer the normal wakeup as long as they are in
    yielded state and the cpu is not idle.

    That solves the same problem and the only difference is that when the
    cpu goes idle it's not guaranteed that the thread returns right away,
    but it won't be longer out than two ticks, so no harm is done. If
    that's an issue than it is way simpler just to wake the task from
    idle as RCU has callbacks there anyway.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Srivatsa S. Bhat
    Cc: Rusty Russell
    Cc: Namhyung Kim
    Reviewed-by: Paul E. McKenney
    Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

06 Jul, 2012

1 commit


03 Jul, 2012

2 commits

  • If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
    also disable itself. This commit therefore checks for tick_nohz_enabled
    being zero, disabling rcu_prepare_for_idle() if so. This commit assumes
    that tick_nohz_enabled can change at runtime: If this is not the case,
    then a simpler approach suffices.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The arrival of TREE_PREEMPT_RCU some years back included some ugly
    code involving either #ifdef or #ifdef'ed wrapper functions to iterate
    over all non-SRCU flavors of RCU. This commit therefore introduces
    a for_each_rcu_flavor() iterator over the rcu_state structures for each
    flavor of RCU to clean up a bit of the ugliness.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney