Eric Lee / smarc-fsl-linux-kernel

11 Jun, 2013

1 commit

016a8d5be rcu: Don't call wakeup() with rcu_node structure ->lock held ... Browse Code »

This commit fixes a lockdep-detected deadlock by moving a wake_up()
call out from a rnp->lock critical section. Please see below for
the long version of this story.

On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:

> [12572.705832] ======================================================
> [12572.750317] [ INFO: possible circular locking dependency detected ]
> [12572.796978] 3.10.0-rc3+ #39 Not tainted
> [12572.833381] -------------------------------------------------------
> [12572.862233] trinity-child17/31341 is trying to acquire lock:
> [12572.870390] (rcu_node_0){..-.-.}, at: [] rcu_read_unlock_special+0x9f/0x4c0
> [12572.878859]
> but task is already holding lock:
> [12572.894894] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
> [12572.903381]
> which lock already depends on the new lock.
>
> [12572.927541]
> the existing dependency chain (in reverse order) is:
> [12572.943736]
> -> #4 (&ctx->lock){-.-...}:
> [12572.960032] [] lock_acquire+0x91/0x1f0
> [12572.968337] [] _raw_spin_lock+0x40/0x80
> [12572.976633] [] __perf_event_task_sched_out+0x2e7/0x5e0
> [12572.984969] [] perf_event_task_sched_out+0x93/0xa0
> [12572.993326] [] __schedule+0x2cf/0x9c0
> [12573.001652] [] schedule_user+0x2e/0x70
> [12573.009998] [] retint_careful+0x12/0x2e
> [12573.018321]
> -> #3 (&rq->lock){-.-.-.}:
> [12573.034628] [] lock_acquire+0x91/0x1f0
> [12573.042930] [] _raw_spin_lock+0x40/0x80
> [12573.051248] [] wake_up_new_task+0xb7/0x260
> [12573.059579] [] do_fork+0x105/0x470
> [12573.067880] [] kernel_thread+0x26/0x30
> [12573.076202] [] rest_init+0x23/0x140
> [12573.084508] [] start_kernel+0x3f1/0x3fe
> [12573.092852] [] x86_64_start_reservations+0x2a/0x2c
> [12573.101233] [] x86_64_start_kernel+0xcc/0xcf
> [12573.109528]
> -> #2 (&p->pi_lock){-.-.-.}:
> [12573.125675] [] lock_acquire+0x91/0x1f0
> [12573.133829] [] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.141964] [] try_to_wake_up+0x31/0x320
> [12573.150065] [] default_wake_function+0x12/0x20
> [12573.158151] [] autoremove_wake_function+0x18/0x40
> [12573.166195] [] __wake_up_common+0x58/0x90
> [12573.174215] [] __wake_up+0x39/0x50
> [12573.182146] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.190119] [] rcu_start_future_gp+0x1c9/0x1f0
> [12573.198023] [] rcu_nocb_kthread+0x114/0x930
> [12573.205860] [] kthread+0xed/0x100
> [12573.213656] [] ret_from_fork+0x7c/0xb0
> [12573.221379]
> -> #1 (&rsp->gp_wq){..-.-.}:
> [12573.236329] [] lock_acquire+0x91/0x1f0
> [12573.243783] [] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.251178] [] __wake_up+0x23/0x50
> [12573.258505] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.265891] [] rcu_start_future_gp+0x1c9/0x1f0
> [12573.273248] [] rcu_nocb_kthread+0x114/0x930
> [12573.280564] [] kthread+0xed/0x100
> [12573.287807] [] ret_from_fork+0x7c/0xb0

Notice the above call chain.

rcu_start_future_gp() is called with the rnp->lock held. Then it calls
rcu_start_gp_advance, which does a wakeup.

You can't do wakeups while holding the rnp->lock, as that would mean
that you could not do a rcu_read_unlock() while holding the rq lock, or
any lock that was taken while holding the rq lock. This is because...
(See below).

> [12573.295067]
> -> #0 (rcu_node_0){..-.-.}:
> [12573.309293] [] __lock_acquire+0x1786/0x1af0
> [12573.316568] [] lock_acquire+0x91/0x1f0
> [12573.323825] [] _raw_spin_lock+0x40/0x80
> [12573.331081] [] rcu_read_unlock_special+0x9f/0x4c0
> [12573.338377] [] __rcu_read_unlock+0x96/0xa0
> [12573.345648] [] perf_lock_task_context+0x143/0x2d0
> [12573.352942] [] find_get_context+0x4e/0x1f0
> [12573.360211] [] SYSC_perf_event_open+0x514/0xbd0
> [12573.367514] [] SyS_perf_event_open+0x9/0x10
> [12573.374816] [] tracesys+0xdd/0xe2

Notice the above trace.

perf took its own ctx->lock, which can be taken while holding the rq
lock. While holding this lock, it did a rcu_read_unlock(). The
perf_lock_task_context() basically looks like:

rcu_read_lock();
raw_spin_lock(ctx->lock);
rcu_read_unlock();

Now, what looks to have happened, is that we scheduled after taking that
first rcu_read_lock() but before taking the spin lock. When we scheduled
back in and took the ctx->lock, the following rcu_read_unlock()
triggered the "special" code.

The rcu_read_unlock_special() takes the rnp->lock, which gives us a
possible deadlock scenario.

CPU0 CPU1 CPU2
---- ---- ----

rcu_nocb_kthread()
lock(rq->lock);
lock(ctx->lock);
lock(rnp->lock);

wake_up();

lock(rq->lock);

rcu_read_unlock();

rcu_read_unlock_special();

lock(rnp->lock);
lock(ctx->lock);

**** DEADLOCK ****

> [12573.382068]
> other info that might help us debug this:
>
> [12573.403229] Chain exists of:
> rcu_node_0 --> &rq->lock --> &ctx->lock
>
> [12573.424471] Possible unsafe locking scenario:
>
> [12573.438499] CPU0 CPU1
> [12573.445599] ---- ----
> [12573.452691] lock(&ctx->lock);
> [12573.459799] lock(&rq->lock);
> [12573.467010] lock(&ctx->lock);
> [12573.474192] lock(rcu_node_0);
> [12573.481262]
> *** DEADLOCK ***
>
> [12573.501931] 1 lock held by trinity-child17/31341:
> [12573.508990] #0: (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
> [12573.516475]
> stack backtrace:
> [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
> [12573.545357] ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
> [12573.552868] ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
> [12573.560353] 0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
> [12573.567856] Call Trace:
> [12573.575011] [] dump_stack+0x19/0x1b
> [12573.582284] [] print_circular_bug+0x200/0x20f
> [12573.589637] [] __lock_acquire+0x1786/0x1af0
> [12573.596982] [] ? sched_clock_cpu+0xb5/0x100
> [12573.604344] [] lock_acquire+0x91/0x1f0
> [12573.611652] [] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.619030] [] _raw_spin_lock+0x40/0x80
> [12573.626331] [] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.633671] [] rcu_read_unlock_special+0x9f/0x4c0
> [12573.640992] [] ? perf_lock_task_context+0x7d/0x2d0
> [12573.648330] [] ? put_lock_stats.isra.29+0xe/0x40
> [12573.655662] [] ? delay_tsc+0x90/0xe0
> [12573.662964] [] __rcu_read_unlock+0x96/0xa0
> [12573.670276] [] perf_lock_task_context+0x143/0x2d0
> [12573.677622] [] ? __perf_event_enable+0x370/0x370
> [12573.684981] [] find_get_context+0x4e/0x1f0
> [12573.692358] [] SYSC_perf_event_open+0x514/0xbd0
> [12573.699753] [] ? get_parent_ip+0xd/0x50
> [12573.707135] [] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [12573.714599] [] SyS_perf_event_open+0x9/0x10
> [12573.721996] [] tracesys+0xdd/0xe2

This commit delays the wakeup via irq_work(), which is what
perf and ftrace use to perform wakeups in critical sections.

Reported-by: Dave Jones
Signed-off-by: Steven Rostedt
Signed-off-by: Paul E. McKenney

Steven Rostedt
2013-06-11 04:37:11 +0800

02 May, 2013

1 commit

c032862fb Merge commit '8700c95adb03 ' into timers/nohz ... Browse Code »

The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.

Merge a common upstream merge point that has these
updates.

Conflicts:
include/linux/perf_event.h
kernel/rcutree.h
kernel/rcutree_plugin.h

Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2013-05-02 23:54:19 +0800

19 Apr, 2013

1 commit

d1e43fa5f nohz: Ensure full dynticks CPUs are RCU nocbs ... Browse Code »

We need full dynticks CPU to also be RCU nocb so
that we don't have to keep the tick to handle RCU
callbacks.

Make sure the range passed to nohz_full= boot
parameter is a subset of rcu_nocbs=

The CPUs that fail to meet this requirement will be
excluded from the nohz_full range. This is checked
early in boot time, before any CPU has the opportunity
to stop its tick.

Suggested-by: Steven Rostedt
Reviewed-by: Paul E. McKenney
Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: Ingo Molnar
Cc: Kevin Hilman
Cc: Li Zhong
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-04-19 19:54:04 +0800

16 Apr, 2013

1 commit

65d798f0f rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods ... Browse Code »

Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
not necessarily turn the scheduler-clock tick back on. This state of
affairs could result in RCU waiting on an adaptive-ticks CPU running
for an extended period in kernel mode. Such a CPU will never run the
RCU state machine, and could therefore indefinitely extend the RCU state
machine, sooner or later resulting in an OOM condition.

This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
causes RCU's force-quiescent-state processing to check for this condition
and to send an IPI to CPUs that remain in that state for too long.
"Too long" currently means about three jiffies by default, which is
quite some time for a CPU to remain in the kernel without blocking.
The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
sysfs variables may be used to tune "too long" if needed.

Reported-by: Frederic Weisbecker
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett
Signed-off-by: Frederic Weisbecker
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: Ingo Molnar
Cc: Kevin Hilman
Cc: Li Zhong
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Paul E. McKenney
2013-04-16 02:18:36 +0800

26 Mar, 2013

6 commits

6d8766935 Merge branches 'doc.2013.03.12a', 'fixes.2013.03.13a' and 'idlenocb.2013.03.26b' into HEAD ... Browse Code »

doc.2013.03.12a: Documentation changes.

fixes.2013.03.13a: Miscellaneous fixes.

idlenocb.2013.03.26b: Remove restrictions on no-CBs CPUs, make
RCU_FAST_NO_HZ take advantage of numbered callbacks, add
callback acceleration based on numbered callbacks.

Paul E. McKenney
2013-03-26 23:07:38 +0800
0446be489 rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp() ... Browse Code »

CPUs going idle will need to record the need for a future grace
period, but won't actually need to block waiting on it. This commit
therefore splits rcu_start_future_gp(), which does the recording, from
rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
recording, after which rcu_nocb_wait_gp() does the waiting.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-26 23:04:57 +0800
8b425aa8f rcu: Rename n_nocb_gp_requests to need_future_gp ... Browse Code »

CPUs going idle need to be able to indicate their need for future grace
periods. A mechanism for doing this already exists for no-callbacks
CPUs, so the idea is to re-use that mechanism. This commit therefore
moves the ->n_nocb_gp_requests field of the rcu_node structure out from
under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-26 23:04:56 +0800
c0f4dfd4f rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks ... Browse Code »

Because RCU callbacks are now associated with the number of the grace
period that they must wait for, CPUs can now take advance callbacks
corresponding to grace periods that ended while a given CPU was in
dyntick-idle mode. This eliminates the need to try forcing the RCU
state machine while entering idle, thus reducing the CPU intensiveness
of RCU_FAST_NO_HZ, which should increase its energy efficiency.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-26 23:04:51 +0800
a48898585 rcu: Distinguish "rcuo" kthreads by RCU flavor ... Browse Code »

Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
the CPU number, for example, "rcuo". This is problematic given that
there are either two or three RCU flavors, each of which gets a per-CPU
kthread with exactly the same name. This commit therefore introduces
a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
'p' for RCU-preempt, and 's' for RCU-sched. This abbreviation is used
to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
"rcuob/0", "rcuop/0", and "rcuos/0".

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Tested-by: Dietmar Eggemann

Paul E. McKenney
2013-03-26 23:04:48 +0800
dae6e64d2 rcu: Introduce proper blocking to no-CBs kthreads GP waits ... Browse Code »

Currently, the no-CBs kthreads do repeated timed waits for grace periods
to elapse. This is crude and energy inefficient, so this commit allows
no-CBs kthreads to specify exactly which grace period they are waiting
for and also allows them to block for the entire duration until the
desired grace period completes.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-26 23:04:44 +0800

14 Mar, 2013

1 commit

6231069bd rcu: Add softirq-stall indications to stall-warning messages ... Browse Code »

If RCU's softirq handler is prevented from executing, an RCU CPU stall
warning can result. Ways to prevent RCU's softirq handler from executing
include: (1) CPU spinning with interrupts disabled, (2) infinite loop
in some softirq handler, and (3) in -rt kernels, an infinite loop in a
set of real-time threads running at priorities higher than that of RCU's
softirq handler.

Because this situation can be difficult to track down, this commit causes
the count of RCU softirq handler invocations to be printed with RCU
CPU stall warnings. This information does require some interpretation,
as now documented in Documentation/RCU/stallwarn.txt.

Reported-by: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Tested-by: Paul Gortmaker

Paul E. McKenney
2013-03-14 05:43:56 +0800

13 Mar, 2013

2 commits

6f0a6ad2b rcu: Delete unused rcu_node "wakemask" field ... Browse Code »

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-13 05:07:39 +0800
34ed62461 rcu: Remove restrictions on no-CBs CPUs ... Browse Code »

Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
at least one no-CBs CPU must remain online at any given time. These
restrictions are problematic in some situations, such as cases where
all CPUs must run a real-time workload that needs to be insulated from
OS jitter and latencies due to RCU callback invocation. This commit
therefore provides no-CBs CPUs a (very crude and energy-inefficient)
way to start and to wait for grace periods independently of the normal
RCU callback mechanisms. This approach allows any or all of the CPUs to
be designated as no-CBs CPUs, and allows any proper subset of the CPUs
(whether no-CBs CPUs or not) to be offlined.

This commit also provides a fix for a locking bug spotted by Xie
ChanglongX .

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-03-13 02:17:51 +0800

29 Jan, 2013

2 commits

40393f525 Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a… ... Browse Code »

…' and 'tiny.2013.01.29b' into HEAD

doctorture.2013.01.11a: Changes to rcutorture and to RCU documentation.

fixes.2013.01.26a: Miscellaneous fixes.

tagcb.2013.01.24a: Tag RCU callbacks with grace-period number to
simplify callback advancement.

tiny.2013.01.29b: Enhancements to uniprocessor handling in tiny RCU.

Paul E. McKenney
2013-01-29 14:25:21 +0800
6bfc09e23 rcu: Provide RCU CPU stall warnings for tiny RCU ... Browse Code »

Tiny RCU has historically omitted RCU CPU stall warnings in order to
reduce memory requirements, however, lack of these warnings caused
Thomas Gleixner some debugging pain recently. Therefore, this commit
adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y. This keeps
the memory footprint small, while still enabling CPU stall warnings
in kernels built to enable them.

Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
config variable to simplify #if expressions.

Reported-by: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2013-01-29 14:06:21 +0800

27 Jan, 2013

1 commit

347f42382 rcu: Remove unused code originally used for context tracking ... Browse Code »

As context tracking subsystem evolved, it stopped using ignore_user_qs
and in_user defined in the rcu_dynticks structure. This commit therefore
removes them.

Signed-off-by: Li Zhong
Signed-off-by: Paul E. McKenney
Acked-by: Frederic Weisbecker

Li Zhong
2013-01-27 08:34:48 +0800

09 Jan, 2013

1 commit

dc35c8934 rcu: Tag callback lists with corresponding grace-period number ... Browse Code »

Currently, callbacks are advanced each time the corresponding CPU
notices a change in its leaf rcu_node structure's ->completed value
(this value counts grace-period completions). This approach has worked
quite well, but with the advent of RCU_FAST_NO_HZ, we cannot count on
a given CPU seeing all the grace-period completions. When a CPU misses
a grace-period completion that occurs while it is in dyntick-idle mode,
this will delay invocation of its callbacks.

In addition, acceleration of callbacks (when RCU realizes that a given
callback need only wait until the end of the next grace period, rather
than having to wait for a partial grace period followed by a full
grace period) must be carried out extremely carefully. Insufficient
acceleration will result in unnecessarily long grace-period latencies,
while excessive acceleration will result in premature callback invocation.
Changes that involve this tradeoff are therefore among the most
nerve-wracking changes to RCU.

This commit therefore explicitly tags groups of callbacks with the
number of the grace period that they are waiting for. This means that
callback-advancement and callback-acceleration functions are idempotent,
so that excessive acceleration will merely waste a few CPU cycles. This
also allows a CPU to take full advantage of any grace periods that have
elapsed while it has been in dyntick-idle mode. It should also enable
simulataneous simplifications to and optimizations of RCU_FAST_NO_HZ.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2013-01-09 06:15:57 +0800

17 Nov, 2012

2 commits

c635a4e1c rcu: Separate accounting of callbacks from callback-free CPUs ... Browse Code »

Currently, callback invocations from callback-free CPUs are accounted to
the CPU that registered the callback, but using the same field that is
used for normal callbacks. This makes it impossible to determine from
debugfs output whether callbacks are in fact being diverted. This commit
therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
in which diverted callback invocations are counted. RCU's debugfs tracing
still displays normal callback invocations using ci=, but displayed
diverted callbacks with nci=.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-11-17 02:05:57 +0800
3fbfbf7a3 rcu: Add callback-free CPUs ... Browse Code »

RCU callback execution can add significant OS jitter and also can
degrade both scheduling latency and, in asymmetric multiprocessors,
energy efficiency. This commit therefore adds the ability for selected
CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
these kthreads will do polling, removing the need for the offloaded
CPUs to do wakeups. At least one CPU must be doing normal callback
processing: currently CPU 0 cannot be selected as a no-CBs CPU.
In addition, attempts to offline the last normal-CBs CPU will fail.

This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
this commit includes fixes to problems located by Fengguang Wu's
kbuild test robot.

[ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-11-17 02:05:56 +0800

09 Nov, 2012

3 commits

a30489c52 rcu: Instrument synchronize_rcu_expedited() for debugfs tracing ... Browse Code »

This commit adds the counters to rcu_state and updates them in
synchronize_rcu_expedited() to provide the data needed for debugfs
tracing.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-11-09 03:50:13 +0800
40694d664 rcu: Move synchronize_sched_expedited() state to rcu_state ... Browse Code »

Tracing (debugfs) of expedited RCU primitives is required, which in turn
requires that the relevant data be located where the tracing code can find
it, not in its current static global variables in kernel/rcutree.c.
This commit therefore moves sync_sched_expedited_started and
sync_sched_expedited_done to the rcu_state structure, as fields
->expedited_start and ->expedited_done, respectively.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-11-09 03:50:12 +0800
7b2e6011f rcu: Rename ->onofflock to ->orphan_lock ... Browse Code »

The ->onofflock field in the rcu_state structure at one time synchronized
CPU-hotplug operations for RCU. However, its scope has decreased over time
so that it now only protects the lists of orphaned RCU callbacks. This
commit therefore renames it to ->orphan_lock to reflect its current use.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-11-09 03:50:11 +0800

09 Oct, 2012

1 commit

a4fbe35a1 rcu: Grace-period initialization excludes only RCU notifier ... Browse Code »

Kirill noted the following deadlock cycle on shutdown involving padata:

> With commit 755609a9087fa983f567dc5452b2fa7b089b591f I've got deadlock on
> poweroff.
>
> It guess it happens because of race for cpu_hotplug.lock:
>
> CPU A CPU B
> disable_nonboot_cpus()
> _cpu_down()
> cpu_hotplug_begin()
> mutex_lock(&cpu_hotplug.lock);
> __cpu_notify()
> padata_cpu_callback()
> __padata_remove_cpu()
> padata_replace()
> synchronize_rcu()
> rcu_gp_kthread()
> get_online_cpus();
> mutex_lock(&cpu_hotplug.lock);

It would of course be good to eliminate grace-period delays from
CPU-hotplug notifiers, but that is a separate issue. Deadlock is
not an appropriate diagnostic for excessive CPU-hotplug latency.

Fortunately, grace-period initialization does not actually need to
exclude all of the CPU-hotplug operation, but rather only RCU's own
CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers. This commit therefore
introduces a new per-rcu_state onoff_mutex that provides the required
concurrency control in place of the get_online_cpus() that was previously
in rcu_gp_init().

Reported-by: "Kirill A. Shutemov"
Signed-off-by: Paul E. McKenney
Tested-by: Kirill A. Shutemov

Paul E. McKenney
2012-10-09 00:06:38 +0800

26 Sep, 2012

3 commits

1e1a689f1 rcu: Ignore userspace extended quiescent state by default ... Browse Code »

By default we don't want to enter into RCU extended quiescent
state while in userspace because doing this produces some overhead
(eg: use of syscall slowpath). Set it off by default and ready to
run when some feature like adaptive tickless need it.

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:01 +0800
c5d900bf6 rcu: Allow rcu_user_enter()/exit() to nest ... Browse Code »

Allow calls to rcu_user_enter() even if we are already
in userspace (as seen by RCU) and allow calls to rcu_user_exit()
even if we are already in the kernel.

This makes the APIs more flexible to be called from architectures.
Exception entries for example won't need to know if they come from
userspace before calling rcu_user_exit().

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:46:55 +0800
5217192b8 Merge remote-tracking branch 'tip/smp/hotplug' into next.2012.09.25b ... Browse Code »

The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
were due to adjacent insertions and deletions, which were resolved
by simply accepting the changes on both branches.

Paul E. McKenney
2012-09-26 01:01:45 +0800

25 Sep, 2012

1 commit

bda4ec9f6 Merge branches 'bigrt.2012.09.23a', 'doctorture.2012.09.23a', 'fixes.2012.09.23a… ... Browse Code »

…', 'hotplug.2012.09.23a' and 'idlechop.2012.09.23a' into HEAD

bigrt.2012.09.23a contains additional commits to reduce scheduling latency
from RCU on huge systems (many hundrends or thousands of CPUs).

doctorture.2012.09.23a contains documentation changes and rcutorture fixes.

fixes.2012.09.23a contains miscellaneous fixes.

hotplug.2012.09.23a contains CPU-hotplug-related changes.

idle.2012.09.23a fixes architectures for which RCU no longer considered
the idle loop to be a quiescent state due to earlier
adaptive-dynticks changes. Affected architectures are alpha,
cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa,
and ia64.

Paul E. McKenney
2012-09-25 11:02:22 +0800

23 Sep, 2012

8 commits

1331e7a1b rcu: Remove _rcu_barrier() dependency on __stop_machine() ... Browse Code »

Currently, _rcu_barrier() relies on preempt_disable() to prevent
any CPU from going offline, which in turn depends on CPU hotplug's
use of __stop_machine().

This patch therefore makes _rcu_barrier() use get_online_cpus() to
block CPU-hotplug operations. This has the added benefit of removing
the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug
operations are excluded, there can be no callbacks to adopt. This
commit simplifies the code accordingly.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:43:55 +0800
d7d6a11e8 rcu: Simplify quiescent-state detection ... Browse Code »

The current quiescent-state detection algorithm is needlessly
complex. It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period. This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.

Therefore, this commit makes this change.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:41:56 +0800
394f2769a rcu: Prevent force_quiescent_state() memory contention ... Browse Code »

Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field. This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.

The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node. In addition, this
functionality is not used on fastpaths.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:41:54 +0800
4605c0143 rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing ... Browse Code »

Moving quiescent-state forcing into a kthread dispenses with the need
for the ->n_rp_need_fqs field, so this commit removes it.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:41:54 +0800
4cdfc175c rcu: Move quiescent-state forcing into kthread ... Browse Code »

As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods. This is yet another step towards keeping scheduling
latency down to a dull roar.

Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
and to remove the now-unused rcu_state structure fields as suggested by
Peter Zijlstra.

Reported-by: Mike Galbraith
Reported-by: Dimitri Sivanich
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-09-23 22:41:54 +0800
b402b73b3 rcu: Segregate rcu_state fields to improve cache locality ... Browse Code »

The fields in the rcu_state structure that are protected by the
root rcu_node structure's ->lock can share a cache line with the
fields protected by ->onofflock. This can result in excessive
memory contention on large systems, so this commit applies
____cacheline_internodealigned_in_smp to the ->onofflock field in
order to segregate them.

Signed-off-by: Dimitri Sivanich
Signed-off-by: Paul E. McKenney
Tested-by: Dimitri Sivanich
Reviewed-by: Josh Triplett

Dimitri Sivanich
2012-09-23 22:41:53 +0800
b626c1b68 rcu: Provide OOM handler to motivate lazy RCU callbacks ... Browse Code »

In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
large number of lazy callbacks, which as the name implies will be slow
to be invoked. This can be a problem on small-memory systems, where the
default 6-second sleep for CPUs having only lazy RCU callbacks could well
be fatal. This commit therefore installs an OOM hander that ensures that
every CPU with lazy callbacks has at least one non-lazy callback, in turn
ensuring timely advancement for these callbacks.

Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.

Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
thus reducing the number of IPIs, as suggested by Steven Rostedt. Also
to make the for_each_online_cpu() loop be preemptible. (Later, it might
be good to use smp_call_function(), as suggested by Peter Zijlstra.)

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Tested-by: Sasha Levin
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:41:53 +0800
b3dbec76e rcu: Move RCU grace-period initialization into a kthread ... Browse Code »

As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread. This is needed to keep large-system scheduling
latency at reasonable levels.

Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
by Peter Zijlstra in review comments.

Reported-by: Mike Galbraith
Reported-by: Dimitri Sivanich
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2012-09-23 22:41:52 +0800

13 Aug, 2012

2 commits

62ab70724 rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthread ... Browse Code »

Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU
kthread code to use the new smp_hotplug_thread facility.

[ tglx: Adapted it to use callbacks and to the simplified rcu yield ]

Signed-off-by: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Srivatsa S. Bhat
Cc: Rusty Russell
Cc: Namhyung Kim
Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de
Signed-off-by: Thomas Gleixner

Paul E. McKenney
2012-08-13 23:01:08 +0800
5d01bbd11 rcu: Yield simpler ... Browse Code »

The rcu_yield() code is amazing. It's there to avoid starvation of the
system when lots of (boosting) work is to be done.

Now looking at the code it's functionality is:

Make the thread SCHED_OTHER and very nice, i.e. get it out of the way
Arm a timer with 2 ticks
schedule()

Now if the system goes idle the rcu task returns, regains SCHED_FIFO
and plugs on. If the systems stays busy the timer fires and wakes a
per node kthread which in turn makes the per cpu thread SCHED_FIFO and
brings it back on the cpu. For the boosting thread the "make it FIFO"
bit is missing and it just runs some magic boost checks. Now this is a
lot of code with extra threads and complexity.

It's way simpler to let the tasks when they detect overload schedule
away for 2 ticks and defer the normal wakeup as long as they are in
yielded state and the cpu is not idle.

That solves the same problem and the only difference is that when the
cpu goes idle it's not guaranteed that the thread returns right away,
but it won't be longer out than two ticks, so no harm is done. If
that's an issue than it is way simpler just to wake the task from
idle as RCU has callbacks there anyway.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Srivatsa S. Bhat
Cc: Rusty Russell
Cc: Namhyung Kim
Reviewed-by: Paul E. McKenney
Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2012-08-13 23:01:06 +0800

06 Jul, 2012

1 commit

02a0677b0 Merge branches 'bigrtm.2012.07.04a', 'doctorture.2012.07.02a', 'fixes.2012.07.06… ... Browse Code »

…a' and 'fnh.2012.07.02a' into HEAD

bigrtm: First steps towards getting RCU out of the way of
tens-of-microseconds real-time response on systems compiled
with NR_CPUS=4096. Also cleanups for and increased concurrency
of rcu_barrier() family of primitives.
doctorture: rcutorture and documentation improvements.
fixes: Miscellaneous fixes.
fnh: RCU_FAST_NO_HZ fixes and improvements.

Paul E. McKenney
2012-07-06 20:59:30 +0800

03 Jul, 2012

2 commits

9d2ad2430 rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter ... Browse Code »

If the nohz= boot parameter disables nohz, then RCU_FAST_NO_HZ needs to
also disable itself. This commit therefore checks for tick_nohz_enabled
being zero, disabling rcu_prepare_for_idle() if so. This commit assumes
that tick_nohz_enabled can change at runtime: If this is not the case,
then a simpler approach suffices.

Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-07-03 03:34:43 +0800
6ce75a232 rcu: Introduce for_each_rcu_flavor() and use it ... Browse Code »

The arrival of TREE_PREEMPT_RCU some years back included some ugly
code involving either #ifdef or #ifdef'ed wrapper functions to iterate
over all non-SRCU flavors of RCU. This commit therefore introduces
a for_each_rcu_flavor() iterator over the rcu_state structures for each
flavor of RCU to clean up a bit of the ugliness.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2012-07-03 03:33:24 +0800