26 Jan, 2017

1 commit

  • commit 52d7e48b86fc108e45a656d8e53e4237993c481d upstream.

    The current preemptible RCU implementation goes through three phases
    during bootup. In the first phase, there is only one CPU that is running
    with preemption disabled, so that a no-op is a synchronous grace period.
    In the second mid-boot phase, the scheduler is running, but RCU has
    not yet gotten its kthreads spawned (and, for expedited grace periods,
    workqueues are not yet running. During this time, any attempt to do
    a synchronous grace period will hang the system (or complain bitterly,
    depending). In the third and final phase, RCU is fully operational and
    everything works normally.

    This has been OK for some time, but there has recently been some
    synchronous grace periods showing up during the second mid-boot phase.
    This code worked "by accident" for awhile, but started failing as soon
    as expedited RCU grace periods switched over to workqueues in commit
    8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
    Note that the code was buggy even before this commit, as it was subject
    to failure on real-time systems that forced all expedited grace periods
    to run as normal grace periods (for example, using the rcu_normal ksysfs
    parameter). The callchain from the failure case is as follows:

    early_amd_iommu_init()
    |-> acpi_put_table(ivrs_base);
    |-> acpi_tb_put_table(table_desc);
    |-> acpi_tb_invalidate_table(table_desc);
    |-> acpi_tb_release_table(...)
    |-> acpi_os_unmap_memory
    |-> acpi_os_unmap_iomem
    |-> acpi_os_map_cleanup
    |-> synchronize_rcu_expedited

    The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
    which caused the code to try using workqueues before they were
    initialized, which did not go well.

    This commit therefore reworks RCU to permit synchronous grace periods
    to proceed during this mid-boot phase. This commit is therefore a
    fix to a regression introduced in v4.9, and is therefore being put
    forward post-merge-window in v4.10.

    This commit sets a flag from the existing rcu_scheduler_starting()
    function which causes all synchronous grace periods to take the expedited
    path. The expedited path now checks this flag, using the requesting task
    to drive the expedited grace period forward during the mid-boot phase.
    Finally, this flag is updated by a core_initcall() function named
    rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

    Note that this arrangement assumes that tasks are not sent POSIX signals
    (or anything similar) from the time that the first task is spawned
    through core_initcall() time.

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Reported-by: "Zheng, Lv"
    Reported-by: Borislav Petkov
    Signed-off-by: Paul E. McKenney
    Tested-by: Stan Kain
    Tested-by: Ivan
    Tested-by: Emanuel Castelo
    Tested-by: Bruno Pesavento
    Tested-by: Borislav Petkov
    Tested-by: Frederic Bezies
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

15 Sep, 2016

1 commit


23 Aug, 2016

4 commits

  • Up to now, RCU has assumed that the CPU-online process makes it from
    CPU_UP_PREPARE to set_cpu_online() within one jiffy. Given the recent
    rise of virtualized environments, this assumption is very clearly
    obsolete. Failing to meet this deadline can result in RCU paying
    attention to an incoming CPU for one jiffy, then ignoring it until the
    grace period following the one in which that CPU sets itself online.
    This situation might prove to be fatally disappointing to any RCU
    read-side critical sections that had the misfortune to execute during
    the time in which RCU was ignoring the slow-to-come-online CPU.

    This commit therefore updates RCU's internal CPU state-tracking
    information at notify_cpu_starting() time, thus providing RCU with
    an exact transition of the CPU's state from offline to online.

    Note that this means that incoming CPUs must not use RCU read-side
    critical section (other than those of SRCU) until notify_cpu_starting()
    time. Note also that the CPU_STARTING notifiers -are- allowed to use
    RCU read-side critical sections. (Of course, CPU-hotplug notifiers are
    rapidly becoming obsolete, so you need to act fast!)

    If a given architecture or CPU family needs to use RCU read-side
    critical sections earlier, the call to rcu_cpu_starting() from
    notify_cpu_starting() will need to be architecture-specific, with
    architectures that need early use being required to hand-place
    the call to rcu_cpu_starting() at some point preceding the call to
    notify_cpu_starting().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, __note_gp_changes() checks to see if the CPU has slept through
    multiple grace periods. If it has, it resynchronizes that CPU's view
    of the grace-period state, which includes whether or not the current
    grace period needs a quiescent state from this CPU. The fact of this
    need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm
    and rdp->core_needs_qs. The former tells RCU's context-switch code to
    go get a quiescent state and the latter says that it needs to be reported.
    The current code unconditionally sets the former to true, but correctly
    sets the latter.

    This does not result in failures, but it does unnecessarily increase
    the amount of work done on average at context-switch time. This commit
    therefore correctly sets both fields.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The Kconfig currently controlling compilation of tree.c is:

    init/Kconfig:config TREE_RCU
    init/Kconfig: bool

    ...and update.c and sync.c are "obj-y" meaning that none are ever
    built as a module by anyone.

    Since MODULE_ALIAS is a no-op for non-modular code, we can remove
    them from these files.

    We leave moduleparam.h behind since the files instantiate some boot
    time configuration parameters with module_param() still.

    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Paul E. McKenney

    Paul Gortmaker
     
  • Commit abedf8e2419f ("rcu: Use simple wait queues where possible in
    rcutree") converts Tree RCU's wait queues to simple wait queues,
    but it incorrectly reverts the commit 2aa792e6faf1 ("rcu: Use
    rcu_gp_kthread_wake() to wake up grace period kthreads"). This can
    result in redundant self-wakeups.

    This commit therefore replaces the simple wait-queue wakeups with
    rcu_gp_kthread_wake(), thus avoiding the redundant wakeups.

    Signed-off-by: Jisheng Zhang
    Signed-off-by: Paul E. McKenney

    Jisheng Zhang
     

15 Jul, 2016

1 commit

  • Straight forward conversion to the state machine. Though the question arises
    whether this needs really all these state transitions to work.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153337.982013161@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

16 Jun, 2016

2 commits

  • In many cases in the RCU tree code, we iterate over the set of cpus for
    a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
    per-cpu data for each cpu in this range. However, if the set of possible
    cpus is sparse, some cpus described in this range are not possible, and
    thus no per-cpu region will have been allocated (or initialised) for
    them by the generic percpu code.

    Erroneous accesses to a per-cpu area for these !possible cpus may fault
    or may hit other data depending on the addressed generated when the
    erroneous per cpu offset is applied. In practice, both cases have been
    observed on arm64 hardware (the former being silent, but detectable with
    additional patches).

    To avoid issues resulting from this, we must iterate over the set of
    *possible* cpus for a given leaf node. This patch add a new helper,
    for_each_leaf_node_possible_cpu, to enable this. As iteration is often
    intertwined with rcu_node local bitmask manipulation, a new
    leaf_node_cpu_bit helper is added to make this simpler and more
    consistent. The RCU tree code is made to use both of these where
    appropriate.

    Without this patch, running reboot at a shell can result in an oops
    like:

    [ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
    [ 3369.083881] pgd = ffffffc3ecdda000
    [ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
    [ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
    [ 3369.101781] Modules linked in:
    [ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
    [ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
    [ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
    [ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
    [ 3369.139479] pc : [] lr : [] pstate: 200001c5
    [ 3369.146860] sp : ffffffc3eb9435a0
    [ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
    [ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
    [ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
    [ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
    [ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
    [ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
    [ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
    [ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
    [ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
    [ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
    [ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
    [ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
    [ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
    [ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
    [ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
    ...
    [ 3369.972257] [] sync_rcu_exp_select_cpus+0x188/0x510
    [ 3369.978685] [] synchronize_rcu_expedited+0x64/0xa8
    [ 3369.985026] [] synchronize_net+0x24/0x30
    [ 3369.990499] [] dev_deactivate_many+0x28c/0x298
    [ 3369.996493] [] __dev_close_many+0x60/0xd0
    [ 3370.002052] [] __dev_close+0x28/0x40
    [ 3370.007178] [] __dev_change_flags+0x8c/0x158
    [ 3370.012999] [] dev_change_flags+0x20/0x60
    [ 3370.018558] [] do_setlink+0x288/0x918
    [ 3370.023771] [] rtnl_newlink+0x398/0x6a8
    [ 3370.029158] [] rtnetlink_rcv_msg+0xe4/0x220
    [ 3370.034891] [] netlink_rcv_skb+0xc4/0xf8
    [ 3370.040364] [] rtnetlink_rcv+0x2c/0x40
    [ 3370.045663] [] netlink_unicast+0x160/0x238
    [ 3370.051309] [] netlink_sendmsg+0x2f0/0x358
    [ 3370.056956] [] sock_sendmsg+0x18/0x30
    [ 3370.062168] [] ___sys_sendmsg+0x26c/0x280
    [ 3370.067728] [] __sys_sendmsg+0x44/0x88
    [ 3370.073027] [] SyS_sendmsg+0x10/0x20
    [ 3370.078153] [] el0_svc_naked+0x24/0x28

    Signed-off-by: Mark Rutland
    Reported-by: Dennis Chen
    Cc: Catalin Marinas
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Mathieu Desnoyers
    Cc: Steve Capper
    Cc: Steven Rostedt
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Paul E. McKenney

    Mark Rutland
     
  • It is not always easy to determine the cause of an RCU stall just by
    analysing the RCU stall messages, mainly when the problem is caused
    by the indirect starvation of rcu threads. For example, when preempt_rcu
    is not awakened due to the starvation of a timer softirq.

    We have been hard coding panic() in the RCU stall functions for
    some time while testing the kernel-rt. But this is not possible in
    some scenarios, like when supporting customers.

    This patch implements the sysctl kernel.panic_on_rcu_stall. If
    set to 1, the system will panic() when an RCU stall takes place,
    enabling the capture of a vmcore. The vmcore provides a way to analyze
    all kernel/tasks states, helping out to point to the culprit and the
    solution for the stall.

    The kernel.panic_on_rcu_stall sysctl is disabled by default.

    Changes from v1:
    - Fixed a typo in the git log
    - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
    - Fixed the CONFIG_TINY_RCU compilation issue
    - The var sysctl_panic_on_rcu_stall is now __read_mostly

    Cc: Jonathan Corbet
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Lai Jiangshan
    Acked-by: Christian Borntraeger
    Reviewed-by: Josh Triplett
    Reviewed-by: Arnaldo Carvalho de Melo
    Tested-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Paul E. McKenney

    Daniel Bristot de Oliveira
     

15 Jun, 2016

4 commits

  • People have been having some difficulty finding their way around the
    RCU code. This commit therefore pulls some of the expedited grace-period
    code from tree.c to a new tree_exp.h file. This commit is strictly code
    movement, with the exception of a forward declaration that was added
    for the sync_sched_exp_online_cleanup() function.

    A subsequent commit will move the remaining expedited grace-period code
    from tree_plugin.h to tree_exp.h.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • I think you'll find this condition is superfluous, as the whole function
    is under #ifdef of that same.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Paul E. McKenney

    Peter Zijlstra
     
  • In the past, RCU grace-period initialization excluded CPU-hotplug
    operations, but this is no longer the case. This commit therefore
    removed an outdated comment in rcu_gp_init() claiming that these
    are excluded.

    Reported-by: Lihao Liang
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The comment header for rcu_scheduler_active states that it is used
    to optimize synchronize_sched() at early boot. This is incorrect.
    The synchronize_sched() function instead checks the number of online
    CPUs. This commit therefore replaces the comment's synchronize_sched()
    with synchronize_rcu(), which really does use rcu_scheduler_active for
    this purpose.

    Reported-by: Lihao Liang
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

22 Apr, 2016

1 commit


01 Apr, 2016

20 commits

  • This commit provides rcu_exp_batches_completed() and
    rcu_exp_batches_completed_sched() functions to allow torture-test modules
    to check how many expedited grace period batches have completed.
    These are analogous to the existing rcu_batches_completed(),
    rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If it is necessary to kick the grace-period kthread, that is a good
    time to dump the trace buffer in order to learn why kicking was needed.
    This commit therefore does the dump.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Recent kernels can fail to awaken the grace-period kthread for
    quiescent-state forcing. This commit is a crude hack that does
    a wakeup if a scheduling-clock interrupt sees that it has been
    too long since force-quiescent-state (FQS) processing.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the force-quiescent-state (FQS) code in rcu_gp_kthread() can
    advance the next FQS even if one was not executed last time. This can
    happen due timeout-duration uncertainty. This commit therefore avoids
    advancing the FQS schedule unless an FQS was just executed. In the
    corner case where an FQS was not executed, but is due now, the code does
    a one-jiffy wait.

    This change prepares for kthread kicking.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Recent kernels can fail to awaken the grace-period kthread for
    quiescent-state forcing. This commit is a crude hack that does
    a wakeup any time a stall is detected.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current expedited grace-period implementation makes subsequent grace
    periods wait on wakeups for the prior grace period. This does not fit
    the dictionary definition of "expedited", so this commit allows these two
    phases to overlap. Doing this requires four waitqueues rather than two
    because tasks can now be waiting on the previous, current, and next grace
    periods. The fourth waitqueue makes the bit masking work out nicely.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit pulls the grace-period-start counter adjustment and tracing
    from synchronize_rcu_expedited() and synchronize_sched_expedited()
    into exp_funnel_lock(), thus eliminating some code duplication.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit moves some duplicate code from synchronize_rcu_expedited()
    and synchronize_sched_expedited() into rcu_exp_gp_seq_snap(). This
    doesn't save lines of code, but does eliminate a "tell me twice" issue.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have
    significant duplicate code. This commit therefore consolidates some of
    this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake()
    in recognition of its added responsibilities.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit speeds up the low-contention case, especially for systems
    with large rcu_node trees, by attempting to directly acquire the
    ->exp_mutex. This fastpath checks the leaves and root first in
    order to avoid excessive memory contention on the mutex itself.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current mutex-based funnel-locking approach used by expedited grace
    periods is subject to severe unfairness. The problem arises when a
    few tasks, making a path from leaves to root, all wake up before other
    tasks do. A new task can then follow this path all the way to the root,
    which needlessly delays tasks whose grace period is done, but who do
    not happen to acquire the lock quickly enough.

    This commit avoids this problem by maintaining per-rcu_node wait queues,
    along with a per-rcu_node counter that tracks the latest grace period
    sought by an earlier task to visit this node. If that grace period
    would satisfy the current task, instead of proceeding up the tree,
    it waits on the current rcu_node structure using a pair of wait queues
    provided for that purpose. This decouples awakening of old tasks from
    the arrival of new tasks.

    If the wakeups prove to be a bottleneck, additional kthreads can be
    brought to bear for that purpose.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Just a name change to save a few lines and a bit of typing.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The cpu_online() function can return values other than 0 and 1, which
    can result in subscript overflow when applied to a two-element array.
    This commit allows for this behavior by using "!!" on the return value
    from cpu_online() when used as a subscript.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
    turns out to be a pessimization at high load because it forces a tree
    full of tasks to wait for an expedited grace period that they probably
    do not need. This commit therefore removes this optimization.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Although cond_resched_rcu_qs() supplies quiescent states to all flavors
    of normal RCU grace periods, it does nothing for expedited RCU-sched
    grace periods. This commit therefore adds a check for a need for a
    quiescent state from the current CPU by an expedited RCU-sched grace
    period, and invokes rcu_sched_qs() to supply that quiescent state if so.

    Note that the check is racy in that we might be migrated to some other
    CPU just after checking the per-CPU variable. This is OK because the
    act of migration will do a context switch, which will supply the needed
    quiescent state. The only downside is that we might do an unnecessary
    call to rcu_sched_qs(), but the probability is low and the overhead
    is small.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, synchronize_sched_expedited_wait() simply sets the ndetected
    variable to the rcu_print_task_exp_stall() return value. This means
    that if the last rcu_node structure has no stalled tasks, record of
    any stalled tasks in previous rcu_node structures is lost, which can
    in turn result in failure to dump out the blocking rcu_node structures.
    Or could, had the test been correct.

    This commit therefore adds the return value of rcu_print_task_exp_stall()
    to ndetected and corrects the later test for ndetected.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, sync_sched_exp_handler() will force a reschedule unless
    this CPU has already checked in or unless a reschedule has already
    been called for. This is clearly wasteful if sync_sched_exp_handler()
    interrupted an idle CPU, so this commit immediately reports the
    quiescent state in that case.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit consolidates a couple definitions and several calls for
    single-shot ftrace-buffer dumping.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

16 Mar, 2016

1 commit

  • Pull cpu hotplug updates from Thomas Gleixner:
    "This is the first part of the ongoing cpu hotplug rework:

    - Initial implementation of the state machine

    - Runs all online and prepare down callbacks on the plugged cpu and
    not on some random processor

    - Replaces busy loop waiting with completions

    - Adds tracepoints so the states can be followed"

    More detailed commentary on this work from an earlier email:
    "What's wrong with the current cpu hotplug infrastructure?

    - Asymmetry

    The hotplug notifier mechanism is asymmetric versus the bringup and
    teardown. This is mostly caused by the notifier mechanism.

    - Largely undocumented dependencies

    While some notifiers use explicitely defined notifier priorities,
    we have quite some notifiers which use numerical priorities to
    express dependencies without any documentation why.

    - Control processor driven

    Most of the bringup/teardown of a cpu is driven by a control
    processor. While it is understandable, that preperatory steps,
    like idle thread creation, memory allocation for and initialization
    of essential facilities needs to be done before a cpu can boot,
    there is no reason why everything else must run on a control
    processor. Before this patch series, bringup looks like this:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring the rest up

    - All or nothing approach

    There is no way to do partial bringups. That's something which is
    really desired because we waste e.g. at boot substantial amount of
    time just busy waiting that the cpu comes to life. That's stupid
    as we could very well do preparatory steps and the initial IPI for
    other cpus and then go back and do the necessary low level
    synchronization with the freshly booted cpu.

    - Minimal debuggability

    Due to the notifier based design, it's impossible to switch between
    two stages of the bringup/teardown back and forth in order to test
    the correctness. So in many hotplug notifiers the cancel
    mechanisms are either not existant or completely untested.

    - Notifier [un]registering is tedious

    To [un]register notifiers we need to protect against hotplug at
    every callsite. There is no mechanism that bringup/teardown
    callbacks are issued on the online cpus, so every caller needs to
    do it itself. That also includes error rollback.

    What's the new design?

    The base of the new design is a symmetric state machine, where both
    the control processor and the booting/dying cpu execute a well
    defined set of states. Each state is symmetric in the end, except
    for some well defined exceptions, and the bringup/teardown can be
    stopped and reversed at almost all states.

    So the bringup of a cpu will look like this in the future:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring itself up

    The synchronization step does not require the control cpu to wait.
    That mechanism can be done asynchronously via a worker or some
    other mechanism.

    The teardown can be made very similar, so that the dying cpu cleans
    up and brings itself down. Cleanups which need to be done after
    the cpu is gone, can be scheduled asynchronously as well.

    There is a long way to this, as we need to refactor the notion when a
    cpu is available. Today we set the cpu online right after it comes
    out of the low level bringup, which is not really correct.

    The proper mechanism is to set it to available, i.e. cpu local
    threads, like softirqd, hotplug thread etc. can be scheduled on that
    cpu, and once it finished all booting steps, it's set to online, so
    general workloads can be scheduled on it. The reverse happens on
    teardown. First thing to do is to forbid scheduling of general
    workloads, then teardown all the per cpu resources and finally shut it
    off completely.

    This patch series implements the basic infrastructure for this at the
    core level. This includes the following:

    - Basic state machine implementation with well defined states, so
    ordering and prioritization can be expressed.

    - Interfaces to [un]register state callbacks

    This invokes the bringup/teardown callback on all online cpus with
    the proper protection in place and [un]installs the callbacks in
    the state machine array.

    For callbacks which have no particular ordering requirement we have
    a dynamic state space, so that drivers don't have to register an
    explicit hotplug state.

    If a callback fails, the code automatically does a rollback to the
    previous state.

    - Sysfs interface to drive the state machine to a particular step.

    This is only partially functional today. Full functionality and
    therefor testability will be achieved once we converted all
    existing hotplug notifiers over to the new scheme.

    - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
    processor:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu
    wait for boot
    bring itself up

    Signal completion to control cpu

    In a previous step of this work we've done a full tree mechanical
    conversion of all hotplug notifiers to the new scheme. The balance
    is a net removal of about 4000 lines of code.

    This is not included in this series, as we decided to take a
    different approach. Instead of mechanically converting everything
    over, we will do a proper overhaul of the usage sites one by one so
    they nicely fit into the symmetric callback scheme.

    I decided to do that after I looked at the ugliness of some of the
    converted sites and figured out that their hotplug mechanism is
    completely buggered anyway. So there is no point to do a
    mechanical conversion first as we need to go through the usage
    sites one by one again in order to achieve a full symmetric and
    testable behaviour"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    cpu/hotplug: Document states better
    cpu/hotplug: Fix smpboot thread ordering
    cpu/hotplug: Remove redundant state check
    cpu/hotplug: Plug death reporting race
    rcu: Make CPU_DYING_IDLE an explicit call
    cpu/hotplug: Make wait for dead cpu completion based
    cpu/hotplug: Let upcoming cpu bring itself fully up
    arch/hotplug: Call into idle with a proper state
    cpu/hotplug: Move online calls to hotplugged cpu
    cpu/hotplug: Create hotplug threads
    cpu/hotplug: Split out the state walk into functions
    cpu/hotplug: Unpark smpboot threads from the state machine
    cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
    cpu/hotplug: Implement setup/removal interface
    cpu/hotplug: Make target state writeable
    cpu/hotplug: Add sysfs state interface
    cpu/hotplug: Hand in target state to _cpu_up/down
    cpu/hotplug: Convert the hotplugged cpu work to a state machine
    cpu/hotplug: Convert to a state machine for the control processor
    cpu/hotplug: Add tracepoints
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit


02 Mar, 2016

1 commit

  • Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets
    invoked at the proper place.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.870167933@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

25 Feb, 2016

1 commit

  • As of commit dae6e64d2bcfd ("rcu: Introduce proper blocking to no-CBs kthreads
    GP waits") the RCU subsystem started making use of wait queues.

    Here we convert all additions of RCU wait queues to use simple wait queues,
    since they don't need the extra overhead of the full wait queue features.

    Originally this was done for RT kernels[1], since we would get things like...

    BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
    in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
    Pid: 8, comm: rcu_preempt Not tainted
    Call Trace:
    [] __might_sleep+0xd0/0xf0
    [] rt_spin_lock+0x24/0x50
    [] __wake_up+0x36/0x70
    [] rcu_gp_kthread+0x4d2/0x680
    [] ? __init_waitqueue_head+0x50/0x50
    [] ? rcu_gp_fqs+0x80/0x80
    [] kthread+0xdb/0xe0
    [] ? finish_task_switch+0x52/0x100
    [] kernel_thread_helper+0x4/0x10
    [] ? __init_kthread_worker+0x60/0x60
    [] ? gs_change+0xb/0xb

    ...and hence simple wait queues were deployed on RT out of necessity
    (as simple wait uses a raw lock), but mainline might as well take
    advantage of the more streamline support as well.

    [1] This is a carry forward of work from v3.10-rt; the original conversion
    was by Thomas on an earlier -rt version, and Sebastian extended it to
    additional post-3.10 added RCU waiters; here I've added a commit log and
    unified the RCU changes into one, and uprev'd it to match mainline RCU.

    Signed-off-by: Daniel Wagner
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-rt-users@vger.kernel.org
    Cc: Boqun Feng
    Cc: Marcelo Tosatti
    Cc: Steven Rostedt
    Cc: Paul Gortmaker
    Cc: Paolo Bonzini
    Cc: "Paul E. McKenney"
    Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
    Signed-off-by: Thomas Gleixner

    Paul Gortmaker