01 Sep, 2013

2 commits

  • Because RCU's quiescent-state-forcing mechanism is used to drive the
    full-system-idle state machine, and because this mechanism is executed
    by RCU's grace-period kthreads, this commit forces these kthreads to
    run on the timekeeping CPU (tick_do_timer_cpu). To do otherwise would
    mean that the RCU grace-period kthreads would force the system into
    non-idle state every time they drove the state machine, which would
    be just a bit on the futile side.

    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Lai Jiangshan
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit adds the state machine that takes the per-CPU idle data
    as input and produces a full-system-idle indication as output. This
    state machine is driven out of RCU's quiescent-state-forcing
    mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
    idle state and then rcu_sysidle_report() to drive the state machine.

    The full-system-idle state is sampled using rcu_sys_is_idle(), which
    also drives the state machine if RCU is idle (and does so by forcing
    RCU to become non-idle). This function returns true if all but the
    timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
    enough to avoid memory contention on the full_sysidle_state state
    variable. The rcu_sysidle_force_exit() may be called externally
    to reset the state machine back into non-idle state.

    For large systems the state machine is driven out of RCU's
    force-quiescent-state logic, which provides good scalability at the price
    of millisecond-scale latencies on the transition to full-system-idle
    state. This is not so good for battery-powered systems, which are usually
    small enough that they don't need to care about scalability, but which
    do care deeply about energy efficiency. Small systems therefore drive
    the state machine directly out of the idle-entry code. The number of
    CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
    Kconfig parameter, which defaults to 8. Note that this is a build-time
    definition.

    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Lai Jiangshan
    [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
    Reviewed-by: Josh Triplett
    [ paulmck: Simplify logic and provide better comments for memory barriers,
    based on review comments and questions by Lai Jiangshan. ]

    Paul E. McKenney
     

19 Aug, 2013

3 commits

  • This commit adds control variables and states for full-system idle.
    The system will progress through the states in numerical order when
    the system is fully idle (other than the timekeeping CPU), and reset
    down to the initial state if any non-timekeeping CPU goes non-idle.
    The current state is kept in full_sysidle_state.

    One flavor of RCU will be in charge of driving the state machine,
    defined by rcu_sysidle_state. This should be the busiest flavor of RCU.

    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit adds the code that updates the rcu_dyntick structure's
    new fields to track the per-CPU idle state based on interrupts and
    transitions into and out of the idle loop (NMIs are ignored because NMI
    handlers cannot cleanly read out the time anyway). This code is similar
    to the code that maintains RCU's idea of per-CPU idleness, but differs
    in that RCU treats CPUs running in user mode as idle, where this new
    code does not.

    Signed-off-by: Paul E. McKenney
    Acked-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit adds fields to the rcu_dyntick structure that are used to
    detect idle CPUs. These new fields differ from the existing ones in
    that the existing ones consider a CPU executing in user mode to be idle,
    where the new ones consider CPUs executing in user mode to be busy.
    The handling of these new fields is otherwise quite similar to that for
    the exiting fields. This commit also adds the initialization required
    for these fields.

    So, why is usermode execution treated differently, with RCU considering
    it a quiescent state equivalent to idle, while in contrast the new
    full-system idle state detection considers usermode execution to be
    non-idle?

    It turns out that although one of RCU's quiescent states is usermode
    execution, it is not a full-system idle state. This is because the
    purpose of the full-system idle state is not RCU, but rather determining
    when accurate timekeeping can safely be disabled. Whenever accurate
    timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
    CPU must keep the scheduling-clock tick going. If even one CPU is
    executing in user mode, accurate timekeeping is requires, particularly for
    architectures where gettimeofday() and friends do not enter the kernel.
    Only when all CPUs are really and truly idle can accurate timekeeping be
    disabled, allowing all CPUs to turn off the scheduling clock interrupt,
    thus greatly improving energy efficiency.

    This naturally raises the question "Why is this code in RCU rather than in
    timekeeping?", and the answer is that RCU has the data and infrastructure
    to efficiently make this determination.

    Signed-off-by: Paul E. McKenney
    Acked-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

30 Jul, 2013

2 commits

  • Currently, RCU tracepoints save only a pointer to strings in the
    ring buffer. When displayed via the /sys/kernel/debug/tracing/trace file
    they are referenced like the printf "%s" that looks at the address
    in the ring buffer and prints out the string it points too. This requires
    that the strings are constant and persistent in the kernel.

    The problem with this is for tools like trace-cmd and perf that read the
    binary data from the buffers but have no access to the kernel memory to
    find out what string is represented by the address in the buffer.

    By using the tracepoint_string infrastructure, the RCU tracepoint strings
    can be exported such that userspace tools can map the addresses to
    the strings.

    # cat /sys/kernel/debug/tracing/printk_formats
    0xffffffff81a4a0e8 : "rcu_preempt"
    0xffffffff81a4a0f4 : "rcu_bh"
    0xffffffff81a4a100 : "rcu_sched"
    0xffffffff818437a0 : "cpuqs"
    0xffffffff818437a6 : "rcu_sched"
    0xffffffff818437a0 : "cpuqs"
    0xffffffff818437b0 : "rcu_bh"
    0xffffffff818437b7 : "Start context switch"
    0xffffffff818437cc : "End context switch"
    0xffffffff818437a0 : "cpuqs"
    [...]

    Now userspaces tools can display:

    rcu_utilization: Start context switch
    rcu_dyntick: Start 1 0
    rcu_utilization: End context switch
    rcu_batch_start: rcu_preempt CBs=0/5 bl=10
    rcu_dyntick: End 0 140000000000000
    rcu_invoke_callback: rcu_preempt rhp=0xffff880071c0d600 func=proc_i_callback
    rcu_invoke_callback: rcu_preempt rhp=0xffff880077b5b230 func=__d_free
    rcu_dyntick: Start 140000000000000 0
    rcu_invoke_callback: rcu_preempt rhp=0xffff880077563980 func=file_free_rcu
    rcu_batch_end: rcu_preempt CBs-invoked=3 idle=>c<>c<>c<>c<
    rcu_utilization: End RCU core
    rcu_grace_period: rcu_preempt 9741 start
    rcu_dyntick: Start 1 0
    rcu_dyntick: End 0 140000000000000
    rcu_dyntick: Start 140000000000000 0

    Instead of:

    rcu_utilization: ffffffff81843110
    rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
    rcu_batch_start: ffffffff81842f1d CBs=0/4 bl=10
    rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
    rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f80
    rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88007888aac0 func=file_free_rcu
    rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f95
    rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88006aeb4600 func=proc_i_callback
    rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
    rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
    rcu_invoke_callback: ffffffff81842f1d rhp=0xffff880071cb9fc0 func=__d_free
    rcu_grace_period: ffffffff81842f1d 9939 ffffffff81842f80
    rcu_invoke_callback: ffffffff81842f1d rhp=0xffff88007888ae80 func=file_free_rcu
    rcu_batch_end: ffffffff81842f1d CBs-invoked=4 idle=>c<>c<>c<>c<
    rcu_utilization: ffffffff8184311f

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • The RCU_STATE_INITIALIZER() macro is used only in the rcutree.c file
    as well as the rcutree_plugin.h file. It is passed as a rvalue to
    a variable of a similar name. A per_cpu variable is also created
    with a similar name as well.

    The uses of RCU_STATE_INITIALIZER() can be simplified to remove some
    of the duplicate code that is done. Currently the three users of this
    macro has this format:

    struct rcu_state rcu_sched_state =
    RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
    DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);

    Notice that "rcu_sched" is called three times. This is the same with
    the other two users. This can be condensed to just:

    RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);

    by moving the rest into the macro itself.

    This also opens the door to allow the RCU tracepoint strings and
    their addresses to be exported so that userspace tracing tools can
    translate the contents of the pointers of the RCU tracepoints.
    The change will allow for helper code to be placed in the
    RCU_STATE_INITIALIZER() macro to export the name that is used.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the drivers/rcu uses of the __cpuinit macros
    from all C files.

    [1] https://lkml.org/lkml/2013/5/20/589

    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Dipankar Sarma
    Reviewed-by: Josh Triplett
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

11 Jun, 2013

5 commits


16 May, 2013

1 commit

  • When rcu_init() is called we already have slab working, allocating
    bootmem at that point results in warnings and an allocation from
    slab. This commit therefore changes alloc_bootmem_cpumask_var() to
    alloc_cpumask_var() in rcu_bootup_announce_oddness(), which is called
    from rcu_init().

    Signed-off-by: Sasha Levin
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Tested-by: Robin Holt

    [paulmck: convert to zalloc_cpumask_var(), as suggested by Yinghai Lu.]

    Sasha Levin
     

15 May, 2013

1 commit

  • Commit c0f4dfd4f (rcu: Make RCU_FAST_NO_HZ take advantage of numbered
    callbacks) introduced a bug that can result in excessively long grace
    periods. This bug reverse the senes of the "if" statement checking
    for lazy callbacks, so that RCU takes a lazy approach when there are
    in fact non-lazy callbacks. This can result in excessive boot, suspend,
    and resume times.

    This commit therefore fixes the sense of this "if" statement.

    Reported-by: Borislav Petkov
    Reported-by: Bjørn Mork
    Reported-by: Joerg Roedel
    Signed-off-by: Paul E. McKenney
    Tested-by: Bjørn Mork
    Tested-by: Joerg Roedel

    Paul E. McKenney
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

02 May, 2013

1 commit


19 Apr, 2013

1 commit

  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

16 Apr, 2013

1 commit

  • Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
    not necessarily turn the scheduler-clock tick back on. This state of
    affairs could result in RCU waiting on an adaptive-ticks CPU running
    for an extended period in kernel mode. Such a CPU will never run the
    RCU state machine, and could therefore indefinitely extend the RCU state
    machine, sooner or later resulting in an OOM condition.

    This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
    causes RCU's force-quiescent-state processing to check for this condition
    and to send an IPI to CPUs that remain in that state for too long.
    "Too long" currently means about three jiffies by default, which is
    quite some time for a CPU to remain in the kernel without blocking.
    The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
    sysfs variables may be used to tune "too long" if needed.

    Reported-by: Frederic Weisbecker
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Paul E. McKenney
     

26 Mar, 2013

12 commits

  • doc.2013.03.12a: Documentation changes.

    fixes.2013.03.13a: Miscellaneous fixes.

    idlenocb.2013.03.26b: Remove restrictions on no-CBs CPUs, make
    RCU_FAST_NO_HZ take advantage of numbered callbacks, add
    callback acceleration based on numbered callbacks.

    Paul E. McKenney
     
  • CPUs going idle will need to record the need for a future grace
    period, but won't actually need to block waiting on it. This commit
    therefore splits rcu_start_future_gp(), which does the recording, from
    rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
    recording, after which rcu_nocb_wait_gp() does the waiting.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • CPUs going idle need to be able to indicate their need for future grace
    periods. A mechanism for doing this already exists for no-callbacks
    CPUs, so the idea is to re-use that mechanism. This commit therefore
    moves the ->n_nocb_gp_requests field of the rcu_node structure out from
    under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If CPUs are to give prior notice of needed grace periods, it will be
    necessary to invoke rcu_start_gp() without dropping the root rcu_node
    structure's ->lock. This commit takes a second step in this direction
    by moving the release of this lock to rcu_start_gp()'s callers.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Dyntick-idle CPUs need to be able to pre-announce their need for grace
    periods. This can be done using something similar to the mechanism used
    by no-CB CPUs to announce their need for grace periods. This commit
    moves in this direction by renaming the no-CBs grace-period event tracing
    to suit the new future-grace-period needs.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Because RCU callbacks are now associated with the number of the grace
    period that they must wait for, CPUs can now take advance callbacks
    corresponding to grace periods that ended while a given CPU was in
    dyntick-idle mode. This eliminates the need to try forcing the RCU
    state machine while entering idle, thus reducing the CPU intensiveness
    of RCU_FAST_NO_HZ, which should increase its energy efficiency.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU_FAST_NO_HZ operation is controlled by four compile-time C-preprocessor
    macros, but some use cases benefit greatly from runtime adjustment,
    particularly when tuning devices. This commit therefore creates the
    corresponding sysfs entries.

    Reported-by: Robin Randhawa
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
    the CPU number, for example, "rcuo". This is problematic given that
    there are either two or three RCU flavors, each of which gets a per-CPU
    kthread with exactly the same name. This commit therefore introduces
    a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
    'p' for RCU-preempt, and 's' for RCU-sched. This abbreviation is used
    to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
    "rcuob/0", "rcuop/0", and "rcuos/0".

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Tested-by: Dietmar Eggemann

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the no-CBs kthreads do repeated timed waits for grace periods
    to elapse. This is crude and energy inefficient, so this commit allows
    no-CBs kthreads to specify exactly which grace period they are waiting
    for and also allows them to block for the entire duration until the
    desired grace period completes.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the only way to specify no-CBs CPUs is via the rcu_nocbs
    kernel command-line parameter. This is inconvenient in some cases,
    particularly for randconfig testing, so this commit adds a new set of
    kernel configuration parameters. CONFIG_RCU_NOCB_CPU_NONE (the default)
    retains the old behavior, CONFIG_RCU_NOCB_CPU_ZERO offloads callback
    processing from CPU 0 (along with any other CPUs specified by the
    rcu_nocbs boot-time parameter), and CONFIG_RCU_NOCB_CPU_ALL offloads
    callback processing from all CPUs.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

14 Mar, 2013

1 commit

  • If RCU's softirq handler is prevented from executing, an RCU CPU stall
    warning can result. Ways to prevent RCU's softirq handler from executing
    include: (1) CPU spinning with interrupts disabled, (2) infinite loop
    in some softirq handler, and (3) in -rt kernels, an infinite loop in a
    set of real-time threads running at priorities higher than that of RCU's
    softirq handler.

    Because this situation can be difficult to track down, this commit causes
    the count of RCU softirq handler invocations to be printed with RCU
    CPU stall warnings. This information does require some interpretation,
    as now documented in Documentation/RCU/stallwarn.txt.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Tested-by: Paul Gortmaker

    Paul E. McKenney
     

13 Mar, 2013

1 commit

  • Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
    at least one no-CBs CPU must remain online at any given time. These
    restrictions are problematic in some situations, such as cases where
    all CPUs must run a real-time workload that needs to be insulated from
    OS jitter and latencies due to RCU callback invocation. This commit
    therefore provides no-CBs CPUs a (very crude and energy-inefficient)
    way to start and to wait for grace periods independently of the normal
    RCU callback mechanisms. This approach allows any or all of the CPUs to
    be designated as no-CBs CPUs, and allows any proper subset of the CPUs
    (whether no-CBs CPUs or not) to be offlined.

    This commit also provides a fix for a locking bug spotted by Xie
    ChanglongX .

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

09 Jan, 2013

2 commits

  • The as-documented rcu_nocb_poll will fail to enable this feature
    for two reasons. (1) there is an extra "s" in the documented
    name which is not in the code, and (2) since it uses module_param,
    it really is expecting a prefix, akin to "rcutree.fanout_leaf"
    and the prefix isn't documented.

    However, there are several reasons why we might not want to
    simply fix the typo and add the prefix:

    1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
    a change to rcutree.nocb_poll

    2) if we did #1, then the prefix wouldn't be consistent with the
    rcu_nocbs= parameter (i.e. one with, one without prefix)

    3) the use of module_param in a header file is less than desired,
    since it isn't immediately obvious that it will get processed
    via rcutree.c and get the prefix from that (although use of
    module_param_named() could clarify that.)

    4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
    data to userspace via module_param() doesn't really buy us anything,
    as it is read-only and we can tell if it is enabled already without
    it, since there is a printk at early boot telling us so.

    In light of all that, just change it from a module_param() to an
    early_setup() call, and worry about adding it to /sys later on if
    we decide to allow a dynamic setting of it.

    Also change the variable to be tagged as read_mostly, since it
    will only ever be fiddled with at most, once at boot.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Paul E. McKenney

    Paul Gortmaker
     
  • The wait_event() at the head of the rcu_nocb_kthread() can result in
    soft-lockup complaints if the CPU in question does not register RCU
    callbacks for an extended period. This commit therefore changes
    the wait_event() to a wait_event_interruptible().

    Reported-by: Frederic Weisbecker
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Paul E. McKenney

    Paul Gortmaker
     

17 Nov, 2012

3 commits

  • Currently, callback invocations from callback-free CPUs are accounted to
    the CPU that registered the callback, but using the same field that is
    used for normal callbacks. This makes it impossible to determine from
    debugfs output whether callbacks are in fact being diverted. This commit
    therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
    in which diverted callback invocations are counted. RCU's debugfs tracing
    still displays normal callback invocations using ci=, but displayed
    diverted callbacks with nci=.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU callback execution can add significant OS jitter and also can
    degrade both scheduling latency and, in asymmetric multiprocessors,
    energy efficiency. This commit therefore adds the ability for selected
    CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
    to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
    these kthreads will do polling, removing the need for the offloaded
    CPUs to do wakeups. At least one CPU must be doing normal callback
    processing: currently CPU 0 cannot be selected as a no-CBs CPU.
    In addition, attempts to offline the last normal-CBs CPU will fail.

    This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
    this commit includes fixes to problems located by Fengguang Wu's
    kbuild test robot.

    [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • …cu.2012.10.27a', 'stall.2012.11.13a', 'tracing.2012.11.08a' and 'idle.2012.10.24a' into HEAD

    urgent.2012.10.27a: Fix for RCU user-mode transition (already in -tip).

    doc.2012.11.08a: Documentation updates, most notably codifying the
    memory-barrier guarantees inherent to grace periods.

    fixes.2012.11.13a: Miscellaneous fixes.

    srcu.2012.10.27a: Allow statically allocated and initialized srcu_struct
    structures (courtesy of Lai Jiangshan).

    stall.2012.11.13a: Add more diagnostic information to RCU CPU stall
    warnings, also decrease from 60 seconds to 21 seconds.

    hotplug.2012.11.08a: Minor updates to CPU hotplug handling.

    tracing.2012.11.08a: Improved debugfs tracing, courtesy of Michael Wang.

    idle.2012.10.24a: Updates to RCU idle/adaptive-idle handling, including
    a boot parameter that maps normal grace periods to expedited.

    Resolved conflict in kernel/rcutree.c due to side-by-side change.

    Paul E. McKenney
     

14 Nov, 2012

1 commit

  • This commit explicitly states the memory-ordering properties of the
    RCU grace-period primitives. Although these properties were in some
    sense implied by the fundmental property of RCU ("a grace period must
    wait for all pre-existing RCU read-side critical sections to complete"),
    stating it explicitly will be a great labor-saving device.

    Reported-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Oleg Nesterov

    Paul E. McKenney
     

09 Nov, 2012

1 commit

  • The ->onofflock field in the rcu_state structure at one time synchronized
    CPU-hotplug operations for RCU. However, its scope has decreased over time
    so that it now only protects the lists of orphaned RCU callbacks. This
    commit therefore renames it to ->orphan_lock to reflect its current use.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney