10 Feb, 2019

2 commits


26 Jan, 2019

6 commits

  • It turns out that it is queue_delayed_work_on() rather than
    queue_work_on() that has difficulties when used concurrently with
    CPU-hotplug removal operations. It is therefore unnecessary to protect
    CPU identification and queue_work_on() with preempt_disable().

    This commit therefore removes the preempt_disable() and preempt_enable()
    from sync_rcu_exp_select_cpus(), which has the further benefit of reducing
    the number of changes that must be maintained in the -rt patchset.

    Reported-by: Thomas Gleixner
    Reported-by: Sebastian Siewior
    Suggested-by: Boqun Feng
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Now that _synchronize_rcu_expedited() has only one caller, and given that
    this is a tail call, this commit inlines _synchronize_rcu_expedited()
    into synchronize_rcu_expedited().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Now that rcu_blocking_is_gp() makes the correct immediate-return
    decision for both PREEMPT and !PREEMPT, a single implementation of
    synchronize_rcu() will work correctly under both configurations.
    This commit therefore eliminates a few lines of code by consolidating
    the two implementations of synchronize_rcu().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The CONFIG_PREEMPT=n and CONFIG_PREEMPT=y implementations of
    synchronize_rcu_expedited() are quite similar, and with small
    modifications to rcu_blocking_is_gp() can be made identical. This commit
    therefore makes this change in order to save a few lines of code and to
    reduce the amount of duplicate code.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Back when there could be multiple RCU flavors running in the same kernel
    at the same time, it was necessary to specify the expedited grace-period
    IPI handler at runtime. Now that there is only one RCU flavor, the
    IPI handler can be determined at build time. There is therefore no
    longer any reason for the RCU-preempt and RCU-sched IPI handlers to
    have different names, nor is there any reason to pass these handlers in
    function arguments and in the data structures enclosing workqueues.

    This commit therefore makes all these changes, pushing the specification
    of the expedited grace-period IPI handler down to the point of use.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • During expedited RCU grace-period initialization, IPIs are sent to
    all non-idle online CPUs. The IPI handler checks to see if the CPU is
    in quiescent state, reporting one if so. This handler looks at three
    different cases: (1) The CPU is not in an rcu_read_lock()-based critical
    section, (2) The CPU is in the process of exiting an rcu_read_lock()-based
    critical section, and (3) The CPU is in an rcu_read_lock()-based critical
    section. In case (2), execution falls through into case (3).

    This is harmless from a functionality viewpoint, but can result in
    needless overhead during an improbable corner case. This commit therefore
    adds the "return" statement needed to prevent fall-through.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

13 Nov, 2018

1 commit

  • In PREEMPT kernels, an expedited grace period might send an IPI to a
    CPU that is executing an RCU read-side critical section. In that case,
    it would be nice if the rcu_read_unlock() directly interacted with the
    RCU core code to immediately report the quiescent state. And this does
    happen in the case where the reader has been preempted. But it would
    also be a nice performance optimization if immediate reporting also
    happened in the preemption-free case.

    This commit therefore adds an ->exp_hint field to the task_struct structure's
    ->rcu_read_unlock_special field. The IPI handler sets this hint when
    it has interrupted an RCU read-side critical section, and this causes
    the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
    which, if preemption is enabled, reports the quiescent state immediately.
    If preemption is disabled, then the report is required to be deferred
    until preemption (or bottom halves or interrupts or whatever) is re-enabled.

    Because this is a hint, it does nothing for more complicated cases. For
    example, if the IPI interrupts an RCU reader, but interrupts are disabled
    across the rcu_read_unlock(), but another rcu_read_lock() is executed
    before interrupts are re-enabled, the hint will already have been cleared.
    If you do crazy things like this, reporting will be deferred until some
    later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.

    Reported-by: Joel Fernandes
    Signed-off-by: Paul E. McKenney
    Acked-by: Joel Fernandes (Google)

    Paul E. McKenney
     

12 Nov, 2018

1 commit

  • The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption
    to prevent the cpu_online_mask from changing. However, this relies on
    the stop-machine mechanism in the CPU-hotplug offline code, which is not
    desirable (it would be good to someday remove the stop-machine mechanism).

    This commit therefore instead uses the relevant leaf rcu_node structure's
    ->ffmask, which has a bit set for all CPUs that are fully functional.
    A given CPU's bit is cleared very early during offline processing by
    rcutree_offline_cpu() and set very late during online processing by
    rcutree_online_cpu(). Therefore, if a CPU's bit is set in this mask, and
    preemption is disabled, we have to be before the synchronize_sched() in
    the CPU-hotplug offline code, which means that the CPU is guaranteed to be
    workqueue-ready throughout the duration of the enclosing preempt_disable()
    region of code.

    This also has the side-effect of using WORK_CPU_UNBOUND if all the CPUs for
    this leaf rcu_node structure are offline, which is an acceptable difference
    in behavior.

    Reported-by: Sebastian Andrzej Siewior
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

31 Aug, 2018

13 commits

  • This commit move ->dynticks from the rcu_dynticks structure to the
    rcu_data structure, replacing the field of the same name. It also updates
    the code to access ->dynticks from the rcu_data structure and to use the
    rcu_data structure rather than following to now-gone ->dynticks field
    to the now-gone rcu_dynticks structure. While in the area, this commit
    also fixes up comments.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit removes ->rcu_need_heavy_qs and ->rcu_urgent_qs from the
    rcu_dynticks structure and updates the code to access them from the
    rcu_data structure.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The resched_cpu() interface is quite handy, but it does acquire the
    specified CPU's runqueue lock, which does not come for free. This
    commit therefore substitutes the following when directing resched_cpu()
    at the current CPU:

    set_tsk_need_resched(current);
    set_preempt_need_resched();

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • There now is only one rcu_state structure in a given build of the Linux
    kernel, so there is no need to pass it as a parameter to RCU's rcu_node
    tree's accessor macros. This commit therefore removes the rsp parameter
    from those macros in kernel/rcu/rcu.h, and removes some now-unused rsp
    local variables while in the area.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • There now is only one rcu_state structure in a given build of the
    Linux kernel, so there is no need to pass it as a parameter to
    RCU's functions. This commit therefore removes the rsp parameter
    from the code in kernel/rcu/tree_exp.h, and removes all of the
    rsp local variables while in the area.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • There now is only one rcu_state structure in a given build of the
    Linux kernel, so there is no need to pass it as a parameter to RCU's
    functions. This commit therefore removes the rsp parameter from
    rcu_get_root().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_state_p pointer references the default rcu_state structure,
    that is, the one that call_rcu() uses, as opposed to call_rcu_bh()
    and sometimes call_rcu_sched(). But there is now only one rcu_state
    structure, so that one structure is by definition the default, which
    means that the rcu_state_p pointer no longer serves any useful purpose.
    This commit therefore removes it.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_state structure's ->rda field was used to find the per-CPU
    rcu_data structures corresponding to that rcu_state structure. But now
    there is only one rcu_state structure (creatively named "rcu_state")
    and one set of per-CPU rcu_data structures (creatively named "rcu_data").
    Therefore, uses of the ->rda field can always be replaced by "rcu_data,
    and this commit makes that change and removes the ->rda field.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_state structure's ->call field references the corresponding RCU
    flavor's call_rcu() function. However, now that there is only ever one
    rcu_state structure in a given build of the Linux kernel, and that flavor
    uses plain old call_rcu(), there is not a lot of point in continuing to
    have the ->call field. This commit therefore removes it.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Now that RCU-preempt knows about preemption disabling, its implementation
    of synchronize_rcu() works for synchronize_sched(), and likewise for the
    other RCU-sched update-side API members. This commit therefore confines
    the RCU-sched update-side code to CONFIG_PREEMPT=n builds, and defines
    RCU-sched's update-side API members in terms of those of RCU-preempt.

    This means that any given build of the Linux kernel has only one
    update-side flavor of RCU, namely RCU-preempt for CONFIG_PREEMPT=y builds
    and RCU-sched for CONFIG_PREEMPT=n builds. This in turn means that kernels
    built with CONFIG_RCU_NOCB_CPU=y have only one rcuo kthread per CPU.

    Signed-off-by: Paul E. McKenney
    Cc: Andi Kleen

    Paul E. McKenney
     
  • The rcu_report_exp_rdp() function is always invoked with its "wake"
    argument set to "true", so this commit drops this parameter. The only
    potential call site that would use "false" is in the code driving the
    expedited grace period, and that code uses rcu_report_exp_cpu_mult()
    instead, which therefore retains its "wake" parameter.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit defers reporting of RCU-preempt quiescent states at
    rcu_read_unlock_special() time when any of interrupts, softirq, or
    preemption are disabled. These deferred quiescent states are reported
    at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
    offline operation. Of course, if another RCU read-side critical
    section has started in the meantime, the reporting of the quiescent
    state will be further deferred.

    This also means that disabling preemption, interrupts, and/or
    softirqs will act as an RCU-preempt read-side critical section.
    This is enforced by checking preempt_count() as needed.

    Some special cases must be handled on an ad-hoc basis, for example,
    context switch is a quiescent state even though both the scheduler and
    do_exit() disable preemption. In these cases, additional calls to
    rcu_preempt_deferred_qs() override the preemption disabling. Similar
    logic overrides disabled interrupts in rcu_preempt_check_callbacks()
    because in this case the quiescent state happened just before the
    corresponding scheduling-clock interrupt.

    In theory, this change lifts a long-standing restriction that required
    that if interrupts were disabled across a call to rcu_read_unlock()
    that the matching rcu_read_lock() also be contained within that
    interrupts-disabled region of code. Because the reporting of the
    corresponding RCU-preempt quiescent state is now deferred until
    after interrupts have been enabled, it is no longer possible for this
    situation to result in deadlocks involving the scheduler's runqueue and
    priority-inheritance locks. This may allow some code simplification that
    might reduce interrupt latency a bit. Unfortunately, in practice this
    would also defer deboosting a low-priority task that had been subjected
    to RCU priority boosting, so real-time-response considerations might
    well force this restriction to remain in place.

    Because RCU-preempt grace periods are now blocked not only by RCU
    read-side critical sections, but also by disabling of interrupts,
    preemption, and softirqs, it will be possible to eliminate RCU-bh and
    RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
    require some additional plumbing to provide the network denial-of-service
    guarantees that have been traditionally provided by RCU-bh. Once these
    are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
    into RCU-sched. This would mean that all kernels would have but
    one flavor of RCU, which would open the door to significant code
    cleanup.

    Moving to a single flavor of RCU would also have the beneficial effect
    of reducing the NOCB kthreads by at least a factor of two.

    Signed-off-by: Paul E. McKenney
    [ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
    from Joel Fernandes. ]
    [ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
    response to bug reports from kbuild test robot. ]
    [ paulmck: Fix bug located by kbuild test robot involving recursion
    via rcu_preempt_deferred_qs(). ]

    Paul E. McKenney
     

14 Aug, 2018

1 commit

  • Pull scheduler updates from Thomas Gleixner:

    - Cleanup and improvement of NUMA balancing

    - Refactoring and improvements to the PELT (Per Entity Load Tracking)
    code

    - Watchdog simplification and related cleanups

    - The usual pile of small incremental fixes and improvements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    watchdog: Reduce message verbosity
    stop_machine: Reflow cpu_stop_queue_two_works()
    sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
    sched/numa: Use group_weights to identify if migration degrades locality
    sched/numa: Update the scan period without holding the numa_group lock
    sched/numa: Remove numa_has_capacity()
    sched/numa: Modify migrate_swap() to accept additional parameters
    sched/numa: Remove unused task_capacity from 'struct numa_stats'
    sched/numa: Skip nodes that are at 'hoplimit'
    sched/debug: Reverse the order of printing faults
    sched/numa: Use task faults only if numa_group is not yet set up
    sched/numa: Set preferred_node based on best_cpu
    sched/numa: Simplify load_too_imbalanced()
    sched/numa: Evaluate move once per node
    sched/numa: Remove redundant field
    sched/debug: Show the sum wait time of a task group
    sched/fair: Remove #ifdefs from scale_rt_capacity()
    sched/core: Remove get_cpu() from sched_fork()
    sched/cpufreq: Clarify sugov_get_util()
    sched/sysctl: Remove unused sched_time_avg_ms sysctl
    ...

    Linus Torvalds
     

13 Jul, 2018

1 commit

  • Currently, the parallelized initialization of expedited grace periods uses
    the workqueue associated with each rcu_node structure's ->grplo field.
    This works fine unless that CPU is offline. This commit therefore uses
    the CPU corresponding to the lowest-numbered online CPU, or just queues
    the work on WORK_CPU_UNBOUND if there are no online CPUs corresponding
    to this rcu_node structure.

    Note that this patch uses cpu_is_offline() instead of the usual approach
    of checking bits in the rcu_node structure's ->qsmaskinitnext field. This
    is safe because preemption is disabled across both the cpu_is_offline()
    check and the call to queue_work_on().

    Signed-off-by: Boqun Feng
    [ paulmck: Disable preemption to close offline race window. ]
    Signed-off-by: Paul E. McKenney
    [ paulmck: Apply Peter Zijlstra feedback on CPU selection. ]
    Tested-by: Aneesh Kumar K.V

    Boqun Feng
     

26 Jun, 2018

1 commit

  • During expedited grace-period initialization, a work item is scheduled
    for each leaf rcu_node structure. However, that initialization code
    is itself (normally) executing from a workqueue, so one of the leaf
    rcu_node structures could just as well be handled by that pre-existing
    workqueue, and with less overhead. This commit therefore uses a
    shiny new rcu_is_leaf_node() macro to execute the last leaf rcu_node
    structure's initialization directly from the pre-existing workqueue.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

20 Jun, 2018

1 commit

  • Since swait basically implemented exclusive waits only, make sure
    the API reflects that.

    $ git grep -l -e "\"
    -e "\" | while read file;
    do
    sed -i -e 's/\/&_one/g'
    -e 's/\/&_exclusive/g' $file;
    done

    With a few manual touch-ups.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Acked-by: Linus Torvalds
    Cc: bigeasy@linutronix.de
    Cc: oleg@redhat.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: pbonzini@redhat.com
    Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org

    Peter Zijlstra
     

16 May, 2018

5 commits

  • …orture.2018.05.15a' into HEAD

    exp.2018.05.15a: Parallelize expedited grace-period initialization.
    fixes.2018.05.15a: Miscellaneous fixes.
    lock.2018.05.15a: Decrease lock contention on root rcu_node structure,
    which is a step towards merging RCU flavors.
    torture.2018.05.15a: Torture-test updates.

    Paul E. McKenney
     
  • Commit ae91aa0adb14 ("rcu: Remove debugfs tracing") removed the
    RCU debugfs tracing code, but did not remove the no-longer used
    ->exp_workdone{0,1,2,3} fields in the srcu_data structure. This commit
    therefore removes these fields along with the code that uselessly
    updates them.

    Signed-off-by: Byungchul Park
    Signed-off-by: Paul E. McKenney
    Tested-by: Nicholas Piggin

    Byungchul Park
     
  • Currently some callsites of sync_rcu_preempt_exp_done() are not called
    with the corresponding rcu_node's ->lock held, which could introduces
    bugs as per Paul:

    o CPU 0 in sync_rcu_preempt_exp_done() reads ->exp_tasks and
    sees that it is NULL.

    o CPU 1 blocks within an RCU read-side critical section, so
    it enqueues the task and points ->exp_tasks at it and
    clears CPU 1's bit in ->expmask.

    o All other CPUs clear their bits in ->expmask.

    o CPU 0 reads ->expmask, sees that it is zero, so incorrectly
    concludes that all quiescent states have completed, despite
    the fact that ->exp_tasks is non-NULL.

    To fix this, sync_rcu_preempt_exp_unlocked() is introduced to replace
    lockless callsites of sync_rcu_preempt_exp_done().

    Further, a lockdep annotation is added into sync_rcu_preempt_exp_done()
    to prevent mis-use in the future.

    Signed-off-by: Boqun Feng
    Signed-off-by: Paul E. McKenney
    Tested-by: Nicholas Piggin

    Boqun Feng
     
  • Since commit d9a3da0699b2 ("rcu: Add expedited grace-period support
    for preemptible RCU"), there are comments for some funtions in
    rcu_report_exp_rnp()'s call-chain saying that exp_mutex or its
    predecessors needs to be held.

    However, exp_mutex and its predecessors were used only to synchronize
    between GPs, and it is clear that all variables visited by those functions
    are under the protection of rcu_node's ->lock. Moreover, those functions
    are currently called without held exp_mutex, and seems that doesn't
    introduce any trouble.

    So this patch fixes this problem by updating the comments to match the
    current code.

    Signed-off-by: Boqun Feng
    Fixes: d9a3da0699b2 ("rcu: Add expedited grace-period support for preemptible RCU")
    Signed-off-by: Paul E. McKenney
    Tested-by: Nicholas Piggin

    Boqun Feng
     
  • The latency of RCU expedited grace periods grows with increasing numbers
    of CPUs, eventually failing to be all that expedited. Much of the growth
    in latency is in the initialization phase, so this commit uses workqueues
    to carry out this initialization concurrently on a rcu_node-by-rcu_node
    basis.

    This change makes use of a new rcu_par_gp_wq because flushing a work
    item from another work item running from the same workqueue can result
    in deadlock.

    Signed-off-by: Paul E. McKenney
    Tested-by: Nicholas Piggin

    Paul E. McKenney
     

24 Feb, 2018

1 commit

  • RCU's expedited grace periods can participate in out-of-memory deadlocks
    due to all available system_wq kthreads being blocked and there not being
    memory available to create more. This commit prevents such deadlocks
    by allocating an RCU-specific workqueue_struct at early boot time, and
    providing it with a rescuer to ensure forward progress. This uses the
    shiny new init_rescuer() function provided by Tejun (but indirectly).

    This commit also causes SRCU to use this new RCU-specific
    workqueue_struct. Note that SRCU's use of workqueues never blocks them
    waiting for readers, so this should be safe from a forward-progress
    viewpoint. Note that this moves SRCU from system_power_efficient_wq
    to a normal workqueue. In the unlikely event that this results in
    measurable degradation, a separate power-efficient workqueue will be
    creates for SRCU.

    Reported-by: Prateek Sood
    Reported-by: Tejun Heo
    Signed-off-by: Paul E. McKenney
    Acked-by: Tejun Heo

    Paul E. McKenney
     

21 Feb, 2018

3 commits

  • This commit reworks the first loop in sync_rcu_exp_select_cpus()
    to avoid doing unnecssary stores to other CPUs' rcu_data
    structures. This speeds up that first loop by roughly a factor of
    two on an old x86 system. In the case where the system is mostly
    idle, this loop incurs a large fraction of the overhead of the
    synchronize_rcu_expedited(). There is less benefit on busy systems
    because the overhead of the smp_call_function_single() in the second
    loop dominates in that case.

    However, it is not unusual to do configuration chances involving
    RCU grace periods (both expedited and normal) while the system is
    mostly idle, so this optimization is worth doing.

    While we are in the area, this commit also adds parentheses to arguments
    used by the for_each_leaf_node_possible_cpu() macro.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If a CPU is transitioning to or from offline state, an expedited
    grace period may undergo a timed wait. This timed wait can unduly
    delay grace periods, so this commit adds a trace statement to make
    it visible.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adds more tracing of expedited grace periods to enable
    improved debugging of slowdowns.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

26 Jul, 2017

1 commit

  • The update of the ->expmaskinitnext and of ->ncpus are unsynchronized,
    with the value of ->ncpus being incremented long before the corresponding
    ->expmaskinitnext mask is updated. If an RCU expedited grace period
    sees ->ncpus change, it will update the ->expmaskinit masks from the new
    ->expmaskinitnext masks. But it is possible that ->ncpus has already
    been updated, but the ->expmaskinitnext masks still have their old values.
    For the current expedited grace period, no harm done. The CPU could not
    have been online before the grace period started, so there is no need to
    wait for its non-existent pre-existing readers.

    But the next RCU expedited grace period is in a world of hurt. The value
    of ->ncpus has already been updated, so this grace period will assume
    that the ->expmaskinitnext masks have not changed. But they have, and
    they won't be taken into account until the next never-been-online CPU
    comes online. This means that RCU will be ignoring some CPUs that it
    should be paying attention to.

    The solution is to update ->ncpus and ->expmaskinitnext while holding
    the ->lock for the rcu_node structure containing the ->expmaskinitnext
    mask. Because smp_store_release() is now used to update ->ncpus and
    smp_load_acquire() is now used to locklessly read it, if the expedited
    grace period sees ->ncpus change, then the updating CPU has to
    already be holding the corresponding ->lock. Therefore, when the
    expedited grace period later acquires that ->lock, it is guaranteed
    to see the new value of ->expmaskinitnext.

    On the other hand, if the expedited grace period loads ->ncpus just
    before an update, earlier full memory barriers guarantee that
    the incoming CPU isn't far enough along to be running any RCU readers.

    This commit therefore makes the required change.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

08 Jun, 2017

1 commit


19 Apr, 2017

2 commits

  • The expedited grace-period code contains several open-coded shifts
    know the format of an rcu_seq grace-period counter, which is not
    particularly good style. This commit therefore creates a new
    rcu_seq_ctr() function that extracts the counter portion of the
    counter, and an rcu_seq_state() function that extracts the low-order
    state bit. This commit prepares for SRCU callback parallelization,
    which will require two state bits.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Expedited grace periods use workqueue handlers that wake up the requesters,
    but there is no lock mediating this wakeup. Therefore, memory barriers
    are required to ensure that the handler's memory references are seen by
    all to occur before synchronize_*_expedited() returns to its caller.
    Possibly detected by syzkaller.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney