06 Apr, 2019

1 commit

  • [ Upstream commit a39f15b9644fac3f950f522c39e667c3af25c588 ]

    Since kprobe itself depends on RCU, probing on RCU debug
    routine can cause recursive breakpoint bugs.

    Prohibit probing on RCU debug routines.

    int3
    ->do_int3()
    ->ist_enter()
    ->RCU_LOCKDEP_WARN()
    ->debug_lockdep_rcu_enabled() -> int3

    Signed-off-by: Masami Hiramatsu
    Cc: Alexander Shishkin
    Cc: Andrea Righi
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devbox
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     

24 Mar, 2019

1 commit

  • commit 1d1f898df6586c5ea9aeaf349f13089c6fa37903 upstream.

    The rcu_gp_kthread_wake() function is invoked when it might be necessary
    to wake the RCU grace-period kthread. Because self-wakeups are normally
    a useless waste of CPU cycles, if rcu_gp_kthread_wake() is invoked from
    this kthread, it naturally refuses to do the wakeup.

    Unfortunately, natural though it might be, this heuristic fails when
    rcu_gp_kthread_wake() is invoked from an interrupt or softirq handler
    that interrupted the grace-period kthread just after the final check of
    the wait-event condition but just before the schedule() call. In this
    case, a wakeup is required, even though the call to rcu_gp_kthread_wake()
    is within the RCU grace-period kthread's context. Failing to provide
    this wakeup can result in grace periods failing to start, which in turn
    results in out-of-memory conditions.

    This race window is quite narrow, but it actually did happen during real
    testing. It would of course need to be fixed even if it was strictly
    theoretical in nature.

    This patch does not Cc stable because it does not apply cleanly to
    earlier kernel versions.

    Fixes: 48a7639ce80c ("rcu: Make callers awaken grace-period kthread")
    Reported-by: "He, Bo"
    Co-developed-by: "Zhang, Jun"
    Co-developed-by: "He, Bo"
    Co-developed-by: "xiao, jin"
    Co-developed-by: Bai, Jie A
    Signed-off: "Zhang, Jun"
    Signed-off: "He, Bo"
    Signed-off: "xiao, jin"
    Signed-off: Bai, Jie A
    Signed-off-by: "Zhang, Jun"
    [ paulmck: Switch from !in_softirq() to "!in_interrupt() &&
    !in_serving_softirq() to avoid redundant wakeups and to also handle the
    interrupt-handler scenario as well as the softirq-handler scenario that
    actually occurred in testing. ]
    Signed-off-by: Paul E. McKenney
    Link: https://lkml.kernel.org/r/CD6925E8781EFD4D8E11882D20FC406D52A11F61@SHSMSX104.ccr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Zhang, Jun
     

13 Jan, 2019

1 commit

  • commit eb4c2382272ae7ae5d81fdfa5b7a6c86146eaaa4 upstream.

    The srcu_gp_start() function is called with the srcu_struct structure's
    ->lock held, but not with the srcu_data structure's ->lock. This is
    problematic because this function accesses and updates the srcu_data
    structure's ->srcu_cblist, which is protected by that lock. Failing to
    hold this lock can result in corruption of the SRCU callback lists,
    which in turn can result in arbitrarily bad results.

    This commit therefore makes srcu_gp_start() acquire the srcu_data
    structure's ->lock across the calls to rcu_segcblist_advance() and
    rcu_segcblist_accelerate(), thus preventing this corruption.

    Reported-by: Bart Van Assche
    Reported-by: Christoph Hellwig
    Reported-by: Sebastian Kuzminsky
    Signed-off-by: Dennis Krein
    Signed-off-by: Paul E. McKenney
    Tested-by: Dennis Krein
    Cc: # 4.16.x
    Signed-off-by: Greg Kroah-Hartman

    Dennis Krein
     

01 Dec, 2018

1 commit

  • commit 92aa39e9dc77481b90cbef25e547d66cab901496 upstream.

    The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
    need for an RCU quiescent state from the force-quiescent-state processing
    within the grace-period kthread to context switches and to cond_resched().
    Unfortunately, such urgent needs are not communicated to need_resched(),
    which is sometimes used to decide when to invoke cond_resched(), for
    but one example, within the KVM vcpu_run() function. As of v4.15, this
    can result in synchronize_sched() being delayed by up to ten seconds,
    which can be problematic, to say nothing of annoying.

    This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
    rcu_check_callbacks(), which is invoked from the scheduling-clock
    interrupt handler. If the current task is not an idle task and is
    not executing in usermode, a context switch is forced, and either way,
    the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
    task is an idle task, then RCU's dyntick-idle code will detect the
    quiescent state, so no further action is required. Similarly, if the
    task is executing in usermode, other code in rcu_check_callbacks() and
    its called functions will report the corresponding quiescent state.

    Reported-by: Marius Hillenbrand
    Reported-by: David Woodhouse
    Suggested-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    [ paulmck: Backported to make patch apply cleanly on older versions. ]
    Tested-by: Marius Hillenbrand
    Cc: # 4.12.x - 4.19.x
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     

14 Aug, 2018

1 commit

  • Pull scheduler updates from Thomas Gleixner:

    - Cleanup and improvement of NUMA balancing

    - Refactoring and improvements to the PELT (Per Entity Load Tracking)
    code

    - Watchdog simplification and related cleanups

    - The usual pile of small incremental fixes and improvements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
    watchdog: Reduce message verbosity
    stop_machine: Reflow cpu_stop_queue_two_works()
    sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
    sched/numa: Use group_weights to identify if migration degrades locality
    sched/numa: Update the scan period without holding the numa_group lock
    sched/numa: Remove numa_has_capacity()
    sched/numa: Modify migrate_swap() to accept additional parameters
    sched/numa: Remove unused task_capacity from 'struct numa_stats'
    sched/numa: Skip nodes that are at 'hoplimit'
    sched/debug: Reverse the order of printing faults
    sched/numa: Use task faults only if numa_group is not yet set up
    sched/numa: Set preferred_node based on best_cpu
    sched/numa: Simplify load_too_imbalanced()
    sched/numa: Evaluate move once per node
    sched/numa: Remove redundant field
    sched/debug: Show the sum wait time of a task group
    sched/fair: Remove #ifdefs from scale_rt_capacity()
    sched/core: Remove get_cpu() from sched_fork()
    sched/cpufreq: Clarify sugov_get_util()
    sched/sysctl: Remove unused sched_time_avg_ms sysctl
    ...

    Linus Torvalds
     

13 Jul, 2018

35 commits

  • fixes1.2018.07.12b: Post-gp_seq miscellaneous fixes
    torture1.2018.07.12b: Post-gp_seq torture-test updates

    Paul E. McKenney
     
  • The rcutorture test module currently increments both successes and error
    for the barrier test upon error, which results in misleading statistics
    being printed. This commit therefore changes the code to increment the
    success counter only when the test actually passes.

    This change was tested by by returning from the barrier callback without
    incrementing the callback counter, thus introducing what appeared to
    rcutorture to be rcu_barrier() failures.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Joel Fernandes (Google)
     
  • When rcutorture is built in to the kernel, an earlier patch detects
    that and raises the priority of RCU's kthreads to allow rcutorture's
    RCU priority boosting tests to succeed.

    However, if rcutorture is built as a module, those priorities must be
    raised manually via the rcutree.kthread_prio kernel boot parameter.
    If this manual step is not taken, rcutorture's RCU priority boosting
    tests will fail due to kthread starvation. One approach would be to
    raise the default priority, but that risks breaking existing users.
    Another approach would be to allow runtime adjustment of RCU's kthread
    priorities, but that introduces numerous "interesting" race conditions.
    This patch therefore instead detects too-low priorities, and prints a
    message and disables the RCU priority boosting tests in that case.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Joel Fernandes (Google)
     
  • The get_seconds() call is deprecated because it overflows on 32-bit
    architectures. The algorithm in rcu_torture_stall() can deal with
    the overflow, but another problem here is that using a CLOCK_REALTIME
    stamp can lead to a false-positive stall warning when a settimeofday()
    happens concurrently.

    Using ktime_get_seconds() instead avoids those issues and will never
    overflow. The added cast to 'unsigned long' however is necessary to
    make ULONG_CMP_LT() work correctly.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney

    Arnd Bergmann
     
  • Currently, with RCU_BOOST disabled, I get no failures when forcing
    rcutorture to test RCU boost priority inversion. The reason seems to be
    that we don't check for failures if the callback never ran at all for
    the duration of the boost-test loop.

    Further, the 'rtb' and 'rtbf' counters seem to be used inconsistently.
    'rtb' is incremented at the start of each test and 'rtbf' is incremented
    per-cpu on each failure of call_rcu. So its possible 'rtbf' > 'rtb'.

    To test the boost with rcutorture, I did following on a 4-CPU x86 machine:

    modprobe rcutorture test_boost=2
    sleep 20
    rmmod rcutorture

    With patch:
    rtbf: 8 rtb: 12

    Without patch:
    rtbf: 0 rtb: 2

    In summary this patch:
    - Increments failed and total test counters once per boost-test.
    - Checks for failure cases correctly.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Joel Fernandes (Google)
     
  • Currently rcutorture is not able to torture RCU boosting properly. This
    is because the rcutorture's boost threads which are doing the torturing
    may be throttled due to RT throttling.

    This patch makes rcutorture use the right torture technique (unthrottled
    rcutorture boost tasks) for torturing RCU so that the test fails
    correctly when no boost is available.

    Currently this requires accessing sysctl_sched_rt_runtime directly, but
    that should be Ok since rcutorture is test code. Such direct access is
    also only possible if rcutorture is used as a built-in so make it
    conditional on that.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Joel Fernandes (Google)
     
  • For RCU implementations supporting multiple types of reader protection,
    rcutorture currently randomly selects the combinations of types of
    protection for each phase of each reader. The problem with this,
    for example, given the four kinds of protection for RCU-sched
    (local_irq_disable(), local_bh_disable(), preempt_disable(), and
    rcu_read_lock_sched()), the reader will be protected by a single
    mechanism only 25% of the time. We really heavier testing of single
    read-side mechanisms.

    This commit therefore uses only a single mechanism about 60% of the time,
    half of the time explicitly and one-eighth of the time by chance.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit enables rcutorture to test whether RCU properly aggregates
    different types of read-side critical sections into a larger section
    covering the set. It does this by extending an initial read-side
    critical section randomly for a random number of extensions. There is
    a new rcu_torture_ops field ->extendable that specifies what extensions
    are permitted for a given flavor of RCU (for example, SRCU does not
    permit any extensions, while RCU-sched permits all types). Note that
    if a given operation (for example, local_bh_disable()) extends an RCU
    read-side critical section, then rcutorture feels free to also start
    and end the critical section with that operation's type of disabling.

    Disabling operations include local_bh_disable(), local_irq_disable(),
    and preempt_disable(). This commit also adds a new "busted_srcud"
    torture type, which verifies rcutorture's ability to detect extensions
    of RCU read-side critical sections that are not handled. Gotta test
    the test, after all!

    Note that it is not legal to invoke local_bh_disable() with interrupts
    disabled, and this transition is avoided by overriding the random-number
    generator when it wants to call local_bh_disable() while interrupts
    are disabled. The code instead leaves both interrupts and bh/softirq
    disabled in this case.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit saves a few lines of code by making rcu_torture_timer()
    invoke rcu_torture_one_read(), thus completing the consolidation of
    code between rcu_torture_timer() and rcu_torture_reader().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the rcu_torture_timer() function uses a single global
    torture_random_state structure protected by a single global lock.
    This conflicts to some extent with performance and scalability,
    but even more with the goal of consolidating read-side testing
    with rcu_torture_reader(). This commit therefore creates a per-CPU
    torture_random_state structure for use by rcu_torture_timer() and
    eliminates the lock.

    Signed-off-by: Paul E. McKenney
    [ paulmck: Make rcu_torture_timer_rand static, per 0day Test Robot report. ]

    Paul E. McKenney
     
  • Currently, rcu_torture_timer() relies on a lock to guard updates to
    n_rcu_torture_timers. Unfortunately, consolidating code with
    rcu_torture_reader() will dispense with this lock. This commit
    therefore makes n_rcu_torture_timers be an atomic_long_t and uses
    atomic_long_inc() to carry out the update.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit extracts the code executed on each pass through the loop
    in rcu_torture_reader() into a new rcu_torture_one_read() function.
    This new function will also be used by rcu_torture_timer().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The torturing_tasks() function in rcuperf.c is not used, so this commit
    removes it.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Back when RCU had a debugfs interface, there was a test version and
    sequence number that allowed associating debugfs data with a particular
    test run, where the test run started with modprobe and ended with rmmod,
    which was how tests were run back on the old ABAT system within IBM.
    But rcutorture testing no longer runs on ABAT, and there is no longer an
    RCU debugfs interface, so there is no longer any need for test versions
    and sequence numbers.

    This commit therefore removes the rcutorture_record_test_transition()
    and rcutorture_record_progress() functions, and along with them the
    rcutorture_testseq and rcutorture_vernum variables that they update.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Some RCU bugs have been sensitive to the frequency of CPU-hotplug
    operations, which have been gradually increased over time. But this
    frequency is now at the one-second lower limit that can be specified using
    the rcutorture.onoff_interval kernel parameter. This commit therefore
    changes the units of rcutorture.onoff_interval from seconds to jiffies,
    and also sets the value specified for this kernel parameter in the TREE03
    rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcutorture RCU priority boosting tests fail even with CONFIG_RCU_BOOST
    set because rcutorture's threads run at the same priority as the default
    RCU kthreads (RT class with priority of 1).

    This patch checks if RCU torture is built into the kernel and if so,
    assigns RT priority 1 to the RCU threads, allowing the rcutorture boost
    tests to pass.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Joel Fernandes (Google)
     
  • This commit adds the SRCU grace-period number to the rcutorture statistics
    printout, which allows it to be compared to the rcutorture "Writer stall
    state" message.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The ->dynticks_nmi_nesting field records the nesting depth of both
    interrupt and NMI handlers. Because the kernel can enter interrupts
    and never leave them (and vice versa) and because NMIs can interrupt
    manipulation of the ->dynticks_nmi_nesting field, the values in this
    field must be both chosen and maniupated very carefully. As a result,
    although the value is zero when the corresponding CPU is executing
    neither an interrupt nor an NMI handler, it is 4,611,686,018,427,387,906
    on 64-bit systems when there is a single level of interrupt/NMI handling
    in progress.

    This number is difficult to remember and interpret, so this commit
    switches the output to hexadecimal, resulting in the much nicer
    0x4000000000000002.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The current implementatation of rcu_seq_diff() follows tradition in
    providing a rough-and-ready approximation of the number of elapsed grace
    periods between the two rcu_seq values. However, this difference is
    used to flag RCU-failure "near misses", which can be a valuable debugging
    aid, so more exactitude would be an improvement. This commit therefore
    improves the accuracy of rcu_seq_diff().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the range of jiffies_till_{first,next}_fqs are checked and
    adjusted on and on in the loop of rcu_gp_kthread on runtime.

    However, it's enough to check them only when setting them, not every
    time in the loop. So make them handled on a setting time via sysfs.

    Signed-off-by: Byungchul Park
    Signed-off-by: Paul E. McKenney

    Byungchul Park
     
  • This commit adds any in-the-future ->gp_seq_needed fields to the
    diagnostics for an rcutorture writer stall warning message.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • At the end of rcu_tasks_kthread() there's a lonely
    schedule_timeout_uninterruptible() call with no apparent rationale for
    its existence. But there is. It is to keep the thread from going into
    a tight loop if there's some anomaly. That really needs a comment.

    Link: http://lkml.kernel.org/r/20180524223839.GU3803@linux.vnet.ibm.com
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Paul E. McKenney

    Steven Rostedt (VMware)
     
  • Joel Fernandes found that the synchronize_rcu_tasks() was taking a
    significant amount of time. He demonstrated it with the following test:

    # cd /sys/kernel/tracing
    # while [ 1 ]; do x=1; done &
    # echo '__schedule_bug:traceon' > set_ftrace_filter
    # time echo '!__schedule_bug:traceon' > set_ftrace_filter;

    real 0m1.064s
    user 0m0.000s
    sys 0m0.004s

    Where it takes a little over a second to perform the synchronize,
    because there's a loop that waits 1 second at a time for tasks to get
    through their quiescent points when there's a task that must be waited
    for.

    After discussion we came up with a simple way to wait for holdouts but
    increase the time for each iteration of the loop but no more than a
    full second.

    With the new patch we have:

    # time echo '!__schedule_bug:traceon' > set_ftrace_filter;

    real 0m0.131s
    user 0m0.000s
    sys 0m0.004s

    Which drops it down to 13% of what the original wait time was.

    Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
    Reported-by: Joel Fernandes (Google)
    Suggested-by: Joel Fernandes (Google)
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Paul E. McKenney

    Steven Rostedt (VMware)
     
  • rcu_seq_snap may be tricky to decipher. Lets document how it works with
    an example to make it easier.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney
    [ paulmck: Shrink comment as suggested by Peter Zijlstra. ]

    Joel Fernandes (Google)
     
  • Currently, rcu_check_gp_start_stall() waits for one second after the first
    request before complaining that a grace period has not yet started. This
    was desirable while testing the conversion from ->future_gp_needed[] to
    ->gp_seq_needed, but it is a bit on the hair-trigger side for production
    use under heavy load. This commit therefore makes this wait time be
    exactly that of the RCU CPU stall warning, allowing easy adjustment of
    both timeouts to suit the distribution or installation at hand.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_cpu_has_callbacks() function is now used in all configurations,
    so this commit removes the __maybe_unused.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This function is in rcuperf.c, which is not an include file, so there
    is no problem dropping the "inline", especially given that this function
    is invoked only twice per rcuperf run. This commit therefore delegates
    the inlining decision to the compiler by dropping the "inline".

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This function is in rcutorture.c, which is not an include file, so there
    is no problem dropping the "inline", especially given that this function
    is invoked only twice per rcutorture run. This commit therefore delegates
    the inlining decision to the compiler by dropping the "inline".

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • These functions are in kernel/rcu/tree.c, which is not an include file,
    so there is no problem dropping the "inline", especially given that these
    functions are nowhere near a fastpath. This commit therefore delegates
    the inlining decision to the compiler by dropping the "inline".

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • One danger of using __maybe_unused is that the compiler doesn't yell
    at you when you remove the last reference, witness rcu_bind_gp_kthread()
    and its local variable "cpu". This commit removes this local variable.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_kick_nohz_cpu() function is no longer used, and the functionality
    it used to provide is now provided by a call to resched_cpu() in the
    force-quiescent-state function rcu_implicit_dynticks_qs(). This commit
    therefore removes rcu_kick_nohz_cpu().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_preempt_qs() function only applies to the CPU, not the task.
    A task really is allowed to invoke this function while in an RCU-preempt
    read-side critical section, but only if it has first added itself to
    some leaf rcu_node structure's ->blkd_tasks list.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The rcu_dynticks_momentary_idle() function is invoked only from
    rcu_momentary_dyntick_idle(), and neither function is particularly
    large. This commit therefore saves a few lines by inlining
    rcu_dynticks_momentary_idle() into rcu_momentary_dyntick_idle().

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • If any scheduling-clock interrupt interrupts an RCU-preempt read-side
    critical section, the interrupted task's ->rcu_read_unlock_special.b.need_qs
    field is set. This causes the outermost rcu_read_unlock() to incur the
    extra overhead of calling into rcu_read_unlock_special(). This commit
    reduces that overhead by setting ->rcu_read_unlock_special.b.need_qs only
    if the grace period has been in effect for more than one second.

    Why one second? Because this is comfortably smaller than the minimum
    RCU CPU stall-warning timeout of three seconds, but long enough that the
    .need_qs marking should happen quite rarely. And if your RCU read-side
    critical section has run on-CPU for a full second, it is not unreasonable
    to invest some CPU time in ending the grace period quickly.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The naming and comments associated with some RCU-tasks code make
    the faulty assumption that context switches due to cond_resched()
    are voluntary. As several people pointed out, this is not the case.
    This commit therefore updates function names and comments to better
    reflect current reality.

    Reported-by: Byungchul Park
    Reported-by: Joel Fernandes
    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney