24 Feb, 2020

1 commit

  • [ Upstream commit 610dea36d3083a977e4f156206cbe1eaa2a532f0 ]

    Commit 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if
    dump_tree") added print statements to rcu_organize_nocb_kthreads for
    debugging, but incorrectly guarded them, causing the function to always
    spew out its message.

    This patch fixes it by guarding both pr_alert statements with dump_tree,
    while also changing the second pr_alert to a pr_cont, to print the
    hierarchy in a single line (assuming that's how it was supposed to
    work).

    Fixes: 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
    Signed-off-by: Stefan Reiter
    [ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Stefan Reiter
     

11 Feb, 2020

2 commits

  • commit c51f83c315c392d9776c33eb16a2fe1349d65c7f upstream.

    The rcu_node structure's ->expmask field is updated only when holding the
    ->lock, but is also accessed locklessly. This means that all ->expmask
    updates must use WRITE_ONCE() and all reads carried out without holding
    ->lock must use READ_ONCE(). This commit therefore changes the lockless
    ->expmask read in rcu_read_unlock_special() to use READ_ONCE().

    Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
    Signed-off-by: Paul E. McKenney
    Acked-by: Marco Elver
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 6935c3983b246d5fbfebd3b891c825e65c118f2d upstream.

    The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
    to read ->gp_tasks while other cpus might overwrite this field.

    We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
    tricks and KCSAN splats like the following :

    BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore

    write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
    rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
    rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
    __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
    rcu_read_unlock include/linux/rcupdate.h:645 [inline]
    __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
    ip_queue_xmit+0x45/0x60 include/net/ip.h:236
    __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
    __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
    tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
    tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
    tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
    inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:871 [inline]
    sock_recvmsg net/socket.c:889 [inline]
    sock_recvmsg+0x92/0xb0 net/socket.c:885
    sock_read_iter+0x15f/0x1e0 net/socket.c:967
    call_read_iter include/linux/fs.h:1864 [inline]
    new_sync_read+0x389/0x4f0 fs/read_write.c:414

    read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
    rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
    rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
    rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
    rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
    kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    [ paulmck: Added another READ_ONCE() for RCU CPU stall warnings. ]
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

26 Jan, 2020

1 commit

  • [ Upstream commit b8889c9c89a2655a231dfed93cc9bdca0930ea67 ]

    We never set this to false. This probably doesn't affect most people's
    runtime because GCC will automatically initialize it to false at certain
    common optimization levels. But that behavior is related to a bug in
    GCC and obviously should not be relied on.

    Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Dan Carpenter
     

14 Aug, 2019

36 commits

  • When under overload conditions, __call_rcu_nocb_wake() will wake the
    no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
    are no ready-to-invoke callbacks, but only after a timer delay. If the
    no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
    from __call_rcu_nocb_wake() is redundant. This commit therefore makes
    __call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
    ->nocb_bypass_timer is pending. This requires adding a bit of ordering
    of timer actions.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, __call_rcu_nocb_wake() advances callbacks each time that it
    detects excessive numbers of callbacks, though only if it succeeds in
    conditionally acquiring its leaf rcu_node structure's ->lock. Despite
    the conditional acquisition of ->lock, this does increase contention.
    This commit therefore avoids advancing callbacks unless there are
    callbacks in ->cblist whose grace period has completed and advancing
    has not yet been done during this jiffy.

    Note that this decision does not take the presence of new callbacks
    into account. That is because on this code path, there will always be
    at least one new callback, namely the one we just enqueued.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, nocb_cb_wait() advances callbacks on each pass through its
    loop, though only if it succeeds in conditionally acquiring its leaf
    rcu_node structure's ->lock. Despite the conditional acquisition of
    ->lock, this does increase contention. This commit therefore avoids
    advancing callbacks unless there are callbacks in ->cblist whose grace
    period has completed.

    Note that nocb_cb_wait() doesn't worry about callbacks that have not
    yet been assigned a grace period. The idea is that the only reason for
    nocb_cb_wait() to advance callbacks is to allow it to continue invoking
    callbacks. Time will tell whether this is the correct choice.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • When callbacks are in full flow, the common case is waiting for a
    grace period, and this grace period will normally take a few jiffies to
    complete. It therefore isn't all that helpful for __call_rcu_nocb_wake()
    to do a synchronous wakeup in this case. This commit therefore turns this
    into a timer-based deferred wakeup of the no-CBs grace-period kthread.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit causes locking, sleeping, and callback state to be printed
    for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
    rcutorture to complain.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
    takes advantage of unrelated grace periods, thus reducing the memory
    footprint in the face of floods of call_rcu() invocations. However,
    the ->cblist field is a more-complex rcu_segcblist structure which must
    be protected via locking. Even though there are only three entities
    which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
    grace-period kthread, and the no-CBs callbacks kthread), the contention
    on this lock is excessive under heavy stress.

    This commit therefore greatly reduces contention by provisioning
    an rcu_cblist structure field named ->nocb_bypass within the
    rcu_data structure. Each no-CBs CPU is permitted only a limited
    number of enqueues onto the ->cblist per jiffy, controlled by a new
    nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
    about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
    exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

    The ->nocb_bypass is flushed into the ->cblist every jiffy or when
    the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
    happens first. During call_rcu() floods, this flushing is carried out
    by the CPU during the course of its call_rcu() invocations. However,
    a CPU could simply stop invoking call_rcu() at any time. The no-CBs
    grace-period kthread therefore carries out less-aggressive flushing
    (every few jiffies or when the number of callbacks on ->nocb_bypass
    exceeds (2 * qhimark), whichever comes first). This means that the
    no-CBs grace-period kthread cannot be permitted to do unbounded waits
    while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
    used to provide the needed wakeups.

    [ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • When there are excessive numbers of callbacks, and when either the
    corresponding no-CBs callback kthread is asleep or there is no more
    ready-to-invoke callbacks, and when least one callback is pending,
    __call_rcu_nocb_wake() will advance the callbacks, but refrain from
    awakening the corresponding no-CBs grace-period kthread. However,
    because rcu_advance_cbs_nowake() is used, it is possible (if a bit
    unlikely) that the needed advancement could not happen due to a grace
    period not being in progress. Plus there will always be at least one
    pending callback due to one having just now been enqueued.

    This commit therefore attempts to advance callbacks and awakens the
    no-CBs grace-period kthread when there are excessive numbers of callbacks
    posted and when the no-CBs callback kthread is not in a position to do
    anything helpful.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The sleep/wakeup of the no-CBs grace-period kthreads is synchronized
    using the ->nocb_lock of the first CPU corresponding to that kthread.
    This commit provides a separate ->nocb_gp_lock for this purpose, thus
    reducing contention on ->nocb_lock.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node
    ->lock to advance callbacks when done invoking the previous batch.
    It does this while holding ->nocb_lock, which means that contention on
    the leaf rcu_node ->lock visits itself on the ->nocb_lock. This commit
    therefore makes this lock acquisition conditional, forgoing callback
    advancement when the leaf rcu_node ->lock is not immediately available.
    (In this case, the no-CBs grace-period kthread will eventually do any
    needed callback advancement.)

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
    structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
    check to see if it is possible to advance callbacks without potentially
    needing to awaken the grace-period kthread. Given that the no-awaken
    check can be done locklessly, this commit reverses the order, so that
    rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
    structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
    state before conditionally acquiring that lock, thus reducing the number
    of needless acquistions of the leaf rcu_node structure's ->lock.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, when the square root of the number of CPUs is rounded down
    by int_sqrt(), this round-down is applied to the number of callback
    kthreads per grace-period kthreads. This makes almost no difference
    for large systems, but results in oddities such as three no-CBs
    grace-period kthreads for a five-CPU system, which is a bit excessive.
    This commit therefore causes the round-down to apply to the number of
    no-CBs grace-period kthreads, so that systems with from four to eight
    CPUs have only two no-CBs grace period kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • A given rcu_data structure's ->nocb_lock can be acquired very frequently
    by the corresponding CPU and occasionally by the corresponding no-CBs
    grace-period and callbacks kthreads. In particular, these two kthreads
    will have frequent gaps between ->nocb_lock acquisitions that are roughly
    a grace period in duration. This means that any excessive ->nocb_lock
    contention will be due to the CPU's acquisitions, and this in turn
    enables a very naive contention-avoidance strategy to be quite effective.

    This commit therefore modifies rcu_nocb_lock() to first
    attempt a raw_spin_trylock(), and to atomically increment a
    separate ->nocb_lock_contended across a raw_spin_lock(). This new
    ->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when
    interrupts are enabled, with a spin-wait for contending acquisitions
    to complete, thus allowing the kthreads a chance to acquire the lock.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, the code provides an extra wakeup for the no-CBs grace-period
    kthread if one of its CPUs is generating excessive numbers of callbacks.
    But satisfying though it is to wake something up when things are going
    south, unless the thing being awakened can actually help solve the
    problem, that extra wakeup does nothing but consume additional CPU time,
    which is exactly what you don't want during a call_rcu() flood.

    This commit therefore avoids doing anything if the corresponding
    no-CBs callback kthread is going full tilt. Otherwise, if advancing
    callbacks immediately might help and if the leaf rcu_node structure's
    lock is immediately available, this commit invokes a new variant of
    rcu_advance_cbs() that advances callbacks only if doing so won't require
    awakening the grace-period kthread (not to be confused with any of the
    no-CBs grace-period kthreads).

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • It might be hard to imagine having more than two billion callbacks
    queued on a single CPU's ->cblist, but someone will do it sometime.
    This commit therefore makes __call_rcu_nocb_wake() handle this situation
    by upgrading local variable "len" from "int" to "long".

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, wake_nocb_gp_defer() simply stores whatever waketype was
    passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded
    to RCU_NOCB_WAKE, which could in turn delay callback processing.
    This commit therefore adds a check so that wake_nocb_gp_defer() only
    updates ->nocb_defer_wakeup when the update increases the forcefulness,
    thus avoiding downgrades.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The __call_rcu_nocb_wake() function and its predecessors set
    ->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2
    for forced reawakenings. The former can result in a too-quick reawakening
    when there are many callbacks ready to invoke and the latter prevents a
    second reawakening. This commit therefore sets ->qlen_last_fqs_check
    to the current number of callbacks in both cases. While in the area,
    this commit also moves both assignments under ->nocb_lock.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Historically, no-CBs CPUs allowed the scheduler-clock tick to be
    unconditionally disabled on any transition to idle or nohz_full userspace
    execution (see the rcu_needs_cpu() implementations). Unfortunately,
    the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs
    use ->cblist, which might make users of battery-powered devices rather
    unhappy. This commit therefore adds explicit rcu_segcblist_is_offloaded()
    checks to return to the historical energy-efficient semantics.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Some compilers complain that wait_gp_seq might be used uninitialized
    in nocb_gp_wait(). This cannot actually happen because when wait_gp_seq
    is uninitialized, needwait_gp must be false, which prevents wait_gp_seq
    from being used. But this analysis is apparently beyond some compilers,
    so this commit adds a bogus initialization of wait_gp_seq for the sole
    purpose of suppressing the false-positive warning.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
    fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
    rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
    disable the ->cblist fields of offline CPUs.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently the RCU callbacks for no-CBs CPUs are queued on a series of
    ad-hoc linked lists, which means that these callbacks cannot benefit
    from "drive-by" grace periods, thus suffering needless delays prior
    to invocation. In addition, the no-CBs grace-period kthreads first
    wait for callbacks to appear and later wait for a new grace period,
    which means that callbacks appearing during a grace-period wait can
    be delayed. These delays increase memory footprint, and could even
    result in an out-of-memory condition.

    This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
    rcu_segcblist structure that is already used by non-no-CBs CPUs. It also
    restructures the no-CBs grace-period kthread to be checking for incoming
    callbacks while waiting for grace periods. Also, instead of waiting
    for a new grace period, it waits for the closest grace period that will
    cause some of the callbacks to be safe to invoke. All of these changes
    reduce callback latency and thus the number of outstanding callbacks,
    in turn reducing the probability of an out-of-memory condition.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • As a first step towards making no-CBs CPUs use the ->cblist, this commit
    leaves the ->cblist enabled for these CPUs. The main reason to make
    no-CBs CPUs use ->cblist is to take advantage of callback numbering,
    which will reduce the effects of missed grace periods which in turn will
    reduce forward-progress problems for no-CBs CPUs.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The idea behind the checks for extended quiescent states at the end of
    __call_rcu_nocb() is to handle cases where call_rcu() is invoked directly
    from within an extended quiescent state, for example, from the idle loop.
    However, this will result in a timer-mediated deferred wakeup, which
    will cause the needed wakeup to happen within a jiffy or thereabouts.
    There should be no forward-progress concerns, and if there are, the proper
    response is to exit the extended quiescent state while executing the
    endless blast of call_rcu() invocations, for example, using RCU_NONIDLE().
    Given the more realistic case of an isolated call_rcu() invocation, there
    should be no problem.

    This commit therefore removes the checks for invoking call_rcu() within
    an extended quiescent state for on no-CBs CPUs.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU callback processing currently uses rcu_is_nocb_cpu() to determine
    whether or not the current CPU's callbacks are to be offloaded.
    This works, but it is not so good for cache locality. Plus use of
    ->cblist for offloaded callbacks will greatly increase the frequency
    of these checks. This commit therefore adds a ->offloaded flag to the
    rcu_segcblist structure to provide a more flexible and cache-friendly
    means of checking for callback offloading.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but
    forward-progress considerations would require that this pointer be both
    NULL and non-NULL, which, absent a quantum-computer port of the Linux
    kernel, simply won't happen. This commit therefore creates as separate
    ->enabled flag to replace the current NULL checks.

    [ paulmck: Add include files per 0day test robot and -next. ]
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit causes the no-CBs grace-period/callback hierarchy to be
    printed to the console when the dump_tree kernel boot parameter is set.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit changes the name of the rcu_nocb_leader_stride kernel
    boot parameter to rcu_nocb_gp_stride in order to account for the new
    distinction between callback and grace-period no-CBs kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake()
    event, which never was documented and is now misleading. This commit
    therefore changes "FollowerSleep" to "CBSleep", documents this, and
    updates the documentation for "Sleep" as well.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit renames rdp_leader to rdp_gp in order to account for the
    new distinction between callback and grace-period no-CBs kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adjusts naming to account for the new distinction between
    callback and grace-period no-CBs kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adjusts naming to account for the new distinction between
    callback and grace-period no-CBs kthreads. While in the area, it also
    updates local variables.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adjusts naming to account for the new distinction between
    callback and grace-period no-CBs kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adjusts naming to account for the new distinction between
    callback and grace-period no-CBs kthreads.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
    are divided into groups. The first rcuo kthread to come online in a
    given group is that group's leader, and the leader both waits for grace
    periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
    only invoke callbacks.

    This works well in the real-time/embedded environments for which it was
    intended because such environments tend not to generate all that many
    callbacks. However, given huge floods of callbacks, it is possible for
    the leader kthread to be stuck invoking callbacks while its followers
    wait helplessly while their callbacks pile up. This is a good recipe
    for an OOM, and rcutorture's new callback-flood capability does generate
    such OOMs.

    One strategy would be to wait until such OOMs start happening in
    production, but similar OOMs have in fact happened starting in 2018.
    It would therefore be wise to take a more proactive approach.

    This commit therefore features per-CPU rcuo kthreads that do nothing
    but invoke callbacks. Instead of having one of these kthreads act as
    leader, each group has a separate rcog kthread that handles grace periods
    for its group. Because these rcuog kthreads do not invoke callbacks,
    callback floods on one CPU no longer block callbacks from reaching the
    rcuc callback-invocation kthreads on other CPUs.

    This change does introduce additional kthreads, however:

    1. The number of additional kthreads is about the square root of
    the number of CPUs, so that a 4096-CPU system would have only
    about 64 additional kthreads. Note that recent changes
    decreased the number of rcuo kthreads by a factor of two
    (CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
    this still represents a significant improvement on most systems.

    2. The leading "rcuo" of the rcuog kthreads should allow existing
    scripting to affinity these additional kthreads as needed, the
    same as for the rcuop and rcuos kthreads. (There are no longer
    any rcuob kthreads.)

    3. A state-machine approach was considered and rejected. Although
    this would allow the rcuo kthreads to continue their dual
    leader/follower roles, it complicates callback invocation
    and makes it more difficult to consolidate rcuo callback
    invocation with existing softirq callback invocation.

    The introduction of rcuog kthreads should thus be acceptable.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney