18 Nov, 2020

1 commit

  • During loss recovery, retransmitted packets are forced to use TCP
    timestamps to calculate the RTT samples, which have a millisecond
    granularity. BBR is designed using a microsecond granularity. As a
    result, multiple RTT samples could be truncated to the same RTT value
    during loss recovery. This is problematic, as BBR will not enter
    PROBE_RTT if the RTT sample is < the current
    min_rtt sample, rather than
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Yuchung Cheng
    Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
    Signed-off-by: Jakub Kicinski

    Ryan Sharpelletti
     

21 Jan, 2020

1 commit

  • do_div() does a 64-by-32 division. Use div64_long() instead of it
    if the divisor is long, to avoid truncation to 32-bit.
    And as a nice side effect also cleans up the function a bit.

    Signed-off-by: Wen Yang
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wen Yang
     

18 Dec, 2019

1 commit

  • sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2019

1 commit

  • There was a bug in the previous logic that attempted to ensure gain cycling
    gets inflight above BDP even for small BDPs. This code correctly raised and
    lowered target inflight values during the gain cycle. And this code
    correctly ensured that cwnd was raised when probing bandwidth. However, it
    did not correspondingly ensure that cwnd was *not* raised in this way when
    *not* probing for bandwidth. The result was that small-BDP flows that were
    always cwnd-bound could go for many cycles with a fixed cwnd, and not probe
    or yield bandwidth at all. This meant that multiple small-BDP flows could
    fail to converge in their bandwidth allocations.

    Fixes: 3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs")
    Signed-off-by: Kevin(Yudong) Yang
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Kevin(Yudong) Yang
     

31 Aug, 2019

1 commit


25 Jan, 2019

2 commits

  • Aggregation effects are extremely common with wifi, cellular, and cable
    modem link technologies, ACK decimation in middleboxes, and LRO and GRO
    in receiving hosts. The aggregation can happen in either direction,
    data or ACKs, but in either case the aggregation effect is visible
    to the sender in the ACK stream.

    Previously BBR's sending was often limited by cwnd under severe ACK
    aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
    were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
    5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
    the bottleneck idle for potentially long periods. Note that loss-based
    congestion control does not have this issue because when facing
    aggregation it continues increasing cwnd after bursts of ACKs, growing
    cwnd until the buffer is full.

    To achieve good throughput in the presence of aggregation effects, this
    algorithm allows the BBR sender to put extra data in flight to keep the
    bottleneck utilized during silences in the ACK stream that it has evidence
    to suggest were caused by aggregation.

    A summary of the algorithm: when a burst of packets are acked by a
    stretched ACK or a burst of ACKs or both, BBR first estimates the expected
    amount of data that should have been acked, based on its estimated
    bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
    filter to estimate the recent level of observed ACK aggregation. Then cwnd
    is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
    being cwnd-limited in the face of ACK silences that recent history suggests
    were caused by aggregation. As a sanity check, the ACK aggregation degree
    is upper-bounded by the cwnd (at the time of measurement) and a global max
    of BW * 100ms. The algorithm is further described by the following
    presentation:
    https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00

    In our internal testing, we observed a significant increase in BBR
    throughput (measured using netperf), in a basic wifi setup.
    - Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
    - 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
    - 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps

    Also, this code is running globally on YouTube TCP connections and produced
    significant bandwidth increases for YouTube traffic.

    This is based on Ian Swett's max_ack_height_ algorithm from the
    QUIC BBR implementation.

    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Priyaranjan Jha
     
  • Because bbr_target_cwnd() is really a general-purpose BBR helper for
    computing some volume of inflight data as a function of the estimated
    BDP, refactor it into following helper functions:
    - bbr_bdp()
    - bbr_quantization_budget()
    - bbr_inflight()

    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Priyaranjan Jha
     

09 Nov, 2018

1 commit

  • Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
    earliest departure time model"), the TCP BBR code switched to a new
    approach of using an explicit bbr_pacing_margin_percent for shaving a
    pacing rate "haircut", rather than the previous implict
    approach. Update an old comment to reflect the new approach.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

18 Oct, 2018

2 commits

  • Centralize the code that sets gains used for computing cwnd and pacing
    rate. This simplifies the code and makes it easier to change the state
    machine or (in the future) dynamically change the gain values and
    ensure that the correct gain values are always used.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Adjust TCP BBR for the new departure time pacing model in the recent
    commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
    departure time model").

    With TSQ and pacing at lower layers, there are often several skbs
    queued in the pacing layer, and thus there is less data "in the
    network" than "in flight".

    With departure time pacing at lower layers (e.g. fq or potential
    future NICs), the data in the pacing layer now has a pre-scheduled
    ("baked-in") departure time that cannot be changed, even if the
    congestion control algorithm decides to use a new pacing rate.

    This means that there can be a non-trivial lag between when BBR makes
    a pacing rate change and when the inter-skb pacing delays
    change. After a pacing rate change, the number of packets in the
    network can gradually evolve to be higher or lower, depending on
    whether the sending rate is higher or lower than the delivery
    rate. Thus ignoring this lag can cause significant overshoot, with the
    flow ending up with too many or too few packets in the network.

    This commit changes BBR to adapt its pacing rate based on the amount
    of data in the network that it estimates has already been "baked in"
    by previous departure time decisions. We estimate the number of our
    packets that will be in the network at the earliest departure time
    (EDT) for the next skb scheduled as:

    in_network_at_edt = inflight_at_edt - (EDT - now) * bw

    If we're increasing the amount of data in the network ("in_network"),
    then we want to know if the transmit of the EDT skb will push
    in_network above the target, so our answer includes
    bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
    in_network, then we want to know if in_network will sink too low just
    before the EDT transmit, so our answer does not include the segments
    from the skb departing at EDT.

    Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
    differently? The in_network curve is a step function: in_network goes
    up on transmits, and down on ACKs. To accurately predict when
    in_network will go beyond our target value, this will happen on
    different events, depending on whether we're concerned about
    in_network potentially going too high or too low:

    o if pushing in_network up (pacing_gain > 1.0),
    then in_network goes above target upon a transmit event

    o if pushing in_network down (pacing_gain < 1.0),
    then in_network goes below target upon an ACK event

    This commit changes the BBR state machine to use this estimated
    "packets in network" value to make its decisions.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

16 Oct, 2018

2 commits

  • There was a typo in this parameter name.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • sk_pacing_rate has beed introduced as a u32 field in 2013,
    effectively limiting per flow pacing to 34Gbit.

    We believe it is time to allow TCP to pace high speed flows
    on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

    This patch adds no cost for 32bit kernels.

    The tcpi_pacing_rate and tcpi_max_pacing_rate were already
    exported as 64bit, so iproute2/ss command require no changes.

    Unfortunately the SO_MAX_PACING_RATE socket option will stay
    32bit and we will need to add a new option to let applications
    control high pacing rates.

    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
    timer:(on,003ms,0) ino:91863 sk:2
    skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
    ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
    rcvmss:536 advmss:1448
    cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
    segs_in:3916318 data_segs_out:177279175
    bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
    send 28045.5Mbps lastrcv:73333
    pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
    busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
    notsent:2085120 minrtt:0.013

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2018

1 commit

  • TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
    no longer has to do it.

    Thanks to this model, TCP can get more accurate RTT samples,
    since pacing no longer inflates them.

    This has the nice effect of removing some delays caused by FQ
    quantum mechanism, causing inflated max/P99 latencies.

    Also we might relax TCP Small Queue tight limits in the future,
    since this new model allow TCP to build bigger batches, since
    sch_fq (or a device with earliest departure time offload) ensure
    these packets will be delivered on time.

    Note that other protocols are not converted (they will probably
    never be) so sch_fq has still support for SO_MAX_PACING_RATE

    Tested:

    Test showing FQ pacing quantum artifact for low-rate flows,
    adding unexpected throttles for RPC flows, inflating max and P99 latencies.

    The parameters chosen here are to show what happens typically when
    a TCP flow has a reduced pacing rate (this can be caused by a reduced
    cwin after few losses, or/and rtt above few ms)

    MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
    Before :
    $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
    Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
    19,82.78,5279,3825,482.02

    After :
    $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
    Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
    20,49.94,128,63,3.18

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2018

3 commits

  • This commit fixes a corner case where TCP BBR would enter PROBE_RTT
    mode but not reduce its cwnd. If a TCP receiver ACKed less than one
    full segment, the number of delivered/acked packets was 0, so that
    bbr_set_cwnd() would short-circuit and exit early, without cutting
    cwnd to the value we want for PROBE_RTT.

    The fix is to instead make sure that even when 0 full packets are
    ACKed, we do apply all the appropriate caps, including the cap that
    applies in PROBE_RTT mode.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Kevin Yang
    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Kevin Yang
     
  • This patch fix the case where BBR does not exit PROBE_RTT mode when
    it restarts from idle. When BBR restarts from idle and if BBR is in
    PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
    yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
    full value.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Kevin Yang
    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Kevin Yang
     
  • This patch add a helper function bbr_check_probe_rtt_done() to
    1. check the condition to see if bbr should exit probe_rtt mode;
    2. process the logic of exiting probe_rtt mode.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Kevin Yang
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Kevin Yang
     

03 Aug, 2018

1 commit


29 Jul, 2018

1 commit

  • For some very small BDPs (with just a few packets) there was a
    quantization effect where the target number of packets in flight
    during the super-unity-gain (1.25x) phase of gain cycling was
    implicitly truncated to a number of packets no larger than the normal
    unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
    scenarios some flows could get stuck with a lower bandwidth, because
    they did not push enough packets inflight to discover that there was
    more bandwidth available. This was really only an issue in multi-flow
    LAN scenarios, where RTTs and BDPs are low enough for this to be an
    issue.

    This fix ensures that gain cycling can raise inflight for small BDPs
    by ensuring that in PROBE_BW mode target inflight values with a
    super-unity gain are always greater than inflight values with a gain

    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Priyaranjan Jha
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

22 Jun, 2018

1 commit

  • This commit makes BBR use only the MSS (without any headers) to
    calculate pacing rates when internal TCP-layer pacing is used.

    This is necessary to achieve the correct pacing behavior in this case,
    since tcp_internal_pacing() uses only the payload length to calculate
    pacing delays.

    Signed-off-by: Kevin Yang
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 May, 2018

1 commit

  • Previously the bbr->idle_restart tracking was zeroing out the
    bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
    e.g. receiving incoming data or receiver window updates. In such
    situations BBR would forget that this was a restart-from-idle
    situation, and if the min_rtt had expired it would unnecessarily enter
    PROBE_RTT (even though we were actually restarting from idle but had
    merely forgotten that fact).

    The fix is simple: we need to remember we are restarting from idle
    until we receive a S/ACK for some data (a S/ACK for the first flight
    of data we send as we are restarting).

    This commit is a stable candidate for kernels back as far as 4.9.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Yousuk Seung
    Signed-off-by: David S. Miller

    Neal Cardwell
     

17 Mar, 2018

1 commit

  • Set tp->snd_ssthresh to BDP upon STARTUP exit. This allows us
    to check if a BBR flow exited STARTUP and the BDP at the
    time of STARTUP exit with SCM_TIMESTAMPING_OPT_STATS. Since BBR does not
    use snd_ssthresh this fix has no impact on BBR's behavior.

    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Yousuk Seung
     

02 Mar, 2018

2 commits

  • Its value is computed then immediately used,
    there is no need to store it.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is second part of dealing with suboptimal device gso parameters.
    In first patch (350c9f484bde "tcp_bbr: better deal with suboptimal GSO")
    we dealt with devices having low gso_max_segs

    Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)

    In order to probe an optimal cwnd, we want BBR being not sensitive
    to whatever GSO constraint a device can have.

    This patch removes tso_segs_goal() CC callback in favor of
    min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs

    Next patch will remove bbr->tso_segs_goal since it does not have
    to be persistent.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Feb, 2018

1 commit

  • This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
    using lt_bw and returning from the PROBE_RTT state to PROBE_BW.

    Previously, when using lt_bw, upon exiting PROBE_RTT and entering
    PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
    end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
    set a pacing gain above 1.0. In such cases this would result in a
    pacing rate that is 1.25x higher than intended, potentially resulting
    in a high loss rate for a little while until we stop using the lt_bw a
    bit later.

    This commit is a stable candidate for kernels back as far as 4.9.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Reported-by: Beyers Cronje
    Signed-off-by: David S. Miller

    Neal Cardwell
     

20 Jan, 2018

1 commit

  • A persistent connection may send tiny amount of data (e.g. health-check)
    for a long period of time. BBR's windowed min RTT filter may only see
    RTT samples from delayed ACKs causing BBR to grossly over-estimate
    the path delay depending how much the ACK was delayed at the receiver.

    This patch skips RTT samples that are likely coming from delayed ACKs. Note
    that it is possible the sender never obtains a valid measure to set the
    min RTT. In this case BBR will continue to set cwnd to initial window
    which seems fine because the connection is thin stream.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

09 Dec, 2017

3 commits

  • Fix BBR so that upon notification of a loss recovery undo BBR resets
    long-term bandwidth sampling.

    Under high reordering, reordering events can be interpreted as loss.
    If the reordering and spurious loss estimates are high enough, this
    can cause BBR to spuriously estimate that we are seeing loss rates
    high enough to trigger long-term bandwidth estimation. To avoid that
    problem, this commit resets long-term bandwidth sampling on loss
    recovery undo events.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix BBR so that upon notification of a loss recovery undo BBR resets
    the full pipe detection (STARTUP exit) state machine.

    Under high reordering, reordering events can be interpreted as loss.
    If the reordering and spurious loss estimates are high enough, this
    could previously cause BBR to spuriously estimate that the pipe is
    full.

    Since spurious loss recovery means that our overall sending will have
    slowed down spuriously, this commit gives a flow more time to probe
    robustly for bandwidth and decide the pipe is really full.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • This commit records the "full bw reached" decision in a new
    full_bw_reached bit. This is a pure refactor that does not change the
    current behavior, but enables subsequent fixes and improvements.

    In particular, this enables simple and clean fixes because the full_bw
    and full_bw_cnt can be unconditionally zeroed without worrying about
    forgetting that we estimated we filled the pipe in Startup. And it
    enables future improvements because multiple code paths can be used
    for estimating that we filled the pipe in Startup; any new code paths
    only need to set this bit when they think the pipe is full.

    Note that this fix intentionally reduces the width of the full_bw_cnt
    counter, since we have never used the most significant bit.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

16 Jul, 2017

5 commits

  • Fixes the following behavior: for connections that had no RTT sample
    at the time of initializing congestion control, BBR was initializing
    the pacing rate to a high nominal rate (based an a guess of RTT=1ms,
    in case this is LAN traffic). Then BBR never adjusted the pacing rate
    downward upon obtaining an actual RTT sample, if the connection never
    filled the pipe (e.g. all sends were small app-limited writes()).

    This fix adjusts the pacing rate upon obtaining the first RTT sample.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix a corner case noticed by Eric Dumazet, where BBR's setting
    sk->sk_pacing_rate to 0 during initialization could theoretically
    cause packets in the sending host to hang if there were packets "in
    flight" in the pacing infrastructure at the time the BBR congestion
    control state is initialized. This could occur if the pacing
    infrastructure happened to race with bbr_init() in a way such that the
    pacer read the 0 rather than the immediately following non-zero pacing
    rate.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Reported-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Introduce a helper to initialize the BBR pacing rate unconditionally,
    based on the current cwnd and RTT estimate. This is a pure refactor,
    but is needed for two following fixes.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Introduce a helper to convert a BBR bandwidth and gain factor to a
    pacing rate in bytes per second. This is a pure refactor, but is
    needed for two following fixes.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • In bbr_set_pacing_rate(), which decides whether to cut the pacing
    rate, there was some code that considered exiting STARTUP to be
    equivalent to the notion of filling the pipe (i.e.,
    bbr_full_bw_reached()). Specifically, as the code was structured,
    exiting STARTUP and going into PROBE_RTT could cause us to cut the
    pacing rate down to something silly and low, based on whatever
    bandwidth samples we've had so far, when it's possible that all of
    them have been small app-limited bandwidth samples that are not
    representative of the bandwidth available in the path. (The code was
    correct at the time it was written, but the state machine changed
    without this spot being adjusted correspondingly.)

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

18 May, 2017

2 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use tcp_jiffies32 instead of tcp_time_stamp, since
    tcp_time_stamp will soon be only used for TCP TS option.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2017

1 commit

  • BBR congestion control depends on pacing, and pacing is
    currently handled by sch_fq packet scheduler for performance reasons,
    and also because implemening pacing with FQ was convenient to truly
    avoid bursts.

    However there are many cases where this packet scheduler constraint
    is not practical.
    - Many linux hosts are not focusing on handling thousands of TCP
    flows in the most efficient way.
    - Some routers use fq_codel or other AQM, but still would like
    to use BBR for the few TCP flows they initiate/terminate.

    This patch implements an automatic fallback to internal pacing.

    Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

    If sch_fq happens to be in the egress path, pacing is delegated to
    the qdisc, otherwise pacing is done by TCP itself.

    One advantage of pacing from TCP stack is to get more precise rtt
    estimations, and less work done from TX completion, since TCP Small
    queue limits are not generally hit. Setups with single TX queue but
    many cpus might even benefit from this.

    Note that unlike sch_fq, we do not take into account header sizes.
    Taking care of these headers would add additional complexity for
    no practical differences in behavior.

    Some performance numbers using 800 TCP_STREAM flows rate limited to
    ~48 Mbit per second on 40Gbit NIC.

    If MQ+pfifo_fast is used on the NIC :

    $ sar -n DEV 1 5 | grep eth
    14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
    14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
    14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
    14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
    14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
    Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
    4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
    0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
    1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
    1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

    Now use MQ+FQ :

    lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
    lpaa23:~# tc qdisc replace dev eth0 root mq

    $ sar -n DEV 1 5 | grep eth
    14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
    14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
    14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
    14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
    14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
    Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
    1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
    0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
    4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
    3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

    As expected, number of interrupts per second is very different.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2016

1 commit


21 Sep, 2016

1 commit

  • This commit implements a new TCP congestion control algorithm: BBR
    (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
    published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
    "BBR: Congestion-Based Congestion Control".

    BBR has significantly increased throughput and reduced latency for
    connections on Google's internal backbone networks and google.com and
    YouTube Web servers.

    BBR requires only changes on the sender side, not in the network or
    the receiver side. Thus it can be incrementally deployed on today's
    Internet, or in datacenters.

    The Internet has predominantly used loss-based congestion control
    (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
    signal to slow down. While this worked well for many years, loss-based
    congestion control is unfortunately out-dated in today's networks. On
    today's Internet, loss-based congestion control causes the infamous
    bufferbloat problem, often causing seconds of needless queuing delay,
    since it fills the bloated buffers in many last-mile links. On today's
    high-speed long-haul links using commodity switches with shallow
    buffers, loss-based congestion control has abysmal throughput because
    it over-reacts to losses caused by transient traffic bursts.

    In 1981 Kleinrock and Gale showed that the optimal operating point for
    a network maximizes delivered bandwidth while minimizing delay and
    loss, not only for single connections but for the network as a
    whole. Finding that optimal operating point has been elusive, since
    any single network measurement is ambiguous: network measurements are
    the result of both bandwidth and propagation delay, and those two
    cannot be measured simultaneously.

    While it is impossible to disambiguate any single bandwidth or RTT
    measurement, a connection's behavior over time tells a clearer
    story. BBR uses a measurement strategy designed to resolve this
    ambiguity. It combines these measurements with a robust servo loop
    using recent control systems advances to implement a distributed
    congestion control algorithm that reacts to actual congestion, not
    packet loss or transient queue delay, and is designed to converge with
    high probability to a point near the optimal operating point.

    In a nutshell, BBR creates an explicit model of the network pipe by
    sequentially probing the bottleneck bandwidth and RTT. On the arrival
    of each ACK, BBR derives the current delivery rate of the last round
    trip, and feeds it through a windowed max-filter to estimate the
    bottleneck bandwidth. Conversely it uses a windowed min-filter to
    estimate the round trip propagation delay. The max-filtered bandwidth
    and min-filtered RTT estimates form BBR's model of the network pipe.

    Using its model, BBR sets control parameters to govern sending
    behavior. The primary control is the pacing rate: BBR applies a gain
    multiplier to transmit faster or slower than the observed bottleneck
    bandwidth. The conventional congestion window (cwnd) is now the
    secondary control; the cwnd is set to a small multiple of the
    estimated BDP (bandwidth-delay product) in order to allow full
    utilization and bandwidth probing while bounding the potential amount
    of queue at the bottleneck.

    When a BBR connection starts, it enters STARTUP mode and applies a
    high gain to perform an exponential search to quickly probe the
    bottleneck bandwidth (doubling its sending rate each round trip, like
    slow start). However, instead of continuing until it fills up the
    buffer (i.e. a loss), or until delay or ACK spacing reaches some
    threshold (like Hystart), it uses its model of the pipe to estimate
    when that pipe is full: it estimates the pipe is full when it notices
    the estimated bandwidth has stopped growing. At that point it exits
    STARTUP and enters DRAIN mode, where it reduces its pacing rate to
    drain the queue it estimates it has created.

    Then BBR enters steady state. In steady state, PROBE_BW mode cycles
    between first pacing faster to probe for more bandwidth, then pacing
    slower to drain any queue that created if no more bandwidth was
    available, and then cruising at the estimated bandwidth to utilize the
    pipe without creating excess queue. Occasionally, on an as-needed
    basis, it sends significantly slower to probe for RTT (PROBE_RTT
    mode).

    BBR has been fully deployed on Google's wide-area backbone networks
    and we're experimenting with BBR on Google.com and YouTube on a global
    scale. Replacing CUBIC with BBR has resulted in significant
    improvements in network latency and application (RPC, browser, and
    video) metrics. For more details please refer to our upcoming ACM
    Queue publication.

    Example performance results, to illustrate the difference between BBR
    and CUBIC:

    Resilience to random loss (e.g. from shallow buffers):
    Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
    path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
    rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

    Low latency with the bloated buffers common in today's last-mile links:
    Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
    path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
    buffer. Both fully utilize the bottleneck bandwidth, but BBR
    achieves this with a median RTT 25x lower (43 ms instead of 1.09
    secs).

    Our long-term goal is to improve the congestion control algorithms
    used on the Internet. We are hopeful that BBR can help advance the
    efforts toward this goal, and motivate the community to do further
    research.

    Test results, performance evaluations, feedback, and BBR-related
    discussions are very welcome in the public e-mail list for BBR:

    https://groups.google.com/forum/#!forum/bbr-dev

    NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
    enabled, since pacing is integral to the BBR design and
    implementation. BBR without pacing would not function properly, and
    may incur unnecessary high packet loss rates.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell