03 Aug, 2018

1 commit

  • [ Upstream commit 383d470936c05554219094a4d364d964cb324827 ]

    For some very small BDPs (with just a few packets) there was a
    quantization effect where the target number of packets in flight
    during the super-unity-gain (1.25x) phase of gain cycling was
    implicitly truncated to a number of packets no larger than the normal
    unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
    scenarios some flows could get stuck with a lower bandwidth, because
    they did not push enough packets inflight to discover that there was
    more bandwidth available. This was really only an issue in multi-flow
    LAN scenarios, where RTTs and BDPs are low enough for this to be an
    issue.

    This fix ensures that gain cycling can raise inflight for small BDPs
    by ensuring that in PROBE_BW mode target inflight values with a
    super-unity gain are always greater than inflight values with a gain

    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Priyaranjan Jha
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

19 May, 2018

1 commit

  • [ Upstream commit e6e6a278b1eaffa19d42186bfacd1ffc15a50b3f ]

    Previously the bbr->idle_restart tracking was zeroing out the
    bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
    e.g. receiving incoming data or receiver window updates. In such
    situations BBR would forget that this was a restart-from-idle
    situation, and if the min_rtt had expired it would unnecessarily enter
    PROBE_RTT (even though we were actually restarting from idle but had
    merely forgotten that fact).

    The fix is simple: we need to remember we are restarting from idle
    until we receive a S/ACK for some data (a S/ACK for the first flight
    of data we send as we are restarting).

    This commit is a stable candidate for kernels back as far as 4.9.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Yousuk Seung
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

13 Feb, 2018

1 commit

  • [ Upstream commit 3aff3b4b986e51bcf4ab249e5d48d39596e0df6a ]

    This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
    using lt_bw and returning from the PROBE_RTT state to PROBE_BW.

    Previously, when using lt_bw, upon exiting PROBE_RTT and entering
    PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
    end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
    set a pacing gain above 1.0. In such cases this would result in a
    pacing rate that is 1.25x higher than intended, potentially resulting
    in a high loss rate for a little while until we stop using the lt_bw a
    bit later.

    This commit is a stable candidate for kernels back as far as 4.9.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Reported-by: Beyers Cronje
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

03 Jan, 2018

3 commits

  • [ Upstream commit 600647d467c6d04b3954b41a6ee1795b5ae00550 ]

    Fix BBR so that upon notification of a loss recovery undo BBR resets
    long-term bandwidth sampling.

    Under high reordering, reordering events can be interpreted as loss.
    If the reordering and spurious loss estimates are high enough, this
    can cause BBR to spuriously estimate that we are seeing loss rates
    high enough to trigger long-term bandwidth estimation. To avoid that
    problem, this commit resets long-term bandwidth sampling on loss
    recovery undo events.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit 2f6c498e4f15d27852c04ed46d804a39137ba364 ]

    Fix BBR so that upon notification of a loss recovery undo BBR resets
    the full pipe detection (STARTUP exit) state machine.

    Under high reordering, reordering events can be interpreted as loss.
    If the reordering and spurious loss estimates are high enough, this
    could previously cause BBR to spuriously estimate that the pipe is
    full.

    Since spurious loss recovery means that our overall sending will have
    slowed down spuriously, this commit gives a flow more time to probe
    robustly for bandwidth and decide the pipe is really full.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit c589e69b508d29ed8e644dfecda453f71c02ec27 ]

    This commit records the "full bw reached" decision in a new
    full_bw_reached bit. This is a pure refactor that does not change the
    current behavior, but enables subsequent fixes and improvements.

    In particular, this enables simple and clean fixes because the full_bw
    and full_bw_cnt can be unconditionally zeroed without worrying about
    forgetting that we estimated we filled the pipe in Startup. And it
    enables future improvements because multiple code paths can be used
    for estimating that we filled the pipe in Startup; any new code paths
    only need to set this bit when they think the pipe is full.

    Note that this fix intentionally reduces the width of the full_bw_cnt
    counter, since we have never used the most significant bit.

    Signed-off-by: Neal Cardwell
    Reviewed-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

16 Jul, 2017

5 commits

  • Fixes the following behavior: for connections that had no RTT sample
    at the time of initializing congestion control, BBR was initializing
    the pacing rate to a high nominal rate (based an a guess of RTT=1ms,
    in case this is LAN traffic). Then BBR never adjusted the pacing rate
    downward upon obtaining an actual RTT sample, if the connection never
    filled the pipe (e.g. all sends were small app-limited writes()).

    This fix adjusts the pacing rate upon obtaining the first RTT sample.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix a corner case noticed by Eric Dumazet, where BBR's setting
    sk->sk_pacing_rate to 0 during initialization could theoretically
    cause packets in the sending host to hang if there were packets "in
    flight" in the pacing infrastructure at the time the BBR congestion
    control state is initialized. This could occur if the pacing
    infrastructure happened to race with bbr_init() in a way such that the
    pacer read the 0 rather than the immediately following non-zero pacing
    rate.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Reported-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Introduce a helper to initialize the BBR pacing rate unconditionally,
    based on the current cwnd and RTT estimate. This is a pure refactor,
    but is needed for two following fixes.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Introduce a helper to convert a BBR bandwidth and gain factor to a
    pacing rate in bytes per second. This is a pure refactor, but is
    needed for two following fixes.

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • In bbr_set_pacing_rate(), which decides whether to cut the pacing
    rate, there was some code that considered exiting STARTUP to be
    equivalent to the notion of filling the pipe (i.e.,
    bbr_full_bw_reached()). Specifically, as the code was structured,
    exiting STARTUP and going into PROBE_RTT could cause us to cut the
    pacing rate down to something silly and low, based on whatever
    bandwidth samples we've had so far, when it's possible that all of
    them have been small app-limited bandwidth samples that are not
    representative of the bandwidth available in the path. (The code was
    correct at the time it was written, but the state machine changed
    without this spot being adjusted correspondingly.)

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

18 May, 2017

2 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use tcp_jiffies32 instead of tcp_time_stamp, since
    tcp_time_stamp will soon be only used for TCP TS option.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2017

1 commit

  • BBR congestion control depends on pacing, and pacing is
    currently handled by sch_fq packet scheduler for performance reasons,
    and also because implemening pacing with FQ was convenient to truly
    avoid bursts.

    However there are many cases where this packet scheduler constraint
    is not practical.
    - Many linux hosts are not focusing on handling thousands of TCP
    flows in the most efficient way.
    - Some routers use fq_codel or other AQM, but still would like
    to use BBR for the few TCP flows they initiate/terminate.

    This patch implements an automatic fallback to internal pacing.

    Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

    If sch_fq happens to be in the egress path, pacing is delegated to
    the qdisc, otherwise pacing is done by TCP itself.

    One advantage of pacing from TCP stack is to get more precise rtt
    estimations, and less work done from TX completion, since TCP Small
    queue limits are not generally hit. Setups with single TX queue but
    many cpus might even benefit from this.

    Note that unlike sch_fq, we do not take into account header sizes.
    Taking care of these headers would add additional complexity for
    no practical differences in behavior.

    Some performance numbers using 800 TCP_STREAM flows rate limited to
    ~48 Mbit per second on 40Gbit NIC.

    If MQ+pfifo_fast is used on the NIC :

    $ sar -n DEV 1 5 | grep eth
    14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
    14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
    14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
    14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
    14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
    Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
    4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
    0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
    1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
    1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

    Now use MQ+FQ :

    lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
    lpaa23:~# tc qdisc replace dev eth0 root mq

    $ sar -n DEV 1 5 | grep eth
    14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
    14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
    14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
    14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
    14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
    Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
    1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
    0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
    4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
    3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

    As expected, number of interrupts per second is very different.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2016

1 commit


21 Sep, 2016

1 commit

  • This commit implements a new TCP congestion control algorithm: BBR
    (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
    published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
    "BBR: Congestion-Based Congestion Control".

    BBR has significantly increased throughput and reduced latency for
    connections on Google's internal backbone networks and google.com and
    YouTube Web servers.

    BBR requires only changes on the sender side, not in the network or
    the receiver side. Thus it can be incrementally deployed on today's
    Internet, or in datacenters.

    The Internet has predominantly used loss-based congestion control
    (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
    signal to slow down. While this worked well for many years, loss-based
    congestion control is unfortunately out-dated in today's networks. On
    today's Internet, loss-based congestion control causes the infamous
    bufferbloat problem, often causing seconds of needless queuing delay,
    since it fills the bloated buffers in many last-mile links. On today's
    high-speed long-haul links using commodity switches with shallow
    buffers, loss-based congestion control has abysmal throughput because
    it over-reacts to losses caused by transient traffic bursts.

    In 1981 Kleinrock and Gale showed that the optimal operating point for
    a network maximizes delivered bandwidth while minimizing delay and
    loss, not only for single connections but for the network as a
    whole. Finding that optimal operating point has been elusive, since
    any single network measurement is ambiguous: network measurements are
    the result of both bandwidth and propagation delay, and those two
    cannot be measured simultaneously.

    While it is impossible to disambiguate any single bandwidth or RTT
    measurement, a connection's behavior over time tells a clearer
    story. BBR uses a measurement strategy designed to resolve this
    ambiguity. It combines these measurements with a robust servo loop
    using recent control systems advances to implement a distributed
    congestion control algorithm that reacts to actual congestion, not
    packet loss or transient queue delay, and is designed to converge with
    high probability to a point near the optimal operating point.

    In a nutshell, BBR creates an explicit model of the network pipe by
    sequentially probing the bottleneck bandwidth and RTT. On the arrival
    of each ACK, BBR derives the current delivery rate of the last round
    trip, and feeds it through a windowed max-filter to estimate the
    bottleneck bandwidth. Conversely it uses a windowed min-filter to
    estimate the round trip propagation delay. The max-filtered bandwidth
    and min-filtered RTT estimates form BBR's model of the network pipe.

    Using its model, BBR sets control parameters to govern sending
    behavior. The primary control is the pacing rate: BBR applies a gain
    multiplier to transmit faster or slower than the observed bottleneck
    bandwidth. The conventional congestion window (cwnd) is now the
    secondary control; the cwnd is set to a small multiple of the
    estimated BDP (bandwidth-delay product) in order to allow full
    utilization and bandwidth probing while bounding the potential amount
    of queue at the bottleneck.

    When a BBR connection starts, it enters STARTUP mode and applies a
    high gain to perform an exponential search to quickly probe the
    bottleneck bandwidth (doubling its sending rate each round trip, like
    slow start). However, instead of continuing until it fills up the
    buffer (i.e. a loss), or until delay or ACK spacing reaches some
    threshold (like Hystart), it uses its model of the pipe to estimate
    when that pipe is full: it estimates the pipe is full when it notices
    the estimated bandwidth has stopped growing. At that point it exits
    STARTUP and enters DRAIN mode, where it reduces its pacing rate to
    drain the queue it estimates it has created.

    Then BBR enters steady state. In steady state, PROBE_BW mode cycles
    between first pacing faster to probe for more bandwidth, then pacing
    slower to drain any queue that created if no more bandwidth was
    available, and then cruising at the estimated bandwidth to utilize the
    pipe without creating excess queue. Occasionally, on an as-needed
    basis, it sends significantly slower to probe for RTT (PROBE_RTT
    mode).

    BBR has been fully deployed on Google's wide-area backbone networks
    and we're experimenting with BBR on Google.com and YouTube on a global
    scale. Replacing CUBIC with BBR has resulted in significant
    improvements in network latency and application (RPC, browser, and
    video) metrics. For more details please refer to our upcoming ACM
    Queue publication.

    Example performance results, to illustrate the difference between BBR
    and CUBIC:

    Resilience to random loss (e.g. from shallow buffers):
    Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
    path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
    rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

    Low latency with the bloated buffers common in today's last-mile links:
    Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
    path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
    buffer. Both fully utilize the bottleneck bandwidth, but BBR
    achieves this with a median RTT 25x lower (43 ms instead of 1.09
    secs).

    Our long-term goal is to improve the congestion control algorithms
    used on the Internet. We are hopeful that BBR can help advance the
    efforts toward this goal, and motivate the community to do further
    research.

    Test results, performance evaluations, feedback, and BBR-related
    discussions are very welcome in the public e-mail list for BBR:

    https://groups.google.com/forum/#!forum/bbr-dev

    NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
    enabled, since pacing is integral to the BBR design and
    implementation. BBR without pacing would not function properly, and
    may incur unnecessary high packet loss rates.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell