26 Jun, 2020

1 commit

  • Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where
    Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an
    ACK when the minimum rtt of a connection goes down. From inspection it
    is clear from the existing code that this could happen in an example
    like the following:

    o The first 8 RTT samples in a round trip are 150ms, resulting in a
    curr_rtt of 150ms and a delay_min of 150ms.

    o The 9th RTT sample is 100ms. The curr_rtt does not change after the
    first 8 samples, so curr_rtt remains 150ms. But delay_min can be
    lowered at any time, so delay_min falls to 100ms. The code executes
    the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min
    of 100ms, and the curr_rtt is declared far enough above delay_min to
    force a (spurious) exit of Slow start.

    The fix here is simple: allow every RTT sample in a round trip to
    lower the curr_rtt.

    Fixes: ae27e98a5152 ("[TCP] CUBIC v2.3")
    Reported-by: Mirja Kuehlewind
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

31 Dec, 2019

1 commit

  • Neal Cardwell suggested to not change ca->delay_min
    and apply the ack delay cushion only when Hystart ACK train
    is still under consideration. This should avoid a 64bit
    divide unless needed.

    Tested:

    40Gbit(mlx4) testbed (with sch_fq as packet scheduler)

    $ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control
    $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    14815
    16280
    15293
    15563
    11574
    15145
    14789
    18548
    16972
    12520
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 1396 0.0
    $ dmesg | tail -10
    [ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
    [ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
    [ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
    [ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
    [ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
    [ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
    [ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
    [ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
    [ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
    [ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152

    10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)

    $ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control
    $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
    7050
    7065
    7100
    6900
    7202
    7263
    7189
    6869
    7463
    7034
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 3199 0.0
    $ dmesg | tail -10
    [ 176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
    [ 179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
    [ 181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
    [ 183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
    [ 185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
    [ 187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
    [ 190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
    [ 192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
    [ 194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
    [ 196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399

    Fixes: 42f3a8aaae66 ("tcp_cubic: tweak Hystart detection for short RTT flows")
    Signed-off-by: Eric Dumazet
    Reported-by: Neal Cardwell
    Link: https://www.spinics.net/lists/netdev/msg621886.html
    Link: https://www.spinics.net/lists/netdev/msg621797.html
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Dec, 2019

5 commits

  • For years we disabled Hystart ACK train detection at Google
    because it was fooled by TCP pacing.

    ACK train detection uses a simple heuristic, detecting if
    we receive ACK past half the RTT, to exit slow start before
    hitting the bottleneck and experience massive drops.

    But pacing by design might delay packets up to RTT/2,
    so we need to tweak the Hystart logic to be aware of this
    extra delay.

    Tested:
    Added a 100 usec delay at receiver.

    Before:
    nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    9117
    7057
    9553
    8300
    7030
    6849
    9533
    10126
    6876
    8473
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 1230 0.0

    After :
    nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    9845
    10103
    10866
    11096
    11936
    11487
    11773
    12188
    11066
    11894
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 6462 0.0

    Disabling Hystart ACK Train detection gives similar numbers

    echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
    nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    11173
    10954
    12455
    10627
    11578
    11583
    11222
    10880
    10665
    11366

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • After switching ca->delay_min to usec resolution, we exit
    slow start prematurely for very low RTT flows, setting
    snd_ssthresh to 20.

    The reason is that delay_min is fed with RTT of small packet
    trains. Then as cwnd is increased, TCP sends bigger TSO packets.

    LRO/GRO aggregation and/or interrupt mitigation strategies
    on receiver tend to inflate RTT samples.

    Fix this by adding to delay_min the expected delay of
    two TSO packets, given current pacing rate.

    Tested:

    Sender uses pfifo_fast qdisc

    Before :
    $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    11348
    11707
    11562
    11428
    11773
    11534
    9878
    11693
    10597
    10968
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 200 0.0

    After :
    $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
    14877
    14517
    15797
    18466
    17376
    14833
    17558
    17933
    16039
    18059
    TcpExtTCPHystartTrainDetect 10 0.0
    TcpExtTCPHystartTrainCwnd 1670 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Current 1ms clock feeds ca->round_start, ca->delay_min,
    ca->last_ack.

    This is quite problematic for data-center flows, where delay_min
    is way below 1 ms.

    This means Hystart Train detection triggers every time jiffies value
    is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
    expression becomes true.

    This kind of random behavior can be solved by reusing the existing
    usec timestamp that TCP keeps in tp->tcp_mstamp

    Note that a followup patch will tweak things a bit, because
    during slow start, GRO aggregation on receivers naturally
    increases the RTT as TSO packets gradually come to ~64KB size.

    To recap, right after this patch CUBIC Hystart train detection
    is more aggressive, since short RTT flows might exit slow start at
    cwnd = 20, instead of being possibly unbounded.

    Following patch will address this problem.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we initialize ca->curr_rtt to ~0U, we do not need to test
    for zero value in hystart_update()

    We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
    been processed, and thus ca->curr_rtt will have a sane value.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We do not care which bit in ca->found is set.

    We avoid accessing hystart and hystart_detect unless really needed,
    possibly avoiding one cache line miss.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have MODULE_LICENCE("GPL*") inside which was used in the initial
    scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

07 Aug, 2017

1 commit

  • Most TCP congestion controls are using identical logic to undo
    cwnd except BBR. This patch consolidates these similar functions
    to the one used currently by Reno and others.

    Suggested-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 May, 2017

2 commits


21 Apr, 2017

1 commit


12 May, 2016

1 commit

  • Replace 2 arguments (cnt and rtt) in the congestion control modules'
    pkts_acked() function with a struct. This will allow adding more
    information without having to modify existing congestion control
    modules (tcp_nv in particular needs bytes in flight when packet
    was sent).

    As proposed by Neal Cardwell in his comments to the tcp_nv patch.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

03 May, 2016

1 commit

  • We want to to make TCP stack preemptible, as draining prequeue
    and backlog queues can take lot of time.

    Many SNMP updates were assuming that BH (and preemption) was disabled.

    Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
    and some __TCP_INC_STATS() to TCP_INC_STATS()

    Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
    and tcp_v4_send_ack(), we add an explicit preempt disabled section.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


18 Sep, 2015

1 commit

  • Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
    is normally set at ACK processing time, not at send time.

    Doing a proper fix would need to add an additional state variable,
    and does not seem worth the trouble, given CUBIC bug has been there
    forever before Jana noticed it.

    Let's simply not set epoch_start in the future, otherwise
    bictcp_update() could overflow and CUBIC would again
    grow cwnd too fast.

    This was detected thanks to a packetdrill test Neal wrote that was flaky
    before applying this fix.

    Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Cc: Jana Iyengar
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Sep, 2015

1 commit

  • Jana Iyengar found an interesting issue on CUBIC :

    The epoch is only updated/reset initially and when experiencing losses.
    The delta "t" of now - epoch_start can be arbitrary large after app idle
    as well as the bic_target. Consequentially the slope (inverse of
    ca->cnt) would be really large, and eventually ca->cnt would be
    lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.

    This particularly shows up when slow_start_after_idle is disabled
    as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle
    time.

    Jana initial fix was to reset epoch_start if app limited,
    but Neal pointed out it would ask the CUBIC algorithm to recalculate the
    curve so that we again start growing steeply upward from where cwnd is
    now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth
    curve to be the same shape, just shifted later in time by the amount of
    the idle period.

    Reported-by: Jana Iyengar
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Cc: Stephen Hemminger
    Cc: Sangtae Ha
    Cc: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jul, 2015

1 commit

  • Add a helper to test the slow start condition in various congestion
    control modules and other places. This is to prepare a slight improvement
    in policy as to exactly when to slow start.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

12 Mar, 2015

1 commit

  • Commit 814d488c6126 ("tcp: fix the timid additive increase on stretch
    ACKs") fixed a bug where tcp_cong_avoid_ai() would either credit a
    connection with an increase of snd_cwnd_cnt, or increase snd_cwnd, but
    not both, resulting in cwnd increasing by 1 packet on at most every
    alternate invocation of tcp_cong_avoid_ai().

    Although the commit correctly implemented the CUBIC algorithm, which
    can increase cwnd by as much as 1 packet per 1 packet ACKed (2x per
    RTT), in practice that could be too aggressive: in tests on network
    paths with small buffers, YouTube server retransmission rates nearly
    doubled.

    This commit restores CUBIC to a maximum cwnd growth rate of 1 packet
    per 2 packets ACKed (1.5x per RTT). In YouTube tests this restored
    retransmit rates to low levels.

    Testing: This patch has been tested in datacenter netperf transfers
    and live youtube.com and google.com servers.

    Fixes: 9cd981dcf174 ("tcp: fix stretch ACK bugs in CUBIC")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

29 Jan, 2015

3 commits

  • This patch fixes a bug in CUBIC that causes cwnd to increase slightly
    too slowly when multiple ACKs arrive in the same jiffy.

    If cwnd is supposed to increase at a rate of more than once per jiffy,
    then CUBIC was sometimes too slow. Because the bic_target is
    calculated for a future point in time, calculated with time in
    jiffies, the cwnd can increase over the course of the jiffy while the
    bic_target calculated as the proper CUBIC cwnd at time
    t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
    increases on jiffy tick boundaries.

    So since the cnt is set to:
    ca->cnt = cwnd / (bic_target - cwnd);
    as cwnd increases but bic_target does not increase due to jiffy
    granularity, the cnt becomes too large, causing cwnd to increase
    too slowly.

    For example:
    - suppose at the beginning of a jiffy, cwnd=40, bic_target=44
    - so CUBIC sets:
    ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
    - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
    increases cwnd to 41
    - so CUBIC sets:
    ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13

    So now CUBIC will wait for 13 packets to be ACKed before increasing
    cwnd to 42, insted of 10 as it should.

    The fix is to avoid adjusting the slope (determined by ca->cnt)
    multiple times within a jiffy, and instead skip to compute the Reno
    cwnd, the "TCP friendliness" code path.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Change CUBIC to properly handle stretch ACKs in additive increase mode
    by passing in the count of ACKed packets to tcp_cong_avoid_ai().

    In addition, because we are now precisely accounting for stretch ACKs,
    including delayed ACKs, we can now remove the delayed ACK tracking and
    estimation code that tracked recent delayed ACK behavior in
    ca->delayed_ack.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
    cover more than the RFC-specified maximum of 2 packets. These stretch
    ACKs can cause serious performance shortfalls in common congestion
    control algorithms that were designed and tuned years ago with
    receiver hosts that were not using LRO or GRO, and were instead
    politely ACKing every other packet.

    This patch series fixes Reno and CUBIC to handle stretch ACKs.

    This patch prepares for the upcoming stretch ACK bug fix patches. It
    adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
    fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
    changes all congestion control algorithms to pass in 1 for the ACKed
    count. It also changes tcp_slow_start() to return the number of packet
    ACK "credits" that were not processed in slow start mode, and can be
    processed by the congestion control module in additive increase mode.

    In future patches we will fix tcp_cong_avoid_ai() to handle stretch
    ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
    and additive increase mode.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

10 Dec, 2014

2 commits

  • In commit 2b4636a5f8ca ("tcp_cubic: make the delay threshold of HyStart
    less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.

    The remaining problem is that using delay_min + (delay_min/16) as the
    threshold is too sensitive.

    6.25 % of variation is too small for rtt above 60 ms, which are not
    uncommon.

    Lets use 12.5 % instead (delay_min + (delay_min/8))

    Tested:
    80 ms RTT between peers, FQ/pacing packet scheduler on sender.
    10 bulk transfers of 10 seconds :

    nstat >/dev/null
    for i in `seq 1 10`
    do
    netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
    done
    nstat | grep Hystart

    With the 6.25 % threshold :

    THROUGHPUT=20.66
    THROUGHPUT=249.38
    THROUGHPUT=254.10
    THROUGHPUT=14.94
    THROUGHPUT=251.92
    THROUGHPUT=237.73
    THROUGHPUT=19.18
    THROUGHPUT=252.89
    THROUGHPUT=21.32
    THROUGHPUT=15.58
    TcpExtTCPHystartTrainDetect 2 0.0
    TcpExtTCPHystartTrainCwnd 4756 0.0
    TcpExtTCPHystartDelayDetect 5 0.0
    TcpExtTCPHystartDelayCwnd 180 0.0

    With the 12.5 % threshold
    THROUGHPUT=251.09
    THROUGHPUT=247.46
    THROUGHPUT=250.92
    THROUGHPUT=248.91
    THROUGHPUT=250.88
    THROUGHPUT=249.84
    THROUGHPUT=250.51
    THROUGHPUT=254.15
    THROUGHPUT=250.62
    THROUGHPUT=250.89
    TcpExtTCPHystartTrainDetect 1 0.0
    TcpExtTCPHystartTrainCwnd 3175 0.0

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
    triggers too soon.

    Having SNMP counters to have an idea of how often the various Hystart
    methods trigger is useful prior to any modifications.

    This patch adds SNMP counters tracking, how many time "ack train" or
    "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
    Hystart decided to end SS (Slow Start)

    myhost:~# nstat -a | grep Hystart
    TcpExtTCPHystartTrainDetect 9 0.0
    TcpExtTCPHystartTrainCwnd 20650 0.0
    TcpExtTCPHystartDelayDetect 10 0.0
    TcpExtTCPHystartDelayCwnd 360 0.0

    ->
    Train detection was triggered 9 times, and average cwnd was
    20650/9=2294,
    Delay detection was triggered 10 times and average cwnd was 36

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2014

1 commit

  • Fix places where there is space before tab, long lines, and
    awkward if(){, double spacing etc. Add blank line after declaration/initialization.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

13 May, 2014

1 commit

  • Conflicts:
    drivers/net/ethernet/altera/altera_sgdma.c
    net/netlink/af_netlink.c
    net/sched/cls_api.c
    net/sched/sch_api.c

    The netlink conflict dealt with moving to netlink_capable() and
    netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
    in non-init namespaces. These were simple transformations from
    netlink_capable to netlink_ns_capable.

    The Altera driver conflict was simply code removal overlapping some
    void pointer cast cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 May, 2014

1 commit


01 May, 2014

1 commit

  • commit b9f47a3aaeab (tcp_cubic: limit delayed_ack ratio to prevent
    divide error) try to prevent divide error, but there is still a little
    chance that delayed_ack can reach zero. In case the param cnt get
    negative value, then ratio+cnt would overflow and may happen to be zero.
    As a result, min(ratio, ACK_RATIO_LIMIT) will calculate to be zero.

    In some old kernels, such as 2.6.32, there is a bug that would
    pass negative param, which then ultimately leads to this divide error.

    commit 5b35e1e6e9c (tcp: fix tcp_trim_head() to adjust segment count
    with skb MSS) fixed the negative param issue. However,
    it's safe that we fix the range of delayed_ack as well,
    to make sure we do not hit a divide by zero.

    CC: Stephen Hemminger
    Signed-off-by: Liu Yu
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Liu Yu
     

27 Feb, 2014

1 commit

  • Upcoming congestion controls for TCP require usec resolution for RTT
    estimations. Millisecond resolution is simply not enough these days.

    FQ/pacing in DC environments also require this change for finer control
    and removal of bimodal behavior due to the current hack in
    tcp_update_pacing_rate() for 'small rtt'

    TCP_CONG_RTT_STAMP is no longer needed.

    As Julian Anastasov pointed out, we need to keep user compatibility :
    tcp_metrics used to export RTT and RTTVAR in msec resolution,
    so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
    to use the new attributes if provided by the kernel.

    In this example ss command displays a srtt of 32 usecs (10Gbit link)

    lpk51:~# ./ss -i dst lpk52
    Netid State Recv-Q Send-Q Local Address:Port Peer
    Address:Port
    tcp ESTAB 0 1 10.246.11.51:42959
    10.246.11.52:64614
    cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
    cwnd:10 send
    3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559

    Updated iproute2 ip command displays :

    lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
    10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
    10.246.11.51

    Old binary displays :

    lpk51:~# ip tcp_metrics | grep 10.246.11.52
    10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
    10.246.11.51

    With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Cc: Stephen Hemminger
    Cc: Yuchung Cheng
    Cc: Larry Brakmo
    Cc: Julian Anastasov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Nov, 2013

1 commit

  • Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
    regardless the number of packets. Consequently slow start performance
    is highly dependent on the degree of the stretch ACKs caused by
    receiver or network ACK compression mechanisms (e.g., delayed-ACK,
    GRO, etc). But slow start algorithm is to send twice the amount of
    packets of packets left so it should process a stretch ACK of degree
    N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
    follow up patch will use the remainder of the N (if greater than 1)
    to adjust cwnd in the congestion avoidance phase.

    In addition this patch retires the experimental limited slow start
    (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
    fractional cwnd increase in LSS requires a loop in slow start even
    though it's rarely used. Configuring such an increase step via a global
    sysctl on different BDPS seems hard. Finally and most importantly the
    slow start overshoot concern is now better covered by the Hybrid slow
    start (hystart) enabled by default.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

08 Aug, 2013

2 commits

  • While investigating about strange increase of retransmit rates
    on hosts ~24 days after boot, Van found hystart was disabled
    if ca->epoch_start was 0, as following condition is true
    when tcp_time_stamp high order bit is set.

    (s32)(tcp_time_stamp - ca->epoch_start) < HZ

    Quoting Van :

    At initialization & after every loss ca->epoch_start is set to zero so
    I believe that the above line will turn off hystart as soon as the 2^31
    bit is set in tcp_time_stamp & hystart will stay off for 24 days.
    I think we've observed that cubic's restart is too aggressive without
    hystart so this might account for the higher drop rate we observe.

    Diagnosed-by: Van Jacobson
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 17a6e9f1aa9 ("tcp_cubic: fix clock dependency") added an
    overflow error in bictcp_update() in following code :

    /* change the unit from HZ to bictcp_HZ */
    t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
    ca->epoch_start) << BICTCP_HZ) / HZ;

    Because msecs_to_jiffies() being unsigned long, compiler does
    implicit type promotion.

    We really want to constrain (tcp_time_stamp - ca->epoch_start)
    to a signed 32bit value, or else 't' has unexpected high values.

    This bugs triggers an increase of retransmit rates ~24 days after
    boot [1], as the high order bit of tcp_time_stamp flips.

    [1] for hosts with HZ=1000

    Big thanks to Van Jacobson for spotting this problem.

    Diagnosed-by: Van Jacobson
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Stephen Hemminger
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jan, 2012

1 commit

  • This patch fixes CUBIC so that cwnd reductions made during RTOs can be
    undone (just as they already can be undone when using the default/Reno
    behavior).

    When undoing cwnd reductions, BIC-derived congestion control modules
    were restoring the cwnd from last_max_cwnd. There were two problems
    with using last_max_cwnd to restore a cwnd during undo:

    (a) last_max_cwnd was set to 0 on state transitions into TCP_CA_Loss
    (by calling the module's reset() functions), so cwnd reductions from
    RTOs could not be undone.

    (b) when fast_covergence is enabled (which it is by default)
    last_max_cwnd does not actually hold the value of snd_cwnd before the
    loss; instead, it holds a scaled-down version of snd_cwnd.

    This patch makes the following changes:

    (1) upon undo, revert snd_cwnd to ca->loss_cwnd, which is already, as
    the existing comment notes, the "congestion window at last loss"

    (2) stop forgetting ca->loss_cwnd on TCP_CA_Loss events

    (3) use ca->last_max_cwnd to check if we're in slow start

    Signed-off-by: Neal Cardwell
    Acked-by: Stephen Hemminger
    Acked-by: Sangtae Ha
    Signed-off-by: David S. Miller

    Neal Cardwell
     

09 May, 2011

1 commit

  • TCP Cubic keeps a metric that estimates the amount of delayed
    acknowledgements to use in adjusting the window. If an abnormally
    large number of packets are acknowledged at once, then the update
    could wrap and reach zero. This kind of ACK could only
    happen when there was a large window and huge number of
    ACK's were lost.

    This patch limits the value of delayed ack ratio. The choice of 32
    is just a conservative value since normally it should be range of
    1 to 4 packets.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

16 Mar, 2011

1 commit


15 Mar, 2011

5 commits