14 Oct, 2019

1 commit

  • Both tcp_v4_err() and tcp_v6_err() do the following operations
    while they do not own the socket lock :

    fastopen = tp->fastopen_rsk;
    snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

    The problem is that without appropriate barrier, the compiler
    might reload tp->fastopen_rsk and trigger a NULL deref.

    request sockets are protected by RCU, we can simply add
    the missing annotations and barriers to solve the issue.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2019

1 commit

  • The cited commit exposed an old retransmits_timed_out() bug
    which assumed it could call tcp_model_timeout() with
    TCP_RTO_MIN as rto_base for all states.

    But flows in SYN_SENT or SYN_RECV state uses a different
    RTO base (1 sec instead of 200 ms, unless BPF choses
    another value)

    This caused a reduction of SYN retransmits from 6 to 4 with
    the default /proc/sys/net/ipv4/tcp_syn_retries value.

    Fixes: a41e8a88b06e ("tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Marek Majkowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2019

1 commit

  • Yuchung Cheng and Marek Majkowski independently reported a weird
    behavior of TCP_USER_TIMEOUT option when used at connect() time.

    When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
    believes the flow should live, and the following condition
    in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :

    remaining = icsk->icsk_user_timeout - elapsed;
    if (remaining
    Reported-by: Yuchung Cheng
    Reported-by: Marek Majkowski
    Cc: Jon Maxwell
    Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
    Acked-by: Jon Maxwell
    Tested-by: Marek Majkowski
    Signed-off-by: Marek Majkowski
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2019

1 commit

  • The current implementation of TCP MTU probing can considerably
    underestimate the MTU on lossy connections allowing the MSS to get down to
    48. We have found that in almost all of these cases on our networks these
    paths can handle much larger MTUs meaning the connections are being
    artificially limited. Even though TCP MTU probing can raise the MSS back up
    we have seen this not to be the case causing connections to be "stuck" with
    an MSS of 48 when heavy loss is present.

    Prior to pushing out this change we could not keep TCP MTU probing enabled
    b/c of the above reasons. Now with a reasonble floor set we've had it
    enabled for the past 6 months.

    The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
    administrators the ability to control the floor of MSS probing.

    Signed-off-by: Josh Hunt
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Josh Hunt
     

16 Jun, 2019

1 commit

  • If mtu probing is enabled tcp_mtu_probing() could very well end up
    with a too small MSS.

    Use the new sysctl tcp_min_snd_mss to make sure MSS search
    is performed in an acceptable range.

    CVE-2019-11479 -- tcp mss hardcoded to 48

    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Lemon
    Cc: Jonathan Looney
    Acked-by: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Tyler Hicks
    Cc: Bruce Curtis
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 May, 2019

1 commit

  • TCP sender would use congestion window of 1 packet on the second SYN
    and SYNACK timeout except passive TCP Fast Open. This makes passive
    TFO too aggressive and unfair during congestion at handshake. This
    patch fixes this issue so TCP (fast open or not, passive or active)
    always conforms to the RFC6298.

    Note that tcp_enter_loss() is called only once during recurring
    timeouts. This is because during handshake, high_seq and snd_una
    are the same so tcp_enter_loss() would incorrect set the undo state
    variables multiple times.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

28 Jan, 2019

1 commit

  • Instead of using pingpong as a single bit information, we refactor the
    code to treat it as a counter. When interactive session is detected,
    we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
    is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

    This patch is a pure refactor and sets foundation for the next patch.
    This patch itself does not change any pingpong logic.

    Signed-off-by: Wei Wang
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

18 Jan, 2019

6 commits

  • Previously when the sender fails to retransmit a data packet on
    timeout due to congestion in the local host (e.g. throttling in
    qdisc), it'll retry within an RTO up to 500ms.

    In low-RTT networks such as data-centers, RTO is often far
    below the default minimum 200ms (and the cap 500ms). Then local
    host congestion could trigger a retry storm pouring gas to the
    fire. Worse yet, the retry counter (icsk_retransmits) is not
    properly updated so the aggressive retry may exceed the system
    limit (15 rounds) until the packet finally slips through.

    On such rare events, it's wise to retry more conservatively (500ms)
    and update the stats properly to reflect these incidents and follow
    the system limit. Note that this is consistent with the behavior
    when a keep-alive probe is dropped due to local congestion.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously we use the next unsent skb's timestamp to determine
    when to abort a socket stalling on window probes. This no longer
    works as skb timestamp reflects the last instead of the first
    transmission.

    Instead we can estimate how long the socket has been stalling
    with the probe count and the exponential backoff behavior.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Create a helper to model TCP exponential backoff for the next patch.
    This is pure refactor w no behavior change.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch addresses a corner issue on timeout behavior of a
    passive Fast Open socket. A passive Fast Open server may write
    and close the socket when it is re-trying SYN-ACK to complete
    the handshake. After the handshake is completely, the server does
    not properly stamp the recovery start time (tp->retrans_stamp is
    0), and the socket may abort immediately on the very first FIN
    timeout, instead of retying until it passes the system or user
    specified limit.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously TCP socket's retrans_stamp is not set if the
    retransmission has failed to send. As a result if a socket is
    experiencing local issues to retransmit packets, determining when
    to abort a socket is complicated w/o knowning the starting time of
    the recovery since retrans_stamp may remain zero.

    This complication causes sub-optimal behavior that TCP may use the
    latest, instead of the first, retransmission time to compute the
    elapsed time of a stalling connection due to local issues. Then TCP
    may disrecard TCP retries settings and keep retrying until it finally
    succeed: not a good idea when the local host is already strained.

    The simple fix is to always timestamp the start of a recovery.
    It's worth noting that retrans_stamp is also used to compare echo
    timestamp values to detect spurious recovery. This patch does
    not break that because retrans_stamp is still later than when the
    original packet was sent.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously TCP only warns if its RTO timer fires and the
    retransmission queue is empty, but it'll cause null pointer
    reference later on. It's better to avoid such catastrophic failure
    and simply exit with a warning.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

11 Jan, 2019

1 commit

  • Previously upon SYN timeouts the sender recomputes the txhash to
    try a different path. However this does not apply on the initial
    timeout of SYN-data (active Fast Open). Therefore an active IPv6
    Fast Open connection may incur one second RTO penalty to take on
    a new path after the second SYN retransmission uses a new flow label.

    This patch removes this undesirable behavior so Fast Open changes
    the flow label just like the regular connections. This also helps
    avoid falsely disabling Fast Open on the sender which triggers
    after two consecutive SYN timeouts on Fast Open.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

01 Dec, 2018

2 commits

  • Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
    1. It counts all SYN and SYN-ACK timeouts
    2. It counts timeouts in other states except recurring timeouts and
    timeouts after fast recovery or disorder state.

    Such selective accounting makes analysis difficult and complicated. For
    example the monitoring system needs to collect many other SNMP counters
    to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
    counter simply counts all the retransmit timeout (SYN or data or FIN).

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously there is an off-by-one bug on determining when to abort
    a stalled window-probing socket. This patch fixes that so it is
    consistent with tcp_write_timeout().

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

25 Nov, 2018

1 commit

  • When a qdisc setup including pacing FQ is dismantled and recreated,
    some TCP packets are sent earlier than instructed by TCP stack.

    TCP can be fooled when ACK comes back, because the following
    operation can return a negative value.

    tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;

    Some paths in TCP stack were not dealing properly with this,
    this patch addresses four of them.

    Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Nov, 2018

1 commit

  • Jean-Louis reported a TCP regression and bisected to recent SACK
    compression.

    After a loss episode (receiver not able to keep up and dropping
    packets because its backlog is full), linux TCP stack is sending
    a single SACK (DUPACK).

    Sender waits a full RTO timer before recovering losses.

    While RFC 6675 says in section 5, "Algorithm Details",

    (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
    indicating at least three segments have arrived above the current
    cumulative acknowledgment point, which is taken to indicate loss
    -- go to step (4).
    ...
    (4) Invoke fast retransmit and enter loss recovery as follows:

    there are old TCP stacks not implementing this strategy, and
    still counting the dupacks before starting fast retransmit.

    While these stacks probably perform poorly when receivers implement
    LRO/GRO, we should be a little more gentle to them.

    This patch makes sure we do not enable SACK compression unless
    3 dupacks have been sent since last rcv_nxt update.

    Ideally we should even rearm the timer to send one or two
    more DUPACK if no more packets are coming, but that will
    be work aiming for linux-4.21.

    Many thanks to Jean-Louis for bisecting the issue, providing
    packet captures and testing this patch.

    Fixes: 5d9f4262b7ea ("tcp: add SACK compression")
    Reported-by: Jean-Louis Dupond
    Tested-by: Jean-Louis Dupond
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2018

1 commit

  • In EDT design, I made the mistake of using tcp_wstamp_ns
    to store the last tcp_clock_ns() sample and to store the
    pacing virtual timer.

    This causes major regressions at high speed flows.

    Introduce tcp_clock_cache to store last tcp_clock_ns().
    This is needed because some arches have slow high-resolution
    kernel time service.

    tcp_wstamp_ns is only updated when a packet is sent.

    Note that we can remove tcp_mstamp in the future since
    tcp_mstamp is essentially tcp_clock_cache/1000, so the
    apparent socket size increase is temporary.

    Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2018

1 commit

  • In the recent TCP/EDT patch series, I switched TCP and sch_fq
    clocks from MONOTONIC to TAI, in order to meet the choice done
    earlier for sch_etf packet scheduler.

    But sure enough, this broke some setups were the TAI clock
    jumps forward (by almost 50 year...), as reported
    by Leonard Crestez.

    If we want to converge later, we'll probably need to add
    an skb field to differentiate the clock bases, or a socket option.

    In the meantime, an UDP application will need to use CLOCK_MONOTONIC
    base for its SCM_TXTIME timestamps if using fq packet scheduler.

    Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
    Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
    Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
    Signed-off-by: Eric Dumazet
    Reported-by: Leonard Crestez
    Tested-by: Leonard Crestez
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2018

2 commits

  • Next patch will use tcp_wstamp_ns to feed internal
    TCP pacing timer, so switch to CLOCK_TAI to share same base.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
    from usec units to nsec units.

    Do not clear skb->tstamp before entering IP stacks in TX,
    so that qdisc or devices can implement pacing based on the
    earliest departure time instead of socket sk->sk_pacing_rate

    Packets are fed with tcp_wstamp_ns, and following patch
    will update tcp_wstamp_ns when both TCP and sch_fq switch to
    the earliest departure time mechanism.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jul, 2018

1 commit


22 Jul, 2018

3 commits

  • Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
    the correct rto, so that the TCP_USER_TIMEOUT socket option is more
    accurate. Taking suggestions and feedback into account from
    Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
    can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.

    Signed-off-by: Jon Maxwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jon Maxwell
     
  • Create a seperate helper routine as per Neal Cardwells suggestion. To
    be used by the final commit in this series and retransmits_timed_out().

    Signed-off-by: Jon Maxwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jon Maxwell
     
  • This is a preparatory commit. Part of this series that improves the
    socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
    to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
    the msecs_to_jiffies() and jiffies_to_msecs() dance in future.

    Signed-off-by: Jon Maxwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jon Maxwell
     

18 May, 2018

1 commit

  • When TCP receives an out-of-order packet, it immediately sends
    a SACK packet, generating network load but also forcing the
    receiver to send 1-MSS pathological packets, increasing its
    RTX queue length/depth, and thus processing time.

    Wifi networks suffer from this aggressive behavior, but generally
    speaking, all these SACK packets add fuel to the fire when networks
    are under congestion.

    This patch adds a high resolution timer and tp->compressed_ack counter.

    Instead of sending a SACK, we program this timer with a small delay,
    based on RTT and capped to 1 ms :

    delay = min ( 5 % of RTT, 1 ms)

    If subsequent SACKs need to be sent while the timer has not yet
    expired, we simply increment tp->compressed_ack.

    When timer expires, a SACK is sent with the latest information.
    Whenever an ACK is sent (if data is sent, or if in-order
    data is received) timer is canceled.

    Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
    if the sack blocks need to be shuffled, even if the timer has not
    expired.

    A new SNMP counter is added in the following patch.

    Two other patches add sysctls to allow changing the 1,000,000 and 44
    values that this commit hard-coded.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 May, 2018

1 commit

  • linux-4.16 got support for softirq based hrtimers.
    TCP can switch its pacing hrtimer to this variant, since this
    avoids going through a tasklet and some atomic operations.

    pacing timer logic looks like other (jiffies based) tcp timers.

    v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers()
    to correctly release reference on socket if needed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2018

1 commit

  • When the connection is aborted, there is no point in
    keeping the packets on the write queue until the connection
    is closed.

    Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
    this is essential for a correct MSG_ZEROCOPY implementation,
    because userspace cannot call close(fd) before receiving
    zerocopy signals even when the connection is aborted.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

29 Jan, 2018

1 commit


26 Jan, 2018

1 commit

  • Adds an optional call to sock_ops BPF program based on whether the
    BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
    The BPF program is passed 2 arguments: icsk_retransmits and whether the
    RTO has expired.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     

25 Jan, 2018

1 commit

  • When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.

    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open. However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open. In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence. The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.

    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.

    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman
    Signed-off-by: David S. Miller

    Dan Streetman
     

17 Dec, 2017

1 commit


14 Dec, 2017

2 commits

  • Only the retransmit timer currently refreshes tcp_mstamp

    We should do the same for delayed acks and keepalives.

    Even if RFC 7323 does not request it, this is consistent to what linux
    did in the past, when TS values were based on jiffies.

    Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Mike Maloney
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Mike Maloney
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Prior to this patch, active Fast Open is paused on a specific
    destination IP address if the previous connections to the
    IP address have experienced recurring timeouts . But recent
    experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
    browsers indicate the isssue is often caused by broken middle-boxes
    sitting close to the client. Therefore it is much better user
    experience if Fast Open is disabled out-right globally to avoid
    experiencing further timeouts on connections toward other
    destinations.

    This patch changes the destination-IP disablement to global
    disablement if a connection experiencing recurring timeouts
    or aborts due to timeout. Repeated incidents would still
    exponentially increase the pause time, starting from an hour.
    This is extremely conservative but an unfortunate compromise to
    minimize bad experience due to broken middle-boxes.

    Reported-by: Dragana Damjanovic
    Reported-by: Patrick McManus
    Signed-off-by: Yuchung Cheng
    Reviewed-by: Wei Wang
    Reviewed-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

05 Nov, 2017

1 commit


27 Oct, 2017

1 commit


18 Oct, 2017

1 commit

  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: "David S. Miller"
    Cc: Gerrit Renker
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Cc: dccp@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

07 Oct, 2017

1 commit

  • Using a linear list to store all skbs in write queue has been okay
    for quite a while : O(N) is not too bad when N < 500.

    Things get messy when N is the order of 100,000 : Modern TCP stacks
    want 10Gbit+ of throughput even with 200 ms RTT flows.

    40 ns per cache line miss means a full scan can use 4 ms,
    blowing away CPU caches.

    SACK processing often can use various hints to avoid parsing
    whole retransmit queue. But with high packet losses and/or high
    reordering, hints no longer work.

    Sender has to process thousands of unfriendly SACK, accumulating
    a huge socket backlog, burning a cpu and massively dropping packets.

    Using an rb-tree for retransmit queue has been avoided for years
    because it added complexity and overhead, but now is the time
    to be more resistant and say no to quadratic behavior.

    1) RTX queue is no longer part of the write queue : already sent skbs
    are stored in one rb-tree.

    2) Since reaching the head of write queue no longer needs
    sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

    Tested:

    On receiver :
    netem on ingress : delay 150ms 200us loss 1
    GRO disabled to force stress and SACK storms.

    for f in `seq 1 10`
    do
    ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
    done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

    Before patch :

    323.87
    351.48
    339.59
    338.62
    306.72
    204.07
    304.93
    291.88
    202.47
    176.88
    2840

    After patch:

    1700.83
    2207.98
    2070.17
    1544.26
    2114.76
    2124.89
    1693.14
    1080.91
    2216.82
    1299.94
    18053

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet