26 Sep, 2020

2 commits

  • A pure refactor to move tcp_mark_skb_lost to tcp_input.c to prepare
    for the later loss marking consolidation.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • tcp_simple_retransmit() used for path MTU discovery may not adjust
    the retransmit hint properly by deducting retrans_out before checking
    it to adjust the hint. This patch fixes this by a correct routine
    tcp_mark_skb_lost() already used by the RACK loss detection.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

22 Sep, 2018

1 commit

  • There are few places where TCP reads skb->skb_mstamp expecting
    a value in usec unit.

    skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

    Add tcp_skb_timestamp_us() to provide proper conversion when needed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Aug, 2018

1 commit

  • Introduce a new TCP stats to record the number of reordering events seen
    and expose it in both tcp_info (TCP_INFO) and opt_stats
    (SOF_TIMESTAMPING_OPT_STATS).
    Application can use this stats to track the frequency of the reordering
    events in addition to the existing reordering stats which tracks the
    magnitude of the latest reordering event.

    Note: this new stats tracks reordering events triggered by ACKs, which
    could often be fewer than the actual number of packets being delivered
    out-of-order.

    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Wei Wang
     

19 May, 2018

1 commit


18 May, 2018

4 commits

  • Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
    to prepare the final RTO change.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • The previous approach for the lost and retransmit bits was to
    wipe the slate clean: zero all the lost and retransmit bits,
    correspondingly zero the lost_out and retrans_out counters, and
    then add back the lost bits (and correspondingly increment lost_out).

    The new approach is to treat this very much like marking packets
    lost in fast recovery. We don’t wipe the slate clean. We just say
    that for all packets that were not yet marked sacked or lost, we now
    mark them as lost in exactly the same way we do for fast recovery.

    This fixes the lost retransmit accounting at RTO time and greatly
    simplifies the RTO code by sharing much of the logic with Fast
    Recovery.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This is a rewrite of NewReno loss recovery implementation that is
    simpler and standalone for readability and better performance by
    using less states.

    Note that NewReno refers to RFC6582 as a modification to the fast
    recovery algorithm. It is used only if the connection does not
    support SACK in Linux. It should not to be confused with the Reno
    (AIMD) congestion control.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch adds support for the classic DUPACK threshold rule
    (#DupThresh) in RACK.

    When the number of packets SACKed is greater or equal to the
    threshold, RACK sets the reordering window to zero which would
    immediately mark all the unsacked packets below the highest SACKed
    sequence lost. Since this approach is known to not work well with
    reordering, RACK only uses it if no reordering has been observed.

    The DUPACK threshold rule is a particularly useful extension to the
    fast recoveries triggered by RACK reordering timer. For example
    data-center transfers where the RTT is much smaller than a timer
    tick, or high RTT path where the default RTT/4 may take too long.

    Note that this patch differs slightly from RFC6675. RFC6675
    considers a packet lost when at least #DupThresh higher-sequence
    packets are SACKed.

    With RACK, for connections that have seen reordering, RACK
    continues to use a dynamically-adaptive time-based reordering
    window to detect losses. But for connections on which we have not
    yet seen reordering, this patch considers a packet lost when at
    least one higher sequence packet is SACKed and the total number
    of SACKed packets is at least DupThresh. For example, suppose a
    connection has not seen reordering, and sends 10 packets, and
    packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
    lost. RACK considers packets 1, 2, 4, 6 lost.

    There is some small risk of spurious retransmits here due to
    reordering. However, this is mostly limited to the first flight of
    a connection on which the sender receives SACKs from reordering.
    And RFC 6675 and FACK loss detection have a similar risk on the
    first flight with reordering (it's just that the risk of spurious
    retransmits from reordering was slightly narrower for those older
    algorithms due to the margin of 3*MSS).

    Also the minimum reordering window is reduced from 1 msec to 0
    to recover quicker on short RTT transfers. Therefore RACK is more
    aggressive in marking packets lost during recovery to reduce the
    reordering window timeouts.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

09 Dec, 2017

3 commits

  • RACK skips an ACK unless it advances the most recently delivered
    TX timestamp (rack.mstamp). Since RACK also uses the most recent
    RTT to decide if a packet is lost, RACK should still run the
    loss detection whenever the most recent RTT changes. For example,
    an ACK that does not advance the timestamp but triggers the cwnd
    undo due to reordering, would then use the most recent (higher)
    RTT measurement to detect further losses.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Reviewed-by: Priyaranjan Jha
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • RACK should mark a packet lost when remaining wait time is zero.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Reviewed-by: Priyaranjan Jha
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • RACK does not test the loss recovery state correctly to compute
    the reordering window. It assumes if lost_out is zero then TCP is
    not in loss recovery. But it can be zero during recovery before
    calling tcp_rack_detect_loss(): when an ACK acknowledges all
    packets marked lost before receiving this ACK, but has not yet
    to discover new ones by tcp_rack_detect_loss(). The fix is to
    simply test the congestion state directly.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Reviewed-by: Priyaranjan Jha
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

05 Nov, 2017

1 commit

  • Currently TCP RACK loss detection does not work well if packets are
    being reordered beyond its static reordering window (min_rtt/4).Under
    such reordering it may falsely trigger loss recoveries and reduce TCP
    throughput significantly.

    This patch improves that by increasing and reducing the reordering
    window based on DSACK, which is now supported in major TCP implementations.
    It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

    - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
    by srtt), since there is possibility that spurious retransmission was
    due to reordering delay longer than reo_wnd.

    - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
    no. of successful recoveries (accounts for full DSACK-based loss
    recovery undo). After that, reset it to default (min_rtt/4).

    - At max, reo_wnd is incremented only once per rtt. So that the new
    DSACK on which we are reacting, is due to the spurious retx (approx)
    after the reo_wnd has been updated last time.

    - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
    absolute value to account for change in rtt.

    In our internal testing, we observed significant increase in throughput,
    in scenarios where reordering exceeds min_rtt/4 (previous static value).

    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Priyaranjan Jha
     

04 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Oct, 2017

1 commit


06 Oct, 2017

2 commits

  • Refactor the RACK loop to improve readability and speed up the checks.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Use the new time-ordered list to speed up RACK. The detection
    logic is identical. But since the list is chronologically ordered
    by skb_mstamp and contains only skbs not yet acked or sacked,
    RACK can abort the loop upon hitting skbs that were sent more
    recently. On YouTube servers this patch reduces the iterations on
    write queue by 40x. The improvement is even bigger with large
    BDP networks.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

20 Jul, 2017

1 commit

  • This patch adjusts the timeout formula to schedule the TCP loss probe
    (TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
    only one packet is in flight. It keeps a lower bound of 10 msec which
    is too large for short RTT connections (e.g. within a data-center).
    The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
    performs better for short and fast connections.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 May, 2017

2 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea is to later convert tp->tcp_mstamp to a full u64 counter
    using usec resolution, so that we can later have fine
    grained TCP TS clock (RFC 7323), regardless of HZ value.

    We try to refresh tp->tcp_mstamp only when necessary.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Apr, 2017

4 commits

  • I wrongly assumed tp->tcp_mstamp was up to date at the time
    tcp_rack_reo_timeout() was called.

    It is not true, since we only update tcp->tcp_mstamp when receiving
    a packet (as initially done in commit 69e996c58a35 ("tcp: add
    tp->tcp_mstamp field")

    tcp_rack_reo_timeout() being called by a timer and not an incoming
    packet, we need to refresh tp->tcp_mstamp

    Fixes: 7c1c7308592f ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • No longer needed, since tp->tcp_mstamp holds the information.

    This is needed to remove sack_state.ack_time in a following patch.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is no longer used, since tcp_rack_detect_loss() takes
    the timestamp from tp->tcp_mstamp

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We can use tp->tcp_mstamp as it contains a recent timestamp.

    This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Apr, 2017

1 commit

  • The lost retransmit SNMP stat is under-counting retransmission
    that uses segment offloading. This patch fixes that so all
    retransmission related SNMP counters are consistent.

    Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

14 Jan, 2017

6 commits

  • This patch changes two things:

    1. Start fast recovery with RACK in addition to other heuristics
    (e.g., DUPACK threshold, FACK). Prior to this change RACK
    is enabled to detect losses only after the recovery has
    started by other algorithms.

    2. Disable TCP early retransmit. RACK subsumes the early retransmit
    with the new reordering timer feature. A latter patch in this
    series removes the early retransmit code.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • The packets inside a jumbo skb (e.g., TSO) share the same skb
    timestamp, even though they are sent sequentially on the wire. Since
    RACK is based on time, it can not detect some packets inside the
    same skb are lost. However, we can leverage the packet sequence
    numbers as extended timestamps to detect losses. Therefore, when
    RACK timestamp is identical to skb's timestamp (i.e., one of the
    packets of the skb is acked or sacked), we use the sequence numbers
    of the acked and unacked packets to break ties.

    We can use the same sequence logic to advance RACK xmit time as
    well to detect more losses and avoid timeout.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch makes RACK install a reordering timer when it suspects
    some packets might be lost, but wants to delay the decision
    a little bit to accomodate reordering.

    It does not create a new timer but instead repurposes the existing
    RTO timer, because both are meant to retransmit packets.
    Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
    the RACK timing check fails. The wait time is set to

    RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

    This translates to expecting a packet (Packet) should take
    (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

    When there are multiple packets that need a timer, we use one timer
    with the maximum timeout. Therefore the timer conservatively uses
    the maximum window to expire N packets by one timeout, instead of
    N timeouts to expire N packets sent at different times.

    The fudge factor is 2 jiffies to ensure when the timer fires, all
    the suspected packets would exceed the deadline and be marked lost
    by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
    clock may tick between calling icsk_reset_xmit_timer(timeout) and
    actually hang the timer. The next jiffy is to lower-bound the timeout
    to 2 jiffies when reo_wnd is < 1ms.

    When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
    in Recovery we'll enter fast recovery and force fast retransmit.
    This is very similar to the early retransmit (RFC5827) except RACK
    is not constrained to only enter recovery for small outstanding
    flights.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Record the most recent RTT in RACK. It is often identical to the
    "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
    been retransmitted, RACK choses to believe the ACK is for the
    (latest) retransmitted packet if the RTT is over minimum RTT.

    This requires passing the arrival time of the most recent ACK to
    RACK routines. The timestamp is now recorded in the "ack_time"
    in tcp_sacktag_state during the ACK processing.

    This patch does not change the RACK algorithm itself. It only adds
    the RTT variable to prepare the next main patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Create a new helper tcp_rack_detect_loss to prepare the upcoming
    RACK reordering timer patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Create a new helper tcp_rack_mark_skb_lost to prepare the
    upcoming RACK reordering timer support.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

03 May, 2016

1 commit

  • We want to to make TCP stack preemptible, as draining prequeue
    and backlog queues can take lot of time.

    Many SNMP updates were assuming that BH (and preemption) was disabled.

    Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
    and some __TCP_INC_STATS() to TCP_INC_STATS()

    Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
    and tcp_v4_send_ack(), we add an explicit preempt disabled section.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


21 Oct, 2015

2 commits

  • This patch implements the second half of RACK that uses the the most
    recent transmit time among all delivered packets to detect losses.

    tcp_rack_mark_lost() is called upon receiving a dubious ACK.
    It then checks if an not-yet-sacked packet was sent at least
    "reo_wnd" prior to the sent time of the most recently delivered.
    If so the packet is deemed lost.

    The "reo_wnd" reordering window starts with 1msec for fast loss
    detection and changes to min-RTT/4 when reordering is observed.
    We found 1msec accommodates well on tiny degree of reordering
    (
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch is the first half of the RACK loss recovery.

    RACK loss recovery uses the notion of time instead
    of packet sequence (FACK) or counts (dupthresh). It's inspired by the
    previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
    transmit (new data packet) is sacked, then current retransmitted
    sequence below the newly sacked sequence must been lost,
    since at least one round trip time has elapsed.

    But it has several limitations:
    1) can't detect tail drops since it depends on limited transmit
    2) is disabled upon reordering (assumes no reordering)
    3) only enabled in fast recovery ut not timeout recovery

    RACK (Recently ACK) addresses these limitations with the notion
    of time instead: a packet P1 is lost if a later packet P2 is s/acked,
    as at least one round trip has passed.

    Since RACK cares about the time sequence instead of the data sequence
    of packets, it can detect tail drops when later retransmission is
    s/acked while FACK or dupthresh can't. For reordering RACK uses a
    dynamically adjusted reordering window ("reo_wnd") to reduce false
    positives on ever (small) degree of reordering.

    This patch implements tcp_advanced_rack() which tracks the
    most recent transmission time among the packets that have been
    delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
    is the key to determine which packet has been lost.

    Consider an example that the sender sends six packets:
    T1: P1 (lost)
    T2: P2
    T3: P3
    T4: P4
    T100: sack of P2. rack.mstamp = T2
    T101: retransmit P1
    T102: sack of P2,P3,P4. rack.mstamp = T4
    T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

    We need to be careful about spurious retransmission because it may
    falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
    to falsely mark all packets lost, just like a spurious timeout.

    We identify spurious retransmission by the ACK's TS echo value.
    If TS option is not applicable but the retransmission is acknowledged
    less than min-RTT ago, it is likely to be spurious. We refrain from
    using the transmission time of these spurious retransmissions.

    The second half is implemented in the next patch that marks packet
    lost using RACK timestamp.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng