02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Jul, 2017

1 commit

  • This patch adjusts the timeout formula to schedule the TCP loss probe
    (TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
    only one packet is in flight. It keeps a lower bound of 10 msec which
    is too large for short RTT connections (e.g. within a data-center).
    The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
    performs better for short and fast connections.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 May, 2017

2 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea is to later convert tp->tcp_mstamp to a full u64 counter
    using usec resolution, so that we can later have fine
    grained TCP TS clock (RFC 7323), regardless of HZ value.

    We try to refresh tp->tcp_mstamp only when necessary.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Apr, 2017

4 commits

  • I wrongly assumed tp->tcp_mstamp was up to date at the time
    tcp_rack_reo_timeout() was called.

    It is not true, since we only update tcp->tcp_mstamp when receiving
    a packet (as initially done in commit 69e996c58a35 ("tcp: add
    tp->tcp_mstamp field")

    tcp_rack_reo_timeout() being called by a timer and not an incoming
    packet, we need to refresh tp->tcp_mstamp

    Fixes: 7c1c7308592f ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • No longer needed, since tp->tcp_mstamp holds the information.

    This is needed to remove sack_state.ack_time in a following patch.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is no longer used, since tcp_rack_detect_loss() takes
    the timestamp from tp->tcp_mstamp

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We can use tp->tcp_mstamp as it contains a recent timestamp.

    This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Apr, 2017

1 commit

  • The lost retransmit SNMP stat is under-counting retransmission
    that uses segment offloading. This patch fixes that so all
    retransmission related SNMP counters are consistent.

    Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

14 Jan, 2017

6 commits

  • This patch changes two things:

    1. Start fast recovery with RACK in addition to other heuristics
    (e.g., DUPACK threshold, FACK). Prior to this change RACK
    is enabled to detect losses only after the recovery has
    started by other algorithms.

    2. Disable TCP early retransmit. RACK subsumes the early retransmit
    with the new reordering timer feature. A latter patch in this
    series removes the early retransmit code.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • The packets inside a jumbo skb (e.g., TSO) share the same skb
    timestamp, even though they are sent sequentially on the wire. Since
    RACK is based on time, it can not detect some packets inside the
    same skb are lost. However, we can leverage the packet sequence
    numbers as extended timestamps to detect losses. Therefore, when
    RACK timestamp is identical to skb's timestamp (i.e., one of the
    packets of the skb is acked or sacked), we use the sequence numbers
    of the acked and unacked packets to break ties.

    We can use the same sequence logic to advance RACK xmit time as
    well to detect more losses and avoid timeout.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch makes RACK install a reordering timer when it suspects
    some packets might be lost, but wants to delay the decision
    a little bit to accomodate reordering.

    It does not create a new timer but instead repurposes the existing
    RTO timer, because both are meant to retransmit packets.
    Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
    the RACK timing check fails. The wait time is set to

    RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

    This translates to expecting a packet (Packet) should take
    (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

    When there are multiple packets that need a timer, we use one timer
    with the maximum timeout. Therefore the timer conservatively uses
    the maximum window to expire N packets by one timeout, instead of
    N timeouts to expire N packets sent at different times.

    The fudge factor is 2 jiffies to ensure when the timer fires, all
    the suspected packets would exceed the deadline and be marked lost
    by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
    clock may tick between calling icsk_reset_xmit_timer(timeout) and
    actually hang the timer. The next jiffy is to lower-bound the timeout
    to 2 jiffies when reo_wnd is < 1ms.

    When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
    in Recovery we'll enter fast recovery and force fast retransmit.
    This is very similar to the early retransmit (RFC5827) except RACK
    is not constrained to only enter recovery for small outstanding
    flights.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Record the most recent RTT in RACK. It is often identical to the
    "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
    been retransmitted, RACK choses to believe the ACK is for the
    (latest) retransmitted packet if the RTT is over minimum RTT.

    This requires passing the arrival time of the most recent ACK to
    RACK routines. The timestamp is now recorded in the "ack_time"
    in tcp_sacktag_state during the ACK processing.

    This patch does not change the RACK algorithm itself. It only adds
    the RTT variable to prepare the next main patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Create a new helper tcp_rack_detect_loss to prepare the upcoming
    RACK reordering timer patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Create a new helper tcp_rack_mark_skb_lost to prepare the
    upcoming RACK reordering timer support.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

03 May, 2016

1 commit

  • We want to to make TCP stack preemptible, as draining prequeue
    and backlog queues can take lot of time.

    Many SNMP updates were assuming that BH (and preemption) was disabled.

    Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
    and some __TCP_INC_STATS() to TCP_INC_STATS()

    Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
    and tcp_v4_send_ack(), we add an explicit preempt disabled section.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


21 Oct, 2015

2 commits

  • This patch implements the second half of RACK that uses the the most
    recent transmit time among all delivered packets to detect losses.

    tcp_rack_mark_lost() is called upon receiving a dubious ACK.
    It then checks if an not-yet-sacked packet was sent at least
    "reo_wnd" prior to the sent time of the most recently delivered.
    If so the packet is deemed lost.

    The "reo_wnd" reordering window starts with 1msec for fast loss
    detection and changes to min-RTT/4 when reordering is observed.
    We found 1msec accommodates well on tiny degree of reordering
    (
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch is the first half of the RACK loss recovery.

    RACK loss recovery uses the notion of time instead
    of packet sequence (FACK) or counts (dupthresh). It's inspired by the
    previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
    transmit (new data packet) is sacked, then current retransmitted
    sequence below the newly sacked sequence must been lost,
    since at least one round trip time has elapsed.

    But it has several limitations:
    1) can't detect tail drops since it depends on limited transmit
    2) is disabled upon reordering (assumes no reordering)
    3) only enabled in fast recovery ut not timeout recovery

    RACK (Recently ACK) addresses these limitations with the notion
    of time instead: a packet P1 is lost if a later packet P2 is s/acked,
    as at least one round trip has passed.

    Since RACK cares about the time sequence instead of the data sequence
    of packets, it can detect tail drops when later retransmission is
    s/acked while FACK or dupthresh can't. For reordering RACK uses a
    dynamically adjusted reordering window ("reo_wnd") to reduce false
    positives on ever (small) degree of reordering.

    This patch implements tcp_advanced_rack() which tracks the
    most recent transmission time among the packets that have been
    delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
    is the key to determine which packet has been lost.

    Consider an example that the sender sends six packets:
    T1: P1 (lost)
    T2: P2
    T3: P3
    T4: P4
    T100: sack of P2. rack.mstamp = T2
    T101: retransmit P1
    T102: sack of P2,P3,P4. rack.mstamp = T4
    T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

    We need to be careful about spurious retransmission because it may
    falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
    to falsely mark all packets lost, just like a spurious timeout.

    We identify spurious retransmission by the ACK's TS echo value.
    If TS option is not applicable but the retransmission is acknowledged
    less than min-RTT ago, it is likely to be spurious. We refrain from
    using the transmission time of these spurious retransmissions.

    The second half is implemented in the next patch that marks packet
    lost using RACK timestamp.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng