13 Jun, 2019

1 commit

  • Adding delays to TCP flows is crucial for studying behavior
    of TCP stacks, including congestion control modules.

    Linux offers netem module, but it has unpractical constraints :
    - Need root access to change qdisc
    - Hard to setup on egress if combined with non trivial qdisc like FQ
    - Single delay for all flows.

    EDT (Earliest Departure Time) adoption in TCP stack allows us
    to enable a per socket delay at a very small cost.

    Networking tools can now establish thousands of flows, each of them
    with a different delay, simulating real world conditions.

    This requires FQ packet scheduler or a EDT-enabled NIC.

    This patchs adds TCP_TX_DELAY socket option, to set a delay in
    usec units.

    unsigned int tx_delay = 10000; /* 10 msec */

    setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

    Note that FQ packet scheduler limits might need some tweaking :

    man tc-fq

    PARAMETERS
    limit
    Hard limit on the real queue size. When this limit is
    reached, new packets are dropped. If the value is lowered,
    packets are dropped so that the new limit is met. Default
    is 10000 packets.

    flow_limit
    Hard limit on the maximum number of packets queued per
    flow. Default value is 100.

    Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
    so packets would be dropped if any of the previous limit is hit.

    Use of a jump label makes this support runtime-free, for hosts
    never using the option.

    Also note that TSQ (TCP Small Queues) limits are slightly changed
    with this patch : we need to account that skbs artificially delayed
    wont stop us providind more skbs to feed the pipe (netem uses
    skb_orphan_partial() for this purpose, but FQ can not use this trick)

    Because of that, using big delays might very well trigger
    old bugs in TSO auto defer logic and/or sndbuf limited detection.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jun, 2019

1 commit

  • In case autoflowlabel is in action, skb_get_hash_flowi6()
    derives a non zero skb->hash to the flowlabel.

    If skb->hash is zero, a flow dissection is performed.

    Since all TCP skbs sent from ESTABLISH state inherit their
    skb->hash from sk->sk_txhash, we better keep a copy
    of sk->sk_txhash into the TIME_WAIT socket.

    After this patch, ACK or RST packets sent on behalf of
    a TIME_WAIT socket have the flowlabel that was previously
    used by the flow.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 May, 2019

1 commit

  • Linux implements RFC6298 and use an initial congestion window
    of 1 upon establishing the connection if the SYNACK packet is
    retransmitted 2 or more times. In cellular networks SYNACK timeouts
    are often spurious if the wireless radio was dormant or idle. Also
    some network path is longer than the default SYNACK timeout. In
    both cases falsely starting with a minimal cwnd are detrimental
    to performance.

    This patch avoids doing so when the final ACK's TCP timestamp
    indicates the original SYNACK was delivered. It remembers the
    original SYNACK timestamp when SYNACK timeout has occurred and
    re-uses the function to detect spurious SYN timeout conveniently.

    Note that a server may receives multiple SYNs from and immediately
    retransmits SYNACKs without any SYNACK timeout. This often happens
    on when the client SYNs have timed out due to wireless delay
    above. In this case since the server will still use the default
    initial congestion (e.g. 10) because tp->undo_marker is reset in
    tcp_init_metrics(). This is an intentional design because packets
    are not lost but delayed.

    This patch only covers regular TCP passive open. Fast Open is
    supported in the next patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

27 Feb, 2019

1 commit


18 Jan, 2019

11 commits


01 Sep, 2018

1 commit

  • RFC 1337 says:
    ''Ignore RST segments in TIME-WAIT state.
    If the 2 minute MSL is enforced, this fix avoids all three hazards.''

    So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
    expire rather than removing it instantly when a reset is received.

    However, Linux will also re-start the TIME-WAIT timer.

    This causes connect to fail when tying to re-use ports or very long
    delays (until syn retry interval exceeds MSL).

    packetdrill test case:
    // Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
    `sysctl net.ipv4.tcp_rfc1337=1`

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < S 0:0(0) win 29200
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4

    // Receive first segment
    0.310 < P. 1:1001(1000) ack 1 win 46

    // Send one ACK
    0.310 > . 1:1(0) ack 1001

    // read 1000 byte
    0.310 read(4, ..., 1000) = 1000

    // Application writes 100 bytes
    0.350 write(4, ..., 100) = 100
    0.350 > P. 1:101(100) ack 1001

    // ACK
    0.500 < . 1001:1001(0) ack 101 win 257

    // close the connection
    0.600 close(4) = 0
    0.600 > F. 101:101(0) ack 1001 win 244

    // Our side is in FIN_WAIT_1 & waits for ack to fin
    0.7 < . 1001:1001(0) ack 102 win 244

    // Our side is in FIN_WAIT_2 with no outstanding data.
    0.8 < F. 1001:1001(0) ack 102 win 244
    0.8 > . 102:102(0) ack 1002 win 244

    // Our side is now in TIME_WAIT state, send ack for fin.
    0.9 < F. 1002:1002(0) ack 102 win 244
    0.9 > . 102:102(0) ack 1002 win 244

    // Peer reopens with in-window SYN:
    1.000 < S 1000:1000(0) win 9200

    // Therefore, reply with ACK.
    1.000 > . 102:102(0) ack 1002 win 244

    // Peer sends RST for this ACK. Normally this RST results
    // in tw socket removal, but rfc1337=1 setting prevents this.
    1.100 < R 1002:1002(0) win 244

    // second syn. Due to rfc1337=1 expect another pure ACK.
    31.0 < S 1000:1000(0) win 9200
    31.0 > . 102:102(0) ack 1002 win 244

    // .. and another RST from peer.
    31.1 < R 1002:1002(0) win 244
    31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

    // third syn after one minute. Time-Wait socket should have expired by now.
    63.0 < S 1000:1000(0) win 9200

    // so we expect a syn-ack & 3whs to proceed from here on.
    63.0 > S. 0:0(0) ack 1

    Without this patch, 'ss' shows restarts of tw timer and last packet is
    thus just another pure ack, more than one minute later.

    This restores the original code from commit 283fd6cf0be690a83
    ("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

    For some reason the else branch was removed/lost in 1f28b683339f7
    ("Merge in TCP/UDP optimizations and [..]") and timer restart became
    unconditional.

    Reported-by: Michal Tesar
    Signed-off-by: Florian Westphal
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florian Westphal
     

13 Jul, 2018

1 commit

  • Using get_seconds() for timestamps is deprecated since it can lead
    to overflows on 32-bit systems. While the interface generally doesn't
    overflow until year 2106, the specific implementation of the TCP PAWS
    algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
    overflow.

    A related problem is that the local timestamps in CLOCK_REALTIME form
    lead to unexpected behavior when settimeofday is called to set the system
    clock backwards or forwards by more than 24 days.

    While the first problem could be solved by using an overflow-safe method
    of comparing the timestamps, a nicer solution is to use a monotonic
    clocksource with ktime_get_seconds() that simply doesn't overflow (at
    least not until 136 years after boot) and that doesn't change during
    settimeofday().

    To make 32-bit and 64-bit architectures behave the same way here, and
    also save a few bytes in the tcp_options_received structure, I'm changing
    the type to a 32-bit integer, which is now safe on all architectures.

    Finally, the ts_recent_stamp field also (confusingly) gets used to store
    a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
    This is currently safe, but changing the type to 32-bit requires
    some small changes there to keep it working.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

28 Jun, 2018

1 commit


05 Jun, 2018

1 commit


11 May, 2018

1 commit

  • This version has some suggestions by Eric Dumazet:

    - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
    races.
    - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
    statement.
    - Factorize code as sk_fullsock() check is not necessary.

    Aidan McGurn from Openwave Mobility systems reported the following bug:

    "Marked routing is broken on customer deployment. Its effects are large
    increase in Uplink retransmissions caused by the client never receiving
    the final ACK to their FINACK - this ACK misses the mark and routes out
    of the incorrect route."

    Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
    sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
    setsockopt(SO_MARK..).

    Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
    original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
    Then progate this so that the skb gets sent with the correct mark. Do the same
    for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
    netfilter rules are still honored.

    Signed-off-by: Jon Maxwell
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jon Maxwell
     

01 Apr, 2018

1 commit


15 Feb, 2018

1 commit

  • 배석진 reported that in some situations, packets for a given 5-tuple
    end up being processed by different CPUS.

    This involves RPS, and fragmentation.

    배석진 is seeing packet drops when a SYN_RECV request socket is
    moved into ESTABLISH state. Other states are protected by socket lock.

    This is caused by a CPU losing the race, and simply not caring enough.

    Since this seems to occur frequently, we can do better and perform
    a second lookup.

    Note that all needed memory barriers are already in the existing code,
    thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
    and reqsk_put(). The second lookup must find the new socket,
    unless it has already been accepted and closed by another cpu.

    Note that the fragmentation could be avoided in the first place by
    use of a correct TCP MSS option in the SYN{ACK} packet, but this
    does not mean we can not be more robust.

    Many thanks to 배석진 for a very detailed analysis.

    Reported-by: 배석진
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Dec, 2017

1 commit


02 Dec, 2017

1 commit

  • Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
    that might be caused by the following bug.

    timewait timer is pinned to the cpu, because we want to transition
    timwewait refcount from 0 to 4 in one go, once everything has been
    initialized.

    At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
    handling") was merged, TCP was always running from BH habdler.

    After commit 5413d1babe8f ("net: do not block BH while processing
    socket backlog") we definitely can run tcp_time_wait() from process
    context.

    We need to block BH in the critical section so that the pinned timer
    has still its purpose.

    This bug is more likely to happen under stress and when very small RTO
    are used in datacenter flows.

    Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Nov, 2017

2 commits

  • Replace the reordering distance measurement in packet unit with
    sequence based approach. Previously it trackes the number of "packets"
    toward the forward ACK (i.e. highest sacked sequence)in a state
    variable "fackets_out".

    Precisely measuring reordering degree on packet distance has not much
    benefit, as the degree constantly changes by factors like path, load,
    and congestion window. It is also complicated and prone to arcane bugs.
    This patch replaces with sequence-based approach that's much simpler.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • FACK loss detection has been disabled by default and the
    successor RACK subsumed FACK and can handle reordering better.
    This patch removes FACK to simplify TCP loss recovery.

    Signed-off-by: Yuchung Cheng
    Reviewed-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Reviewed-by: Priyaranjan Jha
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

05 Nov, 2017

1 commit

  • Currently TCP RACK loss detection does not work well if packets are
    being reordered beyond its static reordering window (min_rtt/4).Under
    such reordering it may falsely trigger loss recoveries and reduce TCP
    throughput significantly.

    This patch improves that by increasing and reducing the reordering
    window based on DSACK, which is now supported in major TCP implementations.
    It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

    - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
    by srtt), since there is possibility that spurious retransmission was
    due to reordering delay longer than reo_wnd.

    - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
    no. of successful recoveries (accounts for full DSACK-based loss
    recovery undo). After that, reset it to default (min_rtt/4).

    - At max, reo_wnd is incremented only once per rtt. So that the new
    DSACK on which we are reacting, is due to the spurious retx (approx)
    after the reo_wnd has been updated last time.

    - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
    absolute value to account for change in rtt.

    In our internal testing, we observed significant increase in throughput,
    in scenarios where reordering exceeds min_rtt/4 (previous static value).

    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Priyaranjan Jha
     

28 Oct, 2017

1 commit


27 Oct, 2017

3 commits


26 Oct, 2017

1 commit

  • The SMC protocol [1] relies on the use of a new TCP experimental
    option [2, 3]. With this option, SMC capabilities are exchanged
    between peers during the TCP three way handshake. This patch adds
    support for this experimental option to TCP.

    References:
    [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
    [2] Shared Use of TCP Experimental Options RFC 6994:
    https://tools.ietf.org/rfc/rfc6994.txt
    [3] IANA ExID SMCR:
    http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids

    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Ursula Braun
     

24 Oct, 2017

1 commit


06 Oct, 2017

1 commit

  • This patch adds a new queue (list) that tracks the sent but not yet
    acked or SACKed skbs for a TCP connection. The list is chronologically
    ordered by skb->skb_mstamp (the head is the oldest sent skb).

    This list will be used to optimize TCP Rack recovery, which checks
    an skb's timestamp to judge if it has been lost and needs to be
    retransmitted. Since TCP write queue is ordered by sequence instead
    of sent time, RACK has to scan over the write queue to catch all
    eligible packets to detect lost retransmission, and iterates through
    SACKed skbs repeatedly.

    Special cares for rare events:
    1. TCP repair fakes skb transmission so the send queue needs adjusted
    2. SACK reneging would require re-inserting SACKed skbs into the
    send queue. For now I believe it's not worth the complexity to
    make RACK work perfectly on SACK reneging, so we do nothing here.
    3. Fast Open: currently for non-TFO, send-queue correctly queues
    the pure SYN packet. For TFO which queues a pure SYN and
    then a data packet, send-queue only queues the data packet but
    not the pure SYN due to the structure of TFO code. This is okay
    because the SYN receiver would never respond with a SACK on a
    missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

    In order to not grow sk_buff, we use an union for the new list and
    _skb_refdst/destructor fields. This is a bit complicated because
    we need to make sure _skb_refdst and destructor are properly zeroed
    before skb is cloned/copied at transmit, and before being freed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2017

1 commit

  • This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

    Eric Dumazet says:
    We found at Google a significant regression caused by
    45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

    In typical RPC (TCP_RR), when a TCP socket receives data, we now call
    tcp_ack() while we used to not call it.

    This touches enough cache lines to cause a slowdown.

    so problem does not seem to be HP removal itself but the tcp_ack()
    call. Therefore, it might be possible to remove HP after all, provided
    one finds a way to elide tcp_ack for most cases.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

01 Aug, 2017

2 commits

  • Like prequeue, I am not sure this is overly useful nowadays.

    If we receive a train of packets, GRO will aggregate them if the
    headers are the same (HP predates GRO by several years) so we don't
    get a per-packet benefit, only a per-aggregated-packet one.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • prequeue is a tcp receive optimization that moves part of rx processing
    from bh to process context.

    This only works if the socket being processed belongs to a process that
    is blocked in recv on that socket.

    In practice, this doesn't happen anymore that often because nowadays
    servers tend to use an event driven (epoll) model.

    Even normal client applications (web browsers) commonly use many tcp
    connections in parallel.

    This has measureable impact only in netperf (which uses plain recv and
    thus allows prequeue use) from host to locally running vm (~4%), however,
    there were no changes when using netperf between two physical hosts with
    ixgbe interfaces.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

02 Jul, 2017

1 commit


08 Jun, 2017

1 commit