24 Aug, 2018

1 commit

  • [ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]

    After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
    related callbacks are no longer needed

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

28 Jul, 2018

2 commits

  • [ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]

    Currently when a DCTCP receiver delays an ACK and receive a
    data packet with a different CE mark from the previous one's, it
    sends two immediate ACKs acking previous and latest sequences
    respectly (for ECN accounting).

    Previously sending the first ACK may mark off the delayed ACK timer
    (tcp_event_ack_sent). This may subsequently prevent sending the
    second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
    The culprit is that tcp_send_ack() assumes it always acknowleges
    the latest sequence, which is not true for the first special ACK.

    The fix is to not make the assumption in tcp_send_ack and check the
    actual ack sequence before cancelling the delayed ACK. Further it's
    safer to pass the ack sequence number as a local variable into
    tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
    future bugs like this.

    Reported-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit 2987babb6982306509380fc11b450227a844493b ]

    Refactor and create helpers to send the special ACK in DCTCP.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

25 May, 2018

1 commit

  • [ Upstream commit 7f582b248d0a86bae5788c548d7bb5bca6f7691a ]

    syzkaller found a reliable way to crash the host, hitting a BUG()
    in __tcp_retransmit_skb()

    Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
    in tcp_connect_init() at the point we init snd_una/write_seq.

    This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()

    kernel BUG at net/ipv4/tcp_output.c:2837!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
    RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206
    RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49
    RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005
    RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2
    R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad
    R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80
    FS: 0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900
    CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
    tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
    tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
    tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
    call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
    expire_timers kernel/time/timer.c:1363 [inline]
    __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
    run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
    __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
    invoke_softirq kernel/softirq.c:365 [inline]
    irq_exit+0x1d1/0x200 kernel/softirq.c:405
    exiting_irq arch/x86/include/asm/apic.h:525 [inline]
    smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863

    Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Reported-by: syzbot
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

09 Mar, 2018

2 commits

  • [ Upstream commit 350c9f484bde93ef229682eedd98cd5f74350f7f ]

    BBR uses tcp_tso_autosize() in an attempt to probe what would be the
    burst sizes and to adjust cwnd in bbr_target_cwnd() with following
    gold formula :

    /* Allow enough full-sized skbs in flight to utilize end systems. */
    cwnd += 3 * bbr->tso_segs_goal;

    But GSO can be lacking or be constrained to very small
    units (ip link set dev ... gso_max_segs 2)

    What we really want is to have enough packets in flight so that both
    GSO and GRO are efficient.

    So in the case GSO is off or downgraded, we still want to have the same
    number of packets in flight as if GSO/TSO was fully operational, so
    that GRO can hopefully be working efficiently.

    To fix this issue, we make tcp_tso_autosize() unaware of
    sk->sk_gso_max_segs

    Only tcp_tso_segs() has to enforce the gso_max_segs limit.

    Tested:

    ethtool -K eth0 tso off gso off
    tc qd replace dev eth0 root pfifo_fast

    Before patch:
    for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
        691  (ss -temoi shows cwnd is stuck around 6 )
        667
        651
        631
        517

    After patch :
    # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
       1733 (ss -temoi shows cwnd is around 386 )
       1778
       1746
       1781
       1718

    Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
    Signed-off-by: Eric Dumazet
    Reported-by: Oleksandr Natalenko
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 808cf9e38cd7923036a99f459ccc8cf2955e47af ]

    Avoid SKB coalescing if eor bit is set in one of the relevant
    SKBs.

    Fixes: c134ecb87817 ("tcp: Make use of MSG_EOR in tcp_sendmsg")
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ilya Lesokhin
     

17 Dec, 2017

1 commit

  • [ Upstream commit ed66dfaf236c04d414de1d218441296e57fb2bd2 ]

    Fix the TLP scheduling logic so that when scheduling a TLP probe, we
    ensure that the estimated time at which an RTO would fire accounts for
    the fact that ACKs indicating forward progress should push back RTO
    times.

    After the following fix:

    df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

    we had an unintentional behavior change in the following kind of
    scenario: suppose the RTT variance has been very low recently. Then
    suppose we send out a flight of N packets and our RTT is 100ms:

    t=0: send a flight of N packets
    t=100ms: receive an ACK for N-1 packets

    The response before df92c8394e6e that was:
    -> schedule a TLP for now + RTO_interval

    The response after df92c8394e6e is:
    -> schedule a TLP for t=0 + RTO_interval

    Since RTO_interval = srtt + RTT_variance, this means that we have
    scheduled a TLP timer at a point in the future that only accounts for
    RTT_variance. If the RTT_variance term is small, this means that the
    timer fires soon.

    Before df92c8394e6e this would not happen, because in that code, when
    we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
    at write_queue_head->paket_tx_time + RTO_interval:
    if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
    tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
    if (flag & FLAG_ACKED) {
    tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
    to TLP at now + RTO_interval:
    if (icsk->icsk_pending == ICSK_TIME_RETRANS)
    tcp_schedule_loss_probe(sk);

    In df92c8394e6e we removed that 3-phase dance, and instead directly
    set the TLP timer once: we set the TLP timer in cases like this to
    write_queue_head->packet_tx_time + RTO_interval. So if the RTT
    variance is small, then this means that this is setting the TLP timer
    to fire quite soon. This means if the ACK for the tail of the flight
    takes longer than an RTT to arrive (often due to delayed ACKs), then
    the TLP timer fires too quickly.

    Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

03 Nov, 2017

1 commit

  • Christoph Paasch sent a patch to address the following issue :

    tcp_make_synack() is leaving some TCP private info in skb->cb[],
    then send the packet by other means than tcp_transmit_skb()

    tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
    IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.

    tcp_make_synack() should not use tcp_init_nondata_skb() :

    tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
    queues (the ones that are only sent via tcp_transmit_skb())

    This patch fixes the issue and should even save few cpu cycles ;)

    Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
    Signed-off-by: Eric Dumazet
    Reported-by: Christoph Paasch
    Reviewed-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Nov, 2017

1 commit

  • Based on SNMP values provided by Roman, Yuchung made the observation
    that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

    Looking at tcp_mtu_probe(), I found that when a new skb was placed
    in front of the write queue, we were not updating tcp highest sack.

    If one skb is freed because all its content was copied to the new skb
    (for MTU probing), then tp->highest_sack could point to a now freed skb.

    Bad things would then happen, including infinite loops.

    This patch renames tcp_highest_sack_combine() and uses it
    from tcp_mtu_probe() to fix the bug.

    Note that I also removed one test against tp->sacked_out,
    since we want to replace tp->highest_sack regardless of whatever
    condition, since keeping a stale pointer to freed skb is a recipe
    for disaster.

    Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
    Signed-off-by: Eric Dumazet
    Reported-by: Alexei Starovoitov
    Reported-by: Roman Gushchin
    Reported-by: Oleksandr Natalenko
    Acked-by: Alexei Starovoitov
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Oct, 2017

1 commit

  • In the unlikely event tcp_mtu_probe() is sending a packet, we
    want tp->tcp_mstamp being as accurate as possible.

    This means we need to call tcp_mstamp_refresh() a bit earlier in
    tcp_write_xmit().

    Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2017

1 commit

  • Current implementation calls tcp_rate_skb_sent() when tcp_transmit_skb()
    is called when it clones skb only. Not calling tcp_rate_skb_sent() is OK
    for all such code paths except from __tcp_retransmit_skb() which happens
    when skb->data address is not aligned. This may rarely happen e.g. when
    small amount of data is sent initially and the receiver partially acks
    odd number of bytes for some reason, possibly malicious.

    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yousuk Seung
     

23 Oct, 2017

1 commit

  • When retransmission on TSQ handler was introduced in the commit
    f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
    skbs' timestamps were updated on the actual transmission. In the later
    commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
    being done so. In the commit, the comment says "We try to refresh
    tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
    tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
    it's rare enough.

    About the former, even though possible retransmissions on the tasklet
    comes just after the destructor run in NET_RX softirq handling, the time
    between them could be nonnegligibly large to the extent that
    tcp_rack_advance or rto rearming be affected if other (remaining) RX,
    BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.

    So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
    ensures the accuracy of algorithms relying on it.

    Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
    Signed-off-by: Koichiro Den
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Koichiro Den
     

20 Sep, 2017

1 commit

  • Our recent change exposed a bug in TCP Fastopen Client that syzkaller
    found right away [1]

    When we prepare skb with SYN+DATA, we attempt to transmit it,
    and we update socket state as if the transmit was a success.

    In socket RTX queue we have two skbs, one with the SYN alone,
    and a second one containing the DATA.

    When (malicious) ACK comes in, we now complain that second one had no
    skb_mstamp.

    The proper fix is to make sure that if the transmit failed, we do not
    pretend we sent the DATA skb, and make it our send_head.

    When 3WHS completes, we can now send the DATA right away, without having
    to wait for a timeout.

    [1]
    WARNING: CPU: 0 PID: 100189 at net/ipv4/tcp_input.c:3117 tcp_clean_rtx_queue+0x2057/0x2ab0 net/ipv4/tcp_input.c:3117()

    WARN_ON_ONCE(last_ackt == 0);

    Modules linked in:
    CPU: 0 PID: 100189 Comm: syz-executor1 Not tainted
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    0000000000000000 ffff8800b35cb1d8 ffffffff81cad00d 0000000000000000
    ffffffff828a4347 ffff88009f86c080 ffffffff8316eb20 0000000000000d7f
    ffff8800b35cb220 ffffffff812c33c2 ffff8800baad2440 00000009d46575c0
    Call Trace:
    [] __dump_stack
    [] dump_stack+0xc1/0x124
    [] warn_slowpath_common+0xe2/0x150
    [] warn_slowpath_null+0x2e/0x40
    [] tcp_clean_rtx_queue+0x2057/0x2ab0 n
    [] tcp_ack+0x151d/0x3930
    [] tcp_rcv_state_process+0x1c69/0x4fd0
    [] tcp_v4_do_rcv+0x54f/0x7c0
    [] sk_backlog_rcv
    [] __release_sock+0x12b/0x3a0
    [] release_sock+0x5e/0x1c0
    [] inet_wait_for_connect
    [] __inet_stream_connect+0x545/0xc50
    [] tcp_sendmsg_fastopen
    [] tcp_sendmsg+0x2298/0x35a0
    [] inet_sendmsg+0xe5/0x520
    [] sock_sendmsg_nosec
    [] sock_sendmsg+0xcf/0x110

    Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
    Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Sep, 2017

1 commit

  • remove tcp_may_send_now and tcp_snd_test that are no longer used

    Fixes: 840a3cbe8969 ("tcp: remove forward retransmit feature")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

17 Sep, 2017

1 commit

  • Now skb->mstamp_skb is updated later, we also need to call
    tcp_rate_skb_sent() after the update is done.

    Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Sep, 2017

1 commit

  • liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
    in tcp_probe_timer() :
    https://www.spinics.net/lists/netdev/msg454496.html

    After investigations, the root cause of the problem is that we update
    skb->skb_mstamp of skbs in write queue, even if the attempt to send a
    clone or copy of it failed. One reason being a routing problem.

    This patch prevents this, solving liujian issue.

    It also removes a potential RTT miscalculation, since
    __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
    TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
    been changed.

    A future ACK would then lead to a very small RTT sample and min_rtt
    would then be lowered to this too small value.

    Tested:

    # cat user_timeout.pkt
    --local_ip=192.168.102.64

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`

    +0 < S 0:0(0) win 0
    +0 > S. 0:0(0) ack 1

    +.1 < . 1:1(0) ack 1 win 65530
    +0 accept(3, ..., ...) = 4

    +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
    +0 write(4, ..., 24) = 24
    +0 > P. 1:25(24) ack 1 win 29200
    +.1 < . 1:1(0) ack 25 win 65530

    //change the ipaddress
    +1 `ifconfig tun0 192.168.0.10/16`

    +1 write(4, ..., 24) = 24
    +1 write(4, ..., 24) = 24
    +1 write(4, ..., 24) = 24
    +1 write(4, ..., 24) = 24

    +0 `ifconfig tun0 192.168.102.64/16`
    +0 < . 1:2(1) ack 25 win 65530
    +0 `ifconfig tun0 192.168.0.10/16`

    +3 write(4, ..., 24) = -1

    # ./packetdrill user_timeout.pkt

    Signed-off-by: Eric Dumazet
    Reported-by: liujian
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2017

1 commit

  • This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

    Eric Dumazet says:
    We found at Google a significant regression caused by
    45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

    In typical RPC (TCP_RR), when a TCP socket receives data, we now call
    tcp_ack() while we used to not call it.

    This touches enough cache lines to cause a slowdown.

    so problem does not seem to be HP removal itself but the tcp_ack()
    call. Therefore, it might be possible to remove HP after all, provided
    one finds a way to elide tcp_ack for most cases.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

10 Aug, 2017

1 commit

  • The UDP offload conflict is dealt with by simply taking what is
    in net-next where we have removed all of the UFO handling code
    entirely.

    The TCP conflict was a case of local variables in a function
    being removed from both net and net-next.

    In netvsc we had an assignment right next to where a missing
    set of u64 stats sync object inits were added.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Aug, 2017

1 commit

  • With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
    to call tcp_connect() while socket sk_dst_cache is either NULL
    or invalid.

    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
    +0 connect(4, ..., ...) = 0

    << sk->sk_dst_cache becomes obsolete, or even set to NULL >>

    +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000

    We need to refresh the route otherwise bad things can happen,
    especially when syzkaller is running on the host :/

    Fixes: 19f6d3f3c8422 ("net/tcp-fastopen: Add new API support")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Cc: Wei Wang
    Cc: Yuchung Cheng
    Acked-by: Wei Wang
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Aug, 2017

2 commits

  • Fix a TCP loss recovery performance bug raised recently on the netdev
    list, in two threads:

    (i) July 26, 2017: netdev thread "TCP fast retransmit issues"
    (ii) July 26, 2017: netdev thread:
    "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
    outstanding TLP retransmission"

    The basic problem is that incoming TCP packets that did not indicate
    forward progress could cause the xmit timer (TLP or RTO) to be rearmed
    and pushed back in time. In certain corner cases this could result in
    the following problems noted in these threads:

    - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
    could cause TCP to repeatedly schedule TLPs forever. We kept
    sending TLPs after every ~200ms, which elicited bogus SACKs, which
    caused more TLPs, ad infinitum; we never fired an RTO to fill in
    the holes.

    - Incoming data segments could, in some cases, cause us to reschedule
    our RTO or TLP timer further out in time, for no good reason. This
    could cause repeated inbound data to result in stalls in outbound
    data, in the presence of packet loss.

    This commit fixes these bugs by changing the TLP and RTO ACK
    processing to:

    (a) Only reschedule the xmit timer once per ACK.

    (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
    ACK indicates sufficient forward progress (a packet was
    cumulatively ACKed, or we got a SACK for a packet that was sent
    before the most recent retransmit of the write queue head).

    This brings us back into closer compliance with the RFCs, since, as
    the comment for tcp_rearm_rto() notes, we should only restart the RTO
    timer after forward progress on the connection. Previously we were
    restarting the xmit timer even in these cases where there was no
    forward progress.

    As a side benefit, this commit simplifies and speeds up the TCP timer
    arming logic. We had been calling inet_csk_reset_xmit_timer() three
    times on normal ACKs that cumulatively acknowledged some data:

    1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
    if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
    tcp_rearm_rto(sk);

    2) Once in tcp_clean_rtx_queue(), to update the RTO:
    if (flag & FLAG_ACKED) {
    tcp_rearm_rto(sk);

    3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
    to TLP:
    if (icsk->icsk_pending == ICSK_TIME_RETRANS)
    tcp_schedule_loss_probe(sk);

    This commit, by only rescheduling the xmit timer once per ACK,
    simplifies the code and reduces CPU overhead.

    This commit was tested in an A/B test with Google web server
    traffic. SNMP stats and request latency metrics were within noise
    levels, substantiating that for normal web traffic patterns this is a
    rare issue. This commit was also tested with packetdrill tests to
    verify that it fixes the timer behavior in the corner cases discussed
    in the netdev threads mentioned above.

    This patch is a bug fix patch intended to be queued for -stable
    relases.

    Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
    Reported-by: Klavs Klavsen
    Reported-by: Mao Wenan
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Have tcp_schedule_loss_probe() base the TLP scheduling decision based
    on when the RTO *should* fire. This is to enable the upcoming xmit
    timer fix in this series, where tcp_schedule_loss_probe() cannot
    assume that the last timer installed was an RTO timer (because we are
    no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
    ACK). So tcp_schedule_loss_probe() must independently figure out when
    an RTO would want to fire.

    In the new TLP implementation following in this series, we cannot
    assume that icsk_timeout was set based on an RTO; after processing a
    cumulative ACK the icsk_timeout we see can be from a previous TLP or
    RTO. So we need to independently recalculate the RTO time (instead of
    reading it out of icsk_timeout). Removing this dependency on the
    nature of icsk_timeout makes things a little easier to reason about
    anyway.

    Note that the old and new code should be equivalent, since they are
    both saying: "if the RTO is in the future, but at an earlier time than
    the normal TLP time, then set the TLP timer to fire when the RTO would
    have fired".

    Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

02 Aug, 2017

1 commit


01 Aug, 2017

1 commit

  • Like prequeue, I am not sure this is overly useful nowadays.

    If we receive a train of packets, GRO will aggregate them if the
    headers are the same (HP predates GRO by several years) so we don't
    get a per-packet benefit, only a per-aggregated-packet one.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Jul, 2017

1 commit

  • When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
    false-positive warning:

    net/ipv4/tcp_output.c: In function 'tcp_connect':
    net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
    tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
    ^~
    net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
    tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

    I have opened a gcc bug for this, but distros have already shipped
    compilers with this problem, and it's not clear yet whether there is
    a way for gcc to avoid the warning. As the problem is related to the
    bitfield access, this introduces a temporary variable to store the old
    enum value.

    I did not notice this warning earlier, since UBSAN is disabled when
    building with COMPILE_TEST, and that was always turned on in both
    allmodconfig and randconfig tests.

    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

20 Jul, 2017

1 commit

  • This patch adjusts the timeout formula to schedule the TCP loss probe
    (TLP). The previous formula uses 2*SRTT or 1.5*RTT + DelayACKMax if
    only one packet is in flight. It keeps a lower bound of 10 msec which
    is too large for short RTT connections (e.g. within a data-center).
    The new formula = 2*RTT + (inflight == 1 ? 200ms : 2ticks) which
    performs better for short and fast connections.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

04 Jul, 2017

1 commit

  • SYN-ACK responses on a server in response to a SYN from a client
    did not get the injected skb mark that was tagged on the SYN packet.

    Fixes: 84f39b08d786 ("net: support marking accepting TCP sockets")
    Reviewed-by: Lorenzo Colitti
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

02 Jul, 2017

4 commits

  • Added support for changing congestion control for SOCK_OPS bpf
    programs through the setsockopt bpf helper function. It also adds
    a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
    congestion controls, like dctcp, that need to enable ECN in the
    SYN packets.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • Added callbacks to BPF SOCK_OPS type program before an active
    connection is intialized and after a passive or active connection is
    established.

    The following patch demostrates how they can be used to set send and
    receive buffer sizes.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • This patch adds suppport for setting the initial advertized window from
    within a BPF_SOCK_OPS program. This can be used to support larger
    initial cwnd values in environments where it is known to be safe.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • This patch adds support for setting a per connection SYN and
    SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
    to set small RTOs when it is known both hosts are within a
    datacenter.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

08 Jun, 2017

3 commits


03 Jun, 2017

1 commit


18 May, 2017

5 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • After this patch, all uses of tcp_time_stamp will require
    a change when we introduce 1 ms and/or 1 us TCP TS option.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • tcp_time_stamp will no longer be tied to jiffies.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use tcp_jiffies32 instead of tcp_time_stamp, since
    tcp_time_stamp will soon be only used for TCP TS option.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Use tcp_jiffies32 instead of tcp_time_stamp, since
    tcp_time_stamp will soon be only used for TCP TS option.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet