18 Oct, 2018

1 commit

  • [ Upstream commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc ]

    In normal SYN processing, packets are handled without listener
    lock and in RCU protected ingress path.

    But syzkaller is known to be able to trick us and SYN
    packets might be processed in process context, after being
    queued into socket backlog.

    In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
    accessing ireq_opt") I made a very stupid fix, that happened
    to work mostly because of the regular path being RCU protected.

    Really the thing protecting ireq->ireq_opt is RCU read lock,
    and the pseudo request refcnt is not relevant.

    This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
    block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
    pair in the paths that might be taken when processing SYN from
    socket backlog (thus possibly in process context)

    Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

20 Sep, 2018

1 commit

  • Geeralize private netem_rb_to_skb()

    TCP rtx queue will soon be converted to rb-tree,
    so we will need skb_rbtree_walk() helpers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

03 Aug, 2018

5 commits

  • [ Upstream commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d ]

    Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
    tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
    about our recent patch removing ~16 quick acks after ECN events.

    tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
    but in the case the sender cwnd was lowered to 1, we do not want
    to have a delayed ack for the next packet we will receive.

    Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
    Signed-off-by: Eric Dumazet
    Reported-by: Neal Cardwell
    Cc: Lawrence Brakmo
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit f4c9f85f3b2cb7669830cd04d0be61192a4d2436 ]

    Refactor tcp_ecn_check_ce and __tcp_ecn_check_ce to accept struct sock*
    instead of tcp_sock* to clean up type casts. This is a pure refactor
    patch.

    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yousuk Seung
     
  • [ Upstream commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef ]

    ECN signals currently forces TCP to enter quickack mode for
    up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

    We believe this is not needed, and only sending one immediate ack
    for the current packet should be enough.

    This should reduce the extra load noticed in DCTCP environments,
    after congestion events.

    This is part 2 of our effort to reduce pure ACK packets.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 9a9c9b51e54618861420093ae6e9b50a961914c5 ]

    We want to add finer control of the number of ACK packets sent after
    ECN events.

    This patch is not changing current behavior, it only enables following
    change.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit a3893637e1eb0ef5eb1bbc52b3a8d2dfa317a35d ]

    As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
    acks in slow start"), TCP stacks have to consider how many packets
    are acknowledged in one single ACK, because of GRO, but also
    because of ACK compression or losses.

    We plan to add SACK compression in the following patch, we
    must therefore not call tcp_enter_quickack_mode()

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

28 Jul, 2018

6 commits

  • [ Upstream commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c ]

    In case skb in out_or_order_queue is the result of
    multiple skbs coalescing, we would like to get a proper gso_segs
    counter tracking, so that future tcp_drop() can report an accurate
    number.

    I chose to not implement this tracking for skbs in receive queue,
    since they are not dropped, unless socket is disconnected.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ]

    In order to be able to give better diagnostics and detect
    malicious traffic, we need to have better sk->sk_drops tracking.

    Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

    In case an attacker feeds tiny packets completely out of order,
    tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
    expensive copies, but not changing socket memory usage at all.

    1) Do not attempt to collapse tiny skbs.
    2) Add logic to exit early when too many tiny skbs are detected.

    We prefer not doing aggressive collapsing (which copies packets)
    for pathological flows, and revert to tcp_prune_ofo_queue() which
    will be less expensive.

    In the future, we might add the possibility of terminating flows
    that are proven to be malicious.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

    Right after a TCP flow is created, receiving tiny out of order
    packets allways hit the condition :

    if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
    tcp_clamp_window(sk);

    tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
    (guarded by tcp_rmem[2])

    Calling tcp_collapse_ofo_queue() in this case is not useful,
    and offers a O(N^2) surface attack to malicious peers.

    Better not attempt anything before full queue capacity is reached,
    forcing attacker to spend lots of resource and allow us to more
    easily detect the abuse.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

    Juha-Matti Tilli reported that malicious peers could inject tiny
    packets in out_of_order_queue, forcing very expensive calls
    to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
    every incoming packet. out_of_order_queue rb-tree can contain
    thousands of nodes, iterating over all of them is not nice.

    Before linux-4.9, we would have pruned all packets in ofo_queue
    in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
    truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

    Since we plan to increase tcp_rmem[2] in the future to cope with
    modern BDP, can not revert to the old behavior, without great pain.

    Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

    Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
    Signed-off-by: Eric Dumazet
    Reported-by: Juha-Matti Tilli
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]

    Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
    has to be sent immediately so the sender can respond quickly:

    """ When receiving packets, the CE codepoint MUST be processed as follows:

    1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
    true and send an immediate ACK.

    2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
    to false and send an immediate ACK.
    """

    Previously DCTCP implementation may continue to delay the ACK. This
    patch fixes that to implement the RFC by forcing an immediate ACK.

    Tested with this packetdrill script provided by Larry Brakmo

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < [ect0] SEW 0:0(0) win 32792
    0.100 > SE. 0:0(0) ack 1
    0.110 < [ect0] . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

    0.200 < [ect0] . 1:1001(1000) ack 1 win 257
    0.200 > [ect01] . 1:1(0) ack 1001

    0.200 write(4, ..., 1) = 1
    0.200 > [ect01] P. 1:2(1) ack 1001

    0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
    +0.005 < [ce] . 2001:3001(1000) ack 2 win 257

    +0.000 > [ect01] . 2:2(0) ack 2001
    // Previously the ACK below would be delayed by 40ms
    +0.000 > [ect01] E. 2:2(0) ack 3001

    +0.500 < F. 9501:9501(0) ack 4 win 257

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

22 Jul, 2018

1 commit

  • [ Upstream commit 1236f22fbae15df3736ab4a984c64c0c6ee6254c ]

    If SACK is not enabled and the first cumulative ACK after the RTO
    retransmission covers more than the retransmitted skb, a spurious
    FRTO undo will trigger (assuming FRTO is enabled for that RTO).
    The reason is that any non-retransmitted segment acknowledged will
    set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
    no indication that it would have been delivered for real (the
    scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
    case so the check for that bit won't help like it does with SACK).
    Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
    in tcp_process_loss.

    We need to use more strict condition for non-SACK case and check
    that none of the cumulatively ACKed segments were retransmitted
    to prove that progress is due to original transmissions. Only then
    keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
    non-SACK case.

    (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
    to better indicate its purpose but to keep this change minimal, it
    will be done in another patch).

    Besides burstiness and congestion control violations, this problem
    can result in RTO loop: When the loss recovery is prematurely
    undoed, only new data will be transmitted (if available) and
    the next retransmission can occur only after a new RTO which in case
    of multiple losses (that are not for consecutive packets) requires
    one RTO per loss to recover.

    Signed-off-by: Ilpo Järvinen
    Tested-by: Neal Cardwell
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ilpo Järvinen
     

21 Jun, 2018

1 commit

  • commit 02db55718d53f9d426cee504c27fb768e9ed4ffe upstream.

    While rcvbuf is properly clamped by tcp_rmem[2], rcvwin
    is left to a potentially too big value.

    It has no serious effect, since :
    1) tcp_grow_window() has very strict checks.
    2) window_clamp can be mangled by user space to any value anyway.

    tcp_init_buffer_space() and companions use tcp_full_space(),
    we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Wei Wang
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Cc: Benjamin Gilbert
    Signed-off-by: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

05 Jun, 2018

1 commit

  • commit 607065bad9931e72207b0cac365d7d4abc06bd99 upstream.

    When using large tcp_rmem[2] values (I did tests with 500 MB),
    I noticed overflows while computing rcvwin.

    Lets fix this before the following patch.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Wei Wang
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    [Backport: sysctl_tcp_rmem is not Namespace-ify'd in older kernels]
    Signed-off-by: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

29 Apr, 2018

1 commit

  • [ Upstream commit 7e5a206ab686f098367b61aca989f5cdfa8114a3 ]

    The old code reads the "opsize" variable from out-of-bounds memory (first
    byte behind the segment) if a broken TCP segment ends directly after an
    opcode that is neither EOL nor NOP.

    The result of the read isn't used for anything, so the worst thing that
    could theoretically happen is a pagefault; and since the physmap is usually
    mostly contiguous, even that seems pretty unlikely.

    The following C reproducer triggers the uninitialized read - however, you
    can't actually see anything happen unless you put something like a
    pr_warn() in tcp_parse_md5sig_option() to print the opsize.

    ====================================
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    void systemf(const char *command, ...) {
    char *full_command;
    va_list ap;
    va_start(ap, command);
    if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
    va_end(ap);
    printf("systemf: <<>>\n", full_command);
    system(full_command);
    }

    char *devname;

    int tun_alloc(char *name) {
    int fd = open("/dev/net/tun", O_RDWR);
    if (fd == -1)
    err(1, "open tun dev");
    static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
    strcpy(req.ifr_name, name);
    if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
    devname = req.ifr_name;
    printf("device name: %s\n", devname);
    return fd;
    }

    #define IPADDR(a,b,c,d) (((a)<<<<> 16) + (sum & 0xffff);
    sum = (sum >> 16) + (sum & 0xffff);
    return htons(~sum);
    }

    void fix_ip_sum(struct iphdr *ip) {
    unsigned int sum = 0;
    sum_accumulate(&sum, ip, sizeof(*ip));
    ip->check = sum_final(sum);
    }

    void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
    unsigned int sum = 0;
    struct {
    unsigned int saddr;
    unsigned int daddr;
    unsigned char pad;
    unsigned char proto_num;
    unsigned short tcp_len;
    } fakehdr = {
    .saddr = ip->saddr,
    .daddr = ip->daddr,
    .proto_num = ip->protocol,
    .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
    };
    sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
    sum_accumulate(&sum, tcp, tcp->doff*4);
    tcp->check = sum_final(sum);
    }

    int main(void) {
    int tun_fd = tun_alloc("inject_dev%d");
    systemf("ip link set %s up", devname);
    systemf("ip addr add 192.168.42.1/24 dev %s", devname);

    struct {
    struct iphdr ip;
    struct tcphdr tcp;
    unsigned char tcp_opts[20];
    } __attribute__((packed)) syn_packet = {
    .ip = {
    .ihl = sizeof(struct iphdr)/4,
    .version = 4,
    .tot_len = htons(sizeof(syn_packet)),
    .ttl = 30,
    .protocol = IPPROTO_TCP,
    /* FIXUP check */
    .saddr = IPADDR(192,168,42,2),
    .daddr = IPADDR(192,168,42,1)
    },
    .tcp = {
    .source = htons(1),
    .dest = htons(1337),
    .seq = 0x12345678,
    .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
    .syn = 1,
    .window = htons(64),
    .check = 0 /*FIXUP*/
    },
    .tcp_opts = {
    /* INVALID: trailing MD5SIG opcode after NOPs */
    1, 1, 1, 1, 1,
    1, 1, 1, 1, 1,
    1, 1, 1, 1, 1,
    1, 1, 1, 1, 19
    }
    };
    fix_ip_sum(&syn_packet.ip);
    fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
    while (1) {
    int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
    if (write_res != sizeof(syn_packet))
    err(1, "packet write failed");
    }
    }
    ====================================

    Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
    Signed-off-by: Jann Horn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

09 Mar, 2018

3 commits

  • [ Upstream commit a27fd7a8ed3856faaf5a2ff1c8c5f00c0667aaa0 ]

    When the connection is reset, there is no point in
    keeping the packets on the write queue until the connection
    is closed.

    RFC 793 (page 70) and RFC 793-bis (page 64) both suggest
    purging the write queue upon RST:
    https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-07

    Moreover, this is essential for a correct MSG_ZEROCOPY
    implementation, because userspace cannot call close(fd)
    before receiving zerocopy signals even when the connection
    is reset.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Soheil Hassas Yeganeh
    Reviewed-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Soheil Hassas Yeganeh
     
  • [ Upstream commit fc68e171d376c322e6777a3d7ac2f0278b68b17f ]

    This reverts commit 89fe18e44f7ee5ab1c90d0dff5835acee7751427.

    While the patch could detect more spurious timeouts, it could cause
    poor TCP performance on broken middle-boxes that modifies TCP packets
    (e.g. receive window, SACK options). Since the performance gain is
    much smaller compared to the potential loss. The best solution is
    to fully revert the change.

    Fixes: 89fe18e44f7e ("tcp: extend F-RTO to catch more spurious timeouts")
    Reported-by: Teodor Milkov
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit d4131f09770d9b7471c9da65e6ecd2477746ac5c ]

    This reverts commit cc663f4d4c97b7297fb45135ab23cfd508b35a77. While fixing
    some broken middle-boxes that modifies receive window fields, it does not
    address middle-boxes that strip off SACK options. The best solution is
    to fully revert this patch and the root F-RTO enhancement.

    Fixes: cc663f4d4c97 ("tcp: restrict F-RTO to work-around broken middle-boxes")
    Reported-by: Teodor Milkov
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

03 Jan, 2018

2 commits

  • [ Upstream commit 9ee11bd03cb1a5c3ca33c2bb70e7ed325f68890f ]

    When ms timestamp is used, current logic uses 1us in
    tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
    This could cause rcv_rtt underestimation.
    Fix it by always using a min value of 1ms if ms timestamp is used.

    Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wei Wang
     
  • [ Upstream commit d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292 ]

    Mark tcp_sock during a SACK reneging event and invalidate rate samples
    while marked. Such rate samples may overestimate bw by including packets
    that were SACKed before reneging.

    < ack 6001 win 10000 sack 7001:38001
    < ack 7001 win 0 sack 8001:38001 // Reneg detected
    > seq 7001:8001 // RTO, SACK cleared.
    < ack 38001 win 10000

    In above example the rate sample taken after the last ack will count
    7001-38001 as delivered while the actual delivery rate likely could
    be much lower i.e. 7001-8001.

    This patch adds a new field tcp_sock.sack_reneg and marks it when we
    declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
    the last rate sample was taken before moving back to TCP_CA_Open. This
    patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
    is set.

    Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Eric Dumazet
    Acked-by: Priyaranjan Jha
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yousuk Seung
     

17 Dec, 2017

2 commits

  • [ Upstream commit ed66dfaf236c04d414de1d218441296e57fb2bd2 ]

    Fix the TLP scheduling logic so that when scheduling a TLP probe, we
    ensure that the estimated time at which an RTO would fire accounts for
    the fact that ACKs indicating forward progress should push back RTO
    times.

    After the following fix:

    df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

    we had an unintentional behavior change in the following kind of
    scenario: suppose the RTT variance has been very low recently. Then
    suppose we send out a flight of N packets and our RTT is 100ms:

    t=0: send a flight of N packets
    t=100ms: receive an ACK for N-1 packets

    The response before df92c8394e6e that was:
    -> schedule a TLP for now + RTO_interval

    The response after df92c8394e6e is:
    -> schedule a TLP for t=0 + RTO_interval

    Since RTO_interval = srtt + RTT_variance, this means that we have
    scheduled a TLP timer at a point in the future that only accounts for
    RTT_variance. If the RTT_variance term is small, this means that the
    timer fires soon.

    Before df92c8394e6e this would not happen, because in that code, when
    we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
    at write_queue_head->paket_tx_time + RTO_interval:
    if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
    tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
    if (flag & FLAG_ACKED) {
    tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
    to TLP at now + RTO_interval:
    if (icsk->icsk_pending == ICSK_TIME_RETRANS)
    tcp_schedule_loss_probe(sk);

    In df92c8394e6e we removed that 3-phase dance, and instead directly
    set the TLP timer once: we set the TLP timer in cases like this to
    write_queue_head->packet_tx_time + RTO_interval. So if the RTT
    variance is small, then this means that this is setting the TLP timer
    to fire quite soon. This means if the ACK for the tail of the flight
    takes longer than an RTT to arrive (often due to delayed ACKs), then
    the TLP timer fires too quickly.

    Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit 8632385022f2b05a6ca0b9e0f95575865de0e2ce ]

    When I switched rcv_rtt_est to high resolution timestamps, I forgot
    that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust()

    Using an old timestamp leads to autotuning lags.

    Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
    Signed-off-by: Eric Dumazet
    Cc: Wei Wang
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

10 Nov, 2017

1 commit

  • This patch fixes the cause of an WARNING indicatng TCP has pending
    retransmission in Open state in tcp_fastretrans_alert().

    The root cause is a bad interaction between path mtu probing,
    if enabled, and the RACK loss detection. Upong receiving a SACK
    above the sequence of the MTU probing packet, RACK could mark the
    probe packet lost in tcp_fastretrans_alert(), prior to calling
    tcp_simple_retransmit().

    tcp_simple_retransmit() only enters Loss state if it newly marks
    the probe packet lost. If the probe packet is already identified as
    lost by RACK, the sender remains in Open state with some packets
    marked lost and retransmitted. Then the next SACK would trigger
    the warning. The likely scenario is that the probe packet was
    lost due to its size or network congestion. The actual impact of
    this warning is small by potentially entering fast recovery an
    ACK later.

    The simple fix is always entering recovery (Loss) state if some
    packet is marked lost during path MTU probing.

    Fixes: a0370b3f3f2c ("tcp: enable RACK loss detection to trigger recovery")
    Reported-by: Oleksandr Natalenko
    Reported-by: Alexei Starovoitov
    Reported-by: Roman Gushchin
    Signed-off-by: Yuchung Cheng
    Reviewed-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

05 Nov, 2017

1 commit

  • Fixes DSACK-based undo when sender is in Open State and
    an ACK advances snd_una.

    Example scenario:
    - Sender goes into recovery and makes some spurious rtx.
    - It comes out of recovery and enters into open state.
    - It sends some more packets, let's say 4.
    - The receiver sends an ACK for the first two, but this ACK is lost.
    - The sender receives ack for first two, and DSACK for previous
    spurious rtx.

    Signed-off-by: Priyaranjan Jha
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Yousuk Seung
    Signed-off-by: David S. Miller

    Priyaranjan Jha
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2017

2 commits

  • This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

    Eric Dumazet says:
    We found at Google a significant regression caused by
    45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

    In typical RPC (TCP_RR), when a TCP socket receives data, we now call
    tcp_ack() while we used to not call it.

    This touches enough cache lines to cause a slowdown.

    so problem does not seem to be HP removal itself but the tcp_ack()
    call. Therefore, it might be possible to remove HP after all, provided
    one finds a way to elide tcp_ack for most cases.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This change was a followup to the header prediction removal,
    so first revert this as a prerequisite to back out hp removal.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Aug, 2017

1 commit

  • Florian reported UDP xmit drops that could be root caused to the
    too small neigh limit.

    Current limit is 64 KB, meaning that even a single UDP socket would hit
    it, since its default sk_sndbuf comes from net.core.wmem_default
    (~212992 bytes on 64bit arches).

    Once ARP/ND resolution is in progress, we should allow a little more
    packets to be queued, at least for one producer.

    Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
    limit and either block in sendmsg() or return -EAGAIN.

    Signed-off-by: Eric Dumazet
    Reported-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Aug, 2017

1 commit

  • When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
    timestamp corresponding to the highest sequence number data returned.

    Previously the skb->tstamp is overwritten when a TCP packet is placed
    in the out of order queue. While the packet is in the ooo queue, save the
    timestamp in the TCB_SKB_CB. This space is shared with the gso_*
    options which are only used on the tx path, and a previously unused 4
    byte hole.

    When skbs are coalesced either in the sk_receive_queue or the
    out_of_order_queue always choose the timestamp of the appended skb to
    maintain the invariant of returning the timestamp of the last byte in
    the recvmsg buffer.

    Signed-off-by: Mike Maloney
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Mike Maloney
     

23 Aug, 2017

2 commits


22 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • In some situations tcp_send_loss_probe() can realize that it's unable
    to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
    to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
    realizes that the RTO was eligible to fire immediately or at some
    point in the past (delta_us
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

10 Aug, 2017

1 commit

  • The UDP offload conflict is dealt with by simply taking what is
    in net-next where we have removed all of the UFO handling code
    entirely.

    The TCP conflict was a case of local variables in a function
    being removed from both net and net-next.

    In netvsc we had an assignment right next to where a missing
    set of u64 stats sync object inits were added.

    Signed-off-by: David S. Miller

    David S. Miller
     

07 Aug, 2017

1 commit

  • Using ssthresh to revert cwnd is less reliable when ssthresh is
    bounded to 2 packets. This patch uses an existing variable in TCP
    "prior_cwnd" that snapshots the cwnd right before entering fast
    recovery and RTO recovery in Reno. This fixes the issue discussed
    in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
    https://www.spinics.net/lists/netdev/msg444955.html

    Suggested-by: Neal Cardwell
    Reported-by: Wei Sun
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

04 Aug, 2017

1 commit

  • Fix a TCP loss recovery performance bug raised recently on the netdev
    list, in two threads:

    (i) July 26, 2017: netdev thread "TCP fast retransmit issues"
    (ii) July 26, 2017: netdev thread:
    "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
    outstanding TLP retransmission"

    The basic problem is that incoming TCP packets that did not indicate
    forward progress could cause the xmit timer (TLP or RTO) to be rearmed
    and pushed back in time. In certain corner cases this could result in
    the following problems noted in these threads:

    - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
    could cause TCP to repeatedly schedule TLPs forever. We kept
    sending TLPs after every ~200ms, which elicited bogus SACKs, which
    caused more TLPs, ad infinitum; we never fired an RTO to fill in
    the holes.

    - Incoming data segments could, in some cases, cause us to reschedule
    our RTO or TLP timer further out in time, for no good reason. This
    could cause repeated inbound data to result in stalls in outbound
    data, in the presence of packet loss.

    This commit fixes these bugs by changing the TLP and RTO ACK
    processing to:

    (a) Only reschedule the xmit timer once per ACK.

    (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
    ACK indicates sufficient forward progress (a packet was
    cumulatively ACKed, or we got a SACK for a packet that was sent
    before the most recent retransmit of the write queue head).

    This brings us back into closer compliance with the RFCs, since, as
    the comment for tcp_rearm_rto() notes, we should only restart the RTO
    timer after forward progress on the connection. Previously we were
    restarting the xmit timer even in these cases where there was no
    forward progress.

    As a side benefit, this commit simplifies and speeds up the TCP timer
    arming logic. We had been calling inet_csk_reset_xmit_timer() three
    times on normal ACKs that cumulatively acknowledged some data:

    1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
    if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
    tcp_rearm_rto(sk);

    2) Once in tcp_clean_rtx_queue(), to update the RTO:
    if (flag & FLAG_ACKED) {
    tcp_rearm_rto(sk);

    3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
    to TLP:
    if (icsk->icsk_pending == ICSK_TIME_RETRANS)
    tcp_schedule_loss_probe(sk);

    This commit, by only rescheduling the xmit timer once per ACK,
    simplifies the code and reduces CPU overhead.

    This commit was tested in an A/B test with Google web server
    traffic. SNMP stats and request latency metrics were within noise
    levels, substantiating that for normal web traffic patterns this is a
    rare issue. This commit was also tested with packetdrill tests to
    verify that it fixes the timer behavior in the corner cases discussed
    in the netdev threads mentioned above.

    This patch is a bug fix patch intended to be queued for -stable
    relases.

    Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
    Reported-by: Klavs Klavsen
    Reported-by: Mao Wenan
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell