29 Jan, 2020

1 commit

  • [ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ]

    Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    apparently allowed syzbot to trigger various crashes in TCP stack [1]

    I believe this commit only made things easier for syzbot to find
    its way into triggering use-after-frees. But really the bugs
    could lead to bad TCP behavior or even plain crashes even for
    non malicious peers.

    I have audited all calls to tcp_rtx_queue_unlink() and
    tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
    if we are removing from rtx queue the skb that tp->highest_sack points to.

    These updates were missing in three locations :

    1) tcp_clean_rtx_queue() [This one seems quite serious,
    I have no idea why this was not caught earlier]

    2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

    3) tcp_send_synack() [Probably not a big deal for normal operations]

    [1]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
    __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
    kasan_report+0x12/0x20 mm/kasan/common.c:639
    __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
    tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
    tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
    tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
    tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
    tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
    tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
    ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
    __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
    __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
    process_backlog+0x206/0x750 net/core/dev.c:6093
    napi_poll net/core/dev.c:6530 [inline]
    net_rx_action+0x508/0x1120 net/core/dev.c:6598
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:603 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
    smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    Allocated by task 10091:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    __kasan_kmalloc mm/kasan/common.c:513 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
    slab_post_alloc_hook mm/slab.h:584 [inline]
    slab_alloc_node mm/slab.c:3263 [inline]
    kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
    alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
    sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
    sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
    tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
    tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
    inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    __sys_sendto+0x262/0x380 net/socket.c:1998
    __do_sys_sendto net/socket.c:2010 [inline]
    __se_sys_sendto net/socket.c:2006 [inline]
    __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 10095:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    kasan_set_free_info mm/kasan/common.c:335 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
    __cache_free mm/slab.c:3426 [inline]
    kmem_cache_free+0x86/0x320 mm/slab.c:3694
    kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
    __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
    sk_eat_skb include/net/sock.h:2453 [inline]
    tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
    inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:886 [inline]
    sock_recvmsg net/socket.c:904 [inline]
    sock_recvmsg+0xce/0x110 net/socket.c:900
    __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
    __do_sys_recvfrom net/socket.c:2073 [inline]
    __se_sys_recvfrom net/socket.c:2069 [inline]
    __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff8880a488d040
    which belongs to the cache skbuff_fclone_cache of size 456
    The buggy address is located 40 bytes inside of
    456-byte region [ffff8880a488d040, ffff8880a488d208)
    The buggy address belongs to the page:
    page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
    raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
    raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
    Signed-off-by: Eric Dumazet
    Cc: Cambda Zhu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

23 Jan, 2020

1 commit

  • [ Upstream commit e176b1ba476cf36f723cfcc7a9e57f3cb47dec70 ]

    When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
    retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
    If packet loss is detected at this time, retransmit_skb_hint will be set
    to point to the current packet loss in tcp_verify_retransmit_hint(),
    then the packets that were previously marked lost but not retransmitted
    due to the restriction of cwnd will be skipped and cannot be
    retransmitted.

    To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
    be reset only after all marked lost packets are retransmitted
    (retrans_out >= lost_out), otherwise we need to traverse from
    tcp_rtx_queue_head in tcp_xmit_retransmit_queue().

    Packetdrill to demonstrate:

    // Disable RACK and set max_reordering to keep things simple
    0 `sysctl -q net.ipv4.tcp_recovery=0`
    +0 `sysctl -q net.ipv4.tcp_max_reordering=3`

    // Establish a connection
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +.1 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +.01 < . 1:1(0) ack 1 win 257
    +0 accept(3, ..., ...) = 4

    // Send 8 data segments
    +0 write(4, ..., 8000) = 8000
    +0 > P. 1:8001(8000) ack 1

    // Enter recovery and 1:3001 is marked lost
    +.01 < . 1:1(0) ack 1 win 257
    +0 < . 1:1(0) ack 1 win 257
    +0 < . 1:1(0) ack 1 win 257

    // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
    +0 > . 1:1001(1000) ack 1

    // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
    +.01 < . 1:1(0) ack 2001 win 257
    // Now retransmit_skb_hint points to 4001:5001 which is now marked lost

    // BUG: 2001:3001 was not retransmitted
    +0 > . 2001:3001(1000) ack 1

    Signed-off-by: Pengcheng Yang
    Acked-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pengcheng Yang
     

12 Jan, 2020

1 commit


14 Oct, 2019

6 commits

  • For the sake of tcp_poll(), there are few places where we fetch
    sk->sk_sndbuf while this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make sure write
    sides use corresponding WRITE_ONCE() to avoid store-tearing.

    Note that other transports probably need similar fixes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • For the sake of tcp_poll(), there are few places where we fetch
    sk->sk_rcvbuf while this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make sure write
    sides use corresponding WRITE_ONCE() to avoid store-tearing.

    Note that other transports probably need similar fixes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There two places where we fetch tp->urg_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write side use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->copied_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->rcv_nxt while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)

    syzbot reported :

    BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv

    write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
    tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
    tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
    tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
    tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
    tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
    ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
    netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
    napi_skb_finish net/core/dev.c:5671 [inline]
    napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
    receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061

    read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
    tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
    tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
    sock_poll+0xed/0x250 net/socket.c:1256
    vfs_poll include/linux/poll.h:90 [inline]
    ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
    ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
    ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
    ep_send_events fs/eventpoll.c:1793 [inline]
    ep_poll+0xe3/0x900 fs/eventpoll.c:1930
    do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
    __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
    __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
    __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
    do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Both tcp_v4_err() and tcp_v6_err() do the following operations
    while they do not own the socket lock :

    fastopen = tp->fastopen_rsk;
    snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

    The problem is that without appropriate barrier, the compiler
    might reload tp->fastopen_rsk and trigger a NULL deref.

    request sockets are protected by RCU, we can simply add
    the missing annotations and barriers to solve the issue.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Sep, 2019

1 commit

  • For receive-heavy cases on the server-side, we want to track the
    connection quality for individual client IPs. This counter, similar to
    the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
    tracks out-of-order packet reception. By providing this counter in
    TCP_INFO, it will allow understanding to what degree receive-heavy
    sockets are experiencing out-of-order delivery and packet drops
    indicating congestion.

    Please note that this is similar to the counter in NetBSD TCP_INFO, and
    has the same name.

    Also note that we avoid increasing the size of the tcp_sock struct by
    taking advantage of a hole.

    Signed-off-by: Thomas Higdon
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Thomas Higdon
     

15 Sep, 2019

1 commit


12 Sep, 2019

1 commit

  • Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
    TCP_ECN_QUEUE_CWR.

    Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
    the behavior of data receivers, and deciding whether to reflect
    incoming IP ECN CE marks as outgoing TCP th->ece marks. The
    TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
    and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
    is only called from tcp_undo_cwnd_reduction() by data senders during
    an undo, so it should zero the sender-side state,
    TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
    incoming CE bits on incoming data packets just because outgoing
    packets were spuriously retransmitted.

    The bug has been reproduced with packetdrill to manifest in a scenario
    with RFC3168 ECN, with an incoming data packet with CE bit set and
    carrying a TCP timestamp value that causes cwnd undo. Before this fix,
    the IP CE bit was ignored and not reflected in the TCP ECE header bit,
    and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
    even though the cwnd reduction had been undone. After this fix, the
    sender properly reflects the CE bit and does not set the W bit.

    Note: the bug actually predates 2005 git history; this Fixes footer is
    chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
    the patch applies cleanly (since before this commit the code was in a
    .h file).

    Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
    Signed-off-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Cc: Eric Dumazet
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

31 Jul, 2019

2 commits


03 Jul, 2019

1 commit


18 Jun, 2019

1 commit


16 Jun, 2019

1 commit

  • Jonathan Looney reported that TCP can trigger the following crash
    in tcp_shifted_skb() :

    BUG_ON(tcp_skb_pcount(skb) < pcount);

    This can happen if the remote peer has advertized the smallest
    MSS that linux TCP accepts : 48

    An skb can hold 17 fragments, and each fragment can hold 32KB
    on x86, or 64KB on PowerPC.

    This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
    can overflow.

    Note that tcp_sendmsg() builds skbs with less than 64KB
    of payload, so this problem needs SACK to be enabled.
    SACK blocks allow TCP to coalesce multiple skbs in the retransmit
    queue, thus filling the 17 fragments to maximal capacity.

    CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

    Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Reviewed-by: Tyler Hicks
    Cc: Yuchung Cheng
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Jun, 2019

1 commit


10 Jun, 2019

1 commit

  • Commit 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK
    retransmit") may cause tcp_fastretrans_alert() to warn about pending
    retransmission in Open state. This is triggered when the Fast Open
    server both sends data and has spurious SYNACK retransmission during
    the handshake, and the data packets were lost or reordered.

    The root cause is a bit complicated:

    (1) Upon receiving SYN-data: a full socket is created with
    snd_una = ISN + 1 by tcp_create_openreq_child()

    (2) On SYNACK timeout the server/sender enters CA_Loss state.

    (3) Upon receiving the final ACK to complete the handshake, sender
    does not mark FLAG_SND_UNA_ADVANCED since (1)

    Sender then calls tcp_process_loss since state is CA_loss by (2)

    (4) tcp_process_loss() does not invoke undo operations but instead
    mark REXMIT_LOST to force retransmission

    (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
    changes state to CA_Open but has positive tp->retrans_out

    (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()

    The step that goes wrong is (4) where the undo operation should
    have been invoked because the ACK successfully acknowledged the
    SYN sequence. This fixes that by specifically checking undo
    when the SYN-ACK sequence is acknowledged. Then after
    tcp_process_loss() the state would be further adjusted based
    in tcp_fastretrans_alert() to avoid triggering the warning in (6).

    Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

31 May, 2019

1 commit

  • The TCP option parsing routines in tcp_parse_options function could
    read one byte out of the buffer of the TCP options.

    1 while (length > 0) {
    2 int opcode = *ptr++;
    3 int opsize;
    4
    5 switch (opcode) {
    6 case TCPOPT_EOL:
    7 return;
    8 case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
    9 length--;
    10 continue;
    11 default:
    12 opsize = *ptr++; //out of bound access

    If length = 1, then there is an access in line2.
    And another access is occurred in line 12.
    This would lead to out-of-bound access.

    Therefore, in the patch we check that the available data length is
    larger enough to pase both TCP option code and size.

    Signed-off-by: Young Xiao
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Young Xiao
     

15 May, 2019

1 commit

  • Commit c7d13c8faa74 ("tcp: properly track retry time on
    passive Fast Open") sets the start of SYNACK retransmission
    time on passive Fast Open in "retrans_stamp". However the
    timestamp is not reset upon the handshake has completed. As a
    result, future data packet retransmission may not update it in
    tcp_retransmit_skb(). This may lead to socket aborting earlier
    unexpectedly by retransmits_timed_out() since retrans_stamp remains
    the SYNACK rtx time.

    This bug only manifests on passive TFO sender that a) suffered
    SYNACK timeout and then b) stalls on very first loss recovery. Any
    successful loss recovery would reset the timestamp to avoid this
    issue.

    Fixes: c7d13c8faa74 ("tcp: properly track retry time on passive Fast Open")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

10 May, 2019

1 commit

  • User space can flip the clean_acked_data_enabled static branch
    on and off with TLS offload when CONFIG_TLS_DEVICE is enabled.
    jump_label.h suggests we use the delayed version in this case.

    Deferred branches now also don't take the branch mutex on
    decrement, so we avoid potential locking issues.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

01 May, 2019

7 commits

  • Relocate the congestion window initialization from tcp_init_metrics()
    to tcp_init_transfer() to improve code readability.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Use a helper to consolidate two identical code block for passive TFO.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch makes passive Fast Open reverts the cwnd to default
    initial cwnd (10 packets) if the SYNACK timeout is spurious.

    Passive Fast Open uses a full socket during handshake so it can
    use the existing undo logic to detect spurious retransmission
    by recording the first SYNACK timeout in key state variable
    retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
    has sent some data before the timeout, the spurious timeout
    is detected by tcp_try_undo_recovery() in tcp_process_loss()
    in tcp_ack().

    But if the socket has not send any data yet, tcp_ack() does not
    execute the undo code since no data is acknowledged. The fix is to
    check such case explicitly after tcp_ack() during the ACK processing
    in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
    in case the server closes the socket before handshake completes.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Linux implements RFC6298 and use an initial congestion window
    of 1 upon establishing the connection if the SYNACK packet is
    retransmitted 2 or more times. In cellular networks SYNACK timeouts
    are often spurious if the wireless radio was dormant or idle. Also
    some network path is longer than the default SYNACK timeout. In
    both cases falsely starting with a minimal cwnd are detrimental
    to performance.

    This patch avoids doing so when the final ACK's TCP timestamp
    indicates the original SYNACK was delivered. It remembers the
    original SYNACK timestamp when SYNACK timeout has occurred and
    re-uses the function to detect spurious SYN timeout conveniently.

    Note that a server may receives multiple SYNs from and immediately
    retransmits SYNACKs without any SYNACK timeout. This often happens
    on when the client SYNs have timed out due to wireless delay
    above. In this case since the server will still use the default
    initial congestion (e.g. 10) because tp->undo_marker is reset in
    tcp_init_metrics(). This is an intentional design because packets
    are not lost but delayed.

    This patch only covers regular TCP passive open. Fast Open is
    supported in the next patch.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Detecting spurious SYNACK timeout using timestamp option requires
    recording the exact SYNACK skb timestamp. Previously the SYNACK
    sent timestamp was stamped slightly earlier before the skb
    was transmitted. This patch uses the SYNACK skb transmission
    timestamp directly.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Linux implements RFC6298 and use an initial congestion window of 1
    upon establishing the connection if the SYN packet is retransmitted 2
    or more times. In cellular networks SYN timeouts are often spurious
    if the wireless radio was dormant or idle. Also some network path
    is longer than the default SYN timeout. Having a minimal cwnd on
    both cases are detrimental to TCP startup performance.

    This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect
    spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp
    records the initial SYN timestamp instead of first retransmission, we
    have to implement a different undo code additionally. The detection
    also must happen before tcp_ack() as retrans_stamp is reset when
    SYN is acknowledged.

    Note this patch covers both active regular and fast open.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously if an active TCP open has SYN timeout, it always undo the
    cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue
    would reset tp->retrans_stamp when SYN is acked, which fools then
    tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is
    required to properly support undo for spurious SYN timeout.

    Fixing this is tricky -- for active TCP open tp->retrans_stamp
    records the time when the handshake starts, not the first
    retransmission time as the name may suggest. The simplest fix is
    for tcp_packet_delayed to ensure it is valid before comparing with
    other timestamp.

    One side effect of this change is active TCP Fast Open that incurred
    SYN timeout. Upon receiving a SYN-ACK that only acknowledged
    the SYN, it would immediately retransmit unacknowledged data in
    tcp_ack() because the data is marked lost after SYN timeout. But
    the retransmission would have an incorrect ack sequence number since
    rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the
    retransmission needs to properly handed by tcp_rcv_fastopen_synack()
    like before.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 Apr, 2019

1 commit


17 Apr, 2019

1 commit

  • For some reason, tcp_grow_window() correctly tests if enough room
    is present before attempting to increase tp->rcv_ssthresh,
    but does not prevent it to grow past tcp_space()

    This is causing hard to debug issues, like failing
    the (__tcp_select_window(sk) >= tp->rcv_wnd) test
    in __tcp_ack_snd_check(), causing ACK delays and possibly
    slow flows.

    Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
    we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
    after about 60 round trips, when the active side no longer sends
    immediate acks.

    This bug predates git history.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Apr, 2019

1 commit

  • Linux currently disable ECN for incoming connections when the SYN
    requests ECN and the IP header has ECT(0)/ECT(1) set, as some
    networks were reportedly mangling the ToS byte, hence could later
    trigger false congestion notifications.

    RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set
    one TCP control packets (including SYNs). The main benefit of this
    is the decreased probability of losing a SYN in a congested
    ECN-capable network (i.e., it avoids the initial 1s timeout).
    Additionally, this allows the development of newer TCP extensions,
    such as AccECN.

    This patch relaxes the previous check, by enabling ECN on incoming
    connections using SYN+ECT if at least one bit of the reserved flags
    of the TCP header is set. Such bit would indicate that the sender of
    the SYN is using a newer TCP feature than what the host implements,
    such as AccECN, and is thus implementing RFC8311. This enables
    end-hosts not supporting such extensions to still negociate ECN, and
    to have some of the benefits of using ECN on control packets.

    Signed-off-by: Olivier Tilmans
    Suggested-by: Bob Briscoe
    Cc: Koen De Schepper
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Tilmans, Olivier (Nokia - BE/Antwerp)
     

20 Mar, 2019

1 commit

  • Since the request socket is created locally, it'd make more sense to
    use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
    path.

    However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
    socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
    of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
    to the outside world and is safe to free immediately, but that'd
    trigger the WARN_ON_ONCE in reqsk_free().

    Define __reqsk_free() for these situations where we know nobody's
    referencing the socket, even though ->rsk_refcnt might be non-null.
    Now we can consolidate the error path of tcp_get_cookie_sock() and
    tcp_conn_request().

    Signed-off-by: Guillaume Nault
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Guillaume Nault
     

09 Mar, 2019

1 commit

  • Commit 7716682cc58e ("tcp/dccp: fix another race at listener
    dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
    {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
    weren't modified, thus leaking allocated resources on error.

    Contrary to tcp_check_req(), in both syncookies and TFO cases,
    we need to drop the request socket. Also, since the child socket is
    created with inet_csk_clone_lock(), we have to unlock it and drop an
    extra reference (->sk_refcount is initially set to 2 and
    inet_csk_reqsk_queue_add() drops only one ref).

    For TFO, we also need to revert the work done by tcp_try_fastopen()
    (with reqsk_fastopen_remove()).

    Fixes: 7716682cc58e ("tcp/dccp: fix another race at listener dismantle")
    Signed-off-by: Guillaume Nault
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Guillaume Nault
     

26 Feb, 2019

2 commits


28 Jan, 2019

1 commit

  • Instead of using pingpong as a single bit information, we refactor the
    code to treat it as a counter. When interactive session is detected,
    we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
    is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

    This patch is a pure refactor and sets foundation for the next patch.
    This patch itself does not change any pingpong logic.

    Signed-off-by: Wei Wang
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

01 Dec, 2018

1 commit

  • Neal pointed out that non sack flows might suffer from ACK compression
    added in the following patch ("tcp: implement coalescing on backlog queue")

    Instead of tweaking tcp_add_backlog() we can take into
    account how many ACK were coalesced, this information
    will be available in skb_shinfo(skb)->gso_segs

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Nov, 2018

1 commit


28 Nov, 2018

1 commit