29 Jan, 2020

1 commit

  • [ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ]

    Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    apparently allowed syzbot to trigger various crashes in TCP stack [1]

    I believe this commit only made things easier for syzbot to find
    its way into triggering use-after-frees. But really the bugs
    could lead to bad TCP behavior or even plain crashes even for
    non malicious peers.

    I have audited all calls to tcp_rtx_queue_unlink() and
    tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
    if we are removing from rtx queue the skb that tp->highest_sack points to.

    These updates were missing in three locations :

    1) tcp_clean_rtx_queue() [This one seems quite serious,
    I have no idea why this was not caught earlier]

    2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

    3) tcp_send_synack() [Probably not a big deal for normal operations]

    [1]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
    __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
    kasan_report+0x12/0x20 mm/kasan/common.c:639
    __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
    tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
    tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
    tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
    tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
    tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
    tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
    ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
    __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
    __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
    process_backlog+0x206/0x750 net/core/dev.c:6093
    napi_poll net/core/dev.c:6530 [inline]
    net_rx_action+0x508/0x1120 net/core/dev.c:6598
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:603 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
    smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    Allocated by task 10091:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    __kasan_kmalloc mm/kasan/common.c:513 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
    slab_post_alloc_hook mm/slab.h:584 [inline]
    slab_alloc_node mm/slab.c:3263 [inline]
    kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
    alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
    sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
    sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
    tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
    tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
    inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    __sys_sendto+0x262/0x380 net/socket.c:1998
    __do_sys_sendto net/socket.c:2010 [inline]
    __se_sys_sendto net/socket.c:2006 [inline]
    __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 10095:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    kasan_set_free_info mm/kasan/common.c:335 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
    __cache_free mm/slab.c:3426 [inline]
    kmem_cache_free+0x86/0x320 mm/slab.c:3694
    kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
    __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
    sk_eat_skb include/net/sock.h:2453 [inline]
    tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
    inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:886 [inline]
    sock_recvmsg net/socket.c:904 [inline]
    sock_recvmsg+0xce/0x110 net/socket.c:900
    __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
    __do_sys_recvfrom net/socket.c:2073 [inline]
    __se_sys_recvfrom net/socket.c:2069 [inline]
    __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff8880a488d040
    which belongs to the cache skbuff_fclone_cache of size 456
    The buggy address is located 40 bytes inside of
    456-byte region [ffff8880a488d040, ffff8880a488d208)
    The buggy address belongs to the page:
    page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
    raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
    raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
    Signed-off-by: Eric Dumazet
    Cc: Cambda Zhu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

09 Jan, 2020

1 commit

  • [ Upstream commit 7c68fa2bddda6d942bd387c9ba5b4300737fd991 ]

    sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

05 Jan, 2020

2 commits

  • [ Upstream commit 1f85e6267caca44b30c54711652b0726fadbb131 ]

    Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
    write queue in error cases") in linux-4.14 stable triggered
    various bugs. One of them has been fixed in commit ba2ddb43f270
    ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
    we still have crashes in some occasions.

    Root-cause is that when tcp_sendmsg() has allocated a fresh
    skb and could not append a fragment before being blocked
    in sk_stream_wait_memory(), tcp_write_xmit() might be called
    and decide to send this fresh and empty skb.

    Sending an empty packet is not only silly, it might have caused
    many issues we had in the past with tp->packets_out being
    out of sync.

    Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
    Signed-off-by: Eric Dumazet
    Cc: Christoph Paasch
    Acked-by: Neal Cardwell
    Cc: Jason Baron
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 853697504de043ff0bfd815bd3a64de1dce73dc7 ]

    >From commit 50895b9de1d3 ("tcp: highest_sack fix"), the logic about
    setting tp->highest_sack to the head of the send queue was removed.
    Of course the logic is error prone, but it is logical. Before we
    remove the pointer to the highest sack skb and use the seq instead,
    we need to set tp->highest_sack to NULL when there is no skb after
    the last sack, and then replace NULL with the real skb when new skb
    inserted into the rtx queue, because the NULL means the highest sack
    seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
    the next ACK with sack option will increase tp->reordering unexpectedly.

    This patch sets tp->highest_sack to the tail of the rtx queue if
    it's NULL and new data is sent. The patch keeps the rule that the
    highest_sack can only be maintained by sack processing, except for
    this only case.

    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Signed-off-by: Cambda Zhu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cambda Zhu
     

18 Dec, 2019

1 commit

  • [ Upstream commit 9424e2e7ad93ffffa88f882c9bc5023570904b55 ]

    Back in 2008, Adam Langley fixed the corner case of packets for flows
    having all of the following options : MD5 TS SACK

    Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
    can be cooked from the remaining 8 bytes.

    tcp_established_options() correctly sets opts->num_sack_blocks
    to zero, but returns 36 instead of 32.

    This means TCP cooks packets with 4 extra bytes at the end
    of options, containing unitialized bytes.

    Fixes: 33ad798c924b ("tcp: options clean up")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

14 Oct, 2019

5 commits

  • For the sake of tcp_poll(), there are few places where we fetch
    sk->sk_wmem_queued while this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make sure write
    sides use corresponding WRITE_ONCE() to avoid store-tearing.

    sk_wmem_queued_add() helper is added so that we can in
    the future convert to ADD_ONCE() or equivalent if/when
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->snd_nxt while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->write_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->copied_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Both tcp_v4_err() and tcp_v6_err() do the following operations
    while they do not own the socket lock :

    fastopen = tp->fastopen_rsk;
    snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

    The problem is that without appropriate barrier, the compiler
    might reload tp->fastopen_rsk and trigger a NULL deref.

    request sockets are protected by RCU, we can simply add
    the missing annotations and barriers to solve the issue.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Sep, 2019

1 commit

  • When tcp sends a TSO packet, adding a PSH flag on it
    reduces the sojourn time of GRO packet in GRO receivers.

    This is particularly the case under pressure, since RX queues
    receive packets for many concurrent flows.

    A sender can give a hint to GRO engines when it is
    appropriate to flush a super-packet, especially when pacing
    is in the picture, since next packet is probably delayed by
    one ms.

    Having less packets in GRO engine reduces chance
    of LRU eviction or inflated RTT, and reduces GRO cost.

    We found recently that we must not set the PSH flag on
    individual full-size MSS segments [1] :

    Under pressure (CWR state), we better let the packet sit
    for a small delay (depending on NAPI logic) so that the
    ACK packet is delayed, and thus next packet we send is
    also delayed a bit. Eventually the bottleneck queue can
    be drained. DCTCP flows with CWND=1 have demonstrated
    the issue.

    This patch allows to slowdown the aggregate traffic without
    involving high resolution timers on senders and/or
    receivers.

    It has been used at Google for about four years,
    and has been discussed at various networking conferences.

    [1] segments smaller than MSS already have PSH flag set
    by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
    has been requested by the user.

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Daniel Borkmann
    Cc: Tariq Toukan
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Sep, 2019

1 commit


29 Aug, 2019

1 commit

  • TCP associates tx timestamp requests with a byte in the bytestream.
    If merging skbs in tcp_mtu_probe, migrate the tstamp request.

    Similar to MSG_EOR, do not allow moving a timestamp from any segment
    in the probe but the last. This to avoid merging multiple timestamps.

    Tested with the packetdrill script at
    https://github.com/wdebruij/packetdrill/commits/mtu_probe-1

    Link: http://patchwork.ozlabs.org/patch/1143278/#2232897
    Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

20 Aug, 2019

1 commit


09 Aug, 2019

1 commit

  • sk_validate_xmit_skb() and drivers depend on the sk member of
    struct sk_buff to identify segments requiring encryption.
    Any operation which removes or does not preserve the original TLS
    socket such as skb_orphan() or skb_clone() will cause clear text
    leaks.

    Make the TCP socket underlying an offloaded TLS connection
    mark all skbs as decrypted, if TLS TX is in offload mode.
    Then in sk_validate_xmit_skb() catch skbs which have no socket
    (or a socket with no validation) and decrypted flag set.

    Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
    sk->sk_validate_xmit_skb are slightly interchangeable right now,
    they all imply TLS offload. The new checks are guarded by
    CONFIG_TLS_DEVICE because that's the option guarding the
    sk_buff->decrypted member.

    Second, smaller issue with orphaning is that it breaks
    the guarantee that packets will be delivered to device
    queues in-order. All TLS offload drivers depend on that
    scheduling property. This means skb_orphan_partial()'s
    trick of preserving partial socket references will cause
    issues in the drivers. We need a full orphan, and as a
    result netem delay/throttling will cause all TLS offload
    skbs to be dropped.

    Reusing the sk_buff->decrypted flag also protects from
    leaking clear text when incoming, decrypted skb is redirected
    (e.g. by TC).

    See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
    through ULP") for justification why the internal flag is safe.
    The only location which could leak the flag in is tcp_bpf_sendmsg(),
    which is taken care of by clearing the previously unused bit.

    v2:
    - remove superfluous decrypted mark copy (Willem);
    - remove the stale doc entry (Boris);
    - rely entirely on EOR marking to prevent coalescing (Boris);
    - use an internal sendpages flag instead of marking the socket
    (Boris).
    v3 (Willem):
    - reorganize the can_skb_orphan_partial() condition;
    - fix the flag leak-in through tcp_bpf_sendmsg.

    Signed-off-by: Jakub Kicinski
    Acked-by: Willem de Bruijn
    Reviewed-by: Boris Pismenny
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

31 Jul, 2019

1 commit


22 Jul, 2019

1 commit

  • Some applications set tiny SO_SNDBUF values and expect
    TCP to just work. Recent patches to address CVE-2019-11478
    broke them in case of losses, since retransmits might
    be prevented.

    We should allow these flows to make progress.

    This patch allows the first and last skb in retransmit queue
    to be split even if memory limits are hit.

    It also adds the some room due to the fact that tcp_sendmsg()
    and tcp_sendpage() might overshoot sk_wmem_queued by about one full
    TSO skb (64KB size). Note this allowance was already present
    in stable backports for kernels < 4.15

    Note for < 4.15 backports :
    tcp_rtx_queue_tail() will probably look like :

    static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
    {
    struct sk_buff *skb = tcp_send_head(sk);

    return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
    }

    Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrew Prout
    Tested-by: Andrew Prout
    Tested-by: Jonathan Lemon
    Tested-by: Michal Kubecek
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Christoph Paasch
    Cc: Jonathan Looney
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Jun, 2019

2 commits


18 Jun, 2019

1 commit


16 Jun, 2019

3 commits

  • Some TCP peers announce a very small MSS option in their SYN and/or
    SYN/ACK messages.

    This forces the stack to send packets with a very high network/cpu
    overhead.

    Linux has enforced a minimal value of 48. Since this value includes
    the size of TCP options, and that the options can consume up to 40
    bytes, this means that each segment can include only 8 bytes of payload.

    In some cases, it can be useful to increase the minimal value
    to a saner value.

    We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
    reasons.

    Note that TCP_MAXSEG socket option enforces a minimal value
    of (TCP_MIN_MSS). David Miller increased this minimal value
    in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
    from 64 to 88.

    We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

    CVE-2019-11479 -- tcp mss hardcoded to 48

    Signed-off-by: Eric Dumazet
    Suggested-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Tyler Hicks
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Jonathan Looney reported that a malicious peer can force a sender
    to fragment its retransmit queue into tiny skbs, inflating memory
    usage and/or overflow 32bit counters.

    TCP allows an application to queue up to sk_sndbuf bytes,
    so we need to give some allowance for non malicious splitting
    of retransmit queue.

    A new SNMP counter is added to monitor how many times TCP
    did not allow to split an skb if the allowance was exceeded.

    Note that this counter might increase in the case applications
    use SO_SNDBUF socket option to lower sk_sndbuf.

    CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
    socket is already using more than half the allowed space

    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Reviewed-by: Tyler Hicks
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Jonathan Looney reported that TCP can trigger the following crash
    in tcp_shifted_skb() :

    BUG_ON(tcp_skb_pcount(skb) < pcount);

    This can happen if the remote peer has advertized the smallest
    MSS that linux TCP accepts : 48

    An skb can hold 17 fragments, and each fragment can hold 32KB
    on x86, or 64KB on PowerPC.

    This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
    can overflow.

    Note that tcp_sendmsg() builds skbs with less than 64KB
    of payload, so this problem needs SACK to be enabled.
    SACK blocks allow TCP to coalesce multiple skbs in the retransmit
    queue, thus filling the 17 fragments to maximal capacity.

    CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

    Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
    Signed-off-by: Eric Dumazet
    Reported-by: Jonathan Looney
    Acked-by: Neal Cardwell
    Reviewed-by: Tyler Hicks
    Cc: Yuchung Cheng
    Cc: Bruce Curtis
    Cc: Jonathan Lemon
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Jun, 2019

1 commit

  • Adding delays to TCP flows is crucial for studying behavior
    of TCP stacks, including congestion control modules.

    Linux offers netem module, but it has unpractical constraints :
    - Need root access to change qdisc
    - Hard to setup on egress if combined with non trivial qdisc like FQ
    - Single delay for all flows.

    EDT (Earliest Departure Time) adoption in TCP stack allows us
    to enable a per socket delay at a very small cost.

    Networking tools can now establish thousands of flows, each of them
    with a different delay, simulating real world conditions.

    This requires FQ packet scheduler or a EDT-enabled NIC.

    This patchs adds TCP_TX_DELAY socket option, to set a delay in
    usec units.

    unsigned int tx_delay = 10000; /* 10 msec */

    setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

    Note that FQ packet scheduler limits might need some tweaking :

    man tc-fq

    PARAMETERS
    limit
    Hard limit on the real queue size. When this limit is
    reached, new packets are dropped. If the value is lowered,
    packets are dropped so that the new limit is met. Default
    is 10000 packets.

    flow_limit
    Hard limit on the maximum number of packets queued per
    flow. Default value is 100.

    Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
    so packets would be dropped if any of the previous limit is hit.

    Use of a jump label makes this support runtime-free, for hosts
    never using the option.

    Also note that TSQ (TCP Small Queues) limits are slightly changed
    with this patch : we need to account that skbs artificially delayed
    wont stop us providind more skbs to feed the pipe (netem uses
    skb_orphan_partial() for this purpose, but FQ can not use this trick)

    Because of that, using big delays might very well trigger
    old bugs in TSO auto defer logic and/or sndbuf limited detection.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 May, 2019

1 commit

  • Detecting spurious SYNACK timeout using timestamp option requires
    recording the exact SYNACK skb timestamp. Previously the SYNACK
    sent timestamp was stamped slightly earlier before the skb
    was transmitted. This patch uses the SYNACK skb transmission
    timestamp directly.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

07 Apr, 2019

1 commit

  • The non-null check on tskb is always false because it is in an else
    path of a check on tskb and hence tskb is null in this code block.
    This is check is therefore redundant and can be removed as well
    as the label coalesc.

    if (tsbk) {
    ...
    } else {
    ...
    if (unlikely(!skb)) {
    if (tskb) /* can never be true, redundant code */
    goto coalesc;
    return;
    }
    }

    Addresses-Coverity: ("Logically dead code")
    Signed-off-by: Colin Ian King
    Reviewed-by: Mukesh Ojha
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Colin Ian King
     

24 Mar, 2019

1 commit


27 Feb, 2019

2 commits


25 Feb, 2019

1 commit

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Feb, 2019

1 commit

  • syzbot reported a WARN_ON(!tcp_skb_pcount(skb))
    in tcp_send_loss_probe() [1]

    This was caused by TCP_REPAIR sent skbs that inadvertenly
    were missing a call to tcp_init_tso_segs()

    [1]
    WARNING: CPU: 1 PID: 0 at net/ipv4/tcp_output.c:2534 tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc7+ #77
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    panic+0x2cb/0x65c kernel/panic.c:214
    __warn.cold+0x20/0x45 kernel/panic.c:571
    report_bug+0x263/0x2b0 lib/bug.c:186
    fixup_bug arch/x86/kernel/traps.c:178 [inline]
    fixup_bug arch/x86/kernel/traps.c:173 [inline]
    do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
    do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
    invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
    RIP: 0010:tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
    Code: 88 fc ff ff 4c 89 ef e8 ed 75 c8 fb e9 c8 fc ff ff e8 43 76 c8 fb e9 63 fd ff ff e8 d9 75 c8 fb e9 94 f9 ff ff e8 bf 03 91 fb 0b e9 7d fa ff ff e8 b3 03 91 fb 0f b6 1d 37 43 7a 03 31 ff 89
    RSP: 0018:ffff8880ae907c60 EFLAGS: 00010206
    RAX: ffff8880a989c340 RBX: 0000000000000000 RCX: ffffffff85dedbdb
    RDX: 0000000000000100 RSI: ffffffff85dee0b1 RDI: 0000000000000005
    RBP: ffff8880ae907c90 R08: ffff8880a989c340 R09: ffffed10147d1ae1
    R10: ffffed10147d1ae0 R11: ffff8880a3e8d703 R12: ffff888091b90040
    R13: ffff8880a3e8d540 R14: 0000000000008000 R15: ffff888091b90860
    tcp_write_timer_handler+0x5c0/0x8a0 net/ipv4/tcp_timer.c:583
    tcp_write_timer+0x10e/0x1d0 net/ipv4/tcp_timer.c:607
    call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
    expire_timers kernel/time/timer.c:1362 [inline]
    __run_timers kernel/time/timer.c:1681 [inline]
    __run_timers kernel/time/timer.c:1649 [inline]
    run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
    __do_softirq+0x266/0x95a kernel/softirq.c:292
    invoke_softirq kernel/softirq.c:373 [inline]
    irq_exit+0x180/0x1d0 kernel/softirq.c:413
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807

    RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
    Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 0c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 0c a1 fa eb 82 90 90 90 90 90 90 fb f4 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
    RSP: 0018:ffff8880a98afd78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
    RAX: 1ffffffff1125061 RBX: ffff8880a989c340 RCX: 0000000000000000
    RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880a989cbbc
    RBP: ffff8880a98afda8 R08: ffff8880a989c340 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
    R13: ffffffff889282f8 R14: 0000000000000001 R15: 0000000000000000
    arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555
    default_idle_call+0x36/0x90 kernel/sched/idle.c:93
    cpuidle_idle_call kernel/sched/idle.c:153 [inline]
    do_idle+0x386/0x570 kernel/sched/idle.c:262
    cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353
    start_secondary+0x404/0x5c0 arch/x86/kernel/smpboot.c:271
    secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
    Kernel Offset: disabled
    Rebooting in 86400 seconds..

    Fixes: 79861919b889 ("tcp: fix TCP_REPAIR xmit queue setup")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Andrey Vagin
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jan, 2019

2 commits

  • In order to be more confident about an on-going interactive session, we
    increment pingpong count by 1 for every interactive transaction and we
    adjust TCP_PINGPONG_THRESH to 3.
    This means, we only consider a session in pingpong mode after we see 3
    interactive transactions, and start to activate delayed acks in quick
    ack mode.
    And in order to not over-count the credits, we only increase pingpong
    count for the first packet sent in response for the previous received
    packet.
    This is mainly to prevent delaying the ack immediately after some
    handshake protocol but no real interactive traffic pattern afterwards.

    Signed-off-by: Wei Wang
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Instead of using pingpong as a single bit information, we refactor the
    code to treat it as a counter. When interactive session is detected,
    we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
    is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

    This patch is a pure refactor and sets foundation for the next patch.
    This patch itself does not change any pingpong logic.

    Signed-off-by: Wei Wang
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

26 Jan, 2019

1 commit

  • Accept MSG_ZEROCOPY in all the TCP states that allow sendmsg. Remove
    the explicit check for ESTABLISHED and CLOSE_WAIT states.

    This requires correctly handling zerocopy state (uarg, sk_zckey) in
    all paths reachable from other TCP states. Such as the EPIPE case
    in sk_stream_wait_connect, which a sendmsg() in incorrect state will
    now hit. Most paths are already safe.

    Only extension needed is for TCP Fastopen active open. This can build
    an skb with data in tcp_send_syn_data. Pass the uarg along with other
    fastopen state, so that this skb also generates a zerocopy
    notification on release.

    Tested with active and passive tcp fastopen packetdrill scripts at
    https://github.com/wdebruij/packetdrill/commit/1747eef03d25a2404e8132817d0f1244fd6f129d

    Signed-off-by: Willem de Bruijn
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

18 Jan, 2019

3 commits

  • Previously when the sender fails to send (original) data packet or
    window probes due to congestion in the local host (e.g. throttling
    in qdisc), it'll retry within an RTO or two up to 500ms.

    In low-RTT networks such as data-centers, RTO is often far below
    the default minimum 200ms. Then local host congestion could trigger
    a retry storm pouring gas to the fire. Worse yet, the probe counter
    (icsk_probes_out) is not properly updated so the aggressive retry
    may exceed the system limit (15 rounds) until the packet finally
    slips through.

    On such rare events, it's wise to retry more conservatively
    (500ms) and update the stats properly to reflect these incidents
    and follow the system limit. Note that this is consistent with
    the behaviors when a keep-alive probe or RTO retry is dropped
    due to local congestion.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously TCP socket's retrans_stamp is not set if the
    retransmission has failed to send. As a result if a socket is
    experiencing local issues to retransmit packets, determining when
    to abort a socket is complicated w/o knowning the starting time of
    the recovery since retrans_stamp may remain zero.

    This complication causes sub-optimal behavior that TCP may use the
    latest, instead of the first, retransmission time to compute the
    elapsed time of a stalling connection due to local issues. Then TCP
    may disrecard TCP retries settings and keep retrying until it finally
    succeed: not a good idea when the local host is already strained.

    The simple fix is to always timestamp the start of a recovery.
    It's worth noting that retrans_stamp is also used to compare echo
    timestamp values to detect spurious recovery. This patch does
    not break that because retrans_stamp is still later than when the
    original packet was sent.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Previously TCP skbs are not always timestamped if the transmission
    failed due to memory or other local issues. This makes deciding
    when to abort a socket tricky and complicated because the first
    unacknowledged skb's timestamp may be 0 on TCP timeout.

    The straight-forward fix is to always timestamp skb on every
    transmission attempt. Also every skb retransmission needs to be
    flagged properly to avoid RTT under-estimation. This can happen
    upon receiving an ACK for the original packet and the a previous
    (spurious) retransmission has failed.

    It's worth noting that this reverts to the old time-stamping
    style before commit 8c72c65b426b ("tcp: update skb->skb_mstamp more
    carefully") which addresses a problem in computing the elapsed time
    of a stalled window-probing socket. The problem will be addressed
    differently in the next patches with a simpler approach.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Reviewed-by: Neal Cardwell
    Reviewed-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

11 Dec, 2018

1 commit

  • In commit f9bfe4e6a9d0 ("tcp: lack of available data can also cause
    TSO defer") we moved the test in tcp_tso_should_defer() for packets
    with a FIN flag, and we mentioned that the same would be done
    later for EOR flag.

    Both flags should be handled at the same time, after all other
    heuristics have been considered. They both mean that no more bytes
    can be added to this skb by an application.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Dec, 2018

1 commit

  • Several conflicts, seemingly all over the place.

    I used Stephen Rothwell's sample resolutions for many of these, if not
    just to double check my own work, so definitely the credit largely
    goes to him.

    The NFP conflict consisted of a bug fix (moving operations
    past the rhashtable operation) while chaning the initial
    argument in the function call in the moved code.

    The net/dsa/master.c conflict had to do with a bug fix intermixing of
    making dsa_master_set_mtu() static with the fixing of the tagging
    attribute location.

    cls_flower had a conflict because the dup reject fix from Or
    overlapped with the addition of port range classifiction.

    __set_phy_supported()'s conflict was relatively easy to resolve
    because Andrew fixed it in both trees, so it was just a matter
    of taking the net-next copy. Or at least I think it was :-)

    Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
    intermixed with changes on how the sdif and caller_net are calculated
    in these code paths in net-next.

    The remaining BPF conflicts were largely about the addition of the
    __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
    to the relevant data structure where the MD pointer macros are used.

    Signed-off-by: David S. Miller

    David S. Miller