31 Jan, 2019

1 commit

  • [ Upstream commit 13d7f46386e060df31b727c9975e38306fa51e7a ]

    TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
    the connection and so transitions this socket to CLOSE_WAIT state.

    Transmission in close wait state is acceptable. Other similar tests in
    the stack (e.g., in FastOpen) accept both states. Relax this test, too.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Reported-by: Marek Majkowski
    Signed-off-by: Willem de Bruijn
    CC: Yuchung Cheng
    CC: Neal Cardwell
    CC: Soheil Hassas Yeganeh
    CC: Alexey Kodanev
    Acked-by: Soheil Hassas Yeganeh
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

01 Dec, 2018

1 commit

  • commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

    syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
    in tcp_close()

    While a socket is being closed, it is very possible other
    threads find it in rtnetlink dump.

    tcp_get_info() will acquire the socket lock for a short amount
    of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
    enough to trigger the warning.

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

26 Sep, 2018

1 commit

  • [ Upstream commit 5cf4a8532c992bb22a9ecd5f6d93f873f4eaccc2 ]

    According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
    flag was introduced because send(2) ignores unknown message flags and
    any legacy application which was accidentally passing the equivalent of
    MSG_ZEROCOPY earlier should not see any new behaviour.

    Before commit f214f915e7db ("tcp: enable MSG_ZEROCOPY"), a send(2) call
    which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
    would succeed. However, after that commit, it fails with -ENOBUFS. So
    it appears that the SO_ZEROCOPY flag fails to fulfill its intended
    purpose. Fix it.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Vincent Whitchurch
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vincent Whitchurch
     

24 Aug, 2018

1 commit

  • [ Upstream commit e56b8ce363a36fb7b74b80aaa5cc9084f2c908b4 ]

    Attempt to make cryptic TCP seq number error messages clearer by
    (1) identifying the source of the message as "TCP", (2) identifying the
    errors as "seq # bug", and (3) grouping the field identifiers and values
    by separating them with commas.

    E.g., the following message is changed from:

    recvmsg bug 2: copied 73BCB6CD seq 70F17CBE rcvnxt 73BCB9AA fl 0
    WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:1881 tcp_recvmsg+0x649/0xb90

    to:

    TCP recvmsg seq # bug 2: copied 73BCB6CD, seq 70F17CBE, rcvnxt 73BCB9AA, fl 0
    WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:2011 tcp_recvmsg+0x694/0xba0

    Suggested-by: 積丹尼 Dan Jacobson
    Signed-off-by: Randy Dunlap
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     

25 Jul, 2018

1 commit

  • [ Upstream commit acc2cf4e37174646a24cba42fa53c668b2338d4e ]

    When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
    frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
    tcp_abort, and then frees it again by calling sock_gen_put.

    Since tcp_abort only has one caller, and all the other codepaths
    in tcp_abort don't free the socket, just remove the free in that
    function.

    Cc: David Ahern
    Tested: passes Android sock_diag_test.py, which exercises this codepath
    Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path destroying socket")
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: Eric Dumazet
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Lorenzo Colitti
     

19 May, 2018

1 commit

  • [ Upstream commit 16ae6aa1705299789f71fdea59bfb119c1fbd9c0 ]

    The TCP repair sequence of operation is to first set the socket in
    repair mode, then inject the TCP stats into the socket with repair
    socket options, then call connect() to re-activate the socket. The
    connect syscall simply returns and set state to ESTABLISHED
    mode. As a result Fast Open is meaningless for TCP repair.

    However allowing sendto() system call with MSG_FASTOPEN flag half-way
    during the repair operation could unexpectedly cause data to be
    sent, before the operation finishes changing the internal TCP stats
    (e.g. MSS). This in turn triggers TCP warnings on inconsistent
    packet accounting.

    The fix is to simply disallow Fast Open operation once the socket
    is in the repair mode.

    Reported-by: syzbot
    Signed-off-by: Yuchung Cheng
    Reviewed-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     

16 May, 2018

1 commit

  • commit bf2acc943a45d2b2e8a9f1a5ddff6b6e43cc69d9 upstream.

    syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
    with following C-repro :

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
    setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
    setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
    sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
    1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
    setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
    writev(3, [{"\270", 1}], 1) = 1
    setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
    writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144

    The 3rd system call looks odd :
    setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0

    This patch makes sure bound checking is using an unsigned compare.

    Fixes: ee9952831cfd ("tcp: Initial repair mode")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

29 Apr, 2018

2 commits

  • Clear tp->packets_out when purging the write queue, otherwise
    tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
    This results in NULL pointer dereference.

    Also, remove the redundant `tp->packets_out = 0` from
    tcp_disconnect(), since tcp_disconnect() calls
    tcp_write_queue_purge().

    Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
    Reported-by: Subash Abhinov Kasiviswanathan
    Reported-by: Sami Farin
    Tested-by: Sami Farin
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: Greg Kroah-Hartman

    Soheil Hassas Yeganeh
     
  • [ Upstream commit 7212303268918b9a203aebeacfdbd83b5e87b20d ]

    syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

    I believe this was caused by a TCP_MD5SIG being set on live
    flow.

    This is highly unexpected, since TCP option space is limited.

    For instance, presence of TCP MD5 option automatically disables
    TCP TimeStamp option at SYN/SYNACK time, which we can not do
    once flow has been established.

    Really, adding/deleting an MD5 key only makes sense on sockets
    in CLOSE or LISTEN state.

    [1]
    BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
    CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
    tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
    tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
    tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
    tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
    tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
    sk_backlog_rcv include/net/sock.h:908 [inline]
    __release_sock+0x2d6/0x680 net/core/sock.c:2271
    release_sock+0x97/0x2a0 net/core/sock.c:2786
    tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
    SyS_sendto+0x8a/0xb0 net/socket.c:1715
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x448fe9
    RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
    RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
    RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
    R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
    R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
    kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
    slab_post_alloc_hook mm/slab.h:445 [inline]
    slab_alloc_node mm/slub.c:2737 [inline]
    __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
    __kmalloc_reserve net/core/skbuff.c:138 [inline]
    __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
    alloc_skb include/linux/skbuff.h:984 [inline]
    tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
    __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
    tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
    tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
    tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
    sk_backlog_rcv include/net/sock.h:908 [inline]
    __release_sock+0x2d6/0x680 net/core/sock.c:2271
    release_sock+0x97/0x2a0 net/core/sock.c:2786
    tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
    SyS_sendto+0x8a/0xb0 net/socket.c:1715
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

01 Apr, 2018

1 commit

  • [ Upstream commit e05836ac07c77dd90377f8c8140bce2a44af5fe7 ]

    When the connection is aborted, there is no point in
    keeping the packets on the write queue until the connection
    is closed.

    Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
    this is essential for a correct MSG_ZEROCOPY implementation,
    because userspace cannot call close(fd) before receiving
    zerocopy signals even when the connection is aborted.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Neal Cardwell
    Reviewed-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Soheil Hassas Yeganeh
     

13 Feb, 2018

1 commit

  • [ Upstream commit 9b42d55a66d388e4dd5550107df051a9637564fc ]

    socket can be disconnected and gets transformed back to a listening
    socket, if sk_frag.page is not released, which will be cloned into
    a new socket by sk_clone_lock, but the reference count of this page
    is increased, lead to a use after free or double free issue

    Signed-off-by: Li RongQing
    Cc: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Li RongQing
     

31 Jan, 2018

1 commit

  • [ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

    When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.

    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open. However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open. In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence. The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.

    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.

    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     

03 Jan, 2018

1 commit

  • [ Upstream commit d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292 ]

    Mark tcp_sock during a SACK reneging event and invalidate rate samples
    while marked. Such rate samples may overestimate bw by including packets
    that were SACKed before reneging.

    < ack 6001 win 10000 sack 7001:38001
    < ack 7001 win 0 sack 8001:38001 // Reneg detected
    > seq 7001:8001 // RTO, SACK cleared.
    < ack 38001 win 10000

    In above example the rate sample taken after the last ack will count
    7001-38001 as delivered while the actual delivery rate likely could
    be much lower i.e. 7001-8001.

    This patch adds a new field tcp_sock.sack_reneg and marks it when we
    declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
    the last rate sample was taken before moving back to TCP_CA_Open. This
    patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
    is set.

    Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Eric Dumazet
    Acked-by: Priyaranjan Jha
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yousuk Seung
     

02 Sep, 2017

2 commits


31 Aug, 2017

1 commit

  • This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

    Eric Dumazet says:
    We found at Google a significant regression caused by
    45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

    In typical RPC (TCP_RR), when a TCP socket receives data, we now call
    tcp_ack() while we used to not call it.

    This touches enough cache lines to cause a slowdown.

    so problem does not seem to be HP removal itself but the tcp_ack()
    call. Therefore, it might be possible to remove HP after all, provided
    one finds a way to elide tcp_ack for most cases.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

26 Aug, 2017

2 commits

  • syszkaller got a hang in tcp stack, related to a bug in
    tcp_sendpage_locked()

    root@syzkaller:~# cat /proc/3059/stack
    [] __lock_sock+0x1dc/0x2f0
    [] lock_sock_nested+0xf3/0x110
    [] tcp_sendmsg+0x21/0x50
    [] inet_sendmsg+0x11f/0x5e0
    [] sock_sendmsg+0xca/0x110
    [] kernel_sendmsg+0x47/0x60
    [] sock_no_sendpage+0x1cc/0x280
    [] tcp_sendpage_locked+0x10b/0x160
    [] tcp_sendpage+0x43/0x60
    [] inet_sendpage+0x1aa/0x660
    [] kernel_sendpage+0x8d/0xe0
    [] sock_sendpage+0x8c/0xc0
    [] pipe_to_sendpage+0x290/0x3b0
    [] __splice_from_pipe+0x343/0x750
    [] splice_from_pipe+0x1e9/0x330
    [] generic_splice_sendpage+0x40/0x50
    [] SyS_splice+0x7b7/0x1610
    [] entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are a few bugs around refcnt handling in the new BPF congestion
    control setsockopt:

    - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
    cannot get a reference on it. This would lead to a use after free,
    since that ca is going away soon.

    - Changing the congestion control case doesn't release the refcnt on
    the previous ca.

    - In the reinit case, we first leak a reference on the old ca, then we
    call tcp_reinit_congestion_control on the ca that we have just
    assigned, leading to deinitializing the wrong ca (->release of the
    new ca on the old ca's data) and releasing the refcount on the ca
    that we actually want to use.

    This is visible by building (for example) BIC as a module and setting
    net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
    samples/bpf.

    This patch fixes the refcount issues, and moves reinit back into tcp
    core to avoid passing a ca pointer back to BPF.

    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Sabrina Dubroca
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

24 Aug, 2017

1 commit

  • When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
    timestamp corresponding to the highest sequence number data returned.

    Previously the skb->tstamp is overwritten when a TCP packet is placed
    in the out of order queue. While the packet is in the ooo queue, save the
    timestamp in the TCB_SKB_CB. This space is shared with the gso_*
    options which are only used on the tx path, and a previously unused 4
    byte hole.

    When skbs are coalesced either in the sk_receive_queue or the
    out_of_order_queue always choose the timestamp of the appended skb to
    maintain the invariant of returning the timestamp of the last byte in
    the recvmsg buffer.

    Signed-off-by: Mike Maloney
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Mike Maloney
     

17 Aug, 2017

1 commit


04 Aug, 2017

1 commit

  • Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
    both supported. Only data sent to remote destinations is sent without
    copying. Packets looped onto a local destination have their payload
    copied to avoid unbounded latency.

    Tested:
    A 10x TCP_STREAM between two hosts showed a reduction in netserver
    process cycles by up to 70%, depending on packet size. Systemwide,
    savings are of course much less pronounced, at up to 20% best case.

    msg_zerocopy.sh 4 tcp:

    without zerocopy
    tx=121792 (7600 MB) txc=0 zc=n
    rx=60458 (7600 MB)

    with zerocopy
    tx=286257 (17863 MB) txc=286257 zc=y
    rx=140022 (17863 MB)

    This test opens a pair of sockets over veth, one one calls send with
    64KB and optionally MSG_ZEROCOPY and on the other reads the initial
    bytes. The receiver truncates, so this is strictly an upper bound on
    what is achievable. It is more representative of sending data out of
    a physical NIC (when payload is not touched, either).

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

02 Aug, 2017

1 commit

  • Add new proto_ops sendmsg_locked and sendpage_locked that can be
    called when the socket lock is already held. Correspondingly, add
    kernel_sendmsg_locked and kernel_sendpage_locked as front end
    functions.

    These functions will be used in zero proxy so that we can take
    the socket lock in a ULP sendmsg/sendpage and then directly call the
    backend transport proto_ops functions.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

01 Aug, 2017

4 commits

  • Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg:
    TCP_NLA_PACING_RATE
    TCP_NLA_DELIVERY_RATE
    TCP_NLA_SND_CWND
    TCP_NLA_REORDERING
    TCP_NLA_MIN_RTT
    TCP_NLA_RECUR_RETRANS
    TCP_NLA_DELIVERY_RATE_APP_LMT

    Signed-off-by: Wei Wang
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Refactor the code to extract the function to compute delivery rate.
    This function will be used in later commit.

    Signed-off-by: Wei Wang
    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Wei Wang
     
  • Like prequeue, I am not sure this is overly useful nowadays.

    If we receive a train of packets, GRO will aggregate them if the
    headers are the same (HP predates GRO by several years) so we don't
    get a per-packet benefit, only a per-aggregated-packet one.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • prequeue is a tcp receive optimization that moves part of rx processing
    from bh to process context.

    This only works if the socket being processed belongs to a process that
    is blocked in recv on that socket.

    In practice, this doesn't happen anymore that often because nowadays
    servers tend to use an event driven (epoll) model.

    Even normal client applications (web browsers) commonly use many tcp
    connections in parallel.

    This has measureable impact only in netperf (which uses plain recv and
    thus allows prequeue use) from host to locally running vm (~4%), however,
    there were no changes when using netperf between two physical hosts with
    ixgbe interfaces.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

02 Jul, 2017

1 commit

  • Added support for changing congestion control for SOCK_OPS bpf
    programs through the setsockopt bpf helper function. It also adds
    a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
    congestion controls, like dctcp, that need to enable ECN in the
    SYN packets.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

01 Jul, 2017

2 commits


28 Jun, 2017

1 commit

  • If icsk_ulp_ops is unset, it dereferences a null ptr.
    Add a null ptr check.

    BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
    BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
    Read of size 4 at addr 0000000000000020 by task syz-executor1/15452

    Signed-off-by: Dave Watson
    Reported-by: "Levin, Alexander (Sasha Levin)"
    Signed-off-by: David S. Miller

    Dave Watson
     

26 Jun, 2017

1 commit

  • We have to reset the sk->sk_rx_dst when we disconnect a TCP
    connection, because otherwise when we re-connect it this
    dst reference is simply overridden in tcp_finish_connect().

    This fixes a dst leak which leads to a loopback dev refcnt
    leak. It is a long-standing bug, Kevin reported a very similar
    (if not same) bug before. Thanks to Andrei for providing such
    a reliable reproducer which greatly narrows down the problem.

    Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
    Reported-by: Andrei Vagin
    Reported-by: Kevin Xu
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

20 Jun, 2017

1 commit

  • Replace first padding in the tcp_md5sig structure with a new flag field
    and address prefix length so it can be specified when configuring a new
    key for TCP MD5 signature. The tcpm_flags field will only be used if the
    socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
    tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

    Signed-off-by: Bob Gilligan
    Signed-off-by: Eric Mowat
    Signed-off-by: Ivan Delalande
    Signed-off-by: David S. Miller

    Ivan Delalande
     

16 Jun, 2017

2 commits

  • Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
    sendpages while the socket is already locked.

    tcp_sendpage is exported, but requires the socket lock to not be held already.

    Signed-off-by: Aviad Yehezkel
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     
  • Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
    sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
    ULP can add its own logic by changing the TCP proto_ops structure to its own
    methods.

    Example usage:

    setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

    modules will call:
    tcp_register_ulp(&tcp_tls_ulp_ops);

    to register/unregister their ulp, with an init function and name.

    A list of registered ulps will be returned by tcp_get_available_ulp, which is
    hooked up to /proc. Example:

    $ cat /proc/sys/net/ipv4/tcp_available_ulp
    tls

    There is currently no functionality to remove or chain ULPs, but
    it should be possible to add these in the future if needed.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

08 Jun, 2017

1 commit

  • DRAM supply shortage and poor memory pressure tracking in TCP
    stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
    limits) and tcp_mem[] quite hazardous.

    TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
    limits being hit, but only tracking number of transitions.

    If TCP stack behavior under stress was perfect :
    1) It would maintain memory usage close to the limit.
    2) Memory pressure state would be entered for short times.

    We certainly prefer 100 events lasting 10ms compared to one event
    lasting 200 seconds.

    This patch adds a new SNMP counter tracking cumulative duration of
    memory pressure events, given in ms units.

    $ cat /proc/sys/net/ipv4/tcp_mem
    3088 4117 6176
    $ grep TCP /proc/net/sockstat
    TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
    $ nstat -n ; sleep 10 ; nstat |grep Pressure
    TcpExtTCPMemoryPressures 1700
    TcpExtTCPMemoryPressuresChrono 5209

    v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
    instructed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jun, 2017

1 commit


01 Jun, 2017

1 commit

  • MTU probing initialization occurred only at connect() and at SYN or
    SYN-ACK reception, but the former sets MSS to either the default or the
    user set value (through TCP_MAXSEG sockopt) and the latter never happens
    with repaired sockets.

    The result was that, with MTU probing enabled and unless TCP_MAXSEG
    sockopt was used before connect(), probing would be stuck at
    tcp_base_mss value until tcp_probe_interval seconds have passed.

    Signed-off-by: Douglas Caetano dos Santos
    Signed-off-by: David S. Miller

    Douglas Caetano dos Santos
     

27 May, 2017

1 commit


26 May, 2017

1 commit

  • Fastopen API should be used to perform fastopen operations on the TCP
    socket. It does not make sense to use fastopen API to perform disconnect
    by calling it with AF_UNSPEC. The fastopen data path is also prone to
    race conditions and bugs when using with AF_UNSPEC.

    One issue reported and analyzed by Vegard Nossum is as follows:
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Thread A: Thread B:
    ------------------------------------------------------------------------
    sendto()
    - tcp_sendmsg()
    - sk_stream_memory_free() = 0
    - goto wait_for_sndbuf
    - sk_stream_wait_memory()
    - sk_wait_event() // sleep
    | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
    | - tcp_sendmsg()
    | - tcp_sendmsg_fastopen()
    | - __inet_stream_connect()
    | - tcp_disconnect() //because of AF_UNSPEC
    | - tcp_transmit_skb()// send RST
    | - return 0; // no reconnect!
    | - sk_stream_wait_connect()
    | - sock_error()
    | - xchg(&sk->sk_err, 0)
    | - return -ECONNRESET
    - ... // wake up, see sk->sk_err == 0
    - skb_entail() on TCP_CLOSE socket

    If the connection is reopened then we will send a brand new SYN packet
    after thread A has already queued a buffer. At this point I think the
    socket internal state (sequence numbers etc.) becomes messed up.

    When the new connection is closed, the FIN-ACK is rejected because the
    sequence number is outside the window. The other side tries to
    retransmit,
    but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
    corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
    and return EOPNOTSUPP to user if such case happens.

    Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
    Reported-by: Vegard Nossum
    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

23 May, 2017

1 commit