Eric Lee / smarc-fsl-linux-kernel

31 Jan, 2019

1 commit

ab6688714 tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state ... Browse Code »

[ Upstream commit 13d7f46386e060df31b727c9975e38306fa51e7a ]

TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
the connection and so transitions this socket to CLOSE_WAIT state.

Transmission in close wait state is acceptable. Other similar tests in
the stack (e.g., in FastOpen) accept both states. Relax this test, too.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Reported-by: Marek Majkowski
Signed-off-by: Willem de Bruijn
CC: Yuchung Cheng
CC: Neal Cardwell
CC: Soheil Hassas Yeganeh
CC: Alexey Kodanev
Acked-by: Soheil Hassas Yeganeh
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
2019-01-31 15:13:42 +0800

01 Dec, 2018

1 commit

e6ddc2c3d tcp: do not release socket ownership in tcp_close() ... Browse Code »

commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()

While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.

tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.

Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-12-01 16:42:51 +0800

26 Sep, 2018

1 commit

effa7afc5 tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY ... Browse Code »

[ Upstream commit 5cf4a8532c992bb22a9ecd5f6d93f873f4eaccc2 ]

According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
flag was introduced because send(2) ignores unknown message flags and
any legacy application which was accidentally passing the equivalent of
MSG_ZEROCOPY earlier should not see any new behaviour.

Before commit f214f915e7db ("tcp: enable MSG_ZEROCOPY"), a send(2) call
which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
would succeed. However, after that commit, it fails with -ENOBUFS. So
it appears that the SO_ZEROCOPY flag fails to fulfill its intended
purpose. Fix it.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Vincent Whitchurch
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Vincent Whitchurch
2018-09-26 14:37:58 +0800

24 Aug, 2018

1 commit

90e7d6650 tcp: identify cryptic messages as TCP seq # bugs ... Browse Code »

[ Upstream commit e56b8ce363a36fb7b74b80aaa5cc9084f2c908b4 ]

Attempt to make cryptic TCP seq number error messages clearer by
(1) identifying the source of the message as "TCP", (2) identifying the
errors as "seq # bug", and (3) grouping the field identifiers and values
by separating them with commas.

E.g., the following message is changed from:

recvmsg bug 2: copied 73BCB6CD seq 70F17CBE rcvnxt 73BCB9AA fl 0
WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:1881 tcp_recvmsg+0x649/0xb90

to:

TCP recvmsg seq # bug 2: copied 73BCB6CD, seq 70F17CBE, rcvnxt 73BCB9AA, fl 0
WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:2011 tcp_recvmsg+0x694/0xba0

Suggested-by: 積丹尼 Dan Jacobson
Signed-off-by: Randy Dunlap
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Randy Dunlap
2018-08-24 19:09:19 +0800

25 Jul, 2018

1 commit

cc0ab6475 net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort ... Browse Code »

[ Upstream commit acc2cf4e37174646a24cba42fa53c668b2338d4e ]

When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
tcp_abort, and then frees it again by calling sock_gen_put.

Since tcp_abort only has one caller, and all the other codepaths
in tcp_abort don't free the socket, just remove the free in that
function.

Cc: David Ahern
Tested: passes Android sock_diag_test.py, which exercises this codepath
Fixes: d7226c7a4dd1 ("net: diag: Fix refcnt leak in error path destroying socket")
Signed-off-by: Lorenzo Colitti
Signed-off-by: Eric Dumazet
Reviewed-by: David Ahern
Tested-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Lorenzo Colitti
2018-07-25 17:25:09 +0800

19 May, 2018

1 commit

413d26276 tcp: ignore Fast Open on repair mode ... Browse Code »

[ Upstream commit 16ae6aa1705299789f71fdea59bfb119c1fbd9c0 ]

The TCP repair sequence of operation is to first set the socket in
repair mode, then inject the TCP stats into the socket with repair
socket options, then call connect() to re-activate the socket. The
connect syscall simply returns and set state to ESTABLISHED
mode. As a result Fast Open is meaningless for TCP repair.

However allowing sendto() system call with MSG_FASTOPEN flag half-way
during the repair operation could unexpectedly cause data to be
sent, before the operation finishes changing the internal TCP stats
(e.g. MSS). This in turn triggers TCP warnings on inconsistent
packet accounting.

The fix is to simply disallow Fast Open operation once the socket
is in the repair mode.

Reported-by: syzbot
Signed-off-by: Yuchung Cheng
Reviewed-by: Neal Cardwell
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Yuchung Cheng
2018-05-19 16:20:25 +0800

16 May, 2018

1 commit

8c12bd91b tcp: fix TCP_REPAIR_QUEUE bound checking ... Browse Code »

commit bf2acc943a45d2b2e8a9f1a5ddff6b6e43cc69d9 upstream.

syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
with following C-repro :

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
writev(3, [{"\270", 1}], 1) = 1
setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144

The 3rd system call looks odd :
setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0

This patch makes sure bound checking is using an unsigned compare.

Fixes: ee9952831cfd ("tcp: Initial repair mode")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Cc: Pavel Emelyanov
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-05-16 16:10:24 +0800

29 Apr, 2018

2 commits

75020d631 tcp: clear tp->packets_out when purging write queue ... Browse Code »

Clear tp->packets_out when purging the write queue, otherwise
tcp_rearm_rto() mistakenly assumes TCP write queue is not empty.
This results in NULL pointer dereference.

Also, remove the redundant `tp->packets_out = 0` from
tcp_disconnect(), since tcp_disconnect() calls
tcp_write_queue_purge().

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Subash Abhinov Kasiviswanathan
Reported-by: Sami Farin
Tested-by: Sami Farin
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: Greg Kroah-Hartman

Soheil Hassas Yeganeh
2018-04-29 17:33:13 +0800
d5387e663 tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets ... Browse Code »

[ Upstream commit 7212303268918b9a203aebeacfdbd83b5e87b20d ]

syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

I believe this was caused by a TCP_MD5SIG being set on live
flow.

This is highly unexpected, since TCP option space is limited.

For instance, presence of TCP MD5 option automatically disables
TCP TimeStamp option at SYN/SYNACK time, which we can not do
once flow has been established.

Really, adding/deleting an MD5 key only makes sense on sockets
in CLOSE or LISTEN state.

[1]
BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:53
kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
__msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
sk_backlog_rcv include/net/sock.h:908 [inline]
__release_sock+0x2d6/0x680 net/core/sock.c:2271
release_sock+0x97/0x2a0 net/core/sock.c:2786
tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
sock_sendmsg_nosec net/socket.c:630 [inline]
sock_sendmsg net/socket.c:640 [inline]
SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
SyS_sendto+0x8a/0xb0 net/socket.c:1715
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x448fe9
RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009

Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
slab_post_alloc_hook mm/slab.h:445 [inline]
slab_alloc_node mm/slub.c:2737 [inline]
__kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
__kmalloc_reserve net/core/skbuff.c:138 [inline]
__alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
alloc_skb include/linux/skbuff.h:984 [inline]
tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
__tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
sk_backlog_rcv include/net/sock.h:908 [inline]
__release_sock+0x2d6/0x680 net/core/sock.c:2271
release_sock+0x97/0x2a0 net/core/sock.c:2786
tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
sock_sendmsg_nosec net/socket.c:630 [inline]
sock_sendmsg net/socket.c:640 [inline]
SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
SyS_sendto+0x8a/0xb0 net/socket.c:1715
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-04-29 17:33:11 +0800

01 Apr, 2018

1 commit

e44c17330 tcp: purge write queue upon aborting the connection ... Browse Code »

[ Upstream commit e05836ac07c77dd90377f8c8140bce2a44af5fe7 ]

When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Neal Cardwell
Reviewed-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Soheil Hassas Yeganeh
2018-04-01 00:10:38 +0800

13 Feb, 2018

1 commit

6d35430fd tcp: release sk_frag.page in tcp_disconnect ... Browse Code »

[ Upstream commit 9b42d55a66d388e4dd5550107df051a9637564fc ]

socket can be disconnected and gets transformed back to a listening
socket, if sk_frag.page is not released, which will be cloned into
a new socket by sk_clone_lock, but the reference count of this page
is increased, lead to a use after free or double free issue

Signed-off-by: Li RongQing
Cc: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Li RongQing
2018-02-13 17:19:47 +0800

31 Jan, 2018

1 commit

32e57f8c5 net: tcp: close sock if net namespace is exiting ... Browse Code »

[ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Dan Streetman
2018-01-31 21:03:45 +0800

03 Jan, 2018

1 commit

f35318b28 tcp: invalidate rate samples during SACK reneging ... Browse Code »

[ Upstream commit d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292 ]

Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.

< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000

In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.

This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.

Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Acked-by: Eric Dumazet
Acked-by: Priyaranjan Jha
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Yousuk Seung
2018-01-03 03:31:09 +0800

02 Sep, 2017

2 commits

db5bce32f net: prepare (struct ubuf_info)->refcnt conversion ... Browse Code »

In order to convert this atomic_t refcnt to refcount_t,
we need to init the refcount to one to not trigger
a 0 -> 1 transition.

This also removes one atomic operation in fast path.

v2: removed dead code in sock_zerocopy_put_abort()
as suggested by Willem.

Signed-off-by: Eric Dumazet
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2017-09-02 11:22:03 +0800
6026e043d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Three cases of simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2017-09-02 08:42:05 +0800

31 Aug, 2017

1 commit

31770e34e tcp: Revert "tcp: remove header prediction" ... Browse Code »

This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78.

Eric Dumazet says:
We found at Google a significant regression caused by
45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction

In typical RPC (TCP_RR), when a TCP socket receives data, we now call
tcp_ack() while we used to not call it.

This touches enough cache lines to cause a slowdown.

so problem does not seem to be HP removal itself but the tcp_ack()
call. Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.

Reported-by: Eric Dumazet
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-31 02:20:09 +0800

26 Aug, 2017

2 commits

bd9dfc54e tcp: fix hang in tcp_sendpage_locked() ... Browse Code »

syszkaller got a hang in tcp stack, related to a bug in
tcp_sendpage_locked()

root@syzkaller:~# cat /proc/3059/stack
[] __lock_sock+0x1dc/0x2f0
[] lock_sock_nested+0xf3/0x110
[] tcp_sendmsg+0x21/0x50
[] inet_sendmsg+0x11f/0x5e0
[] sock_sendmsg+0xca/0x110
[] kernel_sendmsg+0x47/0x60
[] sock_no_sendpage+0x1cc/0x280
[] tcp_sendpage_locked+0x10b/0x160
[] tcp_sendpage+0x43/0x60
[] inet_sendpage+0x1aa/0x660
[] kernel_sendpage+0x8d/0xe0
[] sock_sendpage+0x8c/0xc0
[] pipe_to_sendpage+0x290/0x3b0
[] __splice_from_pipe+0x343/0x750
[] splice_from_pipe+0x1e9/0x330
[] generic_splice_sendpage+0x40/0x50
[] SyS_splice+0x7b7/0x1610
[] entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2017-08-26 08:22:01 +0800
ebfa00c57 tcp: fix refcnt leak with ebpf congestion control ... Browse Code »

There are a few bugs around refcnt handling in the new BPF congestion
control setsockopt:

- The new ca is assigned to icsk->icsk_ca_ops even in the case where we
cannot get a reference on it. This would lead to a use after free,
since that ca is going away soon.

- Changing the congestion control case doesn't release the refcnt on
the previous ca.

- In the reinit case, we first leak a reference on the old ca, then we
call tcp_reinit_congestion_control on the ca that we have just
assigned, leading to deinitializing the wrong ca (->release of the
new ca on the old ca's data) and releasing the refcount on the ca
that we actually want to use.

This is visible by building (for example) BIC as a module and setting
net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
samples/bpf.

This patch fixes the refcount issues, and moves reinit back into tcp
core to avoid passing a ca pointer back to BPF.

Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Sabrina Dubroca
Acked-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Sabrina Dubroca
2017-08-26 08:16:27 +0800

24 Aug, 2017

1 commit

98aaa913b tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg ... Browse Code »

When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
timestamp corresponding to the highest sequence number data returned.

Previously the skb->tstamp is overwritten when a TCP packet is placed
in the out of order queue. While the packet is in the ooo queue, save the
timestamp in the TCB_SKB_CB. This space is shared with the gso_*
options which are only used on the tx path, and a previously unused 4
byte hole.

When skbs are coalesced either in the sk_receive_queue or the
out_of_order_queue always choose the timestamp of the appended skb to
maintain the invariant of returning the timestamp of the last byte in
the recvmsg buffer.

Signed-off-by: Mike Maloney
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Mike Maloney
2017-08-24 11:30:47 +0800

17 Aug, 2017

1 commit

774c46732 tcp: Export tcp_{sendpage,sendmsg}_locked() for ipv6. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2017-08-17 06:41:34 +0800

04 Aug, 2017

1 commit

f214f915e tcp: enable MSG_ZEROCOPY ... Browse Code »

Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
both supported. Only data sent to remote destinations is sent without
copying. Packets looped onto a local destination have their payload
copied to avoid unbounded latency.

Tested:
A 10x TCP_STREAM between two hosts showed a reduction in netserver
process cycles by up to 70%, depending on packet size. Systemwide,
savings are of course much less pronounced, at up to 20% best case.

msg_zerocopy.sh 4 tcp:

without zerocopy
tx=121792 (7600 MB) txc=0 zc=n
rx=60458 (7600 MB)

with zerocopy
tx=286257 (17863 MB) txc=286257 zc=y
rx=140022 (17863 MB)

This test opens a pair of sockets over veth, one one calls send with
64KB and optionally MSG_ZEROCOPY and on the other reads the initial
bytes. The receiver truncates, so this is strictly an upper bound on
what is achievable. It is more representative of sending data out of
a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2017-08-04 12:37:30 +0800

02 Aug, 2017

1 commit

306b13eb3 proto_ops: Add locked held versions of sendmsg and sendpage ... Browse Code »

Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2017-08-02 06:26:18 +0800

01 Aug, 2017

4 commits

bb7c19f96 tcp: add related fields into SCM_TIMESTAMPING_OPT_STATS ... Browse Code »

Add the following stats into SCM_TIMESTAMPING_OPT_STATS control msg:
TCP_NLA_PACING_RATE
TCP_NLA_DELIVERY_RATE
TCP_NLA_SND_CWND
TCP_NLA_REORDERING
TCP_NLA_MIN_RTT
TCP_NLA_RECUR_RETRANS
TCP_NLA_DELIVERY_RATE_APP_LMT

Signed-off-by: Wei Wang
Acked-by: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Wei Wang
2017-08-01 08:26:18 +0800
0263598c7 tcp: extract the function to compute delivery rate ... Browse Code »

Refactor the code to extract the function to compute delivery rate.
This function will be used in later commit.

Signed-off-by: Wei Wang
Acked-by: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Wei Wang
2017-08-01 08:26:18 +0800
45f119bf9 tcp: remove header prediction ... Browse Code »

Like prequeue, I am not sure this is overly useful nowadays.

If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-01 05:37:49 +0800
e7942d063 tcp: remove prequeue support ... Browse Code »

prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2017-08-01 05:37:49 +0800

02 Jul, 2017

1 commit

91b5b21c7 bpf: Add support for changing congestion control ... Browse Code »

Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:14 +0800

01 Jul, 2017

2 commits

14afee4b6 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
b07911593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller

David S. Miller
2017-07-01 00:43:08 +0800

28 Jun, 2017

1 commit

d97af30f6 tcp: fix null ptr deref in getsockopt(..., TCP_ULP, ...) ... Browse Code »

If icsk_ulp_ops is unset, it dereferences a null ptr.
Add a null ptr check.

BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
Read of size 4 at addr 0000000000000020 by task syz-executor1/15452

Signed-off-by: Dave Watson
Reported-by: "Levin, Alexander (Sasha Levin)"
Signed-off-by: David S. Miller

Dave Watson
2017-06-28 03:39:11 +0800

26 Jun, 2017

1 commit

d747a7a51 tcp: reset sk_rx_dst in tcp_disconnect() ... Browse Code »

We have to reset the sk->sk_rx_dst when we disconnect a TCP
connection, because otherwise when we re-connect it this
dst reference is simply overridden in tcp_finish_connect().

This fixes a dst leak which leads to a loopback dev refcnt
leak. It is a long-standing bug, Kevin reported a very similar
(if not same) bug before. Thanks to Andrei for providing such
a reliable reproducer which greatly narrows down the problem.

Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.")
Reported-by: Andrei Vagin
Reported-by: Kevin Xu
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2017-06-26 00:23:07 +0800

20 Jun, 2017

1 commit

8917a777b tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix ... Browse Code »

Replace first padding in the tcp_md5sig structure with a new flag field
and address prefix length so it can be specified when configuring a new
key for TCP MD5 signature. The tcpm_flags field will only be used if the
socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

Signed-off-by: Bob Gilligan
Signed-off-by: Eric Mowat
Signed-off-by: Ivan Delalande
Signed-off-by: David S. Miller

Ivan Delalande
2017-06-20 01:51:34 +0800

16 Jun, 2017

2 commits

e3b5616a3 tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functions ... Browse Code »

Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
sendpages while the socket is already locked.

tcp_sendpage is exported, but requires the socket lock to not be held already.

Signed-off-by: Aviad Yehezkel
Signed-off-by: Ilya Lesokhin
Signed-off-by: Boris Pismenny
Signed-off-by: Dave Watson
Signed-off-by: David S. Miller

Dave Watson
2017-06-16 00:12:40 +0800
734942cc4 tcp: ULP infrastructure ... Browse Code »

Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
ULP can add its own logic by changing the TCP proto_ops structure to its own
methods.

Example usage:

setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

modules will call:
tcp_register_ulp(&tcp_tls_ulp_ops);

to register/unregister their ulp, with an init function and name.

A list of registered ulps will be returned by tcp_get_available_ulp, which is
hooked up to /proc. Example:

$ cat /proc/sys/net/ipv4/tcp_available_ulp
tls

There is currently no functionality to remove or chain ULPs, but
it should be possible to add these in the future if needed.

Signed-off-by: Boris Pismenny
Signed-off-by: Dave Watson
Signed-off-by: David S. Miller

Dave Watson
2017-06-16 00:12:40 +0800

08 Jun, 2017

1 commit

060447511 tcp: add TCPMemoryPressuresChrono counter ... Browse Code »

DRAM supply shortage and poor memory pressure tracking in TCP
stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
limits) and tcp_mem[] quite hazardous.

TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
limits being hit, but only tracking number of transitions.

If TCP stack behavior under stress was perfect :
1) It would maintain memory usage close to the limit.
2) Memory pressure state would be entered for short times.

We certainly prefer 100 events lasting 10ms compared to one event
lasting 200 seconds.

This patch adds a new SNMP counter tracking cumulative duration of
memory pressure events, given in ms units.

$ cat /proc/sys/net/ipv4/tcp_mem
3088 4117 6176
$ grep TCP /proc/net/sockstat
TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
$ nstat -n ; sleep 10 ; nstat |grep Pressure
TcpExtTCPMemoryPressures 1700
TcpExtTCPMemoryPressuresChrono 5209

v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
instructed.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-06-08 23:26:19 +0800

07 Jun, 2017

1 commit

216fe8f02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Just some simple overlapping changes in marvell PHY driver
and the DSA core code.

Signed-off-by: David S. Miller

David S. Miller
2017-06-07 10:20:08 +0800

01 Jun, 2017

1 commit

15e565152 tcp: reinitialize MTU probing when setting MSS in a TCP repair ... Browse Code »

MTU probing initialization occurred only at connect() and at SYN or
SYN-ACK reception, but the former sets MSS to either the default or the
user set value (through TCP_MAXSEG sockopt) and the latter never happens
with repaired sockets.

The result was that, with MTU probing enabled and unless TCP_MAXSEG
sockopt was used before connect(), probing would be stuck at
tcp_base_mss value until tcp_probe_interval seconds have passed.

Signed-off-by: Douglas Caetano dos Santos
Signed-off-by: David S. Miller

Douglas Caetano dos Santos
2017-06-01 00:28:59 +0800

27 May, 2017

1 commit

34aa83c2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net'
restricting a HW workaround alongside cleanups in 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-05-27 08:46:35 +0800

26 May, 2017

1 commit

ba615f675 tcp: avoid fastopen API to be used on AF_UNSPEC ... Browse Code »

Fastopen API should be used to perform fastopen operations on the TCP
socket. It does not make sense to use fastopen API to perform disconnect
by calling it with AF_UNSPEC. The fastopen data path is also prone to
race conditions and bugs when using with AF_UNSPEC.

One issue reported and analyzed by Vegard Nossum is as follows:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Thread A: Thread B:
------------------------------------------------------------------------
sendto()
- tcp_sendmsg()
- sk_stream_memory_free() = 0
- goto wait_for_sndbuf
- sk_stream_wait_memory()
- sk_wait_event() // sleep
| sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
| - tcp_sendmsg()
| - tcp_sendmsg_fastopen()
| - __inet_stream_connect()
| - tcp_disconnect() //because of AF_UNSPEC
| - tcp_transmit_skb()// send RST
| - return 0; // no reconnect!
| - sk_stream_wait_connect()
| - sock_error()
| - xchg(&sk->sk_err, 0)
| - return -ECONNRESET
- ... // wake up, see sk->sk_err == 0
- skb_entail() on TCP_CLOSE socket

If the connection is reopened then we will send a brand new SYN packet
after thread A has already queued a buffer. At this point I think the
socket internal state (sequence numbers etc.) becomes messed up.

When the new connection is closed, the FIN-ACK is rejected because the
sequence number is outside the window. The other side tries to
retransmit,
but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
and return EOPNOTSUPP to user if such case happens.

Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Reported-by: Vegard Nossum
Signed-off-by: Wei Wang
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Wei Wang
2017-05-26 01:30:34 +0800

23 May, 2017

1 commit

218b6a5b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2017-05-23 11:32:48 +0800