Eric Lee / smarc-fsl-linux-kernel

24 Oct, 2020

1 commit

435ccfa89 tcp: Prevent low rmem stalls with SO_RCVLOWAT. ... Browse Code »

With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd
Acked-by: Soheil Hassas Yeganeh
Acked-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski

Arjun Roy
2020-10-24 10:11:20 +0800

06 Oct, 2020

1 commit

8b0308fe3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller

David S. Miller
2020-10-06 09:40:01 +0800

03 Oct, 2020

1 commit

cf83a17ed tcp: use sendpage_ok() to detect misused .sendpage ... Browse Code »

commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab
objects") adds the checks for Slab pages, but the pages don't have
page_count are still missing from the check.

Network layer's sendpage method is not designed to send page_count 0
pages neither, therefore both PageSlab() and page_count() should be
both checked for the sending page. This is exactly what sendpage_ok()
does.

This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
.sendpage, to make the code more robust.

Fixes: a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects")
Suggested-by: Eric Dumazet
Signed-off-by: Coly Li
Cc: Vasily Averin
Cc: David S. Miller
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller

Coly Li
2020-10-03 06:27:08 +0800

01 Oct, 2020

1 commit

b6b6d6533 inet: remove icsk_ack.blocked ... Browse Code »

TCP has been using it to work around the possibility of tcp_delack_timer()
finding the socket owned by user.

After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
so we can get rid of icsk_ack.blocked

This frees space that following patch will reuse.

Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2020-10-01 05:21:30 +0800

24 Sep, 2020

1 commit

6d772f328 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-09-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

The main changes are:

1) Full multi function support in libbpf, from Andrii.

2) Refactoring of function argument checks, from Lorenz.

3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

4) Program metadata support, from YiFei.

5) bpf iterator optimizations, from Yonghong.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-09-24 04:11:11 +0800

15 Sep, 2020

3 commits

afb83012c tcp: schedule EPOLLOUT after a partial sendmsg ... Browse Code »

For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.

As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:

// One large partial send due to memory allocation failure.
sendmsg(20MB) = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB) = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...) = EAGAIN
// At this point, userspace can assume an EPOLLOUT.

To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.

After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.

Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2020-09-15 07:58:24 +0800
8ba3c9d1c tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit ... Browse Code »

If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events. In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat. This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.

Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.

Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2020-09-15 07:58:24 +0800
c76c69565 mptcp: call tcp_cleanup_rbuf on subflows ... Browse Code »

That is needed to let the subflows announce promptly when new
space is available in the receive buffer.

tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.

Reviewed-by: Mat Martineau
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-09-15 04:28:02 +0800

11 Sep, 2020

2 commits

29a949325 tcp: simplify tcp_set_congestion_control(): Always reinitialize ... Browse Code »

Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.

Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: Alexei Starovoitov
Acked-by: Yuchung Cheng
Acked-by: Kevin Yang
Cc: Lawrence Brakmo

Neal Cardwell
2020-09-11 11:53:01 +0800
8919a9b31 tcp: Only init congestion control if not initialized already ... Browse Code »

Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: Alexei Starovoitov
Acked-by: Yuchung Cheng
Acked-by: Kevin Yang
Cc: Lawrence Brakmo

Neal Cardwell
2020-09-11 11:53:01 +0800

25 Aug, 2020

4 commits

267cf9fa4 tcp: bpf: Optionally store mac header in TCP_SAVE_SYN ... Browse Code »

This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header. This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option. Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

Suggested-by: Eric Dumazet
Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com

Martin KaFai Lau
2020-08-25 05:35:00 +0800
ca584ba07 tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt ... Browse Code »

This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection. It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.

Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com

Martin KaFai Lau
2020-08-25 05:35:00 +0800
2b8ee4f05 tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt ... Browse Code »

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated
to its peer through bpf header option. The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.

Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com

Martin KaFai Lau
2020-08-25 05:34:59 +0800
70a217f19 tcp: Use a struct to represent a saved_syn ... Browse Code »

The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header. It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it. This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt. Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn". By using a new struct, it can
give a readable name to each individual header length.

Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com

Martin KaFai Lau
2020-08-25 05:34:59 +0800

11 Aug, 2020

1 commit

f19008e67 tcp: correct read of TFO keys on big endian systems ... Browse Code »

When TFO keys are read back on big endian systems either via the global
sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
don't match what was written.

For example, on s390x:

# echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
# cat /proc/sys/net/ipv4/tcp_fastopen_key
02000000-01000000-04000000-03000000

Instead of:

# cat /proc/sys/net/ipv4/tcp_fastopen_key
00000001-00000002-00000003-00000004

Fix this by converting to the correct endianness on read. This was
reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
selftest on s390x, which depends on the read value matching what was
written. I've confirmed that the test now passes on big and little endian
systems.

Signed-off-by: Jason Baron
Fixes: 438ac88009bc ("net: fastopen: robustness and endianness fixes for SipHash")
Cc: Ard Biesheuvel
Cc: Eric Dumazet
Reported-and-tested-by: Colin Ian King
Signed-off-by: David S. Miller

Jason Baron
2020-08-11 03:12:35 +0800

01 Aug, 2020

1 commit

48040793f tcp: add earliest departure time to SCM_TIMESTAMPING_OPT_STATS ... Browse Code »

This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
the earliest departure time(EDT) of the timestamped skb. By tracking EDT
values of the skb from different timestamps, we can observe when and how
much the value changed. This allows to measure the precise delay
injected on the sender host e.g. by a bpf-base throttler.

Signed-off-by: Yousuk Seung
Signed-off-by: Eric Dumazet
Acked-by: Neal Cardwell
Acked-by: Soheil Hassas Yeganeh
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Yousuk Seung
2020-08-01 08:00:44 +0800

29 Jul, 2020

1 commit

d3c481515 net: remove sockptr_advance ... Browse Code »

sockptr_advance never properly worked. Replace it with _offset variants
of copy_from_sockptr and copy_to_sockptr.

Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
Reported-by: Jason A. Donenfeld
Reported-by: Ido Schimmel
Signed-off-by: Christoph Hellwig
Acked-by: Jason A. Donenfeld
Tested-by: Ido Schimmel
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-29 04:43:40 +0800

25 Jul, 2020

3 commits

a7b75c5a8 net: pass a sockptr_t into ->setsockopt ... Browse Code »

Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.

Signed-off-by: Christoph Hellwig
Acked-by: Stefan Schmidt [ieee802154]
Acked-by: Matthieu Baerts
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-25 06:41:54 +0800
d38d2b00b net/tcp: switch do_tcp_setsockopt to sockptr_t ... Browse Code »

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-25 06:41:54 +0800
d4c19c491 net/tcp: switch ->md5_parse to sockptr_t ... Browse Code »

Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-25 06:41:54 +0800

20 Jul, 2020

1 commit

3021ad529 net/ipv6: remove compat_ipv6_{get,set}sockopt ... Browse Code »

Handle the few cases that need special treatment in-line using
in_compat_syscall(). This also removes all the now unused
compat_{get,set}sockopt methods.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-20 09:16:41 +0800

11 Jul, 2020

1 commit

71930d610 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller

David S. Miller
2020-07-11 15:46:00 +0800

10 Jul, 2020

1 commit

ce69e563b tcp: make sure listeners don't initialize congestion-control state ... Browse Code »

syzkaller found its way into setsockopt with TCP_CONGESTION "cdg".
tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock
just copies all the memory, the allocated pointer will be copied as
well, if the app called setsockopt(..., TCP_CONGESTION) on the listener.
If now the socket will be destroyed before the congestion-control
has properly been initialized (through a call to tcp_init_transfer), we
will end up freeing memory that does not belong to that particular
socket, opening the door to a double-free:

[ 11.413102] ==================================================================
[ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0
[ 11.415329]
[ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80
[ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 11.418148] Call Trace:
[ 11.418534]
[ 11.418834] dump_stack+0x7d/0xb0
[ 11.419297] print_address_description.constprop.0+0x1a/0x210
[ 11.422079] kasan_report_invalid_free+0x51/0x80
[ 11.423433] __kasan_slab_free+0x15e/0x170
[ 11.424761] kfree+0x8c/0x230
[ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0
[ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0
[ 11.426493] inet_csk_destroy_sock+0x153/0x2c0
[ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100
[ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0
[ 11.429457] cookie_v4_check+0x13d0/0x2500
[ 11.433189] tcp_v4_do_rcv+0x60e/0x780
[ 11.433727] tcp_v4_rcv+0x2869/0x2e10
[ 11.437143] ip_protocol_deliver_rcu+0x23/0x190
[ 11.437810] ip_local_deliver+0x294/0x350
[ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0
[ 11.441995] process_backlog+0x1b1/0x6b0
[ 11.443148] net_rx_action+0x37e/0xc40
[ 11.445361] __do_softirq+0x18c/0x61a
[ 11.445881] asm_call_on_stack+0x12/0x20
[ 11.446409]
[ 11.446716] do_softirq_own_stack+0x34/0x40
[ 11.447259] do_softirq.part.0+0x26/0x30
[ 11.447827] __local_bh_enable_ip+0x46/0x50
[ 11.448406] ip_finish_output2+0x60f/0x1bc0
[ 11.450109] __ip_queue_xmit+0x71c/0x1b60
[ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0
[ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a
[ 11.456810] tcp_v4_do_rcv+0x2ad/0x780
[ 11.457995] __release_sock+0x14b/0x2c0
[ 11.458529] release_sock+0x4a/0x170
[ 11.459005] __inet_stream_connect+0x467/0xc80
[ 11.461435] inet_stream_connect+0x4e/0xa0
[ 11.462043] __sys_connect+0x204/0x270
[ 11.465515] __x64_sys_connect+0x6a/0xb0
[ 11.466088] do_syscall_64+0x3e/0x70
[ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 11.467341] RIP: 0033:0x7f56046dc469
[ 11.467844] Code: Bad RIP value.
[ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
[ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469
[ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004
[ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003
[ 11.474321]
[ 11.474527] Allocated by task 4884:
[ 11.475031] save_stack+0x1b/0x40
[ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 11.476182] tcp_cdg_init+0xf0/0x150
[ 11.476744] tcp_init_congestion_control+0x9b/0x3a0
[ 11.477435] tcp_set_congestion_control+0x270/0x32f
[ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00
[ 11.478744] __sys_setsockopt+0xff/0x1e0
[ 11.479259] __x64_sys_setsockopt+0xb5/0x150
[ 11.479895] do_syscall_64+0x3e/0x70
[ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 11.481097]
[ 11.481321] Freed by task 4872:
[ 11.481783] save_stack+0x1b/0x40
[ 11.482230] __kasan_slab_free+0x12c/0x170
[ 11.482839] kfree+0x8c/0x230
[ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0
[ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0
[ 11.484502] inet_csk_destroy_sock+0x153/0x2c0
[ 11.485144] tcp_close+0x932/0xfe0
[ 11.485642] inet_release+0xc1/0x1c0
[ 11.486131] __sock_release+0xc0/0x270
[ 11.486697] sock_close+0xc/0x10
[ 11.487145] __fput+0x277/0x780
[ 11.487632] task_work_run+0xeb/0x180
[ 11.488118] __prepare_exit_to_usermode+0x15a/0x160
[ 11.488834] do_syscall_64+0x4a/0x70
[ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750
("tcp: memset ca_priv data to 0 properly").

This patch here fixes the listener-scenario: We make sure that listeners
setting the congestion-control through setsockopt won't initialize it
(thus CDG never allocates on listeners). For those who use AF_UNSPEC to
reuse a socket, tcp_disconnect() is changed to cleanup afterwards.

(The issue can be reproduced at least down to v4.4.x.)

Cc: Wei Wang
Cc: Eric Dumazet
Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control")
Signed-off-by: Christoph Paasch
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Christoph Paasch
2020-07-10 04:07:45 +0800

03 Jul, 2020

1 commit

1ca0fafd7 tcp: md5: allow changing MD5 keys in all socket states ... Browse Code »

This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG
or TCP_MD5SIG_EXT on established sockets")

Mathieu reported that many vendors BGP implementations can
actually switch TCP MD5 on established flows.

Quoting Mathieu :
Here is a list of a few network vendors along with their behavior
with respect to TCP MD5:

- Cisco: Allows for password to be changed, but within the hold-down
timer (~180 seconds).
- Juniper: When password is initially set on active connection it will
reset, but after that any subsequent password changes no network
resets.
- Nokia: No notes on if they flap the tcp connection or not.
- Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
both sides are ok with new passwords.
- Meta-Switch: Expects the password to be set before a connection is
attempted, but no further info on whether they reset the TCP
connection on a change.
- Avaya: Disable the neighbor, then set password, then re-enable.
- Zebos: Would normally allow the change when socket connected.

We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential
overestimation of TCP option space") removed the leak of 4 kernel bytes to
the wire that was the main reason for my patch.

While doing my investigations, I found a bug when a MD5 key is changed, leading
to these commits that stable teams want to consider before backporting this revert :

Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")

Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
Signed-off-by: Eric Dumazet
Reported-by: Mathieu Desnoyers
Signed-off-by: David S. Miller

Eric Dumazet
2020-07-03 05:07:49 +0800

02 Jul, 2020

1 commit

e6ced831e tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers ... Browse Code »

My prior fix went a bit too far, according to Herbert and Mathieu.

Since we accept that concurrent TCP MD5 lookups might see inconsistent
keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()

Clearing all key->key[] is needed to avoid possible KMSAN reports,
if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.

data_race() was added in linux-5.8 and will prevent KCSAN reports,
this can safely be removed in stable backports, if data_race() is
not yet backported.

v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()

Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
Signed-off-by: Eric Dumazet
Cc: Mathieu Desnoyers
Cc: Herbert Xu
Cc: Marco Elver
Reviewed-by: Mathieu Desnoyers
Acked-by: Herbert Xu
Signed-off-by: David S. Miller

Eric Dumazet
2020-07-02 08:29:45 +0800

01 Jul, 2020

1 commit

6a2febec3 tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key() ... Browse Code »

MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.

Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.

We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()

Fixes: 9ea88a153001 ("tcp: md5: check md5 signature without socket lock")
Signed-off-by: Eric Dumazet
Cc: Mathieu Desnoyers
Signed-off-by: David S. Miller

Eric Dumazet
2020-07-01 09:14:38 +0800

25 Jun, 2020

1 commit

aad4a0a95 tcp: Expose tcp_sock_set_keepidle_locked ... Browse Code »

This is preparation for usage in bpf_setsockopt.

v2:
- remove redundant EXPORT_SYMBOL (Alexei Starovoitov)

Signed-off-by: Dmitry Yakunin
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200620153052.9439-2-zeil@yandex-team.ru

Dmitry Yakunin
2020-06-25 02:21:03 +0800

14 Jun, 2020

1 commit

96144c58a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix cfg80211 deadlock, from Johannes Berg.

2) RXRPC fails to send norigications, from David Howells.

3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
Geliang Tang.

4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

5) The ucc_geth driver needs __netdev_watchdog_up exported, from
Valentin Longchamp.

6) Fix hashtable memory leak in dccp, from Wang Hai.

7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
we forget to clear out the relevant data structures in mlx5 which
causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: ethernet: ti: ale: fix allmulti for nu type ale
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
net: atm: Remove the error message according to the atomic context
bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
libbpf: Support pre-initializing .bss global variables
tools/bpftool: Fix skeleton codegen
bpf: Fix memlock accounting for sock_hash
bpf: sockmap: Don't attach programs to UDP sockets
bpf: tcp: Recv() should return 0 when the peer socket is closed
ibmvnic: Flush existing work items before device removal
genetlink: clean up family attributes allocations
net: ipa: header pad field only valid for AP->modem endpoint
net: ipa: program upper nibbles of sequencer type
net: ipa: fix modem LAN RX endpoint id
net: ipa: program metadata mask differently
ionic: add pcie_print_link_status
rxrpc: Fix race between incoming ACK parser and retransmitter
net/mlx5: E-Switch, Fix some error pointer dereferences
net/mlx5: Don't fail driver on failure to create debugfs
net/mlx5e: CT: Fix ipv6 nat header rewrite actions
...

Linus Torvalds
2020-06-14 07:27:13 +0800

10 Jun, 2020

2 commits

3e4e28c5a mmap locking API: convert mmap_sem API comments ... Browse Code »

Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Davidlohr Bueso
Reviewed-by: Daniel Jordan
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800

09 Jun, 2020

1 commit

3763a24c7 net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy ... Browse Code »

Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as
reported by perf) drop from a couple of percentage points to a fraction of
a percent. This results in a roughly 6% increase in efficiency, measured
roughly as zerocopy receive count divided by CPU utilization.

The intention of this patchset is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

[akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Cc: David Miller
Cc: Matthew Wilcox
Cc: Jason Gunthorpe
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Arjun Roy
2020-06-09 10:08:17 +0800

29 May, 2020

8 commits

480aeb963 tcp: add tcp_sock_set_keepcnt ... Browse Code »

Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
d41ecaac9 tcp: add tcp_sock_set_keepintvl ... Browse Code »

Add a helper to directly set the TCP_KEEPINTVL sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
71c48eb81 tcp: add tcp_sock_set_keepidle ... Browse Code »

Add a helper to directly set the TCP_KEEP_IDLE sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
c488aeadc tcp: add tcp_sock_set_user_timeout ... Browse Code »

Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
557eadfcc tcp: add tcp_sock_set_syncnt ... Browse Code »

Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig
Acked-by: Sagi Grimberg
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
ddd061b8d tcp: add tcp_sock_set_quickack ... Browse Code »

Add a helper to directly set the TCP_QUICKACK sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
12abc5ee7 tcp: add tcp_sock_set_nodelay ... Browse Code »

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig
Acked-by: Sagi Grimberg
Acked-by: Jason Gunthorpe
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800
db10538a4 tcp: add tcp_sock_set_cork ... Browse Code »

Add a helper to directly set the TCP_CORK sockopt from kernel space
without going through a fake uaccess. Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2020-05-29 02:11:45 +0800

16 May, 2020

1 commit

da07f52d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller

David S. Miller
2020-05-16 04:48:59 +0800