Eric Lee / smarc-fsl-linux-kernel

01 Aug, 2020

2 commits

9466a1cce mptcp: enable JOIN requests even if cookies are in use ... Browse Code »

JOIN requests do not work in syncookie mode -- for HMAC validation, the
peers nonce and the mptcp token (to obtain the desired connection socket
the join is for) are required, but this information is only present in the
initial syn.

So either we need to drop all JOIN requests once a listening socket enters
syncookie mode, or we need to store enough state to reconstruct the request
socket later.

This adds a state table (1024 entries) to store the data present in the
MP_JOIN syn request and the random nonce used for the cookie syn/ack.

When a MP_JOIN ACK passed cookie validation, the table is consulted
to rebuild the request socket from it.

An alternate approach would be to "cancel" syn-cookie mode and force
MP_JOIN to always use a syn queue entry.

However, doing so brings the backlog over the configured queue limit.

v2: use req->syncookie, not (removed) want_cookie arg

Suggested-by: Paolo Abeni
Signed-off-by: Florian Westphal
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Florian Westphal
2020-08-01 07:55:32 +0800
c83a47e50 mptcp: subflow: add mptcp_subflow_init_cookie_req helper ... Browse Code »

Will be used to initialize the mptcp request socket when a MP_CAPABLE
request was handled in syncookie mode, i.e. when a TCP ACK containing a
MP_CAPABLE option is a valid syncookie value.

Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
ID and stores it in the MPTCP token storage to be able to retrieve the
mptcp socket for subflow joining.

In syncookie case, we do not want to store any state, so just generate the
unique ID and use it in the reply.

This means there is a small window where another connection could generate
the same token.

When Cookie ACK comes back, we check that the token has not been registered
in the mean time. If it was, the connection needs to fall back to TCP.

Changes in v2:
- use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
(Eric Dumazet)

Signed-off-by: Florian Westphal
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Florian Westphal
2020-08-01 07:55:32 +0800

29 Jul, 2020

2 commits

3721b9b64 mptcp: Track received DATA_FIN sequence number and add related helpers ... Browse Code »

Incoming DATA_FIN headers need to propagate the presence of the DATA_FIN
bit and the associated sequence number to the MPTCP layer, even when
arriving on a bare ACK that does not get added to the receive queue. Add
structure members to store the DATA_FIN information and helpers to set
and check those values.

Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Mat Martineau
2020-07-29 08:02:42 +0800
7279da614 mptcp: Use MPTCP-level flag for sending DATA_FIN ... Browse Code »

Since DATA_FIN information is the same for every subflow, store it only
in the mptcp_sock.

Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Mat Martineau
2020-07-29 08:02:41 +0800

24 Jul, 2020

1 commit

b93df08cc mptcp: explicitly track the fully established status ... Browse Code »

Currently accepted msk sockets become established only after
accept() returns the new sk to user-space.

As MP_JOIN request are refused as per RFC spec on non fully
established socket, the above causes mp_join self-tests
instabilities.

This change lets the msk entering the established status
as soon as it receives the 3rd ack and propagates the first
subflow fully established status on the msk socket.

Finally we can change the subflow acceptance condition to
take in account both the sock state and the msk fully
established flag.

Reviewed-by: Mat Martineau
Tested-by: Christoph Paasch
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-07-24 02:47:24 +0800

22 Jul, 2020

1 commit

c1d069e3b mptcp: move helper to where its used ... Browse Code »

Only used in token.c.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2020-07-22 07:22:18 +0800

10 Jul, 2020

1 commit

96d890daa mptcp: add msk interations helper ... Browse Code »

mptcp_token_iter_next() allow traversing all the MPTCP
sockets inside the token container belonging to the given
network namespace with a quite standard iterator semantic.

That will be used by the next patch, but keep the API generic,
as we plan to use this later for PM's sake.

Additionally export mptcp_token_get_sock(), as it also
will be used by the diag module.

Reviewed-by: Mat Martineau
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-07-10 03:38:41 +0800

08 Jul, 2020

1 commit

b416268b7 mptcp: use mptcp worker for path management ... Browse Code »

We can re-use the existing work queue to handle path management
instead of a dedicated work queue. Just move pm_worker to protocol.c,
call it from the mptcp worker and get rid of the msk lock (already held).

Signed-off-by: Florian Westphal
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Florian Westphal
2020-07-08 04:02:13 +0800

02 Jul, 2020

1 commit

a6b118feb mptcp: add receive buffer auto-tuning ... Browse Code »

When mptcp is used, userspace doesn't read from the tcp (subflow)
socket but from the parent (mptcp) socket receive queue.

skbs are moved from the subflow socket to the mptcp rx queue either from
'data_ready' callback (if mptcp socket can be locked), a work queue, or
the socket receive function.

This means tcp_rcv_space_adjust() is never called and thus no receive
buffer size auto-tuning is done.

An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
function that moves skbs from subflow to mptcp socket.
While this enabled autotuning, it also meant tuning was done even if
userspace was reading the mptcp socket very slowly.

This adds mptcp_rcv_space_adjust() and calls it after userspace has
read data from the mptcp socket rx queue.

Its very similar to tcp_rcv_space_adjust, with two differences:

1. The rtt estimate is the largest one observed on a subflow
2. The rcvbuf size and window clamp of all subflows is adjusted
to the mptcp-level rcvbuf.

Otherwise, we get spurious drops at tcp (subflow) socket level if
the skbs are not moved to the mptcp socket fast enough.

Before:
time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
[..]
ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ]
ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ]
ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ]
Time: 1396 seconds

After:
ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ]
ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ]
ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ]
Time: 296 seconds

Signed-off-by: Florian Westphal
Reviewed-by: Matthieu Baerts
Signed-off-by: David S. Miller

Florian Westphal
2020-07-02 08:47:55 +0800

30 Jun, 2020

2 commits

8fd738049 mptcp: fallback in case of simultaneous connect ... Browse Code »

when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
[...]
CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
[...]
RIP: 0010:subflow_data_ready+0x18b/0x230
[...]
Call Trace:
tcp_data_queue+0xd2f/0x4250
tcp_rcv_state_process+0xb1c/0x49d3
tcp_v4_do_rcv+0x2bc/0x790
__release_sock+0x153/0x2d0
release_sock+0x4f/0x170
mptcp_shutdown+0x167/0x4e0
__sys_shutdown+0xe6/0x180
__x64_sys_shutdown+0x50/0x70
do_syscall_64+0x9a/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
TCP_ESTABLISHED

crash> bt 4847
PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35"
#0 [ffff8881376ff680] __schedule at ffffffff97248da4
#1 [ffff8881376ff778] schedule at ffffffff9724a34f
#2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
#3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
#4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
#5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
#6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
#7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
#8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
#9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
#10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
#11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
#12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217
RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed
RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003
RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720
R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0
R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
didn't complete.

$ tcpdump -tnnr bad.pcap
IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35
Reported-by: Christoph Paasch
Suggested-by: Paolo Abeni
Signed-off-by: Davide Caratti
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Davide Caratti
2020-06-30 08:29:38 +0800
e1ff9e82e net: mptcp: improve fallback to TCP ... Browse Code »

Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Davide Caratti
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Davide Caratti
2020-06-30 08:29:38 +0800

27 Jun, 2020

2 commits

2c5ebd001 mptcp: refactor token container ... Browse Code »

Replace the radix tree with a hash table allocated
at boot time. The radix tree has some shortcoming:
a single lock is contented by all the mptcp operation,
the lookup currently use such lock, and traversing
all the items would require a lock, too.

With hash table instead we trade a little memory to
address all the above - a per bucket lock is used.

To hash the MPTCP sockets, we re-use the msk' sk_node
entry: the MPTCP sockets are never hashed by the stack.
Replace the existing hash proto callbacks with a dummy
implementation, annotating the above constraint.

Additionally refactor the token creation to code to:

- limit the number of consecutive attempts to a fixed
maximum. Hitting a hash bucket with a long chain is
considered a failed attempt

- accept() no longer can fail to token management.

- if token creation fails at connect() time, we do
fallback to TCP (before the connection was closed)

v1 -> v2:
- fix "no newline at end of file" - Jakub

Signed-off-by: Paolo Abeni
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-06-27 07:21:39 +0800
d39dceca3 mptcp: add __init annotation on setup functions ... Browse Code »

Add the missing annotation in some setup-only
functions.

Signed-off-by: Paolo Abeni
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-06-27 07:21:39 +0800

26 Jun, 2020

1 commit

7bed14551 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.

Signed-off-by: David S. Miller

David S. Miller
2020-06-26 10:29:51 +0800

24 Jun, 2020

2 commits

9b9e2f250 tcp: move ipv4_specific to tcp include file ... Browse Code »

Declare ipv4_specific once, in tcp.h were it belongs.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2020-06-24 11:10:15 +0800
b03d2142b tcp: move ipv6_specific declaration to remove a warning ... Browse Code »

ipv6_specific should be declared in tcp include files,
not mptcp.

This removes the following warning :
CHECK net/ipv6/tcp_ipv6.c
net/ipv6/tcp_ipv6.c:78:42: warning: symbol 'ipv6_specific' was not declared. Should it be static?

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2020-06-24 11:10:15 +0800

19 Jun, 2020

1 commit

8fd4de127 mptcp: cache msk on MP_JOIN init_req ... Browse Code »

The msk ownership is transferred to the child socket at
3rd ack time, so that we avoid more lookups later. If the
request does not reach the 3rd ack, the MSK reference is
dropped at request sock release time.

As a side effect, fallback is now tracked by a NULL msk
reference instead of zeroed 'mp_join' field. This will
simplify the next patch.

Signed-off-by: Paolo Abeni
Reviewed-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-06-19 11:25:51 +0800

16 Jun, 2020

2 commits

a386bc5b2 mptcp: use list_first_entry_or_null ... Browse Code »

Use list_first_entry_or_null to simplify the code.

Signed-off-by: Geliang Tang
Signed-off-by: David S. Miller

Geliang Tang
2020-06-16 04:04:53 +0800
c06c1f87b mptcp: drop MPTCP_PM_MAX_ADDR ... Browse Code »

We have defined MPTCP_PM_ADDR_MAX in pm_netlink.c, so drop this duplicate macro.

Fixes: 1b1c7a0ef7f3 ("mptcp: Add path manager interface")
Signed-off-by: Geliang Tang
Reviewed-by: Matthieu Baerts
Signed-off-by: David S. Miller

Geliang Tang
2020-06-16 04:01:17 +0800

25 May, 2020

1 commit

13209a8f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller

David S. Miller
2020-05-25 04:47:27 +0800

23 May, 2020

1 commit

bd6972226 mptcp: use untruncated hash in ADD_ADDR HMAC ... Browse Code »

There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is
the rightmost 64 bits of the entire hash or of the leftmost 160 bits
of the hash. The intention, as clarified with the author of the RFC,
is the entire hash.

This change returns the entire hash from
mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves
any truncation/selection operation on the hash to the caller.

Fixes: 12555a2d97e5 ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC")
Reviewed-by: Christoph Paasch
Reviewed-by: Mat Martineau
Signed-off-by: Todd Malsbary
Signed-off-by: David S. Miller

Todd Malsbary
2020-05-23 05:21:24 +0800

17 May, 2020

1 commit

a0c1d0eaf mptcp: Use 32-bit DATA_ACK when possible ... Browse Code »

RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
extreme scenarios when a very high throughput subflow is combined with a
long-RTT subflow such that the high-throughput subflow wraps around the
32-bit sequence number space within an RTT of the high-RTT subflow.

It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
instead as long as possible. It allows to reduce the TCP-option overhead
by 4 bytes, thus makes space for an additional SACK-block. It also makes
tcpdumps much easier to read when the DSN and DATA_ACK are both either
32 or 64-bit.

Signed-off-by: Christoph Paasch
Reviewed-by: Matthieu Baerts
Signed-off-by: David S. Miller

Christoph Paasch
2020-05-17 04:51:10 +0800

01 May, 2020

1 commit

cfde141ea mptcp: move option parsing into mptcp_incoming_options() ... Browse Code »

The mptcp_options_received structure carries several per
packet flags (mp_capable, mp_join, etc.). Such fields must
be cleared on each packet, even on dropped ones or packet
not carrying any MPTCP options, but the current mptcp
code clears them only on TCP option reset.

On several races/corner cases we end-up with stray bits in
incoming options, leading to WARN_ON splats. e.g.:

[ 171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
[ 171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[ 171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
[ 171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
[ 171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[ 171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[ 171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
[ 171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
[ 171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
[ 171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
[ 171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
[ 171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
[ 171.228460] FS: 00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
[ 171.230065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
[ 171.232586] Call Trace:
[ 171.233109]
[ 171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
[ 171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
[ 171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
[ 171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
[ 171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
[ 171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
[ 171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
[ 171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
[ 171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
[ 171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
[ 171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
[ 171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
[ 171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
[ 171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
[ 171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
[ 171.282358]

We could address the issue clearing explicitly the relevant fields
in several places - tcp_parse_option, tcp_fast_parse_options,
possibly others.

Instead we move the MPTCP option parsing into the already existing
mptcp ingress hook, so that we need to clear the fields in a single
place.

This allows us dropping an MPTCP hook from the TCP code and
removing the quite large mptcp_options_received from the tcp_sock
struct. On the flip side, the MPTCP sockets will traverse the
option space twice (in tcp_parse_option() and in
mptcp_incoming_options(). That looks acceptable: we already
do that for syn and 3rd ack packets, plain TCP socket will
benefit from it, and even MPTCP sockets will experience better
code locality, reducing the jumps between TCP and MPTCP code.

v1 -> v2:
- rebased on current '-net' tree

Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path")
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-05-01 03:23:22 +0800

21 Apr, 2020

1 commit

fca5c82c0 mptcp: drop req socket remote_key* fields ... Browse Code »

We don't need them, as we can use the current ingress opt
data instead. Setting them in syn_recv_sock() may causes
inconsistent mptcp socket status, as per previous commit.

Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec")
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-04-21 03:59:33 +0800

02 Apr, 2020

1 commit

59832e246 mptcp: subflow: check parent mptcp socket on subflow state change ... Browse Code »

This is needed at least until proper MPTCP-Level fin/reset
signalling gets added:

We wake parent when a subflow changes, but we should do this only
when all subflows have closed, not just one.

Schedule the mptcp worker and tell it to check eof state on all
subflows.

Only flag mptcp socket as closed and wake userspace processes blocking
in poll if all subflows have closed.

Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2020-04-02 21:59:21 +0800

30 Mar, 2020

12 commits

01cacb00b mptcp: add netlink-based PM ... Browse Code »

Expose a new netlink family to userspace to control the PM, setting:

- list of local addresses to be signalled.
- list of local addresses used to created subflows.
- maximum number of add_addr option to react

When the msk is fully established, the PM netlink attempts to
announce the 'signal' list via the ADD_ADDR option. Since we
currently lack the ADD_ADDR echo (and related event) only the
first addr is sent.

After exhausting the 'announce' list, the PM tries to create
subflow for each addr in 'local' list, waiting for each
connection to be completed before attempting the next one.

Idea is to add an additional PM hook for ADD_ADDR echo, to allow
the PM netlink announcing multiple addresses, in sequence.

Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-30 13:14:49 +0800
5147dfb50 mptcp: allow dumping subflow context to userspace ... Browse Code »

add ulp-specific diagnostic functions, so that subflow information can be
dumped to userspace programs like 'ss'.

v2 -> v3:
- uapi: use bit macros appropriate for userspace

Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Davide Caratti
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Davide Caratti
2020-03-30 13:14:48 +0800
3b1d6210a mptcp: implement and use MPTCP-level retransmission ... Browse Code »

On timeout event, schedule a work queue to do the retransmission.
Retransmission code closely resembles the sendmsg() implementation and
re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
sake - and peeking the relevant dfrag from the rtx head.

Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-30 13:14:48 +0800
7948f6cc9 mptcp: allow partial cleaning of rtx head dfrag ... Browse Code »

After adding wmem accounting for the mptcp socket we could get
into a situation where the mptcp socket can't transmit more data,
and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced
because it currently will only remove entire dfrags.

Allow advancing the dfrag head sequence and reduce wmem,
even though this isn't correct (as we can't release the page).

Because we will soon block on mptcp sk in case wmem is too large,
call sk_stream_write_space() in case we reduced the backlog so
userspace task blocked in sendmsg or poll will be woken up.

This isn't an issue if the send buffer is large, but it is when
SO_SNDBUF is used to reduce it to a lower value.

Note we can still get a deadlock for low SO_SNDBUF values in
case both sides of the connection write to the socket: both could
be blocked due to wmem being too small -- and current mptcp stack
will only increment mptcp ack_seq on recv.

This doesn't happen with the selftest as it uses poll() and
will always call recv if there is data to read.

Signed-off-by: Florian Westphal
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Florian Westphal
2020-03-30 13:14:48 +0800
b51f9b80c mptcp: introduce MPTCP retransmission timer ... Browse Code »

The timer will be used to schedule retransmission. It's
frequency is based on the current subflow RTO estimation and
is reset on every una_seq update

The timer is clearer for good by __mptcp_clear_xmit()

Also clean MPTCP rtx queue before each transmission.

Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-30 13:14:48 +0800
18b683bff mptcp: queue data for mptcp level retransmission ... Browse Code »

Keep the send page fragment on an MPTCP level retransmission queue.
The queue entries are allocated inside the page frag allocator,
acquiring an additional reference to the page for each list entry.

Also switch to a custom page frag refill function, to ensure that
the current page fragment can always host an MPTCP rtx queue entry.

The MPTCP rtx queue is flushed at disconnect() and close() time

Note that now we need to call __mptcp_init_sock() regardless of mptcp
enable status, as the destructor will try to walk the rtx_queue.

v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)

Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-30 13:14:48 +0800
cc9d25669 mptcp: update per unacked sequence on pkt reception ... Browse Code »

So that we keep per unacked sequence number consistent; since
we update per msk data, use an atomic64 cmpxchg() to protect
against concurrent updates from multiple subflows.

Initialize the snd_una at connect()/accept() time.

Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-30 13:14:48 +0800
926bdeab5 mptcp: Implement path manager interface commands ... Browse Code »

Fill in more path manager functionality by adding a worker function and
modifying the related stub functions to schedule the worker.

Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Peter Krystad
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Peter Krystad
2020-03-30 13:14:48 +0800
ec3edaa7c mptcp: Add handling of outgoing MP_JOIN requests ... Browse Code »

Subflow creation may be initiated by the path manager when
the primary connection is fully established and a remote
address has been received via ADD_ADDR.

Create an in-kernel sock and use kernel_connect() to
initiate connection.

Passive sockets can't acquire the mptcp socket lock at
subflow creation time, so an additional list protected by
a new spinlock is used to track the MPJ subflows.

Such list is spliced into conn_list tail every time the msk
socket lock is acquired, so that it will not interfere
with data flow on the original connection.

Data flow and connection failover not addressed by this commit.

Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Signed-off-by: Peter Krystad
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Peter Krystad
2020-03-30 13:14:48 +0800
f296234c9 mptcp: Add handling of incoming MP_JOIN requests ... Browse Code »

Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Peter Krystad
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Peter Krystad
2020-03-30 13:14:48 +0800
1b1c7a0ef mptcp: Add path manager interface ... Browse Code »

Add enough of a path manager interface to allow sending of ADD_ADDR
when an incoming MPTCP connection is created. Capable of sending only
a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
sock will need to be expanded to handle multiple interfaces and IPv6.
Partial processing of the incoming ADD_ADDR is included so the path
manager notification of that event happens at the proper time, which
involves validating the incoming address information.

This is a skeleton interface definition for events generated by
MPTCP.

Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Co-developed-by: Mat Martineau
Signed-off-by: Mat Martineau
Signed-off-by: Peter Krystad
Signed-off-by: David S. Miller

Peter Krystad
2020-03-30 13:14:48 +0800
3df523ab5 mptcp: Add ADD_ADDR handling ... Browse Code »

Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.

Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Peter Krystad
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Peter Krystad
2020-03-30 13:14:48 +0800

20 Mar, 2020

1 commit

0be534f5c mptcp: rename fourth ack field ... Browse Code »

The name is misleading, it actually tracks the 'fully established'
status.

Reviewed-by: Mat Martineau
Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-20 11:19:34 +0800

15 Mar, 2020

1 commit

58b099196 mptcp: create msk early ... Browse Code »

This change moves the mptcp socket allocation from mptcp_accept() to
subflow_syn_recv_sock(), so that subflow->conn is now always set
for the non fallback scenario.

It allows cleaning up a bit mptcp_accept() reducing the additional
locking and will allow fourther cleanup in the next patch.

Signed-off-by: Paolo Abeni
Reviewed-by: Matthieu Baerts
Signed-off-by: David S. Miller

Paolo Abeni
2020-03-15 15:19:03 +0800

04 Mar, 2020

1 commit

76c42a29c mptcp: Use per-subflow storage for DATA_FIN sequence number ... Browse Code »

Instead of reading the MPTCP-level sequence number when sending DATA_FIN,
store the data in the subflow so it can be safely accessed when the
subflow TCP headers are written to the packet without the MPTCP-level
lock held. This also allows the MPTCP-level socket to close individual
subflows without closing the MPTCP connection.

Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Mat Martineau
2020-03-04 09:01:43 +0800