01 Aug, 2020

2 commits

  • JOIN requests do not work in syncookie mode -- for HMAC validation, the
    peers nonce and the mptcp token (to obtain the desired connection socket
    the join is for) are required, but this information is only present in the
    initial syn.

    So either we need to drop all JOIN requests once a listening socket enters
    syncookie mode, or we need to store enough state to reconstruct the request
    socket later.

    This adds a state table (1024 entries) to store the data present in the
    MP_JOIN syn request and the random nonce used for the cookie syn/ack.

    When a MP_JOIN ACK passed cookie validation, the table is consulted
    to rebuild the request socket from it.

    An alternate approach would be to "cancel" syn-cookie mode and force
    MP_JOIN to always use a syn queue entry.

    However, doing so brings the backlog over the configured queue limit.

    v2: use req->syncookie, not (removed) want_cookie arg

    Suggested-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Will be used to initialize the mptcp request socket when a MP_CAPABLE
    request was handled in syncookie mode, i.e. when a TCP ACK containing a
    MP_CAPABLE option is a valid syncookie value.

    Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
    ID and stores it in the MPTCP token storage to be able to retrieve the
    mptcp socket for subflow joining.

    In syncookie case, we do not want to store any state, so just generate the
    unique ID and use it in the reply.

    This means there is a small window where another connection could generate
    the same token.

    When Cookie ACK comes back, we check that the token has not been registered
    in the mean time. If it was, the connection needs to fall back to TCP.

    Changes in v2:
    - use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
    (Eric Dumazet)

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     

29 Jul, 2020

2 commits


24 Jul, 2020

1 commit

  • Currently accepted msk sockets become established only after
    accept() returns the new sk to user-space.

    As MP_JOIN request are refused as per RFC spec on non fully
    established socket, the above causes mp_join self-tests
    instabilities.

    This change lets the msk entering the established status
    as soon as it receives the 3rd ack and propagates the first
    subflow fully established status on the msk socket.

    Finally we can change the subflow acceptance condition to
    take in account both the sock state and the msk fully
    established flag.

    Reviewed-by: Mat Martineau
    Tested-by: Christoph Paasch
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

22 Jul, 2020

1 commit


10 Jul, 2020

1 commit

  • mptcp_token_iter_next() allow traversing all the MPTCP
    sockets inside the token container belonging to the given
    network namespace with a quite standard iterator semantic.

    That will be used by the next patch, but keep the API generic,
    as we plan to use this later for PM's sake.

    Additionally export mptcp_token_get_sock(), as it also
    will be used by the diag module.

    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Jul, 2020

1 commit

  • We can re-use the existing work queue to handle path management
    instead of a dedicated work queue. Just move pm_worker to protocol.c,
    call it from the mptcp worker and get rid of the msk lock (already held).

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     

02 Jul, 2020

1 commit

  • When mptcp is used, userspace doesn't read from the tcp (subflow)
    socket but from the parent (mptcp) socket receive queue.

    skbs are moved from the subflow socket to the mptcp rx queue either from
    'data_ready' callback (if mptcp socket can be locked), a work queue, or
    the socket receive function.

    This means tcp_rcv_space_adjust() is never called and thus no receive
    buffer size auto-tuning is done.

    An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
    function that moves skbs from subflow to mptcp socket.
    While this enabled autotuning, it also meant tuning was done even if
    userspace was reading the mptcp socket very slowly.

    This adds mptcp_rcv_space_adjust() and calls it after userspace has
    read data from the mptcp socket rx queue.

    Its very similar to tcp_rcv_space_adjust, with two differences:

    1. The rtt estimate is the largest one observed on a subflow
    2. The rcvbuf size and window clamp of all subflows is adjusted
    to the mptcp-level rcvbuf.

    Otherwise, we get spurious drops at tcp (subflow) socket level if
    the skbs are not moved to the mptcp socket fast enough.

    Before:
    time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
    [..]
    ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ]
    ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ]
    ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ]
    ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ]
    Time: 1396 seconds

    After:
    ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
    ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ]
    ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ]
    ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ]
    Time: 296 seconds

    Signed-off-by: Florian Westphal
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Jun, 2020

2 commits

  • when a MPTCP client tries to connect to itself, tcp_finish_connect() is
    never reached. Because of this, depending on the socket current state,
    multiple faulty behaviours can be observed:

    1) a WARN_ON() in subflow_data_ready() is hit
    WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
    [...]
    CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
    [...]
    RIP: 0010:subflow_data_ready+0x18b/0x230
    [...]
    Call Trace:
    tcp_data_queue+0xd2f/0x4250
    tcp_rcv_state_process+0xb1c/0x49d3
    tcp_v4_do_rcv+0x2bc/0x790
    __release_sock+0x153/0x2d0
    release_sock+0x4f/0x170
    mptcp_shutdown+0x167/0x4e0
    __sys_shutdown+0xe6/0x180
    __x64_sys_shutdown+0x50/0x70
    do_syscall_64+0x9a/0x370
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    2) client is stuck forever in mptcp_sendmsg() because the socket is not
    TCP_ESTABLISHED

    crash> bt 4847
    PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35"
    #0 [ffff8881376ff680] __schedule at ffffffff97248da4
    #1 [ffff8881376ff778] schedule at ffffffff9724a34f
    #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
    #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
    #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
    #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
    #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
    #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
    #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
    #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
    #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
    #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
    #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
    RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217
    RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed
    RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003
    RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720
    R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0
    R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

    3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
    didn't complete.

    $ tcpdump -tnnr bad.pcap
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

    force a fallback to TCP in these cases, and adjust the main socket
    state to avoid hanging in mptcp_sendmsg().

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35
    Reported-by: Christoph Paasch
    Suggested-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
    to regular TCP. When fallback is triggered, skip addition of the MPTCP
    option on send.

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     

27 Jun, 2020

2 commits

  • Replace the radix tree with a hash table allocated
    at boot time. The radix tree has some shortcoming:
    a single lock is contented by all the mptcp operation,
    the lookup currently use such lock, and traversing
    all the items would require a lock, too.

    With hash table instead we trade a little memory to
    address all the above - a per bucket lock is used.

    To hash the MPTCP sockets, we re-use the msk' sk_node
    entry: the MPTCP sockets are never hashed by the stack.
    Replace the existing hash proto callbacks with a dummy
    implementation, annotating the above constraint.

    Additionally refactor the token creation to code to:

    - limit the number of consecutive attempts to a fixed
    maximum. Hitting a hash bucket with a long chain is
    considered a failed attempt

    - accept() no longer can fail to token management.

    - if token creation fails at connect() time, we do
    fallback to TCP (before the connection was closed)

    v1 -> v2:
    - fix "no newline at end of file" - Jakub

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Add the missing annotation in some setup-only
    functions.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Jun, 2020

1 commit


24 Jun, 2020

2 commits


19 Jun, 2020

1 commit

  • The msk ownership is transferred to the child socket at
    3rd ack time, so that we avoid more lookups later. If the
    request does not reach the 3rd ack, the MSK reference is
    dropped at request sock release time.

    As a side effect, fallback is now tracked by a NULL msk
    reference instead of zeroed 'mp_join' field. This will
    simplify the next patch.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

16 Jun, 2020

2 commits


25 May, 2020

1 commit


23 May, 2020

1 commit

  • There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is
    the rightmost 64 bits of the entire hash or of the leftmost 160 bits
    of the hash. The intention, as clarified with the author of the RFC,
    is the entire hash.

    This change returns the entire hash from
    mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves
    any truncation/selection operation on the hash to the caller.

    Fixes: 12555a2d97e5 ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC")
    Reviewed-by: Christoph Paasch
    Reviewed-by: Mat Martineau
    Signed-off-by: Todd Malsbary
    Signed-off-by: David S. Miller

    Todd Malsbary
     

17 May, 2020

1 commit

  • RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
    sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
    extreme scenarios when a very high throughput subflow is combined with a
    long-RTT subflow such that the high-throughput subflow wraps around the
    32-bit sequence number space within an RTT of the high-RTT subflow.

    It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
    instead as long as possible. It allows to reduce the TCP-option overhead
    by 4 bytes, thus makes space for an additional SACK-block. It also makes
    tcpdumps much easier to read when the DSN and DATA_ACK are both either
    32 or 64-bit.

    Signed-off-by: Christoph Paasch
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Christoph Paasch
     

01 May, 2020

1 commit

  • The mptcp_options_received structure carries several per
    packet flags (mp_capable, mp_join, etc.). Such fields must
    be cleared on each packet, even on dropped ones or packet
    not carrying any MPTCP options, but the current mptcp
    code clears them only on TCP option reset.

    On several races/corner cases we end-up with stray bits in
    incoming options, leading to WARN_ON splats. e.g.:

    [ 171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
    [ 171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
    [ 171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
    [ 171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
    [ 171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
    [ 171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
    [ 171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
    [ 171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
    [ 171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
    [ 171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
    [ 171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
    [ 171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
    [ 171.228460] FS: 00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
    [ 171.230065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
    [ 171.232586] Call Trace:
    [ 171.233109]
    [ 171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
    [ 171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
    [ 171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
    [ 171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
    [ 171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
    [ 171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
    [ 171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
    [ 171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
    [ 171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
    [ 171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
    [ 171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
    [ 171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
    [ 171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
    [ 171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
    [ 171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
    [ 171.282358]

    We could address the issue clearing explicitly the relevant fields
    in several places - tcp_parse_option, tcp_fast_parse_options,
    possibly others.

    Instead we move the MPTCP option parsing into the already existing
    mptcp ingress hook, so that we need to clear the fields in a single
    place.

    This allows us dropping an MPTCP hook from the TCP code and
    removing the quite large mptcp_options_received from the tcp_sock
    struct. On the flip side, the MPTCP sockets will traverse the
    option space twice (in tcp_parse_option() and in
    mptcp_incoming_options(). That looks acceptable: we already
    do that for syn and 3rd ack packets, plain TCP socket will
    benefit from it, and even MPTCP sockets will experience better
    code locality, reducing the jumps between TCP and MPTCP code.

    v1 -> v2:
    - rebased on current '-net' tree

    Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

21 Apr, 2020

1 commit

  • We don't need them, as we can use the current ingress opt
    data instead. Setting them in syn_recv_sock() may causes
    inconsistent mptcp socket status, as per previous commit.

    Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

02 Apr, 2020

1 commit

  • This is needed at least until proper MPTCP-Level fin/reset
    signalling gets added:

    We wake parent when a subflow changes, but we should do this only
    when all subflows have closed, not just one.

    Schedule the mptcp worker and tell it to check eof state on all
    subflows.

    Only flag mptcp socket as closed and wake userspace processes blocking
    in poll if all subflows have closed.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Mar, 2020

12 commits

  • Expose a new netlink family to userspace to control the PM, setting:

    - list of local addresses to be signalled.
    - list of local addresses used to created subflows.
    - maximum number of add_addr option to react

    When the msk is fully established, the PM netlink attempts to
    announce the 'signal' list via the ADD_ADDR option. Since we
    currently lack the ADD_ADDR echo (and related event) only the
    first addr is sent.

    After exhausting the 'announce' list, the PM tries to create
    subflow for each addr in 'local' list, waiting for each
    connection to be completed before attempting the next one.

    Idea is to add an additional PM hook for ADD_ADDR echo, to allow
    the PM netlink announcing multiple addresses, in sequence.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • add ulp-specific diagnostic functions, so that subflow information can be
    dumped to userspace programs like 'ss'.

    v2 -> v3:
    - uapi: use bit macros appropriate for userspace

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • On timeout event, schedule a work queue to do the retransmission.
    Retransmission code closely resembles the sendmsg() implementation and
    re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
    sake - and peeking the relevant dfrag from the rtx head.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • After adding wmem accounting for the mptcp socket we could get
    into a situation where the mptcp socket can't transmit more data,
    and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced
    because it currently will only remove entire dfrags.

    Allow advancing the dfrag head sequence and reduce wmem,
    even though this isn't correct (as we can't release the page).

    Because we will soon block on mptcp sk in case wmem is too large,
    call sk_stream_write_space() in case we reduced the backlog so
    userspace task blocked in sendmsg or poll will be woken up.

    This isn't an issue if the send buffer is large, but it is when
    SO_SNDBUF is used to reduce it to a lower value.

    Note we can still get a deadlock for low SO_SNDBUF values in
    case both sides of the connection write to the socket: both could
    be blocked due to wmem being too small -- and current mptcp stack
    will only increment mptcp ack_seq on recv.

    This doesn't happen with the selftest as it uses poll() and
    will always call recv if there is data to read.

    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The timer will be used to schedule retransmission. It's
    frequency is based on the current subflow RTO estimation and
    is reset on every una_seq update

    The timer is clearer for good by __mptcp_clear_xmit()

    Also clean MPTCP rtx queue before each transmission.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Keep the send page fragment on an MPTCP level retransmission queue.
    The queue entries are allocated inside the page frag allocator,
    acquiring an additional reference to the page for each list entry.

    Also switch to a custom page frag refill function, to ensure that
    the current page fragment can always host an MPTCP rtx queue entry.

    The MPTCP rtx queue is flushed at disconnect() and close() time

    Note that now we need to call __mptcp_init_sock() regardless of mptcp
    enable status, as the destructor will try to walk the rtx_queue.

    v2 -> v3:
    - remove 'inline' in foo.c files (David S. Miller)

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • So that we keep per unacked sequence number consistent; since
    we update per msk data, use an atomic64 cmpxchg() to protect
    against concurrent updates from multiple subflows.

    Initialize the snd_una at connect()/accept() time.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Fill in more path manager functionality by adding a worker function and
    modifying the related stub functions to schedule the worker.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Subflow creation may be initiated by the path manager when
    the primary connection is fully established and a remote
    address has been received via ADD_ADDR.

    Create an in-kernel sock and use kernel_connect() to
    initiate connection.

    Passive sockets can't acquire the mptcp socket lock at
    subflow creation time, so an additional list protected by
    a new spinlock is used to track the MPJ subflows.

    Such list is spliced into conn_list tail every time the msk
    socket lock is acquired, so that it will not interfere
    with data flow on the original connection.

    Data flow and connection failover not addressed by this commit.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Process the MP_JOIN option in a SYN packet with the same flow
    as MP_CAPABLE but when the third ACK is received add the
    subflow to the MPTCP socket subflow list instead of adding it to
    the TCP socket accept queue.

    The subflow is added at the end of the subflow list so it will not
    interfere with the existing subflows operation and no data is
    expected to be transmitted on it.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add enough of a path manager interface to allow sending of ADD_ADDR
    when an incoming MPTCP connection is created. Capable of sending only
    a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
    sock will need to be expanded to handle multiple interfaces and IPv6.
    Partial processing of the incoming ADD_ADDR is included so the path
    manager notification of that event happens at the proper time, which
    involves validating the incoming address information.

    This is a skeleton interface definition for events generated by
    MPTCP.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Mat Martineau
    Signed-off-by: Mat Martineau
    Signed-off-by: Peter Krystad
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
    and RM_ADDR suboptions.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     

20 Mar, 2020

1 commit


15 Mar, 2020

1 commit

  • This change moves the mptcp socket allocation from mptcp_accept() to
    subflow_syn_recv_sock(), so that subflow->conn is now always set
    for the non fallback scenario.

    It allows cleaning up a bit mptcp_accept() reducing the additional
    locking and will allow fourther cleanup in the next patch.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     

04 Mar, 2020

1 commit

  • Instead of reading the MPTCP-level sequence number when sending DATA_FIN,
    store the data in the subflow so it can be safely accessed when the
    subflow TCP headers are written to the packet without the MPTCP-level
    lock held. This also allows the MPTCP-level socket to close individual
    subflows without closing the MPTCP connection.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau