16 Oct, 2020

1 commit


11 Oct, 2020

2 commits

  • The msk can close MP_JOIN subflows if the initial handshake
    fails. Currently such subflows are kept alive in the
    conn_list until the msk itself is closed.

    Beyond the wasted memory, we could end-up sending the
    DATA_FIN and the DATA_FIN ack on such socket, even after a
    reset.

    Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine")
    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     
  • Additional/MP_JOIN subflows that do not pass some initial handshake
    tests currently causes fallback to TCP. That is an RFC violation:
    we should instead reset the subflow and leave the the msk untouched.

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/91
    Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests")
    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: Jakub Kicinski

    Paolo Abeni
     

09 Oct, 2020

1 commit

  • using packetdrill it's possible to observe the same MPTCP DSN being acked
    by different subflows with DACK4 and DACK8. This is in contrast with what
    specified in RFC8684 §3.3.2: if an MPTCP endpoint transmits a 64-bit wide
    DSN, it MUST be acknowledged with a 64-bit wide DACK. Fix 'use_64bit_ack'
    variable to make it a property of MPTCP sockets, not TCP subflows.

    Fixes: a0c1d0eafd1e ("mptcp: Use 32-bit DATA_ACK when possible")
    Acked-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Reviewed-by: Mat Martineau
    Signed-off-by: Jakub Kicinski

    Davide Caratti
     

06 Oct, 2020

1 commit


04 Oct, 2020

1 commit

  • The MPTCP ADD_ADDR suboption with echo-flag=1 has no HMAC, the size is
    smaller than the one initially sent without echo-flag=1. We then need to
    use the correct size everywhere when we need this echo bit.

    Before this patch, the wrong size was reserved but the correct amount of
    bytes were written (and read): the remaining bytes contained garbage.

    Fixes: 6a6c05a8b016 ("mptcp: send out ADD_ADDR with echo flag")
    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/95
    Reported-and-tested-by: Davide Caratti
    Acked-by: Geliang Tang
    Signed-off-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Matthieu Baerts
     

30 Sep, 2020

1 commit

  • The peer may send a DATA_FIN mapping with either a 32-bit or 64-bit
    sequence number. When a 32-bit sequence number is received for the
    DATA_FIN, it must be expanded to 64 bits before comparing it to the
    last acked sequence number. This expansion was missing.

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/93
    Fixes: 3721b9b64676 ("mptcp: Track received DATA_FIN sequence number and add related helpers")
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     

25 Sep, 2020

8 commits

  • This patch implemented the retransmition of ADD_ADDR when no ADD_ADDR echo
    is received. It added a timer with the announced address. When timeout
    occurs, ADD_ADDR will be retransmitted.

    Suggested-by: Mat Martineau
    Suggested-by: Paolo Abeni
    Acked-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch added a new helper named mptcp_destroy_common containing the
    shared code between mptcp_destroy() and mptcp_sock_destruct().

    Suggested-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch implemented the local subflow removing function,
    mptcp_pm_remove_subflow, it simply called mptcp_pm_nl_rm_subflow_received
    under the PM spin lock.

    We use mptcp_pm_remove_subflow to remove a local subflow, so change it's
    argument from remote_id to local_id.

    We check subflow->local_id in mptcp_pm_nl_rm_subflow_received to remove
    a subflow.

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Suggested-by: Mat Martineau
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch implements the remove announced addr and subflow logic in PM
    netlink.

    When the PM netlink removes an address, we traverse all the existing msk
    sockets to find the relevant sockets.

    We add a new list named anno_list in mptcp_pm_data, to record all the
    announced addrs. In the traversing, we check if it has been recorded.
    If it has been, we trigger the RM_ADDR signal.

    We also check if this address is in conn_list. If it is, we remove the
    subflow which using this local address.

    Since we call mptcp_pm_free_anno_list in mptcp_destroy, we need to move
    __mptcp_init_sock before the mptcp_is_enabled check in mptcp_init_sock.

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Suggested-by: Mat Martineau
    Acked-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • When the ADD_ADDR suboption has been received, we need to send out the same
    ADD_ADDR suboption with echo-flag=1, and no HMAC.

    Suggested-by: Mat Martineau
    Reviewed-by: Mat Martineau
    Signed-off-by: Geliang Tang
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch added the RM_ADDR option parsing logic:

    We parsed the incoming options to find if the rm_addr option is received,
    and called mptcp_pm_rm_addr_received to schedule PM work to a new status,
    named MPTCP_PM_RM_ADDR_RECEIVED.

    PM work got this status, and called mptcp_pm_nl_rm_addr_received to handle
    it.

    In mptcp_pm_nl_rm_addr_received, we closed the subflow matching the rm_id,
    and updated PM counter.

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Suggested-by: Mat Martineau
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch added a new signal named rm_addr_signal in PM. On outgoing path,
    we called mptcp_pm_should_rm_signal to check if rm_addr_signal has been
    set. If it has been, we sent out the RM_ADDR option.

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     
  • This patch renamed addr_signal and the related functions with the explicit
    word "add".

    Suggested-by: Matthieu Baerts
    Suggested-by: Paolo Abeni
    Signed-off-by: Geliang Tang
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Geliang Tang
     

15 Sep, 2020

5 commits

  • Update the scheduler to less trivial heuristic: cache
    the last used subflow, and try to send on it a reasonably
    long burst of data.

    When the burst or the subflow send space is exhausted, pick
    the subflow with the lower ratio between write space and
    send buffer - that is, the subflow with the greater relative
    amount of free space.

    v1 -> v2:
    - fix 32 bit build breakage due to 64bits div
    - fix checkpath issues (uint64_t -> u64)

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • So that can be accessed easily from the subflow creation
    helper. No functional change intended.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • There is no need to use the tcp_read_sock(), we can
    simply drop the skb. Additionally try to look at the
    next buffer for in order data.

    This both simplifies the code and avoid unneeded indirect
    calls.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Add an RB-tree to cope with OoO (at MPTCP level) data.
    __mptcp_move_skb() insert into the RB tree "future"
    data, eventually coalescing skb as allowed by the
    MPTCP DSN.

    To simplify sequence accounting, move the DSN inside
    the cb.

    After successfully enqueuing in sequence data, check
    if we can use any data from the RB tree.

    Additionally move the data_fin check after spooling
    data from the OoO tree, otherwise we could miss shutdown
    events.

    The RB tree code is copied as verbatim as possible
    from tcp_data_queue_ofo(), with a few simplifications
    due to the fact that MPTCP doesn't need to cope with
    sacks. All bugs here are added by me.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • This is a prerequisite to allow receiving data from multiple
    subflows without re-injection.

    Instead of dropping the OoO - "future" data in
    subflow_check_data_avail(), call into __mptcp_move_skbs()
    and let the msk drop that.

    To avoid code duplication factor out the mptcp_subflow_discard_data()
    helper.

    Note that __mptcp_move_skbs() can now find multiple subflows
    with data avail (comprising to-be-discarded data), so must
    update the byte counter incrementally.

    v1 -> v2:
    - fix checkpatch issues (unsigned -> unsigned int)

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

01 Aug, 2020

2 commits

  • JOIN requests do not work in syncookie mode -- for HMAC validation, the
    peers nonce and the mptcp token (to obtain the desired connection socket
    the join is for) are required, but this information is only present in the
    initial syn.

    So either we need to drop all JOIN requests once a listening socket enters
    syncookie mode, or we need to store enough state to reconstruct the request
    socket later.

    This adds a state table (1024 entries) to store the data present in the
    MP_JOIN syn request and the random nonce used for the cookie syn/ack.

    When a MP_JOIN ACK passed cookie validation, the table is consulted
    to rebuild the request socket from it.

    An alternate approach would be to "cancel" syn-cookie mode and force
    MP_JOIN to always use a syn queue entry.

    However, doing so brings the backlog over the configured queue limit.

    v2: use req->syncookie, not (removed) want_cookie arg

    Suggested-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Will be used to initialize the mptcp request socket when a MP_CAPABLE
    request was handled in syncookie mode, i.e. when a TCP ACK containing a
    MP_CAPABLE option is a valid syncookie value.

    Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
    ID and stores it in the MPTCP token storage to be able to retrieve the
    mptcp socket for subflow joining.

    In syncookie case, we do not want to store any state, so just generate the
    unique ID and use it in the reply.

    This means there is a small window where another connection could generate
    the same token.

    When Cookie ACK comes back, we check that the token has not been registered
    in the mean time. If it was, the connection needs to fall back to TCP.

    Changes in v2:
    - use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
    (Eric Dumazet)

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     

29 Jul, 2020

2 commits


24 Jul, 2020

1 commit

  • Currently accepted msk sockets become established only after
    accept() returns the new sk to user-space.

    As MP_JOIN request are refused as per RFC spec on non fully
    established socket, the above causes mp_join self-tests
    instabilities.

    This change lets the msk entering the established status
    as soon as it receives the 3rd ack and propagates the first
    subflow fully established status on the msk socket.

    Finally we can change the subflow acceptance condition to
    take in account both the sock state and the msk fully
    established flag.

    Reviewed-by: Mat Martineau
    Tested-by: Christoph Paasch
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

22 Jul, 2020

1 commit


10 Jul, 2020

1 commit

  • mptcp_token_iter_next() allow traversing all the MPTCP
    sockets inside the token container belonging to the given
    network namespace with a quite standard iterator semantic.

    That will be used by the next patch, but keep the API generic,
    as we plan to use this later for PM's sake.

    Additionally export mptcp_token_get_sock(), as it also
    will be used by the diag module.

    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Jul, 2020

1 commit

  • We can re-use the existing work queue to handle path management
    instead of a dedicated work queue. Just move pm_worker to protocol.c,
    call it from the mptcp worker and get rid of the msk lock (already held).

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     

02 Jul, 2020

1 commit

  • When mptcp is used, userspace doesn't read from the tcp (subflow)
    socket but from the parent (mptcp) socket receive queue.

    skbs are moved from the subflow socket to the mptcp rx queue either from
    'data_ready' callback (if mptcp socket can be locked), a work queue, or
    the socket receive function.

    This means tcp_rcv_space_adjust() is never called and thus no receive
    buffer size auto-tuning is done.

    An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
    function that moves skbs from subflow to mptcp socket.
    While this enabled autotuning, it also meant tuning was done even if
    userspace was reading the mptcp socket very slowly.

    This adds mptcp_rcv_space_adjust() and calls it after userspace has
    read data from the mptcp socket rx queue.

    Its very similar to tcp_rcv_space_adjust, with two differences:

    1. The rtt estimate is the largest one observed on a subflow
    2. The rcvbuf size and window clamp of all subflows is adjusted
    to the mptcp-level rcvbuf.

    Otherwise, we get spurious drops at tcp (subflow) socket level if
    the skbs are not moved to the mptcp socket fast enough.

    Before:
    time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
    [..]
    ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ]
    ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ]
    ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ]
    ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ]
    Time: 1396 seconds

    After:
    ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
    ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ]
    ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ]
    ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ]
    ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ]
    Time: 296 seconds

    Signed-off-by: Florian Westphal
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Jun, 2020

2 commits

  • when a MPTCP client tries to connect to itself, tcp_finish_connect() is
    never reached. Because of this, depending on the socket current state,
    multiple faulty behaviours can be observed:

    1) a WARN_ON() in subflow_data_ready() is hit
    WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
    [...]
    CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
    [...]
    RIP: 0010:subflow_data_ready+0x18b/0x230
    [...]
    Call Trace:
    tcp_data_queue+0xd2f/0x4250
    tcp_rcv_state_process+0xb1c/0x49d3
    tcp_v4_do_rcv+0x2bc/0x790
    __release_sock+0x153/0x2d0
    release_sock+0x4f/0x170
    mptcp_shutdown+0x167/0x4e0
    __sys_shutdown+0xe6/0x180
    __x64_sys_shutdown+0x50/0x70
    do_syscall_64+0x9a/0x370
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    2) client is stuck forever in mptcp_sendmsg() because the socket is not
    TCP_ESTABLISHED

    crash> bt 4847
    PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35"
    #0 [ffff8881376ff680] __schedule at ffffffff97248da4
    #1 [ffff8881376ff778] schedule at ffffffff9724a34f
    #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
    #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
    #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
    #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
    #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
    #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
    #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
    #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
    #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
    #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
    #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
    RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217
    RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed
    RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003
    RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720
    R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0
    R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

    3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
    didn't complete.

    $ tcpdump -tnnr bad.pcap
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
    IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

    force a fallback to TCP in these cases, and adjust the main socket
    state to avoid hanging in mptcp_sendmsg().

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35
    Reported-by: Christoph Paasch
    Suggested-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
    to regular TCP. When fallback is triggered, skip addition of the MPTCP
    option on send.

    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
    Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     

27 Jun, 2020

2 commits

  • Replace the radix tree with a hash table allocated
    at boot time. The radix tree has some shortcoming:
    a single lock is contented by all the mptcp operation,
    the lookup currently use such lock, and traversing
    all the items would require a lock, too.

    With hash table instead we trade a little memory to
    address all the above - a per bucket lock is used.

    To hash the MPTCP sockets, we re-use the msk' sk_node
    entry: the MPTCP sockets are never hashed by the stack.
    Replace the existing hash proto callbacks with a dummy
    implementation, annotating the above constraint.

    Additionally refactor the token creation to code to:

    - limit the number of consecutive attempts to a fixed
    maximum. Hitting a hash bucket with a long chain is
    considered a failed attempt

    - accept() no longer can fail to token management.

    - if token creation fails at connect() time, we do
    fallback to TCP (before the connection was closed)

    v1 -> v2:
    - fix "no newline at end of file" - Jakub

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Add the missing annotation in some setup-only
    functions.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Jun, 2020

1 commit


24 Jun, 2020

2 commits


19 Jun, 2020

1 commit

  • The msk ownership is transferred to the child socket at
    3rd ack time, so that we avoid more lookups later. If the
    request does not reach the 3rd ack, the MSK reference is
    dropped at request sock release time.

    As a side effect, fallback is now tracked by a NULL msk
    reference instead of zeroed 'mp_join' field. This will
    simplify the next patch.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

16 Jun, 2020

2 commits


25 May, 2020

1 commit