02 Apr, 2020

4 commits

  • Obtained with:

    $ make W=1 net/mptcp/token.o
    net/mptcp/token.c:53: warning: Function parameter or member 'req' not described in 'mptcp_token_new_request'
    net/mptcp/token.c:98: warning: Function parameter or member 'sk' not described in 'mptcp_token_new_connect'
    net/mptcp/token.c:133: warning: Function parameter or member 'conn' not described in 'mptcp_token_new_accept'
    net/mptcp/token.c:178: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy_request'
    net/mptcp/token.c:191: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy'

    Fixes: 79c0949e9a09 (mptcp: Add key generation and token tree)
    Fixes: 58b09919626b (mptcp: create msk early)
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Matthieu Baerts
     
  • mptcp_subflow_data_available() is commonly called via
    ssk->sk_data_ready(), in this case the mptcp socket lock
    cannot be acquired.

    Therefore, while we can safely discard subflow data that
    was already received up to msk->ack_seq, we cannot be sure
    that 'subflow->data_avail' will still be valid at the time
    userspace wants to read the data -- a previous read on a
    different subflow might have carried this data already.

    In that (unlikely) event, msk->ack_seq will have been updated
    and will be ahead of the subflow dsn.

    We can check for this condition and skip/resync to the expected
    sequence number.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This is needed at least until proper MPTCP-Level fin/reset
    signalling gets added:

    We wake parent when a subflow changes, but we should do this only
    when all subflows have closed, not just one.

    Schedule the mptcp worker and tell it to check eof state on all
    subflows.

    Only flag mptcp socket as closed and wake userspace processes blocking
    in poll if all subflows have closed.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Christoph Paasch reports following crash:

    general protection fault [..]
    CPU: 0 PID: 2874 Comm: syz-executor072 Not tainted 5.6.0-rc5 #62
    RIP: 0010:__pv_queued_spin_lock_slowpath kernel/locking/qspinlock.c:471
    [..]
    queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
    do_raw_spin_lock include/linux/spinlock.h:181 [inline]
    spin_lock_bh include/linux/spinlock.h:343 [inline]
    __mptcp_flush_join_list+0x44/0xb0 net/mptcp/protocol.c:278
    mptcp_shutdown+0xb3/0x230 net/mptcp/protocol.c:1882
    [..]

    Problem is that mptcp_shutdown() socket isn't an mptcp socket,
    its a plain tcp_sk. Thus, trying to access mptcp_sk specific
    members accesses garbage.

    Root cause is that accept() returns a fallback (tcp) socket, not an mptcp
    one. There is code in getpeername to detect this and override the sockets
    stream_ops. But this will only run when accept() caller provided a
    sockaddr struct. "accept(fd, NULL, 0)" will therefore result in
    mptcp stream ops, but with sock->sk pointing at a tcp_sk.

    Update the existing fallback handling to detect this as well.

    Moreover, mptcp_shutdown did not have fallback handling, and
    mptcp_poll did it too late so add that there as well.

    Reported-by: Christoph Paasch
    Tested-by: Christoph Paasch
    Reviewed-by: Mat Martineau
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Mar, 2020

15 commits

  • Expose a new netlink family to userspace to control the PM, setting:

    - list of local addresses to be signalled.
    - list of local addresses used to created subflows.
    - maximum number of add_addr option to react

    When the msk is fully established, the PM netlink attempts to
    announce the 'signal' list via the ADD_ADDR option. Since we
    currently lack the ADD_ADDR echo (and related event) only the
    first addr is sent.

    After exhausting the 'announce' list, the PM tries to create
    subflow for each addr in 'local' list, waiting for each
    connection to be completed before attempting the next one.

    Idea is to add an additional PM hook for ADD_ADDR echo, to allow
    the PM netlink announcing multiple addresses, in sequence.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
    or "nstat" will show them automatically.

    The MPTCP MIB counters are allocated in a distinct pcpu area in order to
    avoid bloating/wasting TCP pcpu memory.

    Counters are allocated once the first MPTCP socket is created in a
    network namespace and free'd on exit.

    If no sockets have been allocated, all-zero mptcp counters are shown.

    The MIB counter list is taken from the multipath-tcp.org kernel, but
    only a few counters have been picked up so far. The counter list can
    be increased at any time later on.

    v2 -> v3:
    - remove 'inline' in foo.c files (David S. Miller)

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • add ulp-specific diagnostic functions, so that subflow information can be
    dumped to userspace programs like 'ss'.

    v2 -> v3:
    - uapi: use bit macros appropriate for userspace

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • On timeout event, schedule a work queue to do the retransmission.
    Retransmission code closely resembles the sendmsg() implementation and
    re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
    sake - and peeking the relevant dfrag from the rtx head.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • This will simplify mptcp-level retransmission implementation
    in the next patch. If dfrag is provided by the caller, skip
    kernel space memory allocation and use data and metadata
    provided by the dfrag itself.

    Because a peer could ack data at TCP level but refrain from
    sending mptcp-level ACKs, we could grow the mptcp socket
    backlog indefinitely.

    We should thus block mptcp_sendmsg until the peer has acked some of the
    sent data.

    In order to be able to do so, increment the mptcp socket wmem_queued
    counter on memory allocation and decrement it when releasing the memory
    on mptcp-level ack reception.

    Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make
    this the mptcp sk_sndbuf limit.

    In the future we could add experiment with autotuning as TCP does in
    tcp_sndbuf_expand().

    v2 -> v3:
    - remove 'inline' in foo.c files (David S. Miller)

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • After adding wmem accounting for the mptcp socket we could get
    into a situation where the mptcp socket can't transmit more data,
    and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced
    because it currently will only remove entire dfrags.

    Allow advancing the dfrag head sequence and reduce wmem,
    even though this isn't correct (as we can't release the page).

    Because we will soon block on mptcp sk in case wmem is too large,
    call sk_stream_write_space() in case we reduced the backlog so
    userspace task blocked in sendmsg or poll will be woken up.

    This isn't an issue if the send buffer is large, but it is when
    SO_SNDBUF is used to reduce it to a lower value.

    Note we can still get a deadlock for low SO_SNDBUF values in
    case both sides of the connection write to the socket: both could
    be blocked due to wmem being too small -- and current mptcp stack
    will only increment mptcp ack_seq on recv.

    This doesn't happen with the selftest as it uses poll() and
    will always call recv if there is data to read.

    Signed-off-by: Florian Westphal
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Charge the data on the rtx queue to the master MPTCP socket, too.
    Such memory in uncharged when the data is acked/dequeued.

    Also account mptcp sockets inuse via a protocol specific pcpu
    counter.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • The timer will be used to schedule retransmission. It's
    frequency is based on the current subflow RTO estimation and
    is reset on every una_seq update

    The timer is clearer for good by __mptcp_clear_xmit()

    Also clean MPTCP rtx queue before each transmission.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Keep the send page fragment on an MPTCP level retransmission queue.
    The queue entries are allocated inside the page frag allocator,
    acquiring an additional reference to the page for each list entry.

    Also switch to a custom page frag refill function, to ensure that
    the current page fragment can always host an MPTCP rtx queue entry.

    The MPTCP rtx queue is flushed at disconnect() and close() time

    Note that now we need to call __mptcp_init_sock() regardless of mptcp
    enable status, as the destructor will try to walk the rtx_queue.

    v2 -> v3:
    - remove 'inline' in foo.c files (David S. Miller)

    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • So that we keep per unacked sequence number consistent; since
    we update per msk data, use an atomic64 cmpxchg() to protect
    against concurrent updates from multiple subflows.

    Initialize the snd_una at connect()/accept() time.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Fill in more path manager functionality by adding a worker function and
    modifying the related stub functions to schedule the worker.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Subflow creation may be initiated by the path manager when
    the primary connection is fully established and a remote
    address has been received via ADD_ADDR.

    Create an in-kernel sock and use kernel_connect() to
    initiate connection.

    Passive sockets can't acquire the mptcp socket lock at
    subflow creation time, so an additional list protected by
    a new spinlock is used to track the MPJ subflows.

    Such list is spliced into conn_list tail every time the msk
    socket lock is acquired, so that it will not interfere
    with data flow on the original connection.

    Data flow and connection failover not addressed by this commit.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Process the MP_JOIN option in a SYN packet with the same flow
    as MP_CAPABLE but when the third ACK is received add the
    subflow to the MPTCP socket subflow list instead of adding it to
    the TCP socket accept queue.

    The subflow is added at the end of the subflow list so it will not
    interfere with the existing subflows operation and no data is
    expected to be transmitted on it.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add enough of a path manager interface to allow sending of ADD_ADDR
    when an incoming MPTCP connection is created. Capable of sending only
    a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
    sock will need to be expanded to handle multiple interfaces and IPv6.
    Partial processing of the incoming ADD_ADDR is included so the path
    manager notification of that event happens at the proper time, which
    involves validating the incoming address information.

    This is a skeleton interface definition for events generated by
    MPTCP.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Mat Martineau
    Signed-off-by: Mat Martineau
    Signed-off-by: Peter Krystad
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
    and RM_ADDR suboptions.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Peter Krystad
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Peter Krystad
     

24 Mar, 2020

1 commit

  • it's still possible for packetdrill to hang in mptcp_sendmsg(), when the
    MPTCP socket falls back to regular TCP (e.g. after receiving unsupported
    flags/version during the three-way handshake). Adjust MPTCP socket state
    earlier, to ensure correct functionality of mptcp_sendmsg() even in case
    of TCP fallback.

    Fixes: 767d3ded5fb8 ("net: mptcp: don't hang before sending 'MP capable with data'")
    Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send")
    Signed-off-by: Davide Caratti
    Acked-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Davide Caratti
     

22 Mar, 2020

1 commit

  • Fixes gcc '-Wunused-but-set-variable' warning:

    net/mptcp/options.c: In function 'mptcp_established_options_dss':
    net/mptcp/options.c:338:7: warning:
    variable 'can_ack' set but not used [-Wunused-but-set-variable]

    commit dc093db5cc05 ("mptcp: drop unneeded checks")
    leave behind this unused, remove it.

    Signed-off-by: YueHaibing
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    YueHaibing
     

20 Mar, 2020

1 commit


18 Mar, 2020

1 commit

  • After commit 58b09919626b ("mptcp: create msk early"), the
    msk socket is already available at subflow_syn_recv_sock()
    time. Let's move there the state update, to mirror more
    closely the first subflow state.

    The above will also help multiple subflow supports.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     

15 Mar, 2020

2 commits

  • After the previous patch subflow->conn is always != NULL and
    is never changed. We can drop a bunch of now unneeded checks.

    v1 -> v2:
    - rebased on top of commit 2398e3991bda ("mptcp: always
    include dack if possible.")

    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • This change moves the mptcp socket allocation from mptcp_accept() to
    subflow_syn_recv_sock(), so that subflow->conn is now always set
    for the non fallback scenario.

    It allows cleaning up a bit mptcp_accept() reducing the additional
    locking and will allow fourther cleanup in the next patch.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Matthieu Baerts
    Signed-off-by: David S. Miller

    Paolo Abeni
     

13 Mar, 2020

1 commit


12 Mar, 2020

1 commit

  • the following packetdrill script

    socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
    fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
    fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
    > S 0:0(0)
    < S. 0:0(0) ack 1 win 65535
    > . 1:1(0) ack 1 win 256
    getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    fcntl(3, F_SETFL, O_RDWR) = 0
    write(3, ..., 1000) = 1000

    doesn't transmit 1KB data packet after a successful three-way-handshake,
    using mp_capable with data as required by protocol v1, and write() hangs
    forever:

    PID: 973 TASK: ffff97dd399cae80 CPU: 1 COMMAND: "packetdrill"
    #0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000
    #1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0
    #2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d
    #3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184
    #4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064
    #5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c
    #6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324
    #7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7
    #8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976
    #9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685
    #10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985
    #11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475
    #12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c
    RIP: 00007f959407eaf7 RSP: 00007ffe9e95a910 RFLAGS: 00000293
    RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f959407eaf7
    RDX: 00000000000003e8 RSI: 0000000001785fe0 RDI: 0000000000000008
    RBP: 0000000001785fe0 R8: 0000000000000000 R9: 0000000000000003
    R10: 0000000000000007 R11: 0000000000000293 R12: 00000000000003e8
    R13: 00007ffe9e95ae30 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

    Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the
    third ack.

    Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send")
    Suggested-by: Paolo Abeni
    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller

    Davide Caratti
     

10 Mar, 2020

1 commit


06 Mar, 2020

1 commit

  • Currently passive MPTCP socket can skip including the DACK
    option - if the peer sends data before accept() completes.

    The above happens because the msk 'can_ack' flag is set
    only after the accept() call.

    Such missing DACK option may cause - as per RFC spec -
    unwanted fallback to TCP.

    This change addresses the issue using the key material
    available in the current subflow, if any, to create a suitable
    dack option when msk ack seq is not yet available.

    v1 -> v2:
    - adavance the generated ack after the initial MPC packet

    Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

04 Mar, 2020

3 commits

  • When a DATA_FIN is sent in a MPTCP DSS option that contains a data
    mapping, the DATA_FIN consumes one byte of space in the mapping. In this
    case, the DATA_FIN should only be included in the DSS option if its
    sequence number aligns with the end of the mapped data. Otherwise the
    subflow can send an incorrect implicit sequence number for the DATA_FIN,
    and the DATA_ACK for that sequence number would not close the
    MPTCP-level connection correctly.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Instead of reading the MPTCP-level sequence number when sending DATA_FIN,
    store the data in the subflow so it can be safely accessed when the
    subflow TCP headers are written to the packet without the MPTCP-level
    lock held. This also allows the MPTCP-level socket to close individual
    subflows without closing the MPTCP connection.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • MPTCP should wait for an active connection or skip sending depending on
    the connection state, as TCP does. This happens before the possible
    passthrough to a regular TCP sendmsg because the subflow's socket type
    (MPTCP or TCP fallback) is not known until the connection is
    complete. This is also relevent at disconnect time, where data should
    not be sent in certain MPTCP-level connection states.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     

28 Feb, 2020

1 commit


27 Feb, 2020

7 commits

  • syzbot noted that the master MPTCP socket lacks the icsk_sync_mss
    callback, and was able to trigger a null pointer dereference:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0
    Oops: 0010 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246
    RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600
    RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140
    RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a
    R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0
    R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba
    FS: 0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888
    netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989
    smack_netlabel security/smack/smack_lsm.c:2425 [inline]
    smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716
    security_inode_setsecurity+0xb2/0x140 security/security.c:1364
    __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197
    vfs_setxattr fs/xattr.c:224 [inline]
    setxattr+0x335/0x430 fs/xattr.c:451
    __do_sys_fsetxattr fs/xattr.c:506 [inline]
    __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495
    __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495
    do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x440199
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199
    RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8
    R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20
    R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    CR2: 0000000000000000

    Address the issue adding a dummy icsk_sync_mss callback.
    To properly sync the subflows mss and options list we need some
    additional infrastructure, which will land to net-next.

    Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com
    Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Don't schedule the work queue right away, instead defer this
    to the lock release callback.

    This has the advantage that it will give recv path a chance to
    complete -- this might have moved all pending packets from the
    subflow to the mptcp receive queue, which allows to avoid the
    schedule_work().

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • We can't lock_sock() the mptcp socket from the subflow data_ready callback,
    it would result in ABBA deadlock with the subflow socket lock.

    We can however grab the spinlock: if that succeeds and the mptcp socket
    is not owned at the moment, we can process the new skbs right away
    without deferring this to the work queue.

    This avoids the schedule_work and hence the small delay until the
    work item is processed.

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Only used to discard stale data from the subflow, so move
    it where needed.

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • If userspace never drains the receive buffers we must stop draining
    the subflow socket(s) at some point.

    This adds the needed rmem accouting for this.
    If the threshold is reached, we stop draining the subflows.

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • If userspace is not reading data, all the mptcp-level acks contain the
    ack_seq from the last time userspace read data rather than the most
    recent in-sequence value.

    This causes pointless retransmissions for data that is already queued.

    The reason for this is that all the mptcp protocol level processing
    happens at mptcp_recv time.

    This adds work queue to move skbs from the subflow sockets receive
    queue on the mptcp socket receive queue (which was not used so far).

    This allows us to announce the correct mptcp ack sequence in a timely
    fashion, even when the application does not call recv() on the mptcp socket
    for some time.

    We still wake userspace tasks waiting for POLLIN immediately:
    If the mptcp level receive queue is empty (because the work queue is
    still pending) it can be filled from in-sequence subflow sockets at
    recv time without a need to wait for the worker.

    The skb_orphan when moving skbs from subflow to mptcp level is needed,
    because the destructor (sock_rfree) relies on skb->sk (ssk!) lock
    being taken.

    A followup patch will add needed rmem accouting for the moved skbs.

    Other problem: In case application behaves as expected, and calls
    recv() as soon as mptcp socket becomes readable, the work queue will
    only waste cpu cycles. This will also be addressed in followup patches.

    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Will be extended with functionality in followup patches.
    Initial user is moving skbs from subflows receive queue to
    the mptcp-level receive queue.

    Signed-off-by: Paolo Abeni
    Signed-off-by: Florian Westphal
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni