10 Dec, 2020

1 commit

  • When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
    into persistent scenarios where we have the following sequence:

    (1) ACK for full-sized skb of N*MSS arrives
    -> tcp_write_xmit() transmit full-sized skb with N*MSS
    -> move pacing release time forward
    -> exit tcp_write_xmit() because pacing time is in the future

    (2) TSQ callback or TCP internal pacing timer fires
    -> try to transmit next skb, but TSO deferral finds remainder of
    available cwnd is not big enough to trigger an immediate send
    now, so we defer sending until the next ACK.

    (3) repeat...

    So we can get into a case where we never mark ourselves as
    cwnd-limited for many seconds at a time, even with
    bulk/infinite-backlog senders, because:

    o In case (1) above, every time in tcp_write_xmit() we have enough
    cwnd to send a full-sized skb, we are not fully using the cwnd
    (because cwnd is not a multiple of the TSO skb size). So every time we
    send data, we are not cwnd limited, and so in the cwnd-limited
    tracking code in tcp_cwnd_validate() we mark ourselves as not
    cwnd-limited.

    o In case (2) above, every time in tcp_write_xmit() that we try to
    transmit the "remainder" of the cwnd but defer, we set the local
    variable is_cwnd_limited to true, but we do not send any packets, so
    sent_pkts is zero, so we don't call the cwnd-limited logic to update
    tp->is_cwnd_limited.

    Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
    Reported-by: Ingemar Johansson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Neal Cardwell
     

01 Oct, 2020

2 commits

  • Whenever host is under very high memory pressure,
    __tcp_send_ack() skb allocation fails, and we setup
    a 200 ms (TCP_DELACK_MAX) timer before retrying.

    On hosts with high number of TCP sockets, we can spend
    considerable amount of cpu cycles in these attempts,
    add high pressure on various spinlocks in mm-layer,
    ultimately blocking threads attempting to free space
    from making any progress.

    This patch adds standard exponential backoff to avoid
    adding fuel to the fire.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP has been using it to work around the possibility of tcp_delack_timer()
    finding the socket owned by user.

    After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
    we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
    so we can get rid of icsk_ack.blocked

    This frees space that following patch will reuse.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2020

1 commit

  • SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
    that remembers if some room has been made in the rtx queue
    by an incoming ACK packet.

    This is later used from tcp_check_space() before
    considering to send EPOLLOUT.

    Problem is: If we receive SACK packets, and no packet
    is removed from RTX queue, we can send fresh packets, thus
    moving them from write queue to rtx queue and eventually
    empty the write queue.

    This stall can happen if TCP_NOTSENT_LOWAT is used.

    With this fix, we no longer risk stalling sends while holes
    are repaired, and we can fully use socket sndbuf.

    This also removes a cache line dirtying for typical RPC
    workloads.

    Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2020

3 commits

  • [ Note: The TCP changes here is mainly to implement the bpf
    pieces into the bpf_skops_*() functions introduced
    in the earlier patches. ]

    The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
    algorithm to be written in BPF. It opens up opportunities to allow
    a faster turnaround time in testing/releasing new congestion control
    ideas to production environment.

    The same flexibility can be extended to writing TCP header option.
    It is not uncommon that people want to test new TCP header option
    to improve the TCP performance. Another use case is for data-center
    that has a more controlled environment and has more flexibility in
    putting header options for internal only use.

    For example, we want to test the idea in putting maximum delay
    ACK in TCP header option which is similar to a draft RFC proposal [1].

    This patch introduces the necessary BPF API and use them in the
    TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
    and write TCP header options. It currently supports most of
    the TCP packet except RST.

    Supported TCP header option:
    ───────────────────────────
    This patch allows the bpf-prog to write any option kind.
    Different bpf-progs can write its own option by calling the new helper
    bpf_store_hdr_opt(). The helper will ensure there is no duplicated
    option in the header.

    By allowing bpf-prog to write any option kind, this gives a lot of
    flexibility to the bpf-prog. Different bpf-prog can write its
    own option kind. It could also allow the bpf-prog to support a
    recently standardized option on an older kernel.

    Sockops Callback Flags:
    ──────────────────────
    The bpf program will only be called to parse/write tcp header option
    if the following newly added callback flags are enabled
    in tp->bpf_sock_ops_cb_flags:
    BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
    BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

    A few words on the PARSE CB flags. When the above PARSE CB flags are
    turned on, the bpf-prog will be called on packets received
    at a sk that has at least reached the ESTABLISHED state.
    The parsing of the SYN-SYNACK-ACK will be discussed in the
    "3 Way HandShake" section.

    The default is off for all of the above new CB flags, i.e. the bpf prog
    will not be called to parse or write bpf hdr option. There are
    details comment on these new cb flags in the UAPI bpf.h.

    sock_ops->skb_data and bpf_load_hdr_opt()
    ─────────────────────────────────────────
    sock_ops->skb_data and sock_ops->skb_data_end covers the whole
    TCP header and its options. They are read only.

    The new bpf_load_hdr_opt() helps to read a particular option "kind"
    from the skb_data.

    Please refer to the comment in UAPI bpf.h. It has details
    on what skb_data contains under different sock_ops->op.

    3 Way HandShake
    ───────────────
    The bpf-prog can learn if it is sending SYN or SYNACK by reading the
    sock_ops->skb_tcp_flags.

    * Passive side

    When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
    the received SYN skb will be available to the bpf prog. The bpf prog can
    use the SYN skb (which may carry the header option sent from the remote bpf
    prog) to decide what bpf header option should be written to the outgoing
    SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
    More on this later. Also, the bpf prog can learn if it is in syncookie
    mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

    The bpf prog can store the received SYN pkt by using the existing
    bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
    [ Note that the fullsock here is a listen sk, bpf_sk_storage
    is not very useful here since the listen sk will be shared
    by many concurrent connection requests.

    Extending bpf_sk_storage support to request_sock will add weight
    to the minisock and it is not necessary better than storing the
    whole ~100 bytes SYN pkt. ]

    When the connection is established, the bpf prog will be called
    in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
    the bpf prog can get the header option from the saved syn and
    then apply the needed operation to the newly established socket.
    The later patch will use the max delay ack specified in the SYN
    header and set the RTO of this newly established connection
    as an example.

    The received ACK (that concludes the 3WHS) will also be available to
    the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
    It could be useful in syncookie scenario. More on this later.

    There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
    saved syn pkt which includes the IP[46] header and the TCP header.
    A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
    start getting from, e.g. starting from TCP header, or from IP[46] header.

    The new getsockopt(TCP_BPF_SYN*) will also know where it can get
    the SYN's packet from:
    - (a) the just received syn (available when the bpf prog is writing SYNACK)
    and it is the only way to get SYN during syncookie mode.
    or
    - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
    existing CB).

    The bpf prog does not need to know where the SYN pkt is coming from.
    The getsockopt(TCP_BPF_SYN*) will hide this details.

    Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
    bpf_load_hdr_opt() to read a particular header option from the SYN packet.

    * Fastopen

    Fastopen should work the same as the regular non fastopen case.
    This is a test in a later patch.

    * Syncookie

    For syncookie, the later example patch asks the active
    side's bpf prog to resend the header options in ACK. The server
    can use bpf_load_hdr_opt() to look at the options in this
    received ACK during PASSIVE_ESTABLISHED_CB.

    * Active side

    The bpf prog will get a chance to write the bpf header option
    in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
    pkt will also be available to the bpf prog during the existing
    ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
    and bpf_load_hdr_opt().

    * Turn off header CB flags after 3WHS

    If the bpf prog does not need to write/parse header options
    beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
    to avoid being called for header options.
    Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
    so that the kernel will only call it when there is option that
    the kernel cannot handle.

    [1]: draft-wang-tcpm-low-latency-opt-00
    https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com

    Martin KaFai Lau
     
  • The bpf prog needs to parse the SYN header to learn what options have
    been sent by the peer's bpf-prog before writing its options into SYNACK.
    This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
    This syn_skb will eventually be made available (as read-only) to the
    bpf prog. This will be the only SYN packet available to the bpf
    prog during syncookie. For other regular cases, the bpf prog can
    also use the saved_syn.

    When writing options, the bpf prog will first be called to tell the
    kernel its required number of bytes. It is done by the new
    bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new
    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
    When the bpf prog returns, the kernel will know how many bytes are needed
    and then update the "*remaining" arg accordingly. 4 byte alignment will
    be included in the "*remaining" before this function returns. The 4 byte
    aligned number of bytes will also be stored into the opts->bpf_opt_len.
    "bpf_opt_len" is a newly added member to the struct tcp_out_options.

    Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
    header options. The bpf prog is only called if it has reserved spaces
    before (opts->bpf_opt_len > 0).

    The bpf prog is the last one getting a chance to reserve header space
    and writing the header option.

    These two functions are half implemented to highlight the changes in
    TCP stack. The actual codes preparing the bpf running context and
    invoking the bpf prog will be added in the later patch with other
    necessary bpf pieces.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com

    Martin KaFai Lau
     
  • This change is mostly from an internal patch and adapts it from sysctl
    config to the bpf_setsockopt setup.

    The bpf_prog can set the max delay ack by using
    bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated
    to its peer through bpf header option. The receiving peer can then use
    this max delay ack and set a potentially lower rto by using
    bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
    in the next patch.

    Another later selftest patch will also use it like the above to show
    how to write and parse bpf tcp header option.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com

    Martin KaFai Lau
     

01 Aug, 2020

1 commit

  • Nowadays output function has a 'synack_type' argument that tells us when
    the syn/ack is emitted via syncookies.

    The request already tells us when timestamps are supported, so check
    both to detect special timestamp for tcp option encoding is needed.

    We could remove cookie_ts altogether, but a followup patch would
    otherwise need to adjust function signatures to pass 'want_cookie' to
    mptcp core.

    This way, the 'existing' bit can be used.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

26 Jul, 2020

1 commit

  • The UDP reuseport conflict was a little bit tricky.

    The net-next code, via bpf-next, extracted the reuseport handling
    into a helper so that the BPF sk lookup code could invoke it.

    At the same time, the logic for reuseport handling of unconnected
    sockets changed via commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace
    which changed the logic to carry on the reuseport result into the
    rest of the lookup loop if we do not return immediately.

    This requires moving the reuseport_has_conns() logic into the callers.

    While we are here, get rid of inline directives as they do not belong
    in foo.c files.

    The other changes were cases of more straightforward overlapping
    modifications.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jul, 2020

1 commit

  • Previously TLP may send multiple probes of new data in one
    flight. This happens when the sender is cwnd limited. After the
    initial TLP containing new data is sent, the sender receives another
    ACK that acks partial inflight. It may re-arm another TLP timer
    to send more, if no further ACK returns before the next TLP timeout
    (PTO) expires. The sender may send in theory a large amount of TLP
    until send queue is depleted. This only happens if the sender sees
    such irregular uncommon ACK pattern. But it is generally undesirable
    behavior during congestion especially.

    The original TLP design restrict only one TLP probe per inflight as
    published in "Reducing Web Latency: the Virtue of Gentle Aggression",
    SIGCOMM 2013. This patch changes TLP to send at most one probe
    per inflight.

    Note that if the sender is app-limited, TLP retransmits old data
    and did not have this issue.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

14 Jul, 2020

1 commit

  • Simple fixes which require no deep knowledge of the code.

    Cc: Paul Moore
    Cc: Alexey Kuznetsov
    Cc: Eric Dumazet
    Signed-off-by: Andrew Lunn
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller

    Andrew Lunn
     

11 Jul, 2020

1 commit


02 Jul, 2020

1 commit

  • Whenever cookie_init_timestamp() has been used to encode
    ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

    Otherwise, tcp_synack_options() will still advertize options like WSCALE
    that we can not deduce later when receiving the packet from the client
    to complete 3WHS.

    Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
    but we can not know for sure that all TCP stacks have the same logic.

    Before the fix a tcpdump would exhibit this wrong exchange :

    10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
    10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
    10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
    10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
    10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0

    After this patch the exchange looks saner :

    11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
    11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
    11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
    11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
    11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
    11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
    11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

    Of course, no SACK block will ever be added later, but nothing should break.
    Technically, we could remove the 4 nops included in MD5+TS options,
    but again some stacks could break seeing not conventional alignment.

    Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jun, 2020

2 commits


07 May, 2020

2 commits

  • As hinted in prior change ("tcp: refine tcp_pacing_delay()
    for very low pacing rates"), it is probably best arming
    the xmit timer only when all the packets have been scheduled,
    rather than when the head of rtx queue has been re-sent.

    This does matter for flows having extremely low pacing rates,
    since their tp->tcp_wstamp_ns could be far in the future.

    Note that the regular xmit path has a stronger limit
    in tcp_small_queue_check(), meaning it is less likely to
    go beyond the pacing horizon.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With the addition of horizon feature to sch_fq, we noticed some
    suboptimal behavior of extremely low pacing rate TCP flows, especially
    when TCP is not aware of a drop happening in lower stacks.

    Back in commit 3f80e08f40cd ("tcp: add tcp_reset_xmit_timer() helper"),
    tcp_pacing_delay() was added to estimate an extra delay to add to standard
    rto timers.

    This patch removes the skb argument from this helper and
    tcp_reset_xmit_timer() because it makes more sense to simply
    consider the time at which next packet is allowed to be sent,
    instead of the time of whatever packet has been sent.

    This avoids arming RTO timer too soon and removes
    spurious horizon drops.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 May, 2020

1 commit

  • In commit 86de5921a3d5 ("tcp: defer SACK compression after DupThresh")
    I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order
    to enable sack compression only after 3 dupacks.

    Since we plan to relax this rule for flows that involve
    stacks not requiring this old rule, this patch adds
    a distinct tp->dup_ack_counter.

    This means the TCP_FASTRETRANS_THRESH value is now used
    in a single location that a future patch can adjust:

    if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
    tp->dup_ack_counter++;
    goto send_now;
    }

    This patch also introduces tcp_sack_compress_send_ack()
    helper to ease following patch comprehension.

    This patch refines LINUX_MIB_TCPACKCOMPRESSED to not
    count the acks that we had to send if the timer expires
    or tcp_sack_compress_send_ack() is sending an ack.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Apr, 2020

1 commit

  • In MPTCP, the receive window is shared across all subflows, because it
    refers to the mptcp-level sequence space.

    MPTCP receivers already place incoming packets on the mptcp socket
    receive queue and will charge it to the mptcp socket rcvbuf until
    userspace consumes the data.

    Update __tcp_select_window to use the occupancy of the parent/mptcp
    socket instead of the subflow socket in case the tcp socket is part
    of a logical mptcp connection.

    This commit doesn't change choice of initial window for passive or active
    connections.
    While it would be possible to change those as well, this adds complexity
    (especially when handling MP_JOIN requests). Furthermore, the MPTCP RFC
    specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
    of a TCP segment at the connection level if it does not also carry a DSS
    option with a Data ACK field.'

    SYN/SYNACK packets do not carry a DSS option with a Data ACK field.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

21 Mar, 2020

1 commit

  • In rare cases retransmit logic will make a full skb copy, which will not
    trigger the zeroing added in recent change
    b738a185beaa ("tcp: ensure skb->dev is NULL before leaving TCP stack").

    Cc: Eric Dumazet
    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING")
    Signed-off-by: Florian Westphal
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florian Westphal
     

20 Mar, 2020

1 commit

  • skb->rbnode is sharing three skb fields : next, prev, dev

    When a packet is sent, TCP keeps the original skb (master)
    in a rtx queue, which was converted to rbtree a while back.

    __tcp_transmit_skb() is responsible to clone the master skb,
    and add the TCP header to the clone before sending it
    to network layer.

    skb_clone() already clears skb->next and skb->prev, but copies
    the master oskb->dev into the clone.

    We need to clear skb->dev, otherwise lower layers could interpret
    the value as a pointer to a netdev.

    This old bug surfaced recently when commit 28f8bfd1ac94
    ("netfilter: Support iif matches in POSTROUTING") was merged.

    Before this netfilter commit, skb->dev value was ignored and
    changed before reaching dev_queue_xmit()

    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING")
    Signed-off-by: Eric Dumazet
    Reported-by: Martin Zaharinov
    Cc: Florian Westphal
    Cc: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jan, 2020

1 commit


24 Jan, 2020

4 commits

  • This implements MP_CAPABLE options parsing and writing according
    to RFC 6824 bis / RFC 8684: MPTCP v1.

    Local key is sent on syn/ack, and both keys are sent on 3rd ack.
    MP_CAPABLE messages len are updated accordingly. We need the skbuff to
    correctly emit the above, so we push the skbuff struct as an argument
    all the way from tcp code to the relevant mptcp callbacks.

    When processing incoming MP_CAPABLE + data, build a full blown DSS-like
    map info, to simplify later processing. On child socket creation, we
    need to record the remote key, if available.

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     
  • Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
    to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
    the final ACK of the three-way handshake.

    Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
    responding SYN-ACK is received and notify the MPTCP connection layer.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Peter Krystad
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Add hooks to parse and format the MP_CAPABLE option.

    This option is handled according to MPTCP version 0 (RFC6824).
    MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
    coordination with related code changes.

    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Davide Caratti
    Signed-off-by: Davide Caratti
    Signed-off-by: Peter Krystad
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    apparently allowed syzbot to trigger various crashes in TCP stack [1]

    I believe this commit only made things easier for syzbot to find
    its way into triggering use-after-frees. But really the bugs
    could lead to bad TCP behavior or even plain crashes even for
    non malicious peers.

    I have audited all calls to tcp_rtx_queue_unlink() and
    tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
    if we are removing from rtx queue the skb that tp->highest_sack points to.

    These updates were missing in three locations :

    1) tcp_clean_rtx_queue() [This one seems quite serious,
    I have no idea why this was not caught earlier]

    2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

    3) tcp_send_synack() [Probably not a big deal for normal operations]

    [1]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
    __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
    kasan_report+0x12/0x20 mm/kasan/common.c:639
    __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
    tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
    tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
    tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
    tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
    tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
    tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
    ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
    __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
    __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
    process_backlog+0x206/0x750 net/core/dev.c:6093
    napi_poll net/core/dev.c:6530 [inline]
    net_rx_action+0x508/0x1120 net/core/dev.c:6598
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:603 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
    smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    Allocated by task 10091:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    __kasan_kmalloc mm/kasan/common.c:513 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
    slab_post_alloc_hook mm/slab.h:584 [inline]
    slab_alloc_node mm/slab.c:3263 [inline]
    kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
    alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
    sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
    sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
    tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
    tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
    inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    __sys_sendto+0x262/0x380 net/socket.c:1998
    __do_sys_sendto net/socket.c:2010 [inline]
    __se_sys_sendto net/socket.c:2006 [inline]
    __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 10095:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    kasan_set_free_info mm/kasan/common.c:335 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
    __cache_free mm/slab.c:3426 [inline]
    kmem_cache_free+0x86/0x320 mm/slab.c:3694
    kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
    __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
    sk_eat_skb include/net/sock.h:2453 [inline]
    tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
    inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:886 [inline]
    sock_recvmsg net/socket.c:904 [inline]
    sock_recvmsg+0xce/0x110 net/socket.c:900
    __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
    __do_sys_recvfrom net/socket.c:2073 [inline]
    __se_sys_recvfrom net/socket.c:2069 [inline]
    __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff8880a488d040
    which belongs to the cache skbuff_fclone_cache of size 456
    The buggy address is located 40 bytes inside of
    456-byte region [ffff8880a488d040, ffff8880a488d208)
    The buggy address belongs to the page:
    page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
    raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
    raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
    Signed-off-by: Eric Dumazet
    Cc: Cambda Zhu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jan, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-01-22

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 92 non-merge commits during the last 16 day(s) which contain
    a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

    The main changes are:

    1) function by function verification and program extensions from Alexei.

    2) massive cleanup of selftests/bpf from Toke and Andrii.

    3) batched bpf map operations from Brian and Yonghong.

    4) tcp congestion control in bpf from Martin.

    5) bulking for non-map xdp_redirect form Toke.

    6) bpf_send_signal_thread helper from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jan, 2020

3 commits

  • Update the SACK check to work with zero option space available, a case
    that's possible with MPTCP but not MD5+TS. Maintained only one
    conditional branch for insufficient SACK space.

    v1 -> v2:
    - Moves the check inside the SACK branch by taking recent SACK fix:

    9424e2e7ad93 (tcp: md5: fix potential overestimation of TCP option space)

    in to account, but modifies it to work in MPTCP scenarios beyond the
    MD5+TS corner case.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Coalesce and collapse of packets carrying MPTCP extensions is allowed
    when the newer packet has no extension or the extensions carried by both
    packets are equal.

    This allows merging of TSO packet trains and even cross-TSO packets, and
    does not require any additional action when moving data into existing
    SKBs.

    v3 -> v4:
    - allow collapsing, under mptcp_skb_can_collapse() constraint

    v5 -> v6:
    - clarify MPTCP skb extensions must always be cleared at allocation
    time

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • This patch makes "struct tcp_congestion_ops" to be the first user
    of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops
    in bpf.

    The BPF implemented tcp_congestion_ops can be used like
    regular kernel tcp-cc through sysctl and setsockopt. e.g.
    [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
    net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
    net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
    net.ipv4.tcp_congestion_control = bpf_cubic

    There has been attempt to move the TCP CC to the user space
    (e.g. CCP in TCP). The common arguments are faster turn around,
    get away from long-tail kernel versions in production...etc,
    which are legit points.

    BPF has been the continuous effort to join both kernel and
    userspace upsides together (e.g. XDP to gain the performance
    advantage without bypassing the kernel). The recent BPF
    advancements (in particular BTF-aware verifier, BPF trampoline,
    BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
    possible in BPF. It allows a faster turnaround for testing algorithm
    in the production while leveraging the existing (and continue growing)
    BPF feature/framework instead of building one specifically for
    userspace TCP CC.

    This patch allows write access to a few fields in tcp-sock
    (in bpf_tcp_ca_btf_struct_access()).

    The optional "get_info" is unsupported now. It can be added
    later. One possible way is to output the info with a btf-id
    to describe the content.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com

    Martin KaFai Lau
     

31 Dec, 2019

1 commit

  • >From commit 50895b9de1d3 ("tcp: highest_sack fix"), the logic about
    setting tp->highest_sack to the head of the send queue was removed.
    Of course the logic is error prone, but it is logical. Before we
    remove the pointer to the highest sack skb and use the seq instead,
    we need to set tp->highest_sack to NULL when there is no skb after
    the last sack, and then replace NULL with the real skb when new skb
    inserted into the rtx queue, because the NULL means the highest sack
    seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
    the next ACK with sack option will increase tp->reordering unexpectedly.

    This patch sets tp->highest_sack to the tail of the rtx queue if
    it's NULL and new data is sent. The patch keeps the rule that the
    highest_sack can only be maintained by sack processing, except for
    this only case.

    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Signed-off-by: Cambda Zhu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cambda Zhu
     

18 Dec, 2019

1 commit

  • sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Dec, 2019

2 commits

  • Due to how tcp_sendmsg() is implemented, we can have an empty
    skb at the tail of the write queue.

    Most [1] tcp_write_queue_empty() callers want to know if there is
    anything to send (payload and/or FIN)

    Instead of checking if the sk_write_queue is empty, we need
    to test if tp->write_seq == tp->snd_nxt

    [1] tcp_send_fin() was the only caller that expected to
    see if an skb was in the write queue, I have changed the code
    to reuse the tcp_write_queue_tail() result.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
    write queue in error cases") in linux-4.14 stable triggered
    various bugs. One of them has been fixed in commit ba2ddb43f270
    ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
    we still have crashes in some occasions.

    Root-cause is that when tcp_sendmsg() has allocated a fresh
    skb and could not append a fragment before being blocked
    in sk_stream_wait_memory(), tcp_write_xmit() might be called
    and decide to send this fresh and empty skb.

    Sending an empty packet is not only silly, it might have caused
    many issues we had in the past with tp->packets_out being
    out of sync.

    Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
    Signed-off-by: Eric Dumazet
    Cc: Christoph Paasch
    Acked-by: Neal Cardwell
    Cc: Jason Baron
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

07 Dec, 2019

1 commit

  • Back in 2008, Adam Langley fixed the corner case of packets for flows
    having all of the following options : MD5 TS SACK

    Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
    can be cooked from the remaining 8 bytes.

    tcp_established_options() correctly sets opts->num_sack_blocks
    to zero, but returns 36 instead of 32.

    This means TCP cooks packets with 4 extra bytes at the end
    of options, containing unitialized bytes.

    Fixes: 33ad798c924b ("tcp: options clean up")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Nov, 2019

1 commit


14 Oct, 2019

4 commits

  • For the sake of tcp_poll(), there are few places where we fetch
    sk->sk_wmem_queued while this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make sure write
    sides use corresponding WRITE_ONCE() to avoid store-tearing.

    sk_wmem_queued_add() helper is added so that we can in
    the future convert to ADD_ONCE() or equivalent if/when
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->snd_nxt while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->write_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->copied_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet