21 Sep, 2016

1 commit

  • This commit export two new fields in struct tcp_info:

    tcpi_delivery_rate: The most recent goodput, as measured by
    tcp_rate_gen(). If the socket is limited by the sending
    application (e.g., no data to send), it reports the highest
    measurement instead of the most recent. The unit is bytes per
    second (like other rate fields in tcp_info).

    tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
    was measured when the socket's throughput was limited by the
    sending application.

    This delivery rate information can be useful for applications that
    want to know the current throughput the TCP connection is seeing,
    e.g. adaptive bitrate video streaming. It can also be very useful for
    debugging or troubleshooting.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

30 Jun, 2016

1 commit

  • We found that sometimes a restored tcp socket doesn't work.

    A reason of this bug is incorrect window parameters and in this case
    tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
    other side drops packets with this seq, because seq is less than
    tp->rcv_nxt ( tcp_sequence() ).

    Data from a send queue is sent only if there is enough space in a
    window, so when we restore unacked data, we need to expand a window to
    fit this data.

    This was in a first version of this patch:
    "tcp: extend window to fit all restored unacked data in a send queue"

    Then Alexey recommended me to restore window parameters instead of
    adjusted them according with data in a sent queue. This sounds resonable.

    rcv_wnd has to be restored, because it was reported to another side
    and the offered window is never shrunk.
    One of reasons why we need to restore snd_wnd was described above.

    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

15 Mar, 2016

1 commit

  • Per RFC4898, they count segments sent/received
    containing a positive length data segment (that includes
    retransmission segments carrying data). Unlike
    tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
    carrying no data (e.g. pure ack).

    The patch also updates the segs_in in tcp_fastopen_add_skb()
    so that segs_in >= data_segs_in property is kept.

    Together with retransmission data, tcpi_data_segs_out
    gives a better signal on the rxmit rate.

    v6: Rebase on the latest net-next

    v5: Eric pointed out that checking skb->len is still needed in
    tcp_fastopen_add_skb() because skb can carry a FIN without data.
    Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
    helper is used. Comment is added to the fastopen case to explain why
    segs_in has to be reset and tcp_segs_in() has to be called before
    __skb_pull().

    v4: Add comment to the changes in tcp_fastopen_add_skb()
    and also add remark on this case in the commit message.

    v3: Add const modifier to the skb parameter in tcp_segs_in()

    v2: Rework based on recent fix by Eric:
    commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

    Signed-off-by: Martin KaFai Lau
    Cc: Chris Rapier
    Cc: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

17 Feb, 2016

1 commit

  • tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow,
    in usec unit. Might be ~0U if not yet known.

    tcpi_notsent_bytes reports the amount of bytes in the write queue that
    were not yet sent.

    This is done in a single patch to not add a temporary 32bit padding hole
    in tcp_info.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 May, 2015

1 commit

  • This patch tracks the total number of inbound and outbound segments on a
    TCP socket. One may use this number to have an idea on connection
    quality when compared against the retransmissions.

    RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut

    These are a 32bit field each and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->segs_out was placed near tp->snd_nxt for good data
    locality and minimal performance impact, while tp->segs_in was placed
    near tp->bytes_received for the same reason.

    Join work with Eric Dumazet.

    Note that received SYN are accounted on the listener, but sent SYNACK
    are not accounted.

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

06 May, 2015

1 commit

  • This patch allows a server application to get the TCP SYN headers for
    its passive connections. This is useful if the server is doing
    fingerprinting of clients based on SYN packet contents.

    Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.

    The first is used on a socket to enable saving the SYN headers
    for child connections. This can be set before or after the listen()
    call.

    The latter is used to retrieve the SYN headers for passive connections,
    if the parent listener has enabled TCP_SAVE_SYN.

    TCP_SAVED_SYN is read once, it frees the saved SYN headers.

    The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
    headers.

    Original patch was written by Tom Herbert, I changed it to not hold
    a full skb (and associated dst and conntracking reference).

    We have used such patch for about 3 years at Google.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Apr, 2015

3 commits

  • Some Congestion Control modules can provide per flow information,
    but current way to get this information is to use netlink.

    Like TCP_INFO, let's add TCP_CC_INFO so that applications can
    issue a getsockopt() if they have a socket file descriptor,
    instead of playing complex netlink games.

    Sample usage would be :

    union tcp_cc_info info;
    socklen_t len = sizeof(info);

    if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1)

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Daniel Borkmann
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch tracks total number of payload bytes received on a TCP socket.
    This is the sum of all changes done to tp->rcv_nxt

    RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

    This is a 64bit field, and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->bytes_received was placed near tp->rcv_nxt for
    best data locality and minimal performance impact.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Matt Mathis
    Cc: Eric Salo
    Cc: Martin Lau
    Cc: Chris Rapier
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch tracks total number of bytes acked for a TCP socket.
    This is the sum of all changes done to tp->snd_una, and allows
    for precise tracking of delivered data.

    RFC4898 named this : tcpEStatsAppHCThruOctetsAcked

    This is a 64bit field, and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->bytes_acked was placed near tp->snd_una for
    best data locality and minimal performance impact.

    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Cc: Matt Mathis
    Cc: Eric Salo
    Cc: Martin Lau
    Cc: Chris Rapier
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Feb, 2014

1 commit

  • Add two new fields to struct tcp_info, to report sk_pacing_rate
    and sk_max_pacing_rate to monitoring applications, as ss from iproute2.

    User exported fields are 64bit, even if kernel is currently using 32bit
    fields.

    lpaa5:~# ss -i
    ..
    skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
    wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
    13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
    rcv_space:29200

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Jul, 2013

1 commit

  • Idea of this patch is to add optional limitation of number of
    unsent bytes in TCP sockets, to reduce usage of kernel memory.

    TCP receiver might announce a big window, and TCP sender autotuning
    might allow a large amount of bytes in write queue, but this has little
    performance impact if a large part of this buffering is wasted :

    Write queue needs to be large only to deal with large BDP, not
    necessarily to cope with scheduling delays (incoming ACKS make room
    for the application to queue more bytes)

    For most workloads, using a value of 128 KB or less is OK to give
    applications enough time to react to POLLOUT events in time
    (or being awaken in a blocking sendmsg())

    This patch adds two ways to set the limit :

    1) Per socket option TCP_NOTSENT_LOWAT

    2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
    not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
    Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

    This changes poll()/select()/epoll() to report POLLOUT
    only if number of unsent bytes is below tp->nosent_lowat

    Note this might increase number of sendmsg()/sendfile() calls
    when using non blocking sockets,
    and increase number of context switches for blocking sockets.

    Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
    defined as :
    Specify the minimum number of bytes in the buffer until
    the socket layer will pass the data to the protocol)

    Tested:

    netperf sessions, and watching /proc/net/protocols "memory" column for TCP

    With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
    used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    Using 128KB has no bad effect on the throughput or cpu usage
    of a single flow, although there is an increase of context switches.

    A bonus is that we hold socket lock for a shorter amount
    of time and should improve latencies of ACK processing.

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    412,514 context-switches

    200.034645535 seconds time elapsed

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    2,675,818 context-switches

    200.029651391 seconds time elapsed

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-By: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2013

1 commit

  • TCPCT uses option-number 253, reserved for experimental use and should
    not be used in production environments.
    Further, TCPCT does not fully implement RFC 6013.

    As a nice side-effect, removing TCPCT increases TCP's performance for
    very short flows:

    Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
    for files of 1KB size.

    before this patch:
    average (among 7 runs) of 20845.5 Requests/Second
    after:
    average (among 7 runs) of 21403.6 Requests/Second

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

14 Feb, 2013

1 commit

  • A timestamp can be set, only if a socket is in the repair mode.

    This patch adds a new socket option TCP_TIMESTAMP, which allows to
    get and set current tcp times stamp.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

23 Oct, 2012

1 commit

  • Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
    It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
    client application can use this information to check Fast Open success rate.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

13 Oct, 2012

1 commit