04 Oct, 2011

1 commit

  • Allows ss command (iproute2) to display "ecnseen" if at least one packet
    with ECT(0) or ECT(1) or ECN was received by this socket.

    "ecn" means ECN was negotiated at session establishment (TCP level)

    "ecnseen" means we received at least one packet with ECT fields set (IP
    level)

    ss -i
    ...
    ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016
    ino:5950 sk:f178e400
    mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
    rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2011

1 commit

  • This patch implements Proportional Rate Reduction (PRR) for TCP.
    PRR is an algorithm that determines TCP's sending rate in fast
    recovery. PRR avoids excessive window reductions and aims for
    the actual congestion window size at the end of recovery to be as
    close as possible to the window determined by the congestion control
    algorithm. PRR also improves accuracy of the amount of data sent
    during loss recovery.

    The patch implements the recommended flavor of PRR called PRR-SSRB
    (Proportional rate reduction with slow start reduction bound) and
    replaces the existing rate halving algorithm. PRR improves upon the
    existing Linux fast recovery under a number of conditions including:
    1) burst losses where the losses implicitly reduce the amount of
    outstanding data (pipe) below the ssthresh value selected by the
    congestion control algorithm and,
    2) losses near the end of short flows where application runs out of
    data to send.

    As an example, with the existing rate halving implementation a single
    loss event can cause a connection carrying short Web transactions to
    go into the slow start mode after the recovery. This is because during
    recovery Linux pulls the congestion window down to packets_in_flight+1
    on every ACK. A short Web response often runs out of new data to send
    and its pipe reduces to zero by the end of recovery when all its packets
    are drained from the network. Subsequent HTTP responses using the same
    connection will have to slow start to raise cwnd to ssthresh. PRR on
    the other hand aims for the cwnd to be as close as possible to ssthresh
    by the end of recovery.

    A description of PRR and a discussion of its performance can be found at
    the following links:
    - IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
    - IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
    - Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

31 Aug, 2010

1 commit

  • This patch provides a "user timeout" support as described in RFC793. The
    socket option is also needed for the the local half of RFC5482 "TCP User
    Timeout Option".

    TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
    when > 0, to specify the maximum amount of time in ms that transmitted
    data may remain unacknowledged before TCP will forcefully close the
    corresponding connection and return ETIMEDOUT to the application. If
    0 is given, TCP will continue to use the system default.

    Increasing the user timeouts allows a TCP connection to survive extended
    periods without end-to-end connectivity. Decreasing the user timeouts
    allows applications to "fail fast" if so desired. Otherwise it may take
    upto 20 minutes with the current system defaults in a normal WAN
    environment.

    The socket option can be made during any state of a TCP connection, but
    is only effective during the synchronized states of a connection
    (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
    Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
    TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
    connection due to keepalive failure.

    The option does not change in anyway when TCP retransmits a packet, nor
    when a keepalive probe will be sent.

    This option, like many others, will be inherited by an acceptor from its
    listener.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

19 Feb, 2010

2 commits

  • This patch enables fast retransmissions after one dupACK for
    TCP if the stream is identified as thin. This will reduce
    latencies for thin streams that are not able to trigger fast
    retransmissions due to high packet interarrival time. This
    mechanism is only active if enabled by iocontrol or syscontrol
    and the stream is identified as thin.

    Signed-off-by: Andreas Petlund
    Signed-off-by: David S. Miller

    Andreas Petlund
     
  • This patch will make TCP use only linear timeouts if the
    stream is thin. This will help to avoid the very high latencies
    that thin stream suffer because of exponential backoff. This
    mechanism is only active if enabled by iocontrol or syscontrol
    and the stream is identified as thin. A maximum of 6 linear
    timeouts is tried before exponential backoff is resumed.

    Signed-off-by: Andreas Petlund
    Signed-off-by: David S. Miller

    Andreas Petlund
     

03 Dec, 2009

2 commits

  • Data structures are carefully composed to require minimal additions.
    For example, the struct tcp_options_received cookie_plus variable fits
    between existing 16-bit and 8-bit variables, requiring no additional
    space (taking alignment into consideration). There are no additions to
    tcp_request_sock, and only 1 pointer in tcp_sock.

    This is a significantly revised implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    The principle difference is using a TCP option to carry the cookie nonce,
    instead of a user configured offset in the data. This is more flexible and
    less subject to user configuration error. Such a cookie option has been
    suggested for many years, and is also useful without SYN data, allowing
    several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

    These functions will also be used in subsequent patches that implement
    additional features.

    Requires:
    TCPCT part 1a: add request_values parameter for sending SYNACK
    TCPCT part 1b: generate Responder Cookie secret
    TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     
  • Define sysctl (tcp_cookie_size) to turn on and off the cookie option
    default globally, instead of a compiled configuration option.

    Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
    data values, retrieving variable cookie values, and other facilities.

    Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h,
    near its corresponding struct tcp_options_received (prior to changes).

    This is a straightforward re-implementation of an earlier (year-old)
    patch that no longer applies cleanly, with permission of the original
    author (Adam Langley):

    http://thread.gmane.org/gmane.linux.network/102586

    These functions will also be used in subsequent patches that implement
    additional features.

    Requires:
    net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED

    Signed-off-by: William.Allen.Simpson@gmail.com
    Signed-off-by: David S. Miller

    William Allen Simpson
     

14 Nov, 2009

1 commit

  • Define two symbols needed in both kernel and user space.

    Remove old (somewhat incorrect) kernel variant that wasn't used in
    most cases. Default should apply to both RMSS and SMSS (RFC2581).

    Replace numeric constants with defined symbols.

    Stand-alone patch, originally developed for TCPCT.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

05 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Sep, 2009

1 commit


20 Apr, 2009

1 commit

  • last_synq_overflow eats 4 or 8 bytes in struct tcp_sock, even
    though it is only used when a listening sockets syn queue
    is full.

    We can (ab)use rx_opt.ts_recent_stamp to store the same information;
    it is not used otherwise as long as a socket is in listen state.

    Move linger2 around to avoid splitting struct mtu_probe
    across cacheline boundary on 32 bit arches.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

16 Mar, 2009

2 commits

  • The results is very unlikely change every so often so we
    hardly need to divide again after doing that once for a
    connection. Yet, if divide still becomes necessary we
    detect that and do the right thing and again settle for
    non-divide state. Takes the u16 space which was previously
    taken by the plain xmit_size_goal.

    This should take care part of the tso vs non-tso difference
    we found earlier.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • There's very little need for most of the callsites to get
    tp->xmit_goal_size updated. That will cost us divide as is,
    so slice the function in two. Also, the only users of the
    tp->xmit_goal_size are directly behind tcp_current_mss(),
    so there's no need to store that variable into tcp_sock
    at all! The drop of xmit_goal_size currently leaves 16-bit
    hole and some reorganization would again be necessary to
    change that (but I'm aiming to fill that hole with u16
    xmit_goal_size_segs to cache the results of the remaining
    divide to get that tso on regression).

    Bring xmit_goal_size parts into tcp.c

    Signed-off-by: Ilpo Järvinen
    Cc: Evgeniy Polyakov
    Cc: Ingo Molnar
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

02 Mar, 2009

1 commit


15 Feb, 2009

1 commit


08 Oct, 2008

1 commit

  • It all started from me noticing that this urgent check in
    tcp_clean_rtx_queue is unnecessarily inside the loop. Then
    I took a longer look to it and found out that the users of
    urg_mode can trivially do without, well almost, there was
    one gotcha.

    Bonus: those funny people who use urg with >= 2^31 write_seq -
    snd_una could now rejoice too (that's the only purpose for the
    between being there, otherwise a simple compare would have done
    the thing). Not that I assume that the rest of the tcp code
    happily lives with such mind-boggling numbers :-). Alas, it
    turned out to be impossible to set wmem to such numbers anyway,
    yes I really tried a big sendfile after setting some wmem but
    nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
    So I hacked a bit variable to long and found out that it seems
    to work... :-)

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

21 Sep, 2008

2 commits

  • Both loops are quite similar, so they can be combined
    with little effort. As a result, forward_skb_hint becomes
    obsolete as well.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Main benefit in this is that we can then freely point
    the retransmit_skb_hint to anywhere we want to because
    there's no longer need to know what would be the count
    changes involve, and since this is really used only as a
    terminator, unnecessary work is one time walk at most,
    and if some retransmissions are necessary after that
    point later on, the walk is not full waste of time
    anyway.

    Since retransmit_high must be kept valid, all lost
    markers must ensure that.

    Now I also have learned how those "holes" in the
    rexmittable skbs can appear, mtu probe does them. So
    I removed the misleading comment as well.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

19 Jul, 2008

1 commit

  • Remove redundant checks when setting eff_sacks and make the number of SACKs a
    compile time constant. Now that the options code knows how many SACK blocks can
    fit in the header, we don't need to have the SACK code guessing at it.

    Signed-off-by: Adam Langley
    Signed-off-by: David S. Miller

    Adam Langley
     

14 Jun, 2008

1 commit


13 Jun, 2008

1 commit

  • This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
    ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
    the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38
    ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

    This change causes several problems, first reported by Ingo Molnar
    as a distcc-over-loopback regression where connections were getting
    stuck.

    Ilpo Järvinen first spotted the locking problems. The new function
    added by this code, tcp_defer_accept_check(), only has the
    child socket locked, yet it is modifying state of the parent
    listening socket.

    Fixing that is non-trivial at best, because we can't simply just grab
    the parent listening socket lock at this point, because it would
    create an ABBA deadlock. The normal ordering is parent listening
    socket --> child socket, but this code path would require the
    reverse lock ordering.

    Next is a problem noticed by Vitaliy Gusev, he noted:

    ----------------------------------------
    >--- a/net/ipv4/tcp_timer.c
    >+++ b/net/ipv4/tcp_timer.c
    >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
    > goto death;
    > }
    >
    >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
    >+ tcp_send_active_reset(sk, GFP_ATOMIC);
    >+ goto death;

    Here socket sk is not attached to listening socket's request queue. tcp_done()
    will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
    release this sk) as socket is not DEAD. Therefore socket sk will be lost for
    freeing.
    ----------------------------------------

    Finally, Alexey Kuznetsov argues that there might not even be any
    real value or advantage to these new semantics even if we fix all
    of the bugs:

    ----------------------------------------
    Hiding from accept() sockets with only out-of-order data only
    is the only thing which is impossible with old approach. Is this really
    so valuable? My opinion: no, this is nothing but a new loophole
    to consume memory without control.
    ----------------------------------------

    So revert this thing for now.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 May, 2008

1 commit

  • I tried to group recovery related fields nearby (non-CA_Open related
    variables, to be more accurate) so that one to three cachelines would
    not be necessary in CA_Open. These are now contiguously deployed:

    struct sk_buff_head out_of_order_queue; /* 1968 80 */
    /* --- cacheline 32 boundary (2048 bytes) --- */
    struct tcp_sack_block duplicate_sack[1]; /* 2048 8 */
    struct tcp_sack_block selective_acks[4]; /* 2056 32 */
    struct tcp_sack_block recv_sack_cache[4]; /* 2088 32 */
    /* --- cacheline 33 boundary (2112 bytes) was 8 bytes ago --- */
    struct sk_buff * highest_sack; /* 2120 8 */
    int lost_cnt_hint; /* 2128 4 */
    int retransmit_cnt_hint; /* 2132 4 */
    u32 lost_retrans_low; /* 2136 4 */
    u8 reordering; /* 2140 1 */
    u8 keepalive_probes; /* 2141 1 */

    /* XXX 2 bytes hole, try to pack */

    u32 prior_ssthresh; /* 2144 4 */
    u32 high_seq; /* 2148 4 */
    u32 retrans_stamp; /* 2152 4 */
    u32 undo_marker; /* 2156 4 */
    int undo_retrans; /* 2160 4 */
    u32 total_retrans; /* 2164 4 */

    ...and they're then followed by URG slowpath & keepalive related
    variables.

    Head of the out_of_order_queue always needed for empty checks, if
    that's empty (and TCP is in CA_Open), following ~200 bytes (in 64-bit)
    shouldn't be necessary for anything. If only OFO queue exists but TCP
    is in CA_Open, selective_acks (and possibly duplicate_sack) are
    necessary besides the out_of_order_queue but the rest of the block
    again shouldn't be (ie., the other direction had losses).

    As the cacheline boundaries depend on many factors in the preceeding
    stuff, trying to align considering them doesn't make too much sense.

    Commented one ordering hazard.

    There are number of low utilized u8/16s that could be combined get 2
    bytes less in total so that the hole could be made to vanish (includes
    at least ecn_flags, urg_data, urg_mode, frto_counter, nonagle).

    Signed-off-by: Ilpo Järvinen
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

22 May, 2008

1 commit

  • If previous window was above representable values of u16,
    strange things will happen if undo with the truncated value
    is called for. Alternatively, this could be fixed by some
    max trickery but that would limit undoing high-speed undos.

    Adds 16-bit hole but there isn't anything to fill it with.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

22 Mar, 2008

1 commit

  • Change TCP_DEFER_ACCEPT implementation so that it transitions a
    connection to ESTABLISHED after handshake is complete instead of
    leaving it in SYN-RECV until some data arrvies. Place connection in
    accept queue when first data packet arrives from slow path.

    Benefits:
    - established connection is now reset if it never makes it
    to the accept queue

    - diagnostic state of established matches with the packet traces
    showing completed handshake

    - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
    enforced with reasonable accuracy instead of rounding up to next
    exponential back-off of syn-ack retry.

    Signed-off-by: Patrick McManus
    Signed-off-by: David S. Miller

    Patrick McManus
     

29 Jan, 2008

3 commits

  • Key points of this patch are:

    - In case new SACK information is advance only type, no skb
    processing below previously discovered highest point is done
    - Optimize cases below highest point too since there's no need
    to always go up to highest point (which is very likely still
    present in that SACK), this is not entirely true though
    because I'm dropping the fastpath_skb_hint which could
    previously optimize those cases even better. Whether that's
    significant, I'm not too sure.

    Currently it will provide skipping by walking. Combined with
    RB-tree, all skipping would become fast too regardless of window
    size (can be done incrementally later).

    Previously a number of cases in TCP SACK processing fails to
    take advantage of costly stored information in sack_recv_cache,
    most importantly, expected events such as cumulative ACK and new
    hole ACKs. Processing on such ACKs result in rather long walks
    building up latencies (which easily gets nasty when window is
    huge). Those latencies are often completely unnecessary
    compared with the amount of _new_ information received, usually
    for cumulative ACK there's no new information at all, yet TCP
    walks whole queue unnecessary potentially taking a number of
    costly cache misses on the way, etc.!

    Since the inclusion of highest_sack, there's a lot information
    that is very likely redundant (SACK fastpath hint stuff,
    fackets_out, highest_sack), though there's no ultimate guarantee
    that they'll remain the same whole the time (in all unearthly
    scenarios). Take advantage of this knowledge here and drop
    fastpath hint and use direct access to highest SACKed skb as
    a replacement.

    Effectively "special cased" fastpath is dropped. This change
    adds some complexity to introduce better coveraged "fastpath",
    though the added complexity should make TCP behave more cache
    friendly.

    The current ACK's SACK blocks are compared against each cached
    block individially and only ranges that are new are then scanned
    by the high constant walk. For other parts of write queue, even
    when in previously known part of the SACK blocks, a faster skip
    function is used (if necessary at all). In addition, whenever
    possible, TCP fast-forwards to highest_sack skb that was made
    available by an earlier patch. In typical case, no other things
    but this fast-forward and mandatory markings after that occur
    making the access pattern quite similar to the former fastpath
    "special case".

    DSACKs are special case that must always be walked.

    The local to recv_sack_cache copying could be more intelligent
    w.r.t DSACKs which are likely to be there only once but that
    is left to a separate patch.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     
  • It is going to replace the sack fastpath hint quite soon... :-)

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

16 Oct, 2007

1 commit

  • Very little point of having 32-bit snd_cnwd if this is not
    32-bit as well, as a number of snd_cwnd incrementation formulas
    assume that snd_cwnd_cnt can be at least as large as snd_cwnd.

    Whether 32-bit is useful was discussed when e0ef57cc56c3c96
    was made:
    http://marc.info/?l=linux-netdev&m=117218144409825&w=2

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

12 Oct, 2007

1 commit


11 Oct, 2007

5 commits


26 Apr, 2007

5 commits

  • For the places where we need a pointer to the transport header, it is
    still legal to touch skb->h.raw directly if just adding to,
    subtracting from or setting it to another layer header.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
    avoid the longer, open coded equivalent.

    Ditched a no-op in bnx2 in the process.

    I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • I noticed in oprofile study a cache miss in tcp_rcv_established() to read
    copied_seq.

    ffffffff80400a80 : /* tcp_rcv_established total: 4034293  
    2.0400 */

     55493  0.0281 :ffffffff80400bc9:   mov    0x4c8(%r12),%eax copied_seq
    543103  0.2746 :ffffffff80400bd1:   cmp    0x3e0(%r12),%eax   rcv_nxt    

    if (tp->copied_seq == tp->rcv_nxt &&
            len - tcp_header_len ucopy.len) {

    In this function, the cache line 0x4c0 -> 0x500 is used only for this
    reading 'copied_seq' field.

    rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of
    active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...)

    As you suggested, I changed tcp_create_openreq_child() so that these fields
    are changed together, to avoid adding a new store buffer stall.

    Patch is 64bit friendly (no new hole because of alignment constraints)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet