11 Nov, 2012

1 commit


04 Nov, 2012

1 commit

  • For passive TCP connections using TCP_DEFER_ACCEPT facility,
    we incorrectly increment req->retrans each time timeout triggers
    while no SYNACK is sent.

    SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
    which we received the ACK from client). Only the last SYNACK is sent
    so that we can receive again an ACK from client, to move the req into
    accept queue. We plan to change this later to avoid the useless
    retransmit (and potential problem as this SYNACK could be lost)

    TCP_INFO later gives wrong information to user, claiming imaginary
    retransmits.

    Decouple req->retrans field into two independent fields :

    num_retrans : number of retransmit
    num_timeout : number of timeouts

    num_timeout is the counter that is incremented at each timeout,
    regardless of actual SYNACK being sent or not, and used to
    compute the exponential timeout.

    Introduce inet_rtx_syn_ack() helper to increment num_retrans
    only if ->rtx_syn_ack() succeeded.

    Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
    when we re-send a SYNACK in answer to a (retransmitted) SYN.
    Prior to this patch, we were not counting these retransmits.

    Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
    only if a synack packet was successfully queued.

    Reported-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Cc: Vijay Subramanian
    Cc: Elliott Hughes
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2012

1 commit

  • Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
    It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
    client application can use this information to check Fast Open success rate.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

23 Sep, 2012

1 commit

  • Previously, when using TCP Fast Open a server would return from
    tcp_check_req() before updating snt_synack based on TCP timestamp echo
    replies and whether or not we've retransmitted the SYNACK. The result
    was that (a) for TFO connections using timestamps we used an incorrect
    baseline SYNACK send time (tcp_time_stamp of SYNACK send instead of
    rcv_tsecr), and (b) for TFO connections that do not have TCP
    timestamps but retransmit the SYNACK we took a SYNACK RTT sample when
    we should not take a sample.

    This fix merely moves the snt_synack update logic a bit earlier in the
    function, so that connections using TCP Fast Open will properly do
    these updates when the ACK for the SYNACK arrives.

    Moving this snt_synack update logic means that with TCP_DEFER_ACCEPT
    enabled we do a few instructions of wasted work on each bare ACK, but
    that seems OK.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

21 Sep, 2012

1 commit

  • Both tcp_timewait_state_process and tcp_check_req use the same basic
    construct of

    struct tcp_options received tmp_opt;
    tmp_opt.saw_tstamp = 0;

    then call

    tcp_parse_options

    However if they are fed a frame containing a TCP_SACK then tbe code
    behaviour is undefined because opt_rx->sack_ok is undefined data.

    This ought to be documented if it is intentional.

    Signed-off-by: Alan Cox
    Signed-off-by: David S. Miller

    Alan Cox
     

01 Sep, 2012

1 commit

  • This patch builds on top of the previous patch to add the support
    for TFO listeners. This includes -

    1. allocating, properly initializing, and managing the per listener
    fastopen_queue structure when TFO is enabled

    2. changes to the inet_csk_accept code to support TFO. E.g., the
    request_sock can no longer be freed upon accept(), not until 3WHS
    finishes

    3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
    if it's a TFO socket

    4. properly closing a TFO listener, and a TFO socket before 3WHS
    finishes

    5. supporting TCP_FASTOPEN socket option

    6. modifying tcp_check_req() to use to check a TFO socket as well
    as request_sock

    7. supporting TCP's TFO cookie option

    8. adding a new SYN-ACK retransmit handler to use the timer directly
    off the TFO socket rather than the listener socket. Note that TFO
    server side will not retransmit anything other than SYN-ACK until
    the 3WHS is completed.

    The patch also contains an important function
    "reqsk_fastopen_remove()" to manage the somewhat complex relation
    between a listener, its request_sock, and the corresponding child
    socket. See the comment above the function for the detail.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     

20 Aug, 2012

1 commit

  • This commit removes the sk_rx_dst_set calls from
    tcp_create_openreq_child(), because at that point the icsk_af_ops
    field of ipv6_mapped TCP sockets has not been set to its proper final
    value.

    Instead, to make sure we get the right sk_rx_dst_set variant
    appropriate for the address family of the new connection, we have
    tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
    shortly after the call to tcp_create_openreq_child() returns.

    This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
    with the new approach.

    Signed-off-by: Neal Cardwell
    Reported-by: Artem Savkov
    Cc: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

07 Aug, 2012

1 commit

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Jul, 2012

1 commit

  • commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
    added various fatal races with dst refcounts.

    crashes happen on tcp workloads if routes are added/deleted at the same
    time.

    The dst_free() calls from free_fib_info_rcu() are clearly racy.

    We need instead regular dst refcounting (dst_release()) and make
    sure dst_release() is aware of RCU grace periods :

    Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
    before dst destruction for cached dst

    Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
    to make sure we dont increase a zero refcount (On a dst currently
    waiting an rcu grace period before destruction)

    rt_cache_route() must take a reference on the new cached route, and
    release it if was not able to install it.

    With this patch, my machines survive various benchmarks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jul, 2012

1 commit

  • commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.)
    invalidated TCP early demux, because rx_dst_ifindex is not properly
    initialized and checked.

    Also remove the use of inet_iif(skb) in favor or skb->skb_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jul, 2012

1 commit

  • This patch impelements the common code for both the client and server.

    1. TCP Fast Open option processing. Since Fast Open does not have an
    option number assigned by IANA yet, it shares the experiment option
    code 254 by implementing draft-ietf-tcpm-experimental-options
    with a 16 bits magic number 0xF989. This enables global experiments
    without clashing the scarce(2) experimental options available for TCP.

    When the draft status becomes standard (maybe), the client should
    switch to the new option number assigned while the server supports
    both numbers for transistion.

    2. The new sysctl tcp_fastopen

    3. A place holder init function

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

12 Jul, 2012

1 commit

  • This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

    sk->sk_wmem_alloc not allowed to grow above a given limit,
    allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
    given time.

    TSO packets are sized/capped to half the limit, so that we have two
    TSO packets in flight, allowing better bandwidth use.

    As a side effect, setting the limit to 40000 automatically reduces the
    standard gso max limit (65536) to 40000/2 : It can help to reduce
    latencies of high prio packets, having smaller TSO packets.

    This means we divert sock_wfree() to a tcp_wfree() handler, to
    queue/send following frames when skb_orphan() [2] is called for the
    already queued skbs.

    Results on my dev machines (tg3/ixgbe nics) are really impressive,
    using standard pfifo_fast, and with or without TSO/GSO.

    Without reduction of nominal bandwidth, we have reduction of buffering
    per bulk sender :
    < 1ms on Gbit (instead of 50ms with TSO)
    < 8ms on 100Mbit (instead of 132 ms)

    I no longer have 4 MBytes backlogged in qdisc by a single netperf
    session, and both side socket autotuning no longer use 4 Mbytes.

    As skb destructor cannot restart xmit itself ( as qdisc lock might be
    taken at this point ), we delegate the work to a tasklet. We use one
    tasklest per cpu for performance reasons.

    If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
    This flag is tested in a new protocol method called from release_sock(),
    to eventually send new segments.

    [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
    [2] skb_orphan() is usually called at TX completion time,
    but some drivers call it in their start_xmit() handler.
    These drivers should at least use BQL, or else a single TCP
    session can still fill the whole NIC TX ring, since TSQ will
    have no effect.

    Signed-off-by: Eric Dumazet
    Cc: Dave Taht
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

2 commits


20 Jun, 2012

1 commit

  • Input packet processing for local sockets involves two major demuxes.
    One for the route and one for the socket.

    But we can optimize this down to one demux for certain kinds of local
    sockets.

    Currently we only do this for established TCP sockets, but it could
    at least in theory be expanded to other kinds of connections.

    If a TCP socket is established then it's identity is fully specified.

    This means that whatever input route was used during the three-way
    handshake must work equally well for the rest of the connection since
    the keys will not change.

    Once we move to established state, we cache the receive packet's input
    route to use later.

    Like the existing cached route in sk->sk_dst_cache used for output
    packets, we have to check for route invalidations using dst->obsolete
    and dst->ops->check().

    Early demux occurs outside of a socket locked section, so when a route
    invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
    actually inside of established state packet processing and thus have
    the socket locked.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jun, 2012

1 commit


09 Jun, 2012

1 commit

  • The get_peer method TCP uses is full of special cases that make no
    sense accommodating, and it also gets in the way of doing more
    reasonable things here.

    First of all, if the socket doesn't have a usable cached route, there
    is no sense in trying to optimize timewait recycling.

    Likewise for the case where we have IP options, such as SRR enabled,
    that make the IP header destination address (and thus the destination
    address of the route key) differ from that of the connection's
    destination address.

    Just return a NULL peer in these cases, and thus we're also able to
    get rid of the clumsy inetpeer release logic.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 May, 2012

1 commit


03 May, 2012

1 commit

  • This patch implements RFC 5827 early retransmit (ER) for TCP.
    It reduces DUPACK threshold (dupthresh) if outstanding packets are
    less than 4 to recover losses by fast recovery instead of timeout.

    While the algorithm is simple, small but frequent network reordering
    makes this feature dangerous: the connection repeatedly enter
    false recovery and degrade performance. Therefore we implement
    a mitigation suggested in the appendix of the RFC that delays
    entering fast recovery by a small interval, i.e., RTT/4. Currently
    ER is conservative and is disabled for the rest of the connection
    after the first reordering event. A large scale web server
    experiment on the performance impact of ER is summarized in
    section 6 of the paper "Proportional Rate Reduction for TCP”,
    IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

    Note that Linux has a similar feature called THIN_DUPACK. The
    differences are THIN_DUPACK do not mitigate reorderings and is only
    used after slow start. Currently ER is disabled if THIN_DUPACK is
    enabled. I would be happy to merge THIN_DUPACK feature with ER if
    people think it's a good idea.

    ER is enabled by sysctl_tcp_early_retrans:
    0: Disables ER

    1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

    2: (Default) reduce dupthresh like mode 1. In addition, delay
    entering fast recovery by RTT/4.

    Note: mode 2 is implemented in the third part of this patch series.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

01 Feb, 2012

1 commit

  • In order to be able to support proper RST messages for TCP MD5 flows, we
    need to allow access to MD5 keys without locking listener socket.

    This conversion is a nice cleanup, and shrinks size of timewait sockets
    by 80 bytes.

    IPv6 code reuses generic code found in IPv4 instead of duplicating it.

    Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Dec, 2011

1 commit


01 Dec, 2011

1 commit

  • Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
    was ignored for its children sockets : right after accept() the
    congestion control for new socket is the system default one.

    This seems an oversight of the initial design (quoted from Stephen)

    Based on prior investigation and patch from Rick.

    Reported-by: Rick Jones
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Yuchung Cheng
    Tested-by: Rick Jones
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Nov, 2011

1 commit


09 Nov, 2011

1 commit


27 Oct, 2011

1 commit

  • commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from
    TIME_WAIT) fixed IPv4 only.

    This part is for the IPv6 side, adding a tclass param to ip6_xmit()

    We alias tw_tclass and tw_tos, if socket family is INET6.

    [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
    TOS field, TCLASS is not taken into account ]

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Oct, 2011

1 commit


21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2011

1 commit

  • The transparent socket option setting was not copied to the time wait
    socket when an inet socket was being replaced by a time wait socket. This
    broke the --transparent option of the socket match and may have caused
    that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state
    were being dropped by the packet filter.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     

09 Jun, 2011

1 commit

  • This patch lowers the default initRTO from 3secs to 1sec per
    RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
    has been retransmitted, AND the TCP timestamp option is not on.

    It also adds support to take RTT sample during 3WHS on the passive
    open side, just like its active open counterpart, and uses it, if
    valid, to seed the initRTO for the data transmission phase.

    The patch also resets ssthresh to its initial default at the
    beginning of the data transmission phase, and reduces cwnd to 1 if
    there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

    Signed-off-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Jerry Chu
     

09 Dec, 2010

2 commits


02 Dec, 2010

1 commit

  • The only thing AF-specific about remembering the timestamp
    for a time-wait TCP socket is getting the peer.

    Abstract that behind a new timewait_sock_ops vector.

    Support for real IPV6 sockets is not filled in yet, but
    curiously this makes timewait recycling start to work
    for v4-mapped ipv6 sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Dec, 2010

1 commit


24 Sep, 2010

1 commit


13 Jul, 2010

1 commit


12 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

22 Mar, 2010

1 commit

  • Its currently hard to diagnose when ACK frames are dropped because an
    application set TCP_DEFER_ACCEPT on its listening socket.

    See http://bugzilla.kernel.org/show_bug.cgi?id=15507

    This patch adds a SNMP value, named TCPDeferAcceptDrop

    netstat -s | grep TCPDeferAcceptDrop
    TCPDeferAcceptDrop: 0

    This counter is incremented every time we drop a pure ACK frame received
    by a socket in SYN_RECV state because its SYNACK retrans count is lower
    than defer_accept value.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Mar, 2010

1 commit


16 Dec, 2009

1 commit

  • It creates a regression, triggering badness for SYN_RECV
    sockets, for example:

    [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
    [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
    [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32)
    [19148.023496] MSR: 00029032 CR: 24002442 XER: 00000000
    [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000

    This is likely caused by the change in the 'estab' parameter
    passed to tcp_parse_options() when invoked by the functions
    in net/ipv4/tcp_minisocks.c

    But even if that is fixed, the ->conn_request() changes made in
    this patch series is fundamentally wrong. They try to use the
    listening socket's 'dst' to probe the route settings. The
    listening socket doesn't even have a route, and you can't
    get the right route (the child request one) until much later
    after we setup all of the state, and it must be done by hand.

    This stuff really isn't ready, so the best thing to do is a
    full revert. This reverts the following commits:

    f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d
    022c3f7d82f0f1c68018696f2f027b87b9bb45c2
    1aba721eba1d84a2defce45b950272cee1e6c72a
    cda42ebd67ee5fdf09d7057b5a4584d36fe8a335
    345cda2fd695534be5a4494f1b59da9daed33663
    dc343475ed062e13fc260acccaab91d7d80fd5b2
    05eaade2782fb0c90d3034fd7a7d5a16266182bb
    6a2a2d6bf8581216e08be15fcb563cfd6c430e1e

    Signed-off-by: David S. Miller

    David S. Miller