01 Aug, 2013

1 commit


25 Jul, 2013

1 commit

  • Idea of this patch is to add optional limitation of number of
    unsent bytes in TCP sockets, to reduce usage of kernel memory.

    TCP receiver might announce a big window, and TCP sender autotuning
    might allow a large amount of bytes in write queue, but this has little
    performance impact if a large part of this buffering is wasted :

    Write queue needs to be large only to deal with large BDP, not
    necessarily to cope with scheduling delays (incoming ACKS make room
    for the application to queue more bytes)

    For most workloads, using a value of 128 KB or less is OK to give
    applications enough time to react to POLLOUT events in time
    (or being awaken in a blocking sendmsg())

    This patch adds two ways to set the limit :

    1) Per socket option TCP_NOTSENT_LOWAT

    2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
    not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
    Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

    This changes poll()/select()/epoll() to report POLLOUT
    only if number of unsent bytes is below tp->nosent_lowat

    Note this might increase number of sendmsg()/sendfile() calls
    when using non blocking sockets,
    and increase number of context switches for blocking sockets.

    Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
    defined as :
    Specify the minimum number of bytes in the buffer until
    the socket layer will pass the data to the protocol)

    Tested:

    netperf sessions, and watching /proc/net/protocols "memory" column for TCP

    With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
    used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    Using 128KB has no bad effect on the throughput or cpu usage
    of a single flow, although there is an increase of context switches.

    A bonus is that we hold socket lock for a shorter amount
    of time and should improve latencies of ACK processing.

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    412,514 context-switches

    200.034645535 seconds time elapsed

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    2,675,818 context-switches

    200.029651391 seconds time elapsed

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-By: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 May, 2013

1 commit

  • tcp_timeout_skb() was intended to trigger fast recovery on timeout,
    unfortunately in reality it often causes spurious retransmission
    storms during fast recovery. The particular sign is a fast retransmit
    over the highest sacked sequence (SND.FACK).

    Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
    to avoid spurious timeout: when SND.UNA advances the sender re-arms
    RTO and extends the timeout by icsk_rto. The sender does not offset
    the time elapsed since the packet at SND.UNA was sent.

    But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
    tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
    sent before the icsk_rto interval lost, including one that's above the
    highest sacked sequence. Most likely a large part of scorebard will be
    marked.

    If most packets are not lost then the subsequent DUPACKs with new SACK
    blocks will cause the sender to continue to retransmit packets beyond
    SND.FACK spuriously. Even if only one packet is lost the sender may
    falsely retransmit almost the entire window.

    The situation becomes common in the world of bufferbloat: the RTT
    continues to grow as the queue builds up but RTTVAR remains small and
    close to the minimum 200ms. If a data packet is lost and the DUPACK
    triggered by the next data packet is slightly delayed, then a spurious
    retransmission storm forms.

    As the original comment on tcp_timeout_skb() suggests: the usefulness
    of this feature is questionable. It also wastes cycles walking the
    sack scoreboard and is actually harmful because of false recovery.

    It's time to remove this.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

21 Mar, 2013

2 commits

  • This patch implements F-RTO (foward RTO recovery):

    When the first retransmission after timeout is acknowledged, F-RTO
    sends new data instead of old data. If the next ACK acknowledges
    some never-retransmitted data, then the timeout was spurious and the
    congestion state is reverted. Otherwise if the next ACK selectively
    acknowledges the new data, then the timeout was genuine and the
    loss recovery continues. This idea applies to recurring timeouts
    as well. While F-RTO sends different data during timeout recovery,
    it does not (and should not) change the congestion control.

    The implementaion follows the three steps of SACK enhanced algorithm
    (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
    3 are in tcp_process_loss(). The basic version is not supported
    because SACK enhanced version also works for non-SACK connections.

    The new implementation is functionally in parity with the old F-RTO
    implementation except the one case where it increases undo events:
    In addition to the RFC algorithm, a spurious timeout may be detected
    without sending data in step 2, as long as the SACK confirms not
    all the original data are dropped. When this happens, the sender
    will undo the cwnd and perhaps enter fast recovery instead. This
    additional check increases the F-RTO undo events by 5x compared
    to the prior implementation on Google Web servers, since the sender
    often does not have new data to send for HTTP.

    Note F-RTO may detect spurious timeout before Eifel with timestamps
    does so.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • The patch series refactor the F-RTO feature (RFC4138/5682).

    This is to simplify the loss recovery processing. Existing F-RTO
    was developed during the experimental stage (RFC4138) and has
    many experimental features. It takes a separate code path from
    the traditional timeout processing by overloading CA_Disorder
    instead of using CA_Loss state. This complicates CA_Disorder state
    handling because it's also used for handling dubious ACKs and undos.
    While the algorithm in the RFC does not change the congestion control,
    the implementation intercepts congestion control in various places
    (e.g., frto_cwnd in tcp_ack()).

    The new code implements newer F-RTO RFC5682 using CA_Loss processing
    path. F-RTO becomes a small extension in the timeout processing
    and interfaces with congestion control and Eifel undo modules.
    It lets congestion control (module) determines how many to send
    independently. F-RTO only chooses what to send in order to detect
    spurious retranmission. If timeout is found spurious it invokes
    existing Eifel undo algorithms like DSACK or TCP timestamp based
    detection.

    The first patch removes all F-RTO code except the sysctl_tcp_frto is
    left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
    westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 Mar, 2013

1 commit

  • TCPCT uses option-number 253, reserved for experimental use and should
    not be used in production environments.
    Further, TCPCT does not fully implement RFC 6013.

    As a nice side-effect, removing TCPCT increases TCP's performance for
    very short flows:

    Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
    for files of 1KB size.

    before this patch:
    average (among 7 runs) of 20845.5 Requests/Second
    after:
    average (among 7 runs) of 21403.6 Requests/Second

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

12 Mar, 2013

2 commits

  • This is the second of the TLP patch series; it augments the basic TLP
    algorithm with a loss detection scheme.

    This patch implements a mechanism for loss detection when a Tail
    loss probe retransmission plugs a hole thereby masking packet loss
    from the sender. The loss detection algorithm relies on counting
    TLP dupacks as outlined in Sec. 3 of:
    http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01

    The basic idea is: Sender keeps track of TLP "episode" upon
    retransmission of a TLP packet. An episode ends when the sender receives
    an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
    episode. We want to make sure that before the episode ends the sender
    receives a "TLP dupack", indicating that the TLP retransmission was
    unnecessary, so there was no loss/hole that needed plugging. If the
    sender gets no TLP dupack before the end of the episode, then it reduces
    ssthresh and the congestion window, because the TLP packet arriving at
    the receiver probably plugged a hole.

    Signed-off-by: Nandita Dukkipati
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     
  • This patch series implement the Tail loss probe (TLP) algorithm described
    in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
    first patch implements the basic algorithm.

    TLP's goal is to reduce tail latency of short transactions. It achieves
    this by converting retransmission timeouts (RTOs) occuring due
    to tail losses (losses at end of transactions) into fast recovery.
    TLP transmits one packet in two round-trips when a connection is in
    Open state and isn't receiving any ACKs. The transmitted packet, aka
    loss probe, can be either new or a retransmission. When there is tail
    loss, the ACK from a loss probe triggers FACK/early-retransmit based
    fast recovery, thus avoiding a costly RTO. In the absence of loss,
    there is no change in the connection state.

    PTO stands for probe timeout. It is a timer event indicating
    that an ACK is overdue and triggers a loss probe packet. The PTO value
    is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
    ACK timer when there is only one oustanding packet.

    TLP Algorithm

    On transmission of new data in Open state:
    -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
    -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
    -> PTO = min(PTO, RTO)

    Conditions for scheduling PTO:
    -> Connection is in Open state.
    -> Connection is either cwnd limited or no new data to send.
    -> Number of probes per tail loss episode is limited to one.
    -> Connection is SACK enabled.

    When PTO fires:
    new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

    no_new_packet:
    -> retransmit the last segment.
    Its ACK triggers FACK or early retransmit based recovery.

    ACK path:
    -> rearm RTO at start of ACK processing.
    -> reschedule PTO if need be.

    In addition, the patch includes a small variation to the Early Retransmit
    (ER) algorithm, such that ER and TLP together can in principle recover any
    N-degree of tail loss through fast recovery. TLP is controlled by the same
    sysctl as ER, tcp_early_retrans sysctl.
    tcp_early_retrans==0; disables TLP and ER.
    ==1; enables RFC5827 ER.
    ==2; delayed ER.
    ==3; TLP and delayed ER. [DEFAULT]
    ==4; TLP only.

    The TLP patch series have been extensively tested on Google Web servers.
    It is most effective for short Web trasactions, where it reduced RTOs by 15%
    and improved HTTP response time (average by 6%, 99th percentile by 10%).
    The transmitted probes account for
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

11 Mar, 2013

1 commit


14 Feb, 2013

1 commit

  • This functionality is used for restoring tcp sockets. A tcp timestamp
    depends on how long a system has been running, so it's differ for each
    host. The solution is to set a per-socket offset.

    A per-socket offset for a TIME_WAIT socket is inherited from a proper
    tcp socket.

    tcp_request_sock doesn't have a timestamp offset, because the repair
    mode for them are not implemented.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

06 Feb, 2013

1 commit

  • TCP Appropriate Byte Count was added by me, but later disabled.
    There is no point in maintaining it since it is a potential source
    of bugs and Linux already implements other better window protection
    heuristics.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

09 Dec, 2012

1 commit

  • This patch adds support in the kernel for offloading in the NIC Tx and Rx
    checksumming for encapsulated packets (such as VXLAN and IP GRE).

    For Tx encapsulation offload, the driver will need to set the right bits
    in netdev->hw_enc_features. The protocol driver will have to set the
    skb->encapsulation bit and populate the inner headers, so the NIC driver will
    use those inner headers to calculate the csum in hardware.

    For Rx encapsulation offload, the driver will need to set again the
    skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
    In that case the protocol driver should push the decapsulated packet up
    to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
    driver should set the skb->encapsulation flag back to zero. Finally the
    protocol driver should have NETIF_F_RXCSUM flag set in its features.

    Signed-off-by: Joseph Gasparakis
    Signed-off-by: Peter P Waskiewicz Jr
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Joseph Gasparakis
     

23 Oct, 2012

1 commit

  • Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
    It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
    client application can use this information to check Fast Open success rate.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

13 Oct, 2012

1 commit


21 Sep, 2012

1 commit


01 Sep, 2012

1 commit

  • This patch adds all the necessary data structure and support
    functions to implement TFO server side. It also documents a number
    of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
    extension MIBs.

    In addition, it includes the following:

    1. a new TCP_FASTOPEN socket option an application must call to
    supply a max backlog allowed in order to enable TFO on its listener.

    2. A number of key data structures:
    "fastopen_rsk" in tcp_sock - for a big socket to access its
    request_sock for retransmission and ack processing purpose. It is
    non-NULL iff 3WHS not completed.

    "fastopenq" in request_sock_queue - points to a per Fast Open
    listener data structure "fastopen_queue" to keep track of qlen (# of
    outstanding Fast Open requests) and max_qlen, among other things.

    "listener" in tcp_request_sock - to point to the original listener
    for book-keeping purpose, i.e., to maintain qlen against max_qlen
    as part of defense against IP spoofing attack.

    3. various data structure and functions, many in tcp_fastopen.c, to
    support server side Fast Open cookie operations, including
    /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     

23 Jul, 2012

1 commit

  • ICMP messages generated in output path if frame length is bigger than
    mtu are actually lost because socket is owned by user (doing the xmit)

    One example is the ipgre_tunnel_xmit() calling
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

    We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on
    sockets hitting dst_allfrag).

    Problem of such fix is that it relied on retransmit timers, so short tcp
    sessions paid a too big latency increase price.

    This patch uses the tcp_release_cb() infrastructure so that MTU
    reduction messages (ICMP messages) are not lost, and no extra delay
    is added in TCP transmits.

    Reported-by: Maciej Żenczykowski
    Diagnosed-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Cc: Nandita Dukkipati
    Cc: Tom Herbert
    Cc: Tore Anderson
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit

  • Modern TCP stack highly depends on tcp_write_timer() having a small
    latency, but current implementation doesn't exactly meet the
    expectations.

    When a timer fires but finds the socket is owned by the user, it rearms
    itself for an additional delay hoping next run will be more
    successful.

    tcp_write_timer() for example uses a 50ms delay for next try, and it
    defeats many attempts to get predictable TCP behavior in term of
    latencies.

    Use the recently introduced tcp_release_cb(), so that the user owning
    the socket will call various handlers right before socket release.

    This will permit us to post a followup patch to address the
    tcp_tso_should_defer() syndrome (some deferred packets have to wait
    RTO timer to be transmitted, while cwnd should allow us to send them
    sooner)

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Nandita Dukkipati
    Cc: H.K. Jerry Chu
    Cc: John Heffner
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jul, 2012

3 commits

  • In trusted networks, e.g., intranet, data-center, the client does not
    need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
    mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
    of cookie availability.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch implements sending SYN-data in tcp_connect(). The data is
    from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

    The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
    type of the SYN. If the cookie is not cached (len==0), the host sends
    data-less SYN with Fast Open cookie request option to solicit a cookie
    from the remote. If cookie is not available (len > 0), the host sends
    a SYN-data with Fast Open cookie option. If cookie length is negative,
    the SYN will not include any Fast Open option (for fall back operations).

    To deal with middleboxes that may drop SYN with data or experimental TCP
    option, the SYN-data is only sent once. SYN retransmits do not include
    data or Fast Open options. The connection will fall back to regular TCP
    handshake.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch impelements the common code for both the client and server.

    1. TCP Fast Open option processing. Since Fast Open does not have an
    option number assigned by IANA yet, it shares the experiment option
    code 254 by implementing draft-ietf-tcpm-experimental-options
    with a 16 bits magic number 0xF989. This enables global experiments
    without clashing the scarce(2) experimental options available for TCP.

    When the draft status becomes standard (maybe), the client should
    switch to the new option number assigned while the server supports
    both numbers for transistion.

    2. The new sysctl tcp_fastopen

    3. A place holder init function

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

12 Jul, 2012

1 commit

  • This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

    sk->sk_wmem_alloc not allowed to grow above a given limit,
    allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
    given time.

    TSO packets are sized/capped to half the limit, so that we have two
    TSO packets in flight, allowing better bandwidth use.

    As a side effect, setting the limit to 40000 automatically reduces the
    standard gso max limit (65536) to 40000/2 : It can help to reduce
    latencies of high prio packets, having smaller TSO packets.

    This means we divert sock_wfree() to a tcp_wfree() handler, to
    queue/send following frames when skb_orphan() [2] is called for the
    already queued skbs.

    Results on my dev machines (tg3/ixgbe nics) are really impressive,
    using standard pfifo_fast, and with or without TSO/GSO.

    Without reduction of nominal bandwidth, we have reduction of buffering
    per bulk sender :
    < 1ms on Gbit (instead of 50ms with TSO)
    < 8ms on 100Mbit (instead of 132 ms)

    I no longer have 4 MBytes backlogged in qdisc by a single netperf
    session, and both side socket autotuning no longer use 4 Mbytes.

    As skb destructor cannot restart xmit itself ( as qdisc lock might be
    taken at this point ), we delegate the work to a tasklet. We use one
    tasklest per cpu for performance reasons.

    If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
    This flag is tested in a new protocol method called from release_sock(),
    to eventually send new segments.

    [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
    [2] skb_orphan() is usually called at TX completion time,
    but some drivers call it in their start_xmit() handler.
    These drivers should at least use BQL, or else a single TCP
    session can still fill the whole NIC TX ring, since TSQ will
    have no effect.

    Signed-off-by: Eric Dumazet
    Cc: Dave Taht
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

1 commit


13 Jun, 2012

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/wireless/iwlwifi/pcie/trans.c

    The iwlwifi conflict was resolved by keeping the code added
    in 'net' that turns off the buggy chip feature.

    The MAINTAINERS conflict was merely overlapping changes, one
    change updated all the wireless web site URLs and the other
    changed some GIT trees to be Johannes's instead of John's.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jun, 2012

2 commits

  • I originally sent this patch to , but Jiri Kosina did
    not feel that this is fully appropriate for the trivial tree.

    Using linux/tcp.h from C++ results in:

    cat t.cc
    #include
    int main() { }

    g++ -c t.cc

    In file included from t.cc:1:
    /usr/include/linux/tcp.h:72: error: '__u32 __fswab32(__u32)' cannot appear in a constant-expression
    /usr/include/linux/tcp.h:72: error: a function call cannot appear in a constant-expression
    ...

    Attached trivial patch fixes this problem.

    Tested:
    - the t.cc above compiles with g++ and
    - the following program generates the same output before/after
    the patch:

    #include
    #include

    int main ()
    {
    #define P(a) printf("%s: %08x\n", #a, (int)a)
    P(TCP_FLAG_CWR);
    P(TCP_FLAG_ECE);
    P(TCP_FLAG_URG);
    P(TCP_FLAG_ACK);
    P(TCP_FLAG_PSH);
    P(TCP_FLAG_RST);
    P(TCP_FLAG_SYN);
    P(TCP_FLAG_FIN);
    P(TCP_RESERVED_BITS);
    P(TCP_DATA_OFFSET);
    #undef P
    return 0;
    }

    Signed-off-by: Paul Pluzhnikov
    Signed-off-by: David S. Miller

    Paul Pluzhnikov
     
  • Since it's guarenteed that we will access the inetpeer if we're trying
    to do timewait recycling and TCP options were enabled on the
    connection, just cache the peer in the timewait socket.

    In the future, inetpeer lookups will be context dependent (per routing
    realm), and this helps facilitate that as well.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 May, 2012

1 commit

  • Pull trivial updates from Jiri Kosina:
    "As usual, it's mostly typo fixes, redundant code elimination and some
    documentation updates."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
    edac, mips: don't change code that has been removed in edac/mips tree
    xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
    lib: Change mail address of Oskar Schirmer
    net: Change mail address of Oskar Schirmer
    arm/m68k: Change mail address of Sebastian Hess
    i2c: Change mail address of Oskar Schirmer
    net: Fix tcp_build_and_update_options comment in struct tcp_sock
    atomic64_32.h: fix parameter naming mismatch
    Kconfig: replace "--- help ---" with "---help---"
    c2port: fix bogus Kconfig "default no"
    edac: Fix spelling errors.
    qla1280: Remove redundant NULL check before release_firmware() call
    remoteproc: remove redundant NULL check before release_firmware()
    qla2xxx: Remove redundant NULL check before release_firmware() call.
    aic94xx: Get rid of redundant NULL check before release_firmware() call
    tehuti: delete redundant NULL check before release_firmware()
    qlogic: get rid of a redundant test for NULL before call to release_firmware()
    bna: remove redundant NULL test before release_firmware()
    tg3: remove redundant NULL test before release_firmware() call
    typhoon: get rid of redundant conditional before all to release_firmware()
    ...

    Linus Torvalds
     

09 May, 2012

1 commit


03 May, 2012

2 commits

  • Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
    Delays the fast retransmit by an interval of RTT/4. We borrow the
    RTO timer to implement the delay. If we receive another ACK or send
    a new packet, the timer is cancelled and restored to original RTO
    value offset by time elapsed. When the delayed-ER timer fires,
    we enter fast recovery and perform fast retransmit.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch implements RFC 5827 early retransmit (ER) for TCP.
    It reduces DUPACK threshold (dupthresh) if outstanding packets are
    less than 4 to recover losses by fast recovery instead of timeout.

    While the algorithm is simple, small but frequent network reordering
    makes this feature dangerous: the connection repeatedly enter
    false recovery and degrade performance. Therefore we implement
    a mitigation suggested in the appendix of the RFC that delays
    entering fast recovery by a small interval, i.e., RTT/4. Currently
    ER is conservative and is disabled for the rest of the connection
    after the first reordering event. A large scale web server
    experiment on the performance impact of ER is summarized in
    section 6 of the paper "Proportional Rate Reduction for TCP”,
    IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

    Note that Linux has a similar feature called THIN_DUPACK. The
    differences are THIN_DUPACK do not mitigate reorderings and is only
    used after slow start. Currently ER is disabled if THIN_DUPACK is
    enabled. I would be happy to merge THIN_DUPACK feature with ER if
    people think it's a good idea.

    ER is enabled by sysctl_tcp_early_retrans:
    0: Disables ER

    1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

    2: (Default) reduce dupthresh like mode 1. In addition, delay
    entering fast recovery by RTT/4.

    Note: mode 2 is implemented in the third part of this patch series.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

26 Apr, 2012

1 commit

  • Don't pick __u8/__u16 values directly from raw pointers, but instead use
    an array of structures of code:value pairs. This is OK, since the buffer
    we take options from is not an skb memory, but a user-to-kernel one.

    For those options which don't require any value now, require this to be
    zero (for potential future extension of this API).

    v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

22 Apr, 2012

2 commits

  • There are options, which are set up on a socket while performing
    TCP handshake. Need to resurrect them on a socket while repairing.
    A new sockoption accepts a buffer and parses it. The buffer should
    be CODE:VALUE sequence of bytes, where CODE is standard option
    code and VALUE is the respective value.

    Only 4 options should be handled on repaired socket.

    To read 3 out of 4 of these options the TCP_INFO sockoption can be
    used. An ability to get the last one (the mss_clamp) was added by
    the previous patch.

    Now the restore. Three of these options -- timestamp_ok, mss_clamp
    and snd_wscale -- are just restored on a coket.

    The sack_ok flags has 2 issues. First, whether or not to do sacks
    at all. This flag is just read and set back. No other sack info is
    saved or restored, since according to the standart and the code
    dropping all sack-ed segments is OK, the sender will resubmit them
    again, so after the repair we will probably experience a pause in
    connection. Next, the fack bit. It's just set back on a socket if
    the respective sysctl is set. No collected stats about packets flow
    is preserved. As far as I see (plz, correct me if I'm wrong) the
    fack-based congestion algorithm survives dropping all of the stats
    and repairs itself eventually, probably losing the performance for
    that period.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This includes (according the the previous description):

    * TCP_REPAIR sockoption

    This one just puts the socket in/out of the repair mode.
    Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
    When repair mode is turned off and the socket happens to be in
    the established state the window probe is sent to the peer to
    'unlock' the connection.

    * TCP_REPAIR_QUEUE sockoption

    This one sets the queue which we're about to repair. The
    'no-queue' is set by default.

    * TCP_QUEUE_SEQ socoption

    Sets the write_seq/rcv_nxt of a selected repaired queue.
    Allowed for TCP_CLOSE-d sockets only. When the socket changes
    its state the other seq-s are changed by the kernel according
    to the protocol rules (most of the existing code is actually
    reused).

    * Ability to forcibly bind a socket to a port

    The sk->sk_reuse is set to SK_FORCE_REUSE.

    * Immediate connect modification

    The connect syscall initializes the connection, then directly jumps
    to the code which finalizes it.

    * Silent close modification

    The close just aborts the connection (similar to SO_LINGER with 0
    time) but without sending any FIN/RST-s to peer.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

02 Mar, 2012

1 commit


29 Feb, 2012

1 commit

  • There was an off-by-one error in the comments describing the
    highest_sack field in struct tcp_sock. The comments previously claimed
    that it was the "start sequence of the highest skb with SACKed
    bit". This commit fixes the comments to note that it is the "start
    sequence of the skb just *after* the highest skb with SACKed bit".

    Signed-off-by: Neal Cardwell
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     

01 Feb, 2012

2 commits

  • This patch makes sure we use appropriate memory barriers before
    publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
    tcp_v4_send_reset() without holding socket lock (upcoming patch from
    Shawn Lu)

    Note we also need to respect rcu grace period before its freeing, since
    we can free socket without this grace period thanks to
    SLAB_DESTROY_BY_RCU

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to be able to support proper RST messages for TCP MD5 flows, we
    need to allow access to MD5 keys without locking listener socket.

    This conversion is a nice cleanup, and shrinks size of timewait sockets
    by 80 bytes.

    IPv6 code reuses generic code found in IPv4 instead of duplicating it.

    Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.

    Signed-off-by: Eric Dumazet
    Cc: Shawn Lu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Dec, 2011

1 commit


04 Oct, 2011

1 commit

  • Allows ss command (iproute2) to display "ecnseen" if at least one packet
    with ECT(0) or ECT(1) or ECN was received by this socket.

    "ecn" means ECN was negotiated at session establishment (TCP level)

    "ecnseen" means we received at least one packet with ECT fields set (IP
    level)

    ss -i
    ...
    ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016
    ino:5950 sk:f178e400
    mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
    rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2011

1 commit

  • This patch implements Proportional Rate Reduction (PRR) for TCP.
    PRR is an algorithm that determines TCP's sending rate in fast
    recovery. PRR avoids excessive window reductions and aims for
    the actual congestion window size at the end of recovery to be as
    close as possible to the window determined by the congestion control
    algorithm. PRR also improves accuracy of the amount of data sent
    during loss recovery.

    The patch implements the recommended flavor of PRR called PRR-SSRB
    (Proportional rate reduction with slow start reduction bound) and
    replaces the existing rate halving algorithm. PRR improves upon the
    existing Linux fast recovery under a number of conditions including:
    1) burst losses where the losses implicitly reduce the amount of
    outstanding data (pipe) below the ssthresh value selected by the
    congestion control algorithm and,
    2) losses near the end of short flows where application runs out of
    data to send.

    As an example, with the existing rate halving implementation a single
    loss event can cause a connection carrying short Web transactions to
    go into the slow start mode after the recovery. This is because during
    recovery Linux pulls the congestion window down to packets_in_flight+1
    on every ACK. A short Web response often runs out of new data to send
    and its pipe reduces to zero by the end of recovery when all its packets
    are drained from the network. Subsequent HTTP responses using the same
    connection will have to slow start to raise cwnd to ssthresh. PRR on
    the other hand aims for the cwnd to be as close as possible to ssthresh
    by the end of recovery.

    A description of PRR and a discussion of its performance can be found at
    the following links:
    - IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
    - IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
    - Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Nandita Dukkipati