23 Sep, 2012

2 commits

  • When taking SYNACK RTT samples for servers using TCP Fast Open, fix
    the code to ensure that we only call tcp_valid_rtt_meas() after we
    receive the ACK that completes the 3-way handshake.

    Previously we were always taking an RTT sample in
    tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
    tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
    receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
    to take the RTT sample.

    To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
    before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
    already ensures that we only take a SYNACK RTT sample if snt_synack is
    non-zero. To be careful, we only take a snt_synack timestamp when
    a SYNACK transmit or retransmit succeeds.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • In preparation for adding another spot where we compute the SYNACK
    RTT, extract this code so that it can be shared.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

04 Sep, 2012

1 commit

  • Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
    in addition to Recovery state. Retire the current rate-halving in CWR.
    When losses are detected via ACKs in CWR state, the sender enters Recovery
    state but the cwnd reduction continues and does not restart.

    Rename and refactor cwnd reduction functions since both CWR and Recovery
    use the same algorithm:
    tcp_init_cwnd_reduction() is new and initiates reduction state variables.
    tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
    tcp_ends_cwnd_reduction() is previously tcp_complete_cwr().

    The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
    and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
    parameter, flag, in tcp_cwnd_reduction() is also removed.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

01 Sep, 2012

3 commits

  • This patch builds on top of the previous patch to add the support
    for TFO listeners. This includes -

    1. allocating, properly initializing, and managing the per listener
    fastopen_queue structure when TFO is enabled

    2. changes to the inet_csk_accept code to support TFO. E.g., the
    request_sock can no longer be freed upon accept(), not until 3WHS
    finishes

    3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
    if it's a TFO socket

    4. properly closing a TFO listener, and a TFO socket before 3WHS
    finishes

    5. supporting TCP_FASTOPEN socket option

    6. modifying tcp_check_req() to use to check a TFO socket as well
    as request_sock

    7. supporting TCP's TFO cookie option

    8. adding a new SYN-ACK retransmit handler to use the timer directly
    off the TFO socket rather than the listener socket. Note that TFO
    server side will not retransmit anything other than SYN-ACK until
    the 3WHS is completed.

    The patch also contains an important function
    "reqsk_fastopen_remove()" to manage the somewhat complex relation
    between a listener, its request_sock, and the corresponding child
    socket. See the comment above the function for the detail.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     
  • This patch adds all the necessary data structure and support
    functions to implement TFO server side. It also documents a number
    of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
    extension MIBs.

    In addition, it includes the following:

    1. a new TCP_FASTOPEN socket option an application must call to
    supply a max backlog allowed in order to enable TFO on its listener.

    2. A number of key data structures:
    "fastopen_rsk" in tcp_sock - for a big socket to access its
    request_sock for retransmission and ack processing purpose. It is
    non-NULL iff 3WHS not completed.

    "fastopenq" in request_sock_queue - points to a per Fast Open
    listener data structure "fastopen_queue" to keep track of qlen (# of
    outstanding Fast Open requests) and max_qlen, among other things.

    "listener" in tcp_request_sock - to point to the original listener
    for book-keeping purpose, i.e., to maintain qlen against max_qlen
    as part of defense against IP spoofing attack.

    3. various data structure and functions, many in tcp_fastopen.c, to
    support server side Fast Open cookie operations, including
    /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     
  • Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
    the passive open side") changed the initRTO from 3secs to 1sec in
    accordance to RFC6298 (former RFC2988bis). This reduced the time till
    the last SYN retransmission packet gets sent from 93secs to 31secs.

    RFC1122 is stating that the retransmission should be done for at least 3
    minutes, but this seems to be quite high.

    "However, the values of R1 and R2 may be different for SYN
    and data segments. In particular, R2 for a SYN segment MUST
    be set large enough to provide retransmission of the segment
    for at least 3 minutes. The application can close the
    connection (i.e., give up on the open attempt) sooner, of
    course."

    This patch increases the value of TCP_SYN_RETRIES to the value of 6,
    providing a retransmission window of 63secs.

    The comments for SYN and SYNACK retries have also been updated to
    describe the current settings. The same goes for the documentation file
    "Documentation/networking/ip-sysctl.txt".

    Signed-off-by: Alexander Bergmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alex Bergmann
     

25 Aug, 2012

1 commit


15 Aug, 2012

1 commit


10 Aug, 2012

1 commit

  • commit 5d299f3d3c8a2fb (net: ipv6: fix TCP early demux) added a
    regression for ipv6_mapped case.

    [ 67.422369] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 67.449678] SELinux: initialized (dev autofs, type autofs), uses
    genfs_contexts
    [ 92.631060] BUG: unable to handle kernel NULL pointer dereference at
    (null)
    [ 92.631435] IP: [< (null)>] (null)
    [ 92.631645] PGD 0
    [ 92.631846] Oops: 0010 [#1] SMP
    [ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
    dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
    parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
    snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
    snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
    snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
    uhci_hcd
    [ 92.634294] CPU 0
    [ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
    [ 92.634294] RIP: 0010:[] [< (null)>]
    (null)
    [ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282
    [ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
    0000000000000000
    [ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
    ffff88024827ad00
    [ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
    0000000000000000
    [ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
    ffff880254735380
    [ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
    7fffffffffff0218
    [ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
    knlGS:0000000000000000
    [ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
    00000000000007f0
    [ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000
    [ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    0000000000000400
    [ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
    task ffff880254b8cac0)
    [ 92.634294] Stack:
    [ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
    ffff880245fc7d68
    [ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8
    ffff880245fc7d18
    [ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002
    00000000000000ff
    [ 92.634294] Call Trace:
    [ 92.634294] [] ? tcp_finish_connect+0x2c/0xfa
    [ 92.634294] [] tcp_rcv_state_process+0x2b6/0x9c6
    [ 92.634294] [] ? sched_clock_cpu+0xc3/0xd1
    [ 92.634294] [] ? local_clock+0x2b/0x3c
    [ 92.634294] [] tcp_v4_do_rcv+0x63a/0x670
    [ 92.634294] [] release_sock+0x128/0x1bd
    [ 92.634294] [] __inet_stream_connect+0x1b1/0x352
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? wake_up_bit+0x25/0x25
    [ 92.634294] [] ? lock_sock_nested+0x74/0x7f
    [ 92.634294] [] ? inet_stream_connect+0x22/0x4b
    [ 92.634294] [] inet_stream_connect+0x33/0x4b
    [ 92.634294] [] sys_connect+0x78/0x9e
    [ 92.634294] [] ? sysret_check+0x1b/0x56
    [ 92.634294] [] ? __audit_syscall_entry+0x195/0x1c8
    [ 92.634294] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 92.634294] [] system_call_fastpath+0x16/0x1b
    [ 92.634294] Code: Bad RIP value.
    [ 92.634294] RIP [< (null)>] (null)
    [ 92.634294] RSP
    [ 92.634294] CR2: 0000000000000000
    [ 92.648982] ---[ end trace 24e2bed94314c8d9 ]---
    [ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt

    Fix this using inet_sk_rx_dst_set(), and export this function in case
    IPv6 is modular.

    Reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit

  • Modern TCP stack highly depends on tcp_write_timer() having a small
    latency, but current implementation doesn't exactly meet the
    expectations.

    When a timer fires but finds the socket is owned by the user, it rearms
    itself for an additional delay hoping next run will be more
    successful.

    tcp_write_timer() for example uses a 50ms delay for next try, and it
    defeats many attempts to get predictable TCP behavior in term of
    latencies.

    Use the recently introduced tcp_release_cb(), so that the user owning
    the socket will call various handlers right before socket release.

    This will permit us to post a followup patch to address the
    tcp_tso_should_defer() syndrome (some deferred packets have to wait
    RTO timer to be transmitted, while cwnd should allow us to send them
    sooner)

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Nandita Dukkipati
    Cc: H.K. Jerry Chu
    Cc: John Heffner
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jul, 2012

6 commits

  • In trusted networks, e.g., intranet, data-center, the client does not
    need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
    mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
    of cookie availability.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • On paths with firewalls dropping SYN with data or experimental TCP options,
    Fast Open connections will have experience SYN timeout and bad performance.
    The solution is to track such incidents in the cookie cache and disables
    Fast Open temporarily.

    Since only the original SYN includes data and/or Fast Open option, the
    SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
    such drops. If a path has recurring Fast Open SYN drops, Fast Open is
    disabled for 2^(recurring_losses) minutes starting from four minutes up to
    roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
    it behaves as connect() then write().

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
    and write(2). The application should replace connect() with it to
    send data in the opening SYN packet.

    For blocking socket, sendmsg() blocks until all the data are buffered
    locally and the handshake is completed like connect() call. It
    returns similar errno like connect() if the TCP handshake fails.

    For non-blocking socket, it returns the number of bytes queued (and
    transmitted in the SYN-data packet) if cookie is available. If cookie
    is not available, it transmits a data-less SYN packet with Fast Open
    cookie request option and returns -EINPROGRESS like connect().

    Using MSG_FASTOPEN on connecting or connected socket will result in
    simlar errno like repeating connect() calls. Therefore the application
    should only use this flag on new sockets.

    The buffer size of sendmsg() is independent of the MSS of the connection.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch implements sending SYN-data in tcp_connect(). The data is
    from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

    The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
    type of the SYN. If the cookie is not cached (len==0), the host sends
    data-less SYN with Fast Open cookie request option to solicit a cookie
    from the remote. If cookie is not available (len > 0), the host sends
    a SYN-data with Fast Open cookie option. If cookie length is negative,
    the SYN will not include any Fast Open option (for fall back operations).

    To deal with middleboxes that may drop SYN with data or experimental TCP
    option, the SYN-data is only sent once. SYN retransmits do not include
    data or Fast Open options. The connection will fall back to regular TCP
    handshake.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
    The basic ones are MSS and the cookies. Later patch will cache more to
    handle unfriendly middleboxes.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch impelements the common code for both the client and server.

    1. TCP Fast Open option processing. Since Fast Open does not have an
    option number assigned by IANA yet, it shares the experiment option
    code 254 by implementing draft-ietf-tcpm-experimental-options
    with a 16 bits magic number 0xF989. This enables global experiments
    without clashing the scarce(2) experimental options available for TCP.

    When the draft status becomes standard (maybe), the client should
    switch to the new option number assigned while the server supports
    both numbers for transistion.

    2. The new sysctl tcp_fastopen

    3. A place holder init function

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

17 Jul, 2012

1 commit

  • Implement the RFC 5691 mitigation against Blind
    Reset attack using RST bit.

    Idea is to validate incoming RST sequence,
    to match RCV.NXT value, instead of previouly accepted
    window : (RCV.NXT < RCV.NXT+RCV.WND)

    If sequence is in window but not an exact match, send
    a "challenge ACK", so that the other part can resend an
    RST with the appropriate sequence.

    Add a new sysctl, tcp_challenge_ack_limit, to limit
    number of challenge ACK sent per second.

    Add a new SNMP counter to count number of challenge acks sent.
    (netstat -s | grep TCPChallengeACK)

    Signed-off-by: Eric Dumazet
    Cc: Kiran Kumar Kella
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jul, 2012

1 commit

  • This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

    sk->sk_wmem_alloc not allowed to grow above a given limit,
    allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
    given time.

    TSO packets are sized/capped to half the limit, so that we have two
    TSO packets in flight, allowing better bandwidth use.

    As a side effect, setting the limit to 40000 automatically reduces the
    standard gso max limit (65536) to 40000/2 : It can help to reduce
    latencies of high prio packets, having smaller TSO packets.

    This means we divert sock_wfree() to a tcp_wfree() handler, to
    queue/send following frames when skb_orphan() [2] is called for the
    already queued skbs.

    Results on my dev machines (tg3/ixgbe nics) are really impressive,
    using standard pfifo_fast, and with or without TSO/GSO.

    Without reduction of nominal bandwidth, we have reduction of buffering
    per bulk sender :
    < 1ms on Gbit (instead of 50ms with TSO)
    < 8ms on 100Mbit (instead of 132 ms)

    I no longer have 4 MBytes backlogged in qdisc by a single netperf
    session, and both side socket autotuning no longer use 4 Mbytes.

    As skb destructor cannot restart xmit itself ( as qdisc lock might be
    taken at this point ), we delegate the work to a tasklet. We use one
    tasklest per cpu for performance reasons.

    If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
    This flag is tested in a new protocol method called from release_sock(),
    to eventually send new segments.

    [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
    [2] skb_orphan() is usually called at TX completion time,
    but some drivers call it in their start_xmit() handler.
    These drivers should at least use BQL, or else a single TCP
    session can still fill the whole NIC TX ring, since TSQ will
    have no effect.

    Signed-off-by: Eric Dumazet
    Cc: Dave Taht
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

4 commits


28 Jun, 2012

3 commits

  • It's completely unnecessary.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

    This change has several unwanted side effects:

    1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
    thus never create a real cached route.

    2) All TCP traffic will use DST_NOCACHE and never use the routing
    cache at all.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • DDOS synflood attacks hit badly IP route cache.

    On typical machines, this cache is allowed to hold up to 8 Millions dst
    entries, 256 bytes for each, for a total of 2GB of memory.

    rt_garbage_collect() triggers and tries to cleanup things.

    Eventually route cache is disabled but machine is under fire and might
    OOM and crash.

    This patch exploits the new TCP early demux, to set a nocache
    boolean in case incoming TCP frame is for a not yet ESTABLISHED or
    TIMEWAIT socket.

    This 'nocache' boolean is then used in case dst entry is not found in
    route cache, to create an unhashed dst entry (DST_NOCACHE)

    SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
    output dst for syncookies), so after this patch, a machine is able to
    absorb a DDOS synflood attack without polluting its IP route cache.

    Signed-off-by: Eric Dumazet
    Cc: Hans Schillstrom
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jun, 2012

1 commit

  • Input packet processing for local sockets involves two major demuxes.
    One for the route and one for the socket.

    But we can optimize this down to one demux for certain kinds of local
    sockets.

    Currently we only do this for established TCP sockets, but it could
    at least in theory be expanded to other kinds of connections.

    If a TCP socket is established then it's identity is fully specified.

    This means that whatever input route was used during the three-way
    handshake must work equally well for the rest of the connection since
    the keys will not change.

    Once we move to established state, we cache the receive packet's input
    route to use later.

    Like the existing cached route in sk->sk_dst_cache used for output
    packets, we have to check for route invalidations using dst->obsolete
    and dst->ops->check().

    Early demux occurs outside of a socket locked section, so when a route
    invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
    actually inside of established state packet processing and thus have
    the socket locked.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jun, 2012

1 commit


09 Jun, 2012

1 commit

  • The get_peer method TCP uses is full of special cases that make no
    sense accommodating, and it also gets in the way of doing more
    reasonable things here.

    First of all, if the socket doesn't have a usable cached route, there
    is no sense in trying to optimize timewait recycling.

    Likewise for the case where we have IP options, such as SRR enabled,
    that make the IP header destination address (and thus the destination
    address of the route key) differ from that of the connection's
    destination address.

    Just return a NULL peer in these cases, and thus we're also able to
    get rid of the clumsy inetpeer release logic.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 May, 2012

1 commit


11 May, 2012

1 commit


05 May, 2012

1 commit

  • It appears some networks play bad games with the two bits reserved for
    ECN. This can trigger false congestion notifications and very slow
    transferts.

    Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
    disable TCP ECN negociation if it happens we receive mangled CT bits in
    the SYN packet.

    Signed-off-by: Eric Dumazet
    Cc: Perry Lorier
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Wilmer van der Gaast
    Cc: Ankur Jain
    Cc: Tom Herbert
    Cc: Dave Täht
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2012

3 commits

  • Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
    receiver function when application is not blocked in recvmsg().

    Function tcp_queue_rcv() is moved a bit to allow its call from
    tcp_data_queue()

    This gives good results especially if GRO could not kick, and if skb
    head is a fragment.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
    Delays the fast retransmit by an interval of RTT/4. We borrow the
    RTO timer to implement the delay. If we receive another ACK or send
    a new packet, the timer is cancelled and restored to original RTO
    value offset by time elapsed. When the delayed-ER timer fires,
    we enter fast recovery and perform fast retransmit.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch implements RFC 5827 early retransmit (ER) for TCP.
    It reduces DUPACK threshold (dupthresh) if outstanding packets are
    less than 4 to recover losses by fast recovery instead of timeout.

    While the algorithm is simple, small but frequent network reordering
    makes this feature dangerous: the connection repeatedly enter
    false recovery and degrade performance. Therefore we implement
    a mitigation suggested in the appendix of the RFC that delays
    entering fast recovery by a small interval, i.e., RTT/4. Currently
    ER is conservative and is disabled for the rest of the connection
    after the first reordering event. A large scale web server
    experiment on the performance impact of ER is summarized in
    section 6 of the paper "Proportional Rate Reduction for TCP”,
    IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

    Note that Linux has a similar feature called THIN_DUPACK. The
    differences are THIN_DUPACK do not mitigate reorderings and is only
    used after slow start. Currently ER is disabled if THIN_DUPACK is
    enabled. I would be happy to merge THIN_DUPACK feature with ER if
    people think it's a good idea.

    ER is enabled by sysctl_tcp_early_retrans:
    0: Disables ER

    1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

    2: (Default) reduce dupthresh like mode 1. In addition, delay
    entering fast recovery by RTT/4.

    Note: mode 2 is implemented in the third part of this patch series.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

27 Apr, 2012

1 commit

  • Quoting Tore Anderson from :
    https://bugzilla.kernel.org/show_bug.cgi?id=42572

    When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
    size does not take into account the size of the IPv6 Fragmentation
    header that needs to be included in outbound packets, causing every
    transmitted TCP segment to be fragmented across two IPv6 packets, the
    latter of which will only contain 8 bytes of actual payload.

    RTAX_FEATURE_ALLFRAG is typically set on a route in response to
    receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
    than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
    PTBs with MTU < 1280 are still valid, in particular when an IPv6
    packet is sent to an IPv4 destination through a stateless translator.
    Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
    the path will be translated to ICMPv6 PTB which may then indicate an
    MTU of less than 1280.

    The Linux kernel refuses to reduce the effective MTU to anything below
    1280 bytes, instead it sets it to exactly 1280 bytes, and
    RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
    to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
    instead of 1232 (additionally taking into account the 8 bytes required
    by the IPv6 Fragmentation extension header).

    This in turn results in rather inefficient transmission, as every
    transmitted TCP segment now is split in two fragments containing
    1232+8 bytes of payload.

    After this patch, all the outgoing packets that includes a
    Fragmentation header all are "atomic" or "non-fragmented" fragments,
    i.e., they both have Offset=0 and More Fragments=0.

    With help from David S. Miller

    Reported-by: Tore Anderson
    Signed-off-by: Eric Dumazet
    Cc: Maciej Żenczykowski
    Cc: Tom Herbert
    Tested-by: Tore Anderson
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Apr, 2012

3 commits

  • This commit moves the (substantial) common code shared between
    tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
    independent function, tcp_init_sock().

    Centralizing this functionality should help avoid drift issues,
    e.g. where the IPv4 side is updated without a corresponding update to
    IPv6. There was already some drift: IPv4 initialized snd_cwnd to
    TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
    2 (in this case it should not matter, since snd_cwnd is also
    initialized in tcp_init_metrics(), but the general risks and
    maintenance overhead remain).

    When diffing the old and new code, note that new tcp_init_sock()
    function uses the order of steps from the tcp_v4_init_sock()
    implementation (the order is slightly different in
    tcp_v6_init_sock()).

    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • This includes (according the the previous description):

    * TCP_REPAIR sockoption

    This one just puts the socket in/out of the repair mode.
    Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
    When repair mode is turned off and the socket happens to be in
    the established state the window probe is sent to the peer to
    'unlock' the connection.

    * TCP_REPAIR_QUEUE sockoption

    This one sets the queue which we're about to repair. The
    'no-queue' is set by default.

    * TCP_QUEUE_SEQ socoption

    Sets the write_seq/rcv_nxt of a selected repaired queue.
    Allowed for TCP_CLOSE-d sockets only. When the socket changes
    its state the other seq-s are changed by the kernel according
    to the protocol rules (most of the existing code is actually
    reused).

    * Ability to forcibly bind a socket to a port

    The sk->sk_reuse is set to SK_FORCE_REUSE.

    * Immediate connect modification

    The connect syscall initializes the connection, then directly jumps
    to the code which finalizes it.

    * Silent close modification

    The close just aborts the connection (similar to SO_LINGER with 0
    time) but without sending any FIN/RST-s to peer.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This is just the preparation patch, which makes the needed for
    TCP repair code ready for use.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

17 Apr, 2012

1 commit

  • Commit b82d1bb4 inadvertendly placed unrelated new code between
    TCPCB_EVER_RETRANS and TCPCB_RETRANS and the other macros that refer
    to the sacked field in the struct tcp_skb_cb (probably because there
    was a misleading empty line there). This commit fixes up the
    formatting so that all macros related to the sacked field are adjacent
    again.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

16 Apr, 2012

1 commit