25 Aug, 2012

1 commit

  • The cwnd reduction in fast recovery is based on the number of packets
    newly delivered per ACK. For non-sack connections every DUPACK
    signifies a packet has been delivered, but the sender mistakenly
    skips counting them for cwnd reduction.

    The fix is to compute newly_acked_sacked after DUPACKs are accounted
    in sacked_out for non-sack connections.

    Signed-off-by: Yuchung Cheng
    Acked-by: Nandita Dukkipati
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

07 Aug, 2012

1 commit

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Aug, 2012

2 commits

  • Merge Andrew's second set of patches:
    - MM
    - a few random fixes
    - a couple of RTC leftovers

    * emailed patches from Andrew Morton : (120 commits)
    rtc/rtc-88pm80x: remove unneed devm_kfree
    rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
    mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
    tmpfs: distribute interleave better across nodes
    mm: remove redundant initialization
    mm: warn if pg_data_t isn't initialized with zero
    mips: zero out pg_data_t when it's allocated
    memcg: gix memory accounting scalability in shrink_page_list
    mm/sparse: remove index_init_lock
    mm/sparse: more checks on mem_section number
    mm/sparse: optimize sparse_index_alloc
    memcg: add mem_cgroup_from_css() helper
    memcg: further prevent OOM with too many dirty pages
    memcg: prevent OOM with too many dirty pages
    mm: mmu_notifier: fix freed page still mapped in secondary MMU
    mm: memcg: only check anon swapin page charges for swap cache
    mm: memcg: only check swap cache pages for repeated charging
    mm: memcg: split swapin charge function into private and public part
    mm: memcg: remove needless !mm fixup to init_mm when charging
    mm: memcg: remove unneeded shmem charge type
    ...

    Linus Torvalds
     
  • This patch series is based on top of "Swap-over-NBD without deadlocking
    v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

    When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. In diskless systems this is not an option so if swap if
    required then swapping over the network is considered. The two likely
    scenarios are when blade servers are used as part of a cluster where the
    form factor or maintenance costs do not allow the use of disks and thin
    clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap but this is not always an option. There is no
    guarantee that the network attached storage (NAS) device is running Linux
    or supports NBD. However, it is likely that it supports NFS so there are
    users that want support for swapping over NFS despite any performance
    concern. Some distributions currently carry patches that support swapping
    over NFS but it would be preferable to support it in the mainline kernel.

    Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

    Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
    reserves.

    Patch 3 adds three helpers for filesystems to handle swap cache pages.
    For example, page_file_mapping() returns page->mapping for
    file-backed pages and the address_space of the underlying
    swap file for swap cache pages.

    Patch 4 adds two address_space_operations to allow a filesystem
    to pin all metadata relevant to a swapfile in memory. Upon
    successful activation, the swapfile is marked SWP_FILE and
    the address space operation ->direct_IO is used for writing
    and ->readpage for reading in swap pages.

    Patch 5 notes that patch 3 is bolting
    filesystem-specific-swapfile-support onto the side and that
    the default handlers have different information to what
    is available to the filesystem. This patch refactors the
    code so that there are generic handlers for each of the new
    address_space operations.

    Patch 6 adds an API to allow a vector of kernel addresses to be
    translated to struct pages and pinned for IO.

    Patch 7 adds support for using highmem pages for swap by kmapping
    the pages before calling the direct_IO handler.

    Patch 8 updates NFS to use the helpers from patch 3 where necessary.

    Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

    Patch 10 implements the new swapfile-related address_space operations
    for NFS and teaches the direct IO handler how to manage
    kernel addresses.

    Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
    where appropriate.

    Patch 12 fixes a NULL pointer dereference that occurs when using
    swap-over-NFS.

    With the patches applied, it is possible to mount a swapfile that is on an
    NFS filesystem. Swap performance is not great with a swap stress test
    taking roughly twice as long to complete than if the swap device was
    backed by NBD.

    This patch: netvm: prevent a stream-specific deadlock

    It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
    that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
    buffers from receiving data, which will prevent userspace from running,
    which is needed to reduce the buffered data.

    Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
    this change it applied, it is important that sockets that set
    SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
    If this happens, a warning is generated and the tokens reclaimed to avoid
    accounting errors until the bug is fixed.

    [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Acked-by: Rik van Riel
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Christoph Hellwig
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

31 Jul, 2012

1 commit

  • commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
    added various fatal races with dst refcounts.

    crashes happen on tcp workloads if routes are added/deleted at the same
    time.

    The dst_free() calls from free_fib_info_rcu() are clearly racy.

    We need instead regular dst refcounting (dst_release()) and make
    sure dst_release() is aware of RCU grace periods :

    Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
    before dst destruction for cached dst

    Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
    to make sure we dont increase a zero refcount (On a dst currently
    waiting an rcu grace period before destruction)

    rt_cache_route() must take a reference on the new cached route, and
    release it if was not able to install it.

    With this patch, my machines survive various benchmarks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jul, 2012

2 commits

  • Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
    added support for receive offloading to IOAT dma engine if available.

    The code in tcp_rcv_established() tries to perform early DMA copy if
    applicable. It however does so without checking whether the userspace
    task is actually expecting the data in the buffer.

    This is not a problem under normal circumstances, but there is a corner
    case where this doesn't work -- and that's when MSG_TRUNC flag to
    recvmsg() is used.

    If the IOAT dma engine is not used, the code properly checks whether
    there is a valid ucopy.task and the socket is owned by userspace, but
    misses the check in the dmaengine case.

    This problem can be observed in real trivially -- for example 'tbench' is a
    good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
    IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
    have been already early-copied in tcp_rcv_established() using dma engine.

    This patch introduces the same check we are performing in the simple
    iovec copy case to the IOAT case as well. It fixes the indefinite
    recvmsg(MSG_TRUNC) hangs.

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     
  • commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.)
    invalidated TCP early demux, because rx_dst_ifindex is not properly
    initialized and checked.

    Also remove the use of inet_iif(skb) in favor or skb->skb_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jul, 2012

1 commit

  • Use inet_iif() consistently, and for TCP record the input interface of
    cached RX dst in inet sock.

    rt->rt_iif is going to be encoded differently, so that we can
    legitimately cache input routes in the FIB info more aggressively.

    When the input interface is "use SKB device index" the rt->rt_iif will
    be set to zero.

    This forces us to move the TCP RX dst cache installation into the ipv4
    specific code, and as well it should since doing the route caching for
    ipv6 is pointless at the moment since it is not inspected in the ipv6
    input paths yet.

    Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
    obsolete set to a non-zero value to force invocation of the check
    callback.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

1 commit


20 Jul, 2012

4 commits

  • In trusted networks, e.g., intranet, data-center, the client does not
    need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
    mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
    of cookie availability.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • On paths with firewalls dropping SYN with data or experimental TCP options,
    Fast Open connections will have experience SYN timeout and bad performance.
    The solution is to track such incidents in the cookie cache and disables
    Fast Open temporarily.

    Since only the original SYN includes data and/or Fast Open option, the
    SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
    such drops. If a path has recurring Fast Open SYN drops, Fast Open is
    disabled for 2^(recurring_losses) minutes starting from four minutes up to
    roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
    it behaves as connect() then write().

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • On receiving the SYN-ACK after SYN-data, the client needs to
    a) update the cached MSS and cookie (if included in SYN-ACK)
    b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of
    the handshake.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch impelements the common code for both the client and server.

    1. TCP Fast Open option processing. Since Fast Open does not have an
    option number assigned by IANA yet, it shares the experiment option
    code 254 by implementing draft-ietf-tcpm-experimental-options
    with a 16 bits magic number 0xF989. This enables global experiments
    without clashing the scarce(2) experimental options available for TCP.

    When the draft status becomes standard (maybe), the client should
    switch to the new option number assigned while the server supports
    both numbers for transistion.

    2. The new sysctl tcp_fastopen

    3. A place holder init function

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

19 Jul, 2012

1 commit

  • Followup of commit 0c24604b68fc (tcp: implement RFC 5961 4.2)

    As reported by Vijay Subramanian, we should send a challenge ACK
    instead of a dup ack if a SYN flag is set on a packet received out of
    window.

    This permits the ratelimiting to work as intended, and to increase
    correct SNMP counters.

    Suggested-by: Vijay Subramanian
    Signed-off-by: Eric Dumazet
    Acked-by: Vijay Subramanian
    Cc: Kiran Kumar Kella
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jul, 2012

3 commits

  • Implement the RFC 5691 mitigation against Blind
    Reset attack using SYN bit.

    Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
    incoming packet, instead of resetting the session.

    Add a new SNMP counter to count number of challenge acks sent
    in response to SYN packets.
    (netstat -s | grep TCPSYNChallenge)

    Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
    because of a SYN flag.

    Signed-off-by: Eric Dumazet
    Cc: Kiran Kumar Kella
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Implement the RFC 5691 mitigation against Blind
    Reset attack using RST bit.

    Idea is to validate incoming RST sequence,
    to match RCV.NXT value, instead of previouly accepted
    window : (RCV.NXT < RCV.NXT+RCV.WND)

    If sequence is in window but not an exact match, send
    a "challenge ACK", so that the other part can resend an
    RST with the appropriate sequence.

    Add a new sysctl, tcp_challenge_ack_limit, to limit
    number of challenge ACK sent per second.

    Add a new SNMP counter to count number of challenge acks sent.
    (netstat -s | grep TCPChallengeACK)

    Signed-off-by: Eric Dumazet
    Cc: Kiran Kumar Kella
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Add three SNMP TCP counters, to better track TCP behavior
    at global stage (netstat -s), when packets are received
    Out Of Order (OFO)

    TCPOFOQueue : Number of packets queued in OFO queue

    TCPOFODrop : Number of packets meant to be queued in OFO
    but dropped because socket rcvbuf limit hit.

    TCPOFOMerge : Number of packets in OFO that were merged with
    other packets.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

1 commit


05 Jul, 2012

1 commit

  • When a dst_confirm() happens, mark the confirmation as pending in the
    dst. Then on the next packet out, when we have the neigh in-hand, do
    the update.

    This removes the dependency in dst_confirm() of dst's having an
    attached neigh.

    While we're here, remove the explicit 'dst' NULL check, all except 2
    or 3 call sites ensure it's not NULL. So just fix those cases up.

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Jun, 2012

1 commit

  • Input packet processing for local sockets involves two major demuxes.
    One for the route and one for the socket.

    But we can optimize this down to one demux for certain kinds of local
    sockets.

    Currently we only do this for established TCP sockets, but it could
    at least in theory be expanded to other kinds of connections.

    If a TCP socket is established then it's identity is fully specified.

    This means that whatever input route was used during the three-way
    handshake must work equally well for the rest of the connection since
    the keys will not change.

    Once we move to established state, we cache the receive packet's input
    route to use later.

    Like the existing cached route in sk->sk_dst_cache used for output
    packets, we have to check for route invalidations using dst->obsolete
    and dst->ops->check().

    Early demux occurs outside of a socket locked section, so when a route
    invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
    actually inside of established state packet processing and thus have
    the socket locked.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 May, 2012

1 commit

  • Sergio Correia reported following warning :

    WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()

    WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
    "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
    tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);

    It appears TCP coalescing, and more specifically commit b081f85c297
    (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
    possible segment overlaps in receive queue. This was properly done in
    the case of out_or_order_queue by the caller.

    For example, segment at tail of queue have sequence 1000-2000, and we
    add a segment with sequence 1500-2500.
    This can happen in case of retransmits.

    In this case, just don't do the coalescing.

    Reported-by: Sergio Correia
    Signed-off-by: Eric Dumazet
    Tested-by: Sergio Correia
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 May, 2012

1 commit

  • Move tcp_try_coalesce() protocol independent part to
    skb_try_coalesce().

    skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
    to build optimized skbs (less sk_buff, and possibly less 'headers')

    skb_try_coalesce() is zero copy, unless the copy can fit in destination
    header (its a rare case)

    kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
    because IPv6 will need it in patch (ipv6: use skb coalescing in
    reassembly).

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2012

1 commit


16 May, 2012

2 commits


11 May, 2012

3 commits

  • As proposed by Eric, make the tcp_input.o thinner.

    add/remove: 1/1 grow/shrink: 1/4 up/down: 868/-1329 (-461)
    function old new delta
    tcp_try_rmem_schedule - 864 +864
    tcp_ack 4811 4815 +4
    tcp_validate_incoming 817 815 -2
    tcp_collapse 860 858 -2
    tcp_send_rcvq 555 353 -202
    tcp_data_queue 3435 3033 -402
    tcp_prune_queue 721 - -721

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • As noted by Eric, no checks are performed on the data size we're
    putting in the read queue during repair. Thus, validate the given
    data size with the common rmem management routine.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • It actually works on the input queue and will use its read mem
    routines, thus it's better to have in in the tcp_input.c file.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

08 May, 2012

1 commit

  • Conflicts:
    drivers/net/ethernet/intel/e1000e/param.c
    drivers/net/wireless/iwlwifi/iwl-agn-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans.h

    Resolved the iwlwifi conflict with mainline using 3-way diff posted
    by John Linville and Stephen Rothwell. In 'net' we added a bug
    fix to make iwlwifi report a more accurate skb->truesize but this
    conflicted with RX path changes that happened meanwhile in net-next.

    In e1000e a conflict arose in the validation code for settings of
    adapter->itr. 'net-next' had more sophisticated logic so that
    logic was used.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 May, 2012

1 commit

  • This patch adds support for a skb_head_is_locked helper function. It is
    meant to be used any time we are considering transferring the head from
    skb->head to a paged frag. If the head is locked it means we cannot remove
    the head from the skb so it must be copied or we must take the skb as a
    whole.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

03 May, 2012

10 commits

  • This change cleans up the last bits of tcp_try_coalesce so that we only
    need one goto which jumps to the end of the function. The idea is to make
    the code more readable by putting things in a linear order so that we start
    execution at the top of the function, and end it at the bottom.

    I also made a slight tweak to the code for handling frags when we are a
    clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed
    the logic so that if (!clone) we just set the number of frags to 0 which
    disables the for loop anyway.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This change reorders the code related to the use of an skb->head_frag so it
    is placed before we check the rest of the frags. This allows the code to
    read more linearly instead of like some sort of loop.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch addresses several issues in the way we were tracking the
    truesize in tcp_try_coalesce.

    First it was using ksize which prevents us from having a 0 sized head frag
    and getting a usable result. To resolve that this patch uses the end
    pointer which is set based off either ksize, or the frag_size supplied in
    build_skb. This allows us to compute the original truesize of the entire
    buffer and remove that value leaving us with just what was added as pages.

    The second issue was the use of skb->len if there is a mergeable head frag.
    We should only need to remove the size of an data aligned sk_buff from our
    current skb->truesize to compute the delta for a buffer with a reused head.
    By using skb->len the value of truesize was being artificially reduced
    which means that head frags could use more memory than buffers using
    standard allocations.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This change is meant ot prevent stealing the skb->head to use as a page in
    the event that the skb->head was cloned. This allows the other clones to
    track each other via shinfo->dataref.

    Without this we break down to two methods for tracking the reference count,
    one being dataref, the other being the page count. As a result it becomes
    difficult to track how many references there are to skb->head.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
    receiver function when application is not blocked in recvmsg().

    Function tcp_queue_rcv() is moved a bit to allow its call from
    tcp_data_queue()

    This gives good results especially if GRO could not kick, and if skb
    head is a fragment.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Before stealing fragments or skb head, we must make sure skbs are not
    cloned.

    Alexander was worried about destination skb being cloned : In bridge
    setups, a driver could be fooled if skb->data_len would not match skb
    nr_frags.

    If source skb is cloned, we must take references on pages instead.

    Bug happened using tcpdump (if not using mmap())

    Introduce kfree_skb_partial() helper to cleanup code.

    Reported-by: Alexander Duyck
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • tcp_adv_win_scale default value is 2, meaning we expect a good citizen
    skb to have skb->len / skb->truesize ratio of 75% (3/4)

    In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
    1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
    So these skbs were considered as not bloated.

    With recent truesize fixes, a typical MSS=1460 frame truesize is now the
    more precise :
    2048 + 256 = 2304. But 2304 * 3/4 = 1728.
    So these skb are not good citizen anymore, because 1460 < 1728

    (GRO can escape this problem because it build skbs with a too low
    truesize.)

    This also means tcp advertises a too optimistic window for a given
    allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
    sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
    especially when application is slow to drain its receive queue or in
    case of losses (netperf is fast, scp is slow). This is a major latency
    source.

    We should adjust the len/truesize ratio to 50% instead of 75%

    This patch :

    1) changes tcp_adv_win_scale default to 1 instead of 2

    2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
    better truesize tracking and to allow autotuning tcp receive window to
    reach same value than before. Note that same amount of kernel memory is
    consumed compared to 2.6 kernels.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
    Delays the fast retransmit by an interval of RTT/4. We borrow the
    RTO timer to implement the delay. If we receive another ACK or send
    a new packet, the timer is cancelled and restored to original RTO
    value offset by time elapsed. When the delayed-ER timer fires,
    we enter fast recovery and perform fast retransmit.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch implements RFC 5827 early retransmit (ER) for TCP.
    It reduces DUPACK threshold (dupthresh) if outstanding packets are
    less than 4 to recover losses by fast recovery instead of timeout.

    While the algorithm is simple, small but frequent network reordering
    makes this feature dangerous: the connection repeatedly enter
    false recovery and degrade performance. Therefore we implement
    a mitigation suggested in the appendix of the RFC that delays
    entering fast recovery by a small interval, i.e., RTT/4. Currently
    ER is conservative and is disabled for the rest of the connection
    after the first reordering event. A large scale web server
    experiment on the performance impact of ER is summarized in
    section 6 of the paper "Proportional Rate Reduction for TCP”,
    IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

    Note that Linux has a similar feature called THIN_DUPACK. The
    differences are THIN_DUPACK do not mitigate reorderings and is only
    used after slow start. Currently ER is disabled if THIN_DUPACK is
    enabled. I would be happy to merge THIN_DUPACK feature with ER if
    people think it's a good idea.

    ER is enabled by sysctl_tcp_early_retrans:
    0: Disables ER

    1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

    2: (Default) reduce dupthresh like mode 1. In addition, delay
    entering fast recovery by RTT/4.

    Note: mode 2 is implemented in the third part of this patch series.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This a prepartion patch that refactors the code to enter recovery
    into a new function tcp_enter_recovery(). It's needed to implement
    the delayed fast retransmit in ER.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng