03 Jan, 2015

1 commit

  • Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.

    In fact the problem was completely unrelated to IPsec. The bug is
    also reproducible if you just disable TSO/GSO.

    The problem is that when the MSS goes down, existing queued packet
    on the TX queue that have not been transmitted yet all look like
    TSO packets and get treated as such.

    This then triggers a bug where tcp_mss_split_point tells us to
    generate a zero-sized packet on the TX queue. Once that happens
    we're screwed because the zero-sized packet can never be removed
    by ACKs.

    Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
    Reported-by: Thomas Jarosch
    Signed-off-by: Herbert Xu

    Cheers,
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 Dec, 2014

1 commit


10 Dec, 2014

2 commits

  • Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
    control TSO size, but did this at the wrong place (sendmsg() time)

    At sendmsg() time, we might have a pessimistic view of flow rate,
    and we end up building very small skbs (with 2 MSS per skb).

    This is bad because :

    - It sends small TSO packets even in Slow Start where rate quickly
    increases.
    - It tends to make socket write queue very big, increasing tcp_ack()
    processing time, but also increasing memory needs, not necessarily
    accounted for, as fast clones overhead is currently ignored.
    - Lower GRO efficiency and more ACK packets.

    Servers with a lot of small lived connections suffer from this.

    Lets instead fill skbs as much as possible (64KB of payload), but split
    them at xmit time, when we have a precise idea of the flow rate.
    skb split is actually quite efficient.

    Patch looks bigger than necessary, because TCP Small Queue decision now
    has to take place after the eventual split.

    As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
    tcp_tso_should_defer() can be synchronized on same goal.

    Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
    contains number of mss that we can put in GSO packet, and is not
    related to the autosizing goal anymore.

    Tested:

    40 ms rtt link

    nstat >/dev/null
    netperf -H remote -l -2000000 -- -s 1000000
    nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

    Before patch :

    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/s

    87380 2000000 2000000 0.36 44.22
    IpInReceives 600 0.0
    IpOutRequests 599 0.0
    TcpOutSegs 1397 0.0
    IpExtOutOctets 2033249 0.0

    After patch :

    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 2000000 2000000 0.36 44.27
    IpInReceives 221 0.0
    IpOutRequests 232 0.0
    TcpOutSegs 1397 0.0
    IpExtOutOctets 2013953 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Note that the code _using_ ->msg_iter at that point will be very
    unhappy with anything other than unshifted iovec-backed iov_iter.
    We still need to convert users to proper primitives.

    Signed-off-by: Al Viro

    Al Viro
     

20 Nov, 2014

1 commit

  • While working on sk_forward_alloc problems reported by Denys
    Fedoryshchenko, we found that tcp connect() (and fastopen) do not call
    sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so
    sk_forward_alloc is negative while connect is in progress.

    We can fix this by calling regular sk_stream_alloc_skb() both for the
    SYN packet (in tcp_connect()) and the syn_data packet in
    tcp_send_syn_data()

    Then, tcp_send_syn_data() can avoid copying syn_data as we simply
    can manipulate syn_data->cb[] to remove SYN flag (and increment seq)

    Instead of open coding memcpy_fromiovecend(), simply use this helper.

    This leaves in socket write queue clean fast clone skbs.

    This was tested against our fastopen packetdrill tests.

    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2014

1 commit

  • In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
    big, as sk_pacing_rate is high.

    When network is congested, cwnd can be smaller than the GSO packets
    found in socket write queue. tcp_write_xmit() splits GSO packets
    using the available cwnd, and we end up sending a single GSO packet,
    consuming all available cwnd.

    With GRO aggregation on the receiver, we might handle a single GRO
    packet, sending back a single ACK.

    1) This single ACK might be lost
    TLP or RTO are forced to attempt a retransmit.
    2) This ACK releases a full cwnd, sender sends another big GSO packet,
    in a ping pong mode.

    This behavior does not fill the pipes in the best way, because of
    scheduling artifacts.

    Make sure we always have at least two GSO packets in flight.

    This allows us to safely increase GRO efficiency without risking
    spurious retransmits.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Nov, 2014

1 commit

  • This patch allows to set ECN on a per-route basis in case the sysctl
    tcp_ecn is not set to 1. In other words, when ECN is set for specific
    routes, it provides a tcp_ecn=1 behaviour for that route while the rest
    of the stack acts according to the global settings.

    One can use 'ip route change dev $dev $net features ecn' to toggle this.

    Having a more fine-grained per-route setting can be beneficial for various
    reasons, for example, 1) within data centers, or 2) local ISPs may deploy
    ECN support for their own video/streaming services [1], etc.

    There was a recent measurement study/paper [2] which scanned the Alexa's
    publicly available top million websites list from a vantage point in US,
    Europe and Asia:

    Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
    blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
    only ECN") ;)); the break in connectivity on-path was found is about
    1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
    more common in the negotiation phase (and mostly seen in the Alexa
    middle band, ranks around 50k-150k): from 12-thousand hosts on which
    there _may_ be ECN-linked connection failures, only 79 failed with RST
    when _not_ failing with RST when ECN is not requested.

    It's unclear though, how much equipment in the wild actually marks CE
    when buffers start to fill up.

    We thought about a fallback to non-ECN for retransmitted SYNs as another
    global option (which could perhaps one day be made default), but as Eric
    points out, there's much more work needed to detect broken middleboxes.

    Two examples Eric mentioned are buggy firewalls that accept only a single
    SYN per flow, and middleboxes that successfully let an ECN flow establish,
    but later mark CE for all packets (so cwnd converges to 1).

    [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
    [2] http://ecn.ethz.ch/

    Joint work with Daniel Borkmann.

    Reference: http://thread.gmane.org/gmane.linux.network/335797
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

31 Oct, 2014

1 commit

  • Some drivers are unable to perform TX completions in a bound time.
    They instead call skb_orphan()

    Problem is skb_fclone_busy() has to detect this case, otherwise
    we block TCP retransmits and can freeze unlucky tcp sessions on
    mostly idle hosts.

    Signed-off-by: Eric Dumazet
    Fixes: 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2014

1 commit

  • Pull networking fixes from David Miller:

    1) Include fixes for netrom and dsa (Fabian Frederick and Florian
    Fainelli)

    2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.

    3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
    ip_tunnel, fou), from Li ROngQing.

    4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.

    5) Use after free in virtio_net, from Michael S Tsirkin.

    6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
    Shelar.

    7) ISDN gigaset and capi bug fixes from Tilman Schmidt.

    8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.

    9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.

    10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
    Wang and Eric Dumazet.

    11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
    chunks, from Daniel Borkmann.

    12) Don't call sock_kfree_s() with NULL pointers, this function also has
    the side effect of adjusting the socket memory usage. From Cong Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    bna: fix skb->truesize underestimation
    net: dsa: add includes for ethtool and phy_fixed definitions
    openvswitch: Set flow-key members.
    netrom: use linux/uaccess.h
    dsa: Fix conversion from host device to mii bus
    tipc: fix bug in bundled buffer reception
    ipv6: introduce tcp_v6_iif()
    sfc: add support for skb->xmit_more
    r8152: return -EBUSY for runtime suspend
    ipv4: fix a potential use after free in fou.c
    ipv4: fix a potential use after free in ip_tunnel_core.c
    hyperv: Add handling of IP header with option field in netvsc_set_hash()
    openvswitch: Create right mask with disabled megaflows
    vxlan: fix a free after use
    openvswitch: fix a use after free
    ipv4: dst_entry leak in ip_send_unicast_reply()
    ipv4: clean up cookie_v4_check()
    ipv4: share tcp_v4_save_options() with cookie_v4_check()
    ipv4: call __ip_options_echo() in cookie_v4_check()
    atm: simplify lanai.c by using module_pci_driver
    ...

    Linus Torvalds
     

15 Oct, 2014

3 commits

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     
  • TCP Small queues tries to keep number of packets in qdisc
    as small as possible, and depends on a tasklet to feed following
    packets at TX completion time.
    Choice of tasklet was driven by latencies requirements.

    Then, TCP stack tries to avoid reorders, by locking flows with
    outstanding packets in qdisc in a given TX queue.

    What can happen is that many flows get attracted by a low performing
    TX queue, and cpu servicing TX completion has to feed packets for all of
    them, making this cpu 100% busy in softirq mode.

    This became particularly visible with latest skb->xmit_more support

    Strategy adopted in this patch is to detect when tcp_wfree() is called
    from ksoftirqd and let the outstanding queue for this flow being drained
    before feeding additional packets, so that skb->ooo_okay can be set
    to allow select_queue() to select the optimal queue :

    Incoming ACKS are normally handled by different cpus, so this patch
    gives more chance for these cpus to take over the burden of feeding
    qdisc with future packets.

    Tested:

    lpaa23:~# ./super_netperf 1400 --google-pacing-rate 3028000 -H lpaa24 -l 3600 &

    lpaa23:~# sar -n DEV 1 10 | grep eth1
    06:16:18 AM eth1 595448.00 1190564.00 38381.09 1760253.12 0.00 0.00 1.00
    06:16:19 AM eth1 594858.00 1189686.00 38340.76 1758952.72 0.00 0.00 0.00
    06:16:20 AM eth1 597017.00 1194019.00 38480.79 1765370.29 0.00 0.00 1.00
    06:16:21 AM eth1 595450.00 1190936.00 38380.19 1760805.05 0.00 0.00 0.00
    06:16:22 AM eth1 596385.00 1193096.00 38442.56 1763976.29 0.00 0.00 1.00
    06:16:23 AM eth1 598155.00 1195978.00 38552.97 1768264.60 0.00 0.00 0.00
    06:16:24 AM eth1 594405.00 1188643.00 38312.57 1757414.89 0.00 0.00 1.00
    06:16:25 AM eth1 593366.00 1187154.00 38252.16 1755195.83 0.00 0.00 0.00
    06:16:26 AM eth1 593188.00 1186118.00 38232.88 1753682.57 0.00 0.00 1.00
    06:16:27 AM eth1 596301.00 1192241.00 38440.94 1762733.09 0.00 0.00 0.00
    Average: eth1 595457.30 1190843.50 38381.69 1760664.84 0.00 0.00 0.50
    lpaa23:~# ./tc -s -d qd sh dev eth1 | grep backlog
    backlog 7606336b 2513p requeues 167982
    backlog 224072b 74p requeues 566
    backlog 581376b 192p requeues 5598
    backlog 181680b 60p requeues 1070
    backlog 5305056b 1753p requeues 110166 // Here, this TX queue is attracting flows
    backlog 157456b 52p requeues 1758
    backlog 672216b 222p requeues 3025
    backlog 60560b 20p requeues 24541
    backlog 448144b 148p requeues 21258

    lpaa23:~# echo 1 >/proc/sys/net/ipv4/tcp_tsq_enable_tcp_wfree_ksoftirqd_detect

    Immediate jump to full bandwidth, and traffic is properly
    shard on all tx queues.

    lpaa23:~# sar -n DEV 1 10 | grep eth1
    06:16:46 AM eth1 1397632.00 2795397.00 90081.87 4133031.26 0.00 0.00 1.00
    06:16:47 AM eth1 1396874.00 2793614.00 90032.99 4130385.46 0.00 0.00 0.00
    06:16:48 AM eth1 1395842.00 2791600.00 89966.46 4127409.67 0.00 0.00 1.00
    06:16:49 AM eth1 1395528.00 2791017.00 89946.17 4126551.24 0.00 0.00 0.00
    06:16:50 AM eth1 1397891.00 2795716.00 90098.74 4133497.39 0.00 0.00 1.00
    06:16:51 AM eth1 1394951.00 2789984.00 89908.96 4125022.51 0.00 0.00 0.00
    06:16:52 AM eth1 1394608.00 2789190.00 89886.90 4123851.36 0.00 0.00 1.00
    06:16:53 AM eth1 1395314.00 2790653.00 89934.33 4125983.09 0.00 0.00 0.00
    06:16:54 AM eth1 1396115.00 2792276.00 89984.25 4128411.21 0.00 0.00 1.00
    06:16:55 AM eth1 1396829.00 2793523.00 90030.19 4130250.28 0.00 0.00 0.00
    Average: eth1 1396158.40 2792297.00 89987.09 4128439.35 0.00 0.00 0.50

    lpaa23:~# tc -s -d qd sh dev eth1 | grep backlog
    backlog 7900052b 2609p requeues 173287
    backlog 878120b 290p requeues 589
    backlog 1068884b 354p requeues 5621
    backlog 996212b 329p requeues 1088
    backlog 984100b 325p requeues 115316
    backlog 956848b 316p requeues 1781
    backlog 1080996b 357p requeues 3047
    backlog 975016b 322p requeues 24571
    backlog 990156b 327p requeues 21274

    (All 8 TX queues get a fair share of the traffic)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP Small Queues (tcp_tsq_handler()) can hold one reference on
    sk->sk_wmem_alloc, preventing skb->ooo_okay being set.

    We should relax test done to set skb->ooo_okay to take care
    of this extra reference.

    Minimal truesize of skb containing one byte of payload is
    SKB_TRUESIZE(1)

    Without this fix, we have more chance locking flows into the wrong
    transmit queue.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2014

1 commit

  • Lets use a proper structure to clearly document and implement
    skb fast clones.

    Then, we might experiment more easily alternative layouts.

    This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
    to stop leaking of implementation details.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2014

1 commit

  • Suggested by Stephen. Also drop inline keyword and let compiler decide.

    gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up.
    The actual evaluation is not inlined anymore while the ECN_OK test is.

    Suggested-by: Stephen Hemminger
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

29 Sep, 2014

5 commits

  • This work adds the DataCenter TCP (DCTCP) congestion control
    algorithm [1], which has been first published at SIGCOMM 2010 [2],
    resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
    recently as an informational IETF draft available at [4]).

    DCTCP is an enhancement to the TCP congestion control algorithm for
    data center networks. Typical data center workloads are i.e.
    i) partition/aggregate (queries; bursty, delay sensitive), ii) short
    messages e.g. 50KB-1MB (for coordination and control state; delay
    sensitive), and iii) large flows e.g. 1MB-100MB (data update;
    throughput sensitive). DCTCP has therefore been designed for such
    environments to provide/achieve the following three requirements:

    * High burst tolerance (incast due to partition/aggregate)
    * Low latency (short flows, queries)
    * High throughput (continuous data updates, large file
    transfers) with commodity, shallow buffered switches

    The basic idea of its design consists of two fundamentals: i) on the
    switch side, packets are being marked when its internal queue
    length > threshold K (K is chosen so that a large enough headroom
    for marked traffic is still available in the switch queue); ii) the
    sender/host side maintains a moving average of the fraction of marked
    packets, so each RTT, F is being updated as follows:

    F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
    alpha := (1 - g) * alpha + g * F, where g is a smoothing constant

    The resulting alpha (iow: probability that switch queue is congested)
    is then being used in order to adaptively decrease the congestion
    window W:

    W := (1 - (alpha / 2)) * W

    The means for receiving marked packets resp. marking them on switch
    side in DCTCP is the use of ECN.

    RFC3168 describes a mechanism for using Explicit Congestion Notification
    from the switch for early detection of congestion, rather than waiting
    for segment loss to occur.

    However, this method only detects the presence of congestion, not
    the *extent*. In the presence of mild congestion, it reduces the TCP
    congestion window too aggressively and unnecessarily affects the
    throughput of long flows [4].

    DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
    processing to estimate the fraction of bytes that encounter congestion,
    rather than simply detecting that some congestion has occurred. DCTCP
    then scales the TCP congestion window based on this estimate [4],
    thus it can derive multibit feedback from the information present in
    the single-bit sequence of marks in its control law. And thus act in
    *proportion* to the extent of congestion, not its *presence*.

    Switches therefore set the Congestion Experienced (CE) codepoint in
    packets when internal queue lengths exceed threshold K. Resulting,
    DCTCP delivers the same or better throughput than normal TCP, while
    using 90% less buffer space.

    It was found in [2] that DCTCP enables the applications to handle 10x
    the current background traffic, without impacting foreground traffic.
    Moreover, a 10x increase in foreground traffic did not cause any
    timeouts, and thus largely eliminates TCP incast collapse problems.

    The algorithm itself has already seen deployments in large production
    data centers since then.

    We did a long-term stress-test and analysis in a data center, short
    summary of our TCP incast tests with iperf compared to cubic:

    This test measured DCTCP throughput and latency and compared it with
    CUBIC throughput and latency for an incast scenario. In this test, 19
    senders sent at maximum rate to a single receiver. The receiver simply
    ran iperf -s.

    The senders ran iperf -c -t 30. All senders started
    simultaneously (using local clocks synchronized by ntp).

    This test was repeated multiple times. Below shows the results from a
    single test. Other tests are similar. (DCTCP results were extremely
    consistent, CUBIC results show some variance induced by the TCP timeouts
    that CUBIC encountered.)

    For this test, we report statistics on the number of TCP timeouts,
    flow throughput, and traffic latency.

    1) Timeouts (total over all flows, and per flow summaries):

    CUBIC DCTCP
    Total 3227 25
    Mean 169.842 1.316
    Median 183 1
    Max 207 5
    Min 123 0
    Stddev 28.991 1.600

    Timeout data is taken by measuring the net change in netstat -s
    "other TCP timeouts" reported. As a result, the timeout measurements
    above are not restricted to the test traffic, and we believe that it
    is likely that all of the "DCTCP timeouts" are actually timeouts for
    non-test traffic. We report them nevertheless. CUBIC will also include
    some non-test timeouts, but they are drawfed by bona fide test traffic
    timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
    TCP timeouts. DCTCP reduces timeouts by at least two orders of
    magnitude and may well have eliminated them in this scenario.

    2) Throughput (per flow in Mbps):

    CUBIC DCTCP
    Mean 521.684 521.895
    Median 464 523
    Max 776 527
    Min 403 519
    Stddev 105.891 2.601
    Fairness 0.962 0.999

    Throughput data was simply the average throughput for each flow
    reported by iperf. By avoiding TCP timeouts, DCTCP is able to
    achieve much better per-flow results. In CUBIC, many flows
    experience TCP timeouts which makes flow throughput unpredictable and
    unfair. DCTCP, on the other hand, provides very clean predictable
    throughput without incurring TCP timeouts. Thus, the standard deviation
    of CUBIC throughput is dramatically higher than the standard deviation
    of DCTCP throughput.

    Mean throughput is nearly identical because even though cubic flows
    suffer TCP timeouts, other flows will step in and fill the unused
    bandwidth. Note that this test is something of a best case scenario
    for incast under CUBIC: it allows other flows to fill in for flows
    experiencing a timeout. Under situations where the receiver is issuing
    requests and then waiting for all flows to complete, flows cannot fill
    in for timed out flows and throughput will drop dramatically.

    3) Latency (in ms):

    CUBIC DCTCP
    Mean 4.0088 0.04219
    Median 4.055 0.0395
    Max 4.2 0.085
    Min 3.32 0.028
    Stddev 0.1666 0.01064

    Latency for each protocol was computed by running "ping -i 0.2
    " from a single sender to the receiver during the incast
    test. For DCTCP, "ping -Q 0x6 -i 0.2 " was used to ensure
    that traffic traversed the DCTCP queue and was not dropped when the
    queue size was greater than the marking threshold. The summary
    statistics above are over all ping metrics measured between the single
    sender, receiver pair.

    The latency results for this test show a dramatic difference between
    CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
    which incurs the maximum queue latency (more buffer memory will lead
    to high latency.) DCTCP, on the other hand, deliberately attempts to
    keep queue occupancy low. The result is a two orders of magnitude
    reduction of latency with DCTCP - even with a switch with relatively
    little RAM. Switches with larger amounts of RAM will incur increasing
    amounts of latency for CUBIC, but not for DCTCP.

    4) Convergence and stability test:

    This test measured the time that DCTCP took to fairly redistribute
    bandwidth when a new flow commences. It also measured DCTCP's ability
    to remain stable at a fair bandwidth distribution. DCTCP is compared
    with CUBIC for this test.

    At the commencement of this test, a single flow is sending at maximum
    rate (near 10 Gbps) to a single receiver. One second after that first
    flow commences, a new flow from a distinct server begins sending to
    the same receiver as the first flow. After the second flow has sent
    data for 10 seconds, the second flow is terminated. The first flow
    sends for an additional second. Ideally, the bandwidth would be evenly
    shared as soon as the second flow starts, and recover as soon as it
    stops.

    The results of this test are shown below. Note that the flow bandwidth
    for the two flows was measured near the same time, but not
    simultaneously.

    DCTCP performs nearly perfectly within the measurement limitations
    of this test: bandwidth is quickly distributed fairly between the two
    flows, remains stable throughout the duration of the test, and
    recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
    fairly, and has trouble remaining stable.

    CUBIC DCTCP

    Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2
    0 9.93 0 0 9.92 0
    0.5 9.87 0 0.5 9.86 0
    1 8.73 2.25 1 6.46 4.88
    1.5 7.29 2.8 1.5 4.9 4.99
    2 6.96 3.1 2 4.92 4.94
    2.5 6.67 3.34 2.5 4.93 5
    3 6.39 3.57 3 4.92 4.99
    3.5 6.24 3.75 3.5 4.94 4.74
    4 6 3.94 4 5.34 4.71
    4.5 5.88 4.09 4.5 4.99 4.97
    5 5.27 4.98 5 4.83 5.01
    5.5 4.93 5.04 5.5 4.89 4.99
    6 4.9 4.99 6 4.92 5.04
    6.5 4.93 5.1 6.5 4.91 4.97
    7 4.28 5.8 7 4.97 4.97
    7.5 4.62 4.91 7.5 4.99 4.82
    8 5.05 4.45 8 5.16 4.76
    8.5 5.93 4.09 8.5 4.94 4.98
    9 5.73 4.2 9 4.92 5.02
    9.5 5.62 4.32 9.5 4.87 5.03
    10 6.12 3.2 10 4.91 5.01
    10.5 6.91 3.11 10.5 4.87 5.04
    11 8.48 0 11 8.49 4.94
    11.5 9.87 0 11.5 9.9 0

    SYN/ACK ECT test:

    This test demonstrates the importance of ECT on SYN and SYN-ACK packets
    by measuring the connection probability in the presence of competing
    flows for a DCTCP connection attempt *without* ECT in the SYN packet.
    The test was repeated five times for each number of competing flows.

    Competing Flows 1 | 2 | 4 | 8 | 16
    ------------------------------
    Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
    Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0

    As the number of competing flows moves beyond 1, the connection
    probability drops rapidly.

    Enabling DCTCP with this patch requires the following steps:

    DCTCP must be running both on the sender and receiver side in your
    data center, i.e.:

    sysctl -w net.ipv4.tcp_congestion_control=dctcp

    Also, ECN functionality must be enabled on all switches in your
    data center for DCTCP to work. The default ECN marking threshold (K)
    heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
    1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).

    In above tests, for each switch port, traffic was segregated into two
    queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
    0x04 - the packet was placed into the DCTCP queue. All other packets
    were placed into the default drop-tail queue. For the DCTCP queue,
    RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
    More details however, we refer you to the paper [2] under section 3).

    There are no code changes required to applications running in user
    space. DCTCP has been implemented in full *isolation* of the rest of
    the TCP code as its own congestion control module, so that it can run
    without a need to expose code to the core of the TCP stack, and thus
    nothing changes for non-DCTCP users.

    Changes in the CA framework code are minimal, and DCTCP algorithm
    operates on mechanisms that are already available in most Silicon.
    The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
    the paper, but we leave the option that it can be chosen carefully
    to a different value by the user.

    In case DCTCP is being used and ECN support on peer site is off,
    DCTCP falls back after 3WHS to operate in normal TCP Reno mode.

    ss {-4,-6} -t -i diag interface:

    ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
    ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
    send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
    reordering:101 rcv_space:29200

    ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
    cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
    325.5Mbps rcv_rtt:1.5 rcv_space:29200

    More information about DCTCP can be found in [1-4].

    [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
    [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
    [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
    [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00

    Joint work with Florian Westphal and Glenn Judd.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Glenn Judd
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
    and ACK properties, e.g. ACK that updates window is treated differently
    than DUPACK.

    Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
    DCTCP also implements a CE state machine that keeps track of CE markings
    of incoming packets.

    Therefore, extend the congestion control framework to provide these
    event types, so that DCTCP can be properly implemented as a normal
    congestion algorithm module outside of the core stack.

    Joint work with Daniel Borkmann and Glenn Judd.

    Signed-off-by: Florian Westphal
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Glenn Judd
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This patch adds a flag to TCP congestion algorithms that allows
    for requesting to mark IPv4/IPv6 sockets with transport as ECN
    capable, that is, ECT(0), when required by a congestion algorithm.

    It is currently used and needed in DataCenter TCP (DCTCP), as it
    requires both peers to assert ECT on all IP packets sent - it
    uses ECN feedback (i.e. CE, Congestion Encountered information)
    from switches inside the data center to derive feedback to the
    end hosts.

    Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
    algorithm/behaviour slightly diverges from RFC3168, therefore this
    is only (!) enabled iff the assigned congestion control ops module
    has requested this. By that, we can tightly couple this logic really
    only to the provided congestion control ops.

    Joint work with Florian Westphal and Glenn Judd.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Glenn Judd
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Our goal is to access no more than one cache line access per skb in
    a write or receive queue when doing the various walks.

    After recent TCP_SKB_CB() reorganizations, it is almost done.

    Last part is tcp_skb_pcount() which currently uses
    skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
    3 cache lines in current kernel (skb->head, skb->end, and
    shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)

    This very simple patch reuses space currently taken by tcp_tw_isn
    only in input path, as tcp_skb_pcount is only needed for skb stored in
    write queue.

    This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
    to get SKBTX_ACK_TSTAMP, which seems possible.

    This also speeds up all sack processing in general.

    This speeds up tcp_sendmsg() because it no longer has to access/dirty
    shinfo.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP maintains lists of skb in write queue, and in receive queues
    (in order and out of order queues)

    Scanning these lists both in input and output path usually requires
    access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq

    These fields are currently in two different cache lines, meaning we
    waste lot of memory bandwidth when these queues are big and flows
    have either packet drops or packet reorders.

    We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
    this header is not used in fast path. This allows TCP to search much faster
    in the skb lists.

    Even with regular flows, we save one cache line miss in fast path.

    Thanks to Christoph Paasch for noticing we need to cleanup
    skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
    and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2014

1 commit

  • While profiling TCP stack, I noticed one useless atomic operation
    in tcp_sendmsg(), caused by skb_header_release().

    It turns out all current skb_header_release() users have a fresh skb,
    that no other user can see, so we can avoid one atomic operation.

    Introduce __skb_header_release() to clearly document this.

    This gave me a 1.5 % improvement on TCP_RR workload.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2014

1 commit

  • icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
    or more if some sysctl (eg tcp_retries2) are changed.

    Better use 64bit to perform icsk_rto << icsk_backoff operations

    As Joe Perches suggested, add a helper for this.

    Yuchung spotted the tcp_v4_err() case.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Sep, 2014

1 commit

  • The TCP_SKB_CB(skb)->when field no longer exists as of recent change
    7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when"). And in any case,
    tcp_fragment() is called on already-transmitted packets from the
    __tcp_retransmit_skb() call site, so copying timestamps of any kind
    in this spot is quite sensible.

    Signed-off-by: Neal Cardwell
    Reported-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

06 Sep, 2014

1 commit

  • After commit 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution"),
    we no longer need to maintain timestamps in two different fields.

    TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies

    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Aug, 2014

1 commit


15 Aug, 2014

2 commits

  • Make sure we use the correct address-family-specific function for
    handling MTU reductions from within tcp_release_cb().

    Previously AF_INET6 sockets were incorrectly always using the IPv6
    code path when sometimes they were handling IPv4 traffic and thus had
    an IPv4 dst.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Willem de Bruijn
    Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications")
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • We don't know right timestamp for repaired skb-s. Wrong RTT estimations
    isn't good, because some congestion modules heavily depends on it.

    This patch adds the TCPCB_REPAIRED flag, which is included in
    TCPCB_RETRANS.

    Thanks to Eric for the advice how to fix this issue.

    This patch fixes the warning:
    [ 879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
    [ 879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
    [ 879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 879.568177] 0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
    [ 879.568776] 0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
    [ 879.569386] ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
    [ 879.569982] Call Trace:
    [ 879.570264] [] dump_stack+0x4d/0x66
    [ 879.570599] [] warn_slowpath_common+0x7d/0xa0
    [ 879.570935] [] warn_slowpath_null+0x1a/0x20
    [ 879.571292] [] tcp_ack+0x11f5/0x1380
    [ 879.571614] [] tcp_rcv_established+0x1ed/0x710
    [ 879.571958] [] tcp_v4_do_rcv+0x10a/0x370
    [ 879.572315] [] release_sock+0x89/0x1d0
    [ 879.572642] [] do_tcp_setsockopt.isra.36+0x120/0x860
    [ 879.573000] [] ? rcu_read_lock_held+0x6e/0x80
    [ 879.573352] [] tcp_setsockopt+0x32/0x40
    [ 879.573678] [] sock_common_setsockopt+0x14/0x20
    [ 879.574031] [] SyS_setsockopt+0x80/0xf0
    [ 879.574393] [] system_call_fastpath+0x16/0x1b
    [ 879.574730] ---[ end trace a17cbc38eb8c5c00 ]---

    v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
    where it's set for other skb-s.

    Fixes: 431a91242d8d ("tcp: timestamp SYN+DATA messages")
    Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

14 Aug, 2014

1 commit

  • Bytestream timestamps are correlated with a single byte in the skbuff,
    recorded in skb_shinfo(skb)->tskey. When fragmenting skbuffs, ensure
    that the tskey is set for the fragment in which the tskey falls
    (seqno < end_seqno).

    The original implementation did not address fragmentation in
    tcp_fragment or tso_fragment. Add code to inspect the sequence numbers
    and move both tskey and the relevant tx_flags if necessary.

    Reported-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

17 Jul, 2014

1 commit


16 Jul, 2014

1 commit

  • Since Yuchung's 9b44190dc11 (tcp: refactor F-RTO), tcp_enter_cwr is always
    called with set_ssthresh = 1. Thus, we can remove this argument from
    tcp_enter_cwr. Further, as we remove this one, tcp_init_cwnd_reduction
    is then always called with set_ssthresh = true, and so we can get rid of
    this argument as well.

    Cc: Yuchung Cheng
    Signed-off-by: Christoph Paasch
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Christoph Paasch
     

08 Jul, 2014

2 commits

  • The undo code assumes that, upon entering loss recovery, TCP
    1) always retransmit something
    2) the retransmission never fails locally (e.g., qdisc drop)

    so undo_marker is set in tcp_enter_recovery() and undo_retrans is
    incremented only when tcp_retransmit_skb() is successful.

    When the assumption is broken because TCP's cwnd is too small to
    retransmit or the retransmit fails locally. The next (DUP)ACK
    would incorrectly revert the cwnd and the congestion state in
    tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
    may enter the recovery state. The sender repeatedly enter and
    (incorrectly) exit recovery states if the retransmits continue to
    fail locally while receiving (DUP)ACKs.

    The fix is to initialize undo_retrans to -1 and start counting on
    the first retransmission. Always increment undo_retrans even if the
    retransmissions fail locally because they couldn't cause DSACKs to
    undo the cwnd reduction.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • For a connected socket we can precompute the flow hash for setting
    in skb->hash on output. This is a performance advantage over
    calculating the skb->hash for every packet on the connection. The
    computation is done using the common hash algorithm to be consistent
    with computations done for packets of the connection in other states
    where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).

    This patch adds sk_txhash to the sock structure. inet_set_txhash and
    ip6_set_txhash functions are added which are called from points in
    TCP and UDP where socket moves to established state.

    skb_set_hash_from_sk is a function which sets skb->hash from the
    sock txhash value. This is called in UDP and TCP transmit path when
    transmitting within the context of a socket.

    Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
    interface (in this case skb_get_hash called on every TX packet to
    create a UDP source port).

    Before fix:

    95.02% CPU utilization
    154/256/505 90/95/99% latencies
    1.13042e+06 tps

    Time in functions:
    0.28% skb_flow_dissect
    0.21% __skb_get_hash

    After fix:

    94.95% CPU utilization
    156/254/485 90/95/99% latencies
    1.15447e+06

    Neither __skb_get_hash nor skb_flow_dissect appear in perf

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

28 Jun, 2014

1 commit


13 Jun, 2014

2 commits

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     
  • Fix to a problem observed when losing a FIN segment that does not
    contain data. In such situations, TLP is unable to recover from
    *any* tail loss and instead adds at least PTO ms to the
    retransmission process, i.e., RTO = RTO + PTO.

    Signed-off-by: Per Hurtig
    Signed-off-by: Eric Dumazet
    Acked-by: Nandita Dukkipati
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Per Hurtig
     

11 Jun, 2014

1 commit


04 Jun, 2014

1 commit

  • …el/git/tip/tip into next

    Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - reduced/streamlined smp_mb__*() interface that allows more usecases
    and makes the existing ones less buggy, especially in rarer
    architectures

    - add rwsem implementation comments

    - bump up lockdep limits"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    rwsem: Add comments to explain the meaning of the rwsem's count field
    lockdep: Increase static allocations
    arch: Mass conversion of smp_mb__*()
    arch,doc: Convert smp_mb__*()
    arch,xtensa: Convert smp_mb__*()
    arch,x86: Convert smp_mb__*()
    arch,tile: Convert smp_mb__*()
    arch,sparc: Convert smp_mb__*()
    arch,sh: Convert smp_mb__*()
    arch,score: Convert smp_mb__*()
    arch,s390: Convert smp_mb__*()
    arch,powerpc: Convert smp_mb__*()
    arch,parisc: Convert smp_mb__*()
    arch,openrisc: Convert smp_mb__*()
    arch,mn10300: Convert smp_mb__*()
    arch,mips: Convert smp_mb__*()
    arch,metag: Convert smp_mb__*()
    arch,m68k: Convert smp_mb__*()
    arch,m32r: Convert smp_mb__*()
    arch,ia64: Convert smp_mb__*()
    ...

    Linus Torvalds
     

23 May, 2014

1 commit

  • Experience with the recent e114a710aa50 ("tcp: fix cwnd limited
    checking to improve congestion control") has shown that there are
    common cases where that commit can cause cwnd to be much larger than
    necessary. This leads to TSO autosizing cooking skbs that are too
    large, among other things.

    The main problems seemed to be:

    (1) That commit attempted to predict the future behavior of the
    connection by looking at the write queue (if TSO or TSQ limit
    sending). That prediction sometimes overestimated future outstanding
    packets.

    (2) That commit always allowed cwnd to grow to twice the number of
    outstanding packets (even in congestion avoidance, where this is not
    needed).

    This commit improves both of these, by:

    (1) Switching to a measurement-based approach where we explicitly
    track the largest number of packets in flight during the past window
    ("max_packets_out"), and remember whether we were cwnd-limited at the
    moment we finished sending that flight.

    (2) Only allowing cwnd to grow to twice the number of outstanding
    packets ("max_packets_out") in slow start. In congestion avoidance
    mode we now only allow cwnd to grow if it was fully utilized.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

14 May, 2014

2 commits

  • To avoid large code duplication in IPv6, we need to first simplify
    the complicate SYN-ACK sending code in tcp_v4_conn_request().

    To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to
    initialize the mini socket's receive window before trying to
    create the child socket and/or building the SYN-ACK packet. So we move
    that initialization from tcp_make_synack() to tcp_v4_conn_request()
    as a new function tcp_openreq_init_req_rwin().

    After this refactoring the SYN-ACK sending code is simpler and easier
    to implement Fast Open for IPv6.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Daniel Lee
    Signed-off-by: Jerry Chu
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Consolidate various cookie checking and generation code to simplify
    the fast open processing. The main goal is to reduce code duplication
    in tcp_v4_conn_request() for IPv6 support.

    Removes two experimental sysctl flags TFO_SERVER_ALWAYS and
    TFO_SERVER_COOKIE_NOT_CHKD used primarily for developmental debugging
    purposes.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Daniel Lee
    Signed-off-by: Jerry Chu
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

13 May, 2014

1 commit

  • Conflicts:
    drivers/net/ethernet/altera/altera_sgdma.c
    net/netlink/af_netlink.c
    net/sched/cls_api.c
    net/sched/sch_api.c

    The netlink conflict dealt with moving to netlink_capable() and
    netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
    in non-init namespaces. These were simple transformations from
    netlink_capable to netlink_ns_capable.

    The Altera driver conflict was simply code removal overlapping some
    void pointer cast cleanups in net-next.

    Signed-off-by: David S. Miller

    David S. Miller