06 Dec, 2016

2 commits

  • tsq_flags being in the same cache line than sk_wmem_alloc
    makes a lot of sense. Both fields are changed from tcp_wfree()
    and more generally by various TSQ related functions.

    Prior patch made room in struct sock and added sk_tsq_flags,
    this patch deletes tsq_flags from struct tcp_sock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is a cleanup, to ease code review of following patches.

    Old 'enum tsq_flags' is renamed, and a new enumeration is added
    with the flags used in cmpxchg() operations as opposed to
    single bit operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2016

1 commit

  • jiffies based timestamps allow for easy inference of number of devices
    behind NAT translators and also makes tracking of hosts simpler.

    commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
    added the main infrastructure that is needed for per-connection ts
    randomization, in particular writing/reading the on-wire tcp header
    format takes the offset into account so rest of stack can use normal
    tcp_time_stamp (jiffies).

    So only two items are left:
    - add a tsoffset for request sockets
    - extend the tcp isn generator to also return another 32bit number
    in addition to the ISN.

    Re-use of ISN generator also means timestamps are still monotonically
    increasing for same connection quadruple, i.e. PAWS will still work.

    Includes fixes from Eric Dumazet.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Florian Westphal
     

30 Nov, 2016

2 commits

  • This patch exports the sender chronograph stats via the socket
    SO_TIMESTAMPING channel. Currently we can instrument how long a
    particular application unit of data was queued in TCP by tracking
    SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
    these sender chronograph stats exported simultaneously along with
    these timestamps allow further breaking down the various sender
    limitation. For example, a video server can tell if a particular
    chunk of video on a connection takes a long time to deliver because
    TCP was experiencing small receive window. It is not possible to
    tell before this patch without packet traces.

    To prepare these stats, the user needs to set
    SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
    while requesting other SOF_TIMESTAMPING TX timestamps. When the
    timestamps are available in the error queue, the stats are returned
    in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
    in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
    TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

    Signed-off-by: Francis Yan
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Francis Yan
     
  • This patch implements the skeleton of the TCP chronograph
    instrumentation on sender side limits:

    1) idle (unspec)
    2) busy sending data other than 3-4 below
    3) rwnd-limited
    4) sndbuf-limited

    The limits are enumerated 'tcp_chrono'. Since a connection in
    theory can idle forever, we do not track the actual length of this
    uninteresting idle period. For the rest we track how long the sender
    spends in each limit. At any point during the life time of a
    connection, the sender must be in one of the four states.

    If there are multiple conditions worthy of tracking in a chronograph
    then the highest priority enum takes precedence over
    the other conditions. So that if something "more interesting"
    starts happening, stop the previous chrono and start a new one.

    The time unit is jiffy(u32) in order to save space in tcp_sock.
    This implies application must sample the stats no longer than every
    49 days of 1ms jiffy.

    Signed-off-by: Francis Yan
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Francis Yan
     

10 Nov, 2016

1 commit

  • We had various problems in the past in tcp_get_info() and used
    specific synchronization to avoid deadlocks.

    We would like to add more instrumentation points for TCP, and
    avoiding grabing socket lock in tcp_getinfo() was too costly.

    Being able to lock the socket allows to provide consistent set
    of fields.

    inet_diag_dump_icsk() can make sure ehash locks are not
    held any more when tcp_get_info() is called.

    We can remove syncp added in commit d654976cbf85
    ("tcp: fix a potential deadlock in tcp_get_info()"), but we need
    to use lock_sock_fast() instead of spin_lock_bh() since TCP input
    path can now be run from process context.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2016

5 commits

  • This commit export two new fields in struct tcp_info:

    tcpi_delivery_rate: The most recent goodput, as measured by
    tcp_rate_gen(). If the socket is limited by the sending
    application (e.g., no data to send), it reports the highest
    measurement instead of the most recent. The unit is bytes per
    second (like other rate fields in tcp_info).

    tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
    was measured when the socket's throughput was limited by the
    sending application.

    This delivery rate information can be useful for applications that
    want to know the current throughput the TCP connection is seeing,
    e.g. adaptive bitrate video streaming. It can also be very useful for
    debugging or troubleshooting.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This commit adds code to track whether the delivery rate represented
    by each rate_sample was limited by the application.

    Upon each transmit, we store in the is_app_limited field in the skb a
    boolean bit indicating whether there is a known "bubble in the pipe":
    a point in the rate sample interval where the sender was
    application-limited, and did not transmit even though the cwnd and
    pacing rate allowed it.

    This logic marks the flow app-limited on a write if *all* of the
    following are true:

    1) There is less than 1 MSS of unsent data in the write queue
    available to transmit.

    2) There is no packet in the sender's queues (e.g. in fq or the NIC
    tx queue).

    3) The connection is not limited by cwnd.

    4) There are no lost packets to retransmit.

    The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
    the connection is application-limited at the moment. If the flow is
    application-limited, it sets the tp->app_limited field. If the flow is
    application-limited then that means there is effectively a "bubble" of
    silence in the pipe now, and this silence will be reflected in a lower
    bandwidth sample for any rate samples from now until we get an ACK
    indicating this bubble has exited the pipe: specifically, until we get
    an ACK for the next packet we transmit.

    When we send every skb we record in scb->tx.is_app_limited whether the
    resulting rate sample will be application-limited.

    The code in tcp_rate_gen() checks to see when it is safe to mark all
    known application-limited bubbles of silence as having exited the
    pipe. It does this by checking to see when the delivered count moves
    past the tp->app_limited marker. At this point it zeroes the
    tp->app_limited marker, as all known bubbles are out of the pipe.

    We make room for the tx.is_app_limited bit in the skb by borrowing a
    bit from the in_flight field used by NV to record the number of bytes
    in flight. The receive window in the TCP header is 16 bits, and the
    max receive window scaling shift factor is 14 (RFC 1323). So the max
    receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
    only need 30 bits for the tx.in_flight used by NV.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • This patch generates data delivery rate (throughput) samples on a
    per-ACK basis. These rate samples can be used by congestion control
    modules, and specifically will be used by TCP BBR in later patches in
    this series.

    Key state:

    tp->delivered: Tracks the total number of data packets (original or not)
    delivered so far. This is an already-existing field.

    tp->delivered_mstamp: the last time tp->delivered was updated.

    Algorithm:

    A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

    d1: the current tp->delivered after processing the ACK
    t1: the current time after processing the ACK

    d0: the prior tp->delivered when the acked skb was transmitted
    t0: the prior tp->delivered_mstamp when the acked skb was transmitted

    When an skb is transmitted, we snapshot d0 and t0 in its control
    block in tcp_rate_skb_sent().

    When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
    or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
    to reflect the latest (d0, t0).

    Finally, tcp_rate_gen() generates a rate sample by storing
    (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

    One caveat: if an skb was sent with no packets in flight, then
    tp->delivered_mstamp may be either invalid (if the connection is
    starting) or outdated (if the connection was idle). In that case,
    we'll re-stamp tp->delivered_mstamp.

    At first glance it seems t0 should always be the time when an skb was
    transmitted, but actually this could over-estimate the rate due to
    phase mismatch between transmit and ACK events. To track the delivery
    rate, we ensure that if packets are in flight then t0 and and t1 are
    times at which packets were marked delivered.

    If the initial and final RTTs are different then one may be corrupted
    by some sort of noise. The noise we see most often is sending gaps
    caused by delayed, compressed, or stretched acks. This either affects
    both RTTs equally or artificially reduces the final RTT. We approach
    this by recording the info we need to compute the initial RTT
    (duration of the "send phase" of the window) when we recorded the
    associated inflight. Then, for a filter to avoid bandwidth
    overestimates, we generalize the per-sample bandwidth computation
    from:

    bw = delivered / ack_phase_rtt

    to the following:

    bw = delivered / max(send_phase_rtt, ack_phase_rtt)

    In large-scale experiments, this filtering approach incorporating
    send_phase_rtt is effective at avoiding bandwidth overestimates due to
    ACK compression or stretched ACKs.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Count the number of packets that a TCP connection marks lost.

    Congestion control modules can use this loss rate information for more
    intelligent decisions about how fast to send.

    Specifically, this is used in TCP BBR policer detection. BBR uses a
    high packet loss rate as one signal in its policer detection and
    policer bandwidth estimation algorithm.

    The BBR policer detection algorithm cannot simply track retransmits,
    because a retransmit can be (and often is) an indicator of packets
    lost long, long ago. This is particularly true in a long CA_Loss
    period that repairs the initial massive losses when a policer kicks
    in.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Refactor the TCP min_rtt code to reuse the new win_minmax library in
    lib/win_minmax.c to simplify the TCP code.

    This is a pure refactor: the functionality is exactly the same. We
    just moved the windowed min code to make TCP easier to read and
    maintain, and to allow other parts of the kernel to use the windowed
    min/max filter code.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

09 Sep, 2016

1 commit

  • Over the years, TCP BDP has increased by several orders of magnitude,
    and some people are considering to reach the 2 Gbytes limit.

    Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
    MSS.

    In presence of packet losses (or reorders), TCP stores incoming packets
    into an out of order queue, and number of skbs sitting there waiting for
    the missing packets to be received can be in the 10^5 range.

    Most packets are appended to the tail of this queue, and when
    packets can finally be transferred to receive queue, we scan the queue
    from its head.

    However, in presence of heavy losses, we might have to find an arbitrary
    point in this queue, involving a linear scan for every incoming packet,
    throwing away cpu caches.

    This patch converts it to a RB tree, to get bounded latencies.

    Yaogong wrote a preliminary patch about 2 years ago.
    Eric did the rebase, added ofo_last_skb cache, polishing and tests.

    Tested with network dropping between 1 and 10 % packets, with good
    success (about 30 % increase of throughput in stress tests)

    Next step would be to also use an RB tree for the write queue at sender
    side ;)

    Signed-off-by: Yaogong Wang
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Ilpo Järvinen
    Acked-By: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yaogong Wang
     

15 Mar, 2016

1 commit

  • Per RFC4898, they count segments sent/received
    containing a positive length data segment (that includes
    retransmission segments carrying data). Unlike
    tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
    carrying no data (e.g. pure ack).

    The patch also updates the segs_in in tcp_fastopen_add_skb()
    so that segs_in >= data_segs_in property is kept.

    Together with retransmission data, tcpi_data_segs_out
    gives a better signal on the rxmit rate.

    v6: Rebase on the latest net-next

    v5: Eric pointed out that checking skb->len is still needed in
    tcp_fastopen_add_skb() because skb can carry a FIN without data.
    Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
    helper is used. Comment is added to the fastopen case to explain why
    segs_in has to be reset and tcp_segs_in() has to be called before
    __skb_pull().

    v4: Add comment to the changes in tcp_fastopen_add_skb()
    and also add remark on this case in the commit message.

    v3: Add const modifier to the skb parameter in tcp_segs_in()

    v2: Rework based on recent fix by Eric:
    commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

    Signed-off-by: Martin KaFai Lau
    Cc: Chris Rapier
    Cc: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

11 Feb, 2016

1 commit

  • tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr.
    This splits the size calculation into a helper function that can be
    used if a struct tcphdr is already available.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Feb, 2016

1 commit

  • This patch changes the accounting of how many packets are
    newly acked or sacked when the sender receives an ACK.

    The current approach basically computes

    newly_acked_sacked = (prior_packets - prior_sacked) -
    (tp->packets_out - tp->sacked_out)

    where prior_packets and prior_sacked out are snapshot
    at the beginning of the ACK processing.

    The new approach tracks the delivery information via a new
    TCP state variable "delivered" which monotically increases
    as new packets are delivered in order or out-of-order.

    The reason for this change is that the current approach is
    brittle that produces negative or inaccurate estimate.

    1) For non-SACK connections, an ACK that advances the SND.UNA
    could reset the DUPACK counters (tp->sacked_out) in
    tcp_process_loss() or tcp_fastretrans_alert(). This inflates
    the inflight suddenly and causes under-estimate or even
    negative estimate. Here is a real example:

    before after (processing ACK)
    packets_out 75 73
    sacked_out 23 0
    ca state Loss Open

    The old approach computes (75-23) - (73 - 0) = -21 delivered
    while the new approach computes 1 delivered since it
    considers the 2nd-24th packets are delivered OOO.

    2) MSS change would re-count packets_out and sacked_out so
    the estimate is in-accurate and can even become negative.
    E.g., the inflight is doubled when MSS is halved.

    3) Spurious retransmission signaled by DSACK is not accounted

    The new approach is simpler and more robust. For SACK connections,
    tp->delivered increments as packets are being acked or sacked in
    SACK and ACK processing.

    For non-sack connections, it's done in tcp_remove_reno_sacks() and
    tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
    is incremented by the number of packets ACKed (less the current
    number of DUPACKs received plus one packet hole). Upon receiving
    a DUPACK, tp->delivered is incremented assuming one out-of-order
    packet is delivered.

    Upon receiving a DSACK, tp->delivered is incremtened assuming one
    retransmission is delivered in tcp_sacktag_write_queue().

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

06 Nov, 2015

1 commit

  • For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    req->saved_syn unless we own the request sock.

    This fixes races for listeners using TCP_SAVE_SYN option.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Reported-by: Ying Cai
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Oct, 2015

1 commit

  • Allowing an application to set whatever limit for
    the list of recently RST fastopen sessions [1] is not wise,
    as it open ways to deplete kernel memory.

    Cap the user provided limit by somaxconn sysctl,
    like listen() backlog.

    [1] https://tools.ietf.org/html/rfc7413#section-5.1

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2015

3 commits

  • This patch is the first half of the RACK loss recovery.

    RACK loss recovery uses the notion of time instead
    of packet sequence (FACK) or counts (dupthresh). It's inspired by the
    previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
    transmit (new data packet) is sacked, then current retransmitted
    sequence below the newly sacked sequence must been lost,
    since at least one round trip time has elapsed.

    But it has several limitations:
    1) can't detect tail drops since it depends on limited transmit
    2) is disabled upon reordering (assumes no reordering)
    3) only enabled in fast recovery ut not timeout recovery

    RACK (Recently ACK) addresses these limitations with the notion
    of time instead: a packet P1 is lost if a later packet P2 is s/acked,
    as at least one round trip has passed.

    Since RACK cares about the time sequence instead of the data sequence
    of packets, it can detect tail drops when later retransmission is
    s/acked while FACK or dupthresh can't. For reordering RACK uses a
    dynamically adjusted reordering window ("reo_wnd") to reduce false
    positives on ever (small) degree of reordering.

    This patch implements tcp_advanced_rack() which tracks the
    most recent transmission time among the packets that have been
    delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
    is the key to determine which packet has been lost.

    Consider an example that the sender sends six packets:
    T1: P1 (lost)
    T2: P2
    T3: P3
    T4: P4
    T100: sack of P2. rack.mstamp = T2
    T101: retransmit P1
    T102: sack of P2,P3,P4. rack.mstamp = T4
    T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

    We need to be careful about spurious retransmission because it may
    falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
    to falsely mark all packets lost, just like a spurious timeout.

    We identify spurious retransmission by the ACK's TS echo value.
    If TS option is not applicable but the retransmission is acknowledged
    less than min-RTT ago, it is likely to be spurious. We refrain from
    using the transmission time of these spurious retransmissions.

    The second half is implemented in the next patch that marks packet
    lost using RACK timestamp.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Remove the existing lost retransmit detection because RACK subsumes
    it completely. This also stops the overloading the ack_seq field of
    the skb control block.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Kathleen Nichols' algorithm for tracking the minimum RTT of a
    data stream over some measurement window. It uses constant space
    and constant time per update. Yet it almost always delivers
    the same minimum as an implementation that has to keep all
    the data in the window. The measurement window is tunable via
    sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

    The algorithm keeps track of the best, 2nd best & 3rd best min
    values, maintaining an invariant that the measurement time of
    the n'th best >= n-1'th best. It also makes sure that the three
    values are widely separated in the time window since that bounds
    the worse case error when that data is monotonically increasing
    over the window.

    Upon getting a new min, we can forget everything earlier because
    it has no value - the new min is less than everything else in the
    window by definition and it's the most recent. So we restart fresh
    on every new min and overwrites the 2nd & 3rd choices. The same
    property holds for the 2nd & 3rd best.

    Therefore we have to maintain two invariants to maximize the
    information in the samples, one on values (1st.v
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

13 Oct, 2015

1 commit


30 Sep, 2015

1 commit

  • While auditing TCP stack for upcoming 'lockless' listener changes,
    I found I had to change fastopen_init_queue() to properly init the object
    before publishing it.

    Otherwise an other cpu could try to lock the spinlock before it gets
    properly initialized.

    Instead of adding appropriate barriers, just remove dynamic memory
    allocations :
    - Structure is 28 bytes on 64bit arches. Using additional 8 bytes
    for holding a pointer seems overkill.
    - Two listeners can share same cache line and performance would suffer.

    If we really want to save few bytes, we would instead dynamically allocate
    whole struct request_sock_queue in the future.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2015

1 commit

  • Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
    RTT is often measured as 0ms or sometimes 1ms, which would affect
    RTT estimation and min RTT samping used by some congestion control.

    This patch improves SYN/ACK RTT to be usec resolution if platform
    supports it. While the timestamping of SYN/ACK is done in request
    sock, the RTT measurement is carefully arranged to avoid storing
    another u64 timestamp in tcp_sock.

    For regular handshake w/o SYNACK retransmission, the RTT is sampled
    right after the child socket is created and right before the request
    sock is released (tcp_check_req() in tcp_minisocks.c)

    For Fast Open the child socket is already created when SYN/ACK was
    sent, the RTT is sampled in tcp_rcv_state_process() after processing
    the final ACK an right before the request socket is released.

    If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
    on TCP timestamps to measure the RTT. The sample is taken at the
    same place in tcp_rcv_state_process() after the timestamp values
    are validated in tcp_validate_incoming(). Note that we do not store
    TS echo value in request_sock for SYN-cookies, because the value
    is already stored in tp->rx_opt used by tcp_ack_update_rtt().

    One side benefit is that the RTT measurement now happens before
    initializing congestion control (of the passive side). Therefore
    the congestion control can use the SYN/ACK RTT.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 Sep, 2015

1 commit

  • In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf
    on xmit"), Tom provided a l4 hash to most outgoing TCP packets.

    We'd like to provide one as well for SYNACK packets, so that all packets
    of a given flow share same txhash, to later enable bonding driver to
    also use skb->hash to perform slave selection.

    Note that a SYNACK retransmit shuffles the tx hash, as Tom did
    in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing
    advice") for established sockets.

    This has nice effect making TCP flows resilient to some kind of black
    holes, even at connection establish phase.

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Mahesh Bandewar
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 May, 2015

2 commits

  • Conflicts:
    drivers/net/ethernet/cadence/macb.c
    drivers/net/phy/phy.c
    include/linux/skbuff.h
    net/ipv4/tcp.c
    net/switchdev/switchdev.c

    Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
    renaming overlapping with net-next changes of various
    sorts.

    phy.c was a case of two changes, one adding a local
    variable to a function whilst the second was removing
    one.

    tcp.c overlapped a deadlock fix with the addition of new tcp_info
    statistic values.

    macb.c involved the addition of two zyncq device entries.

    skbuff.h involved adding back ipv4_daddr to nf_bridge_info
    whilst net-next changes put two other existing members of
    that struct into a union.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Taking socket spinlock in tcp_get_info() can deadlock, as
    inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
    while packet processing can use the reverse locking order.

    We could avoid this locking for TCP_LISTEN states, but lockdep would
    certainly get confused as all TCP sockets share same lockdep classes.

    [ 523.722504] ======================================================
    [ 523.728706] [ INFO: possible circular locking dependency detected ]
    [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
    [ 523.739202] -------------------------------------------------------
    [ 523.745474] ss/18032 is trying to acquire lock:
    [ 523.750002] (slock-AF_INET){+.-...}, at: [] tcp_get_info+0x2c4/0x360
    [ 523.758129]
    [ 523.758129] but task is already holding lock:
    [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [] inet_diag_dump_icsk+0x1d5/0x6c0
    [ 523.774661]
    [ 523.774661] which lock already depends on the new lock.
    [ 523.774661]
    [ 523.782850]
    [ 523.782850] the existing dependency chain (in reverse order) is:
    [ 523.790326]
    -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
    [ 523.796599] [] lock_acquire+0xbb/0x270
    [ 523.802565] [] _raw_spin_lock+0x38/0x50
    [ 523.808628] [] __inet_hash_nolisten+0x78/0x110
    [ 523.815273] [] tcp_v4_syn_recv_sock+0x24b/0x350
    [ 523.822067] [] tcp_check_req+0x3c1/0x500
    [ 523.828199] [] tcp_v4_do_rcv+0x239/0x3d0
    [ 523.834331] [] tcp_v4_rcv+0xa8e/0xc10
    [ 523.840202] [] ip_local_deliver_finish+0x133/0x3e0
    [ 523.847214] [] ip_local_deliver+0xaa/0xc0
    [ 523.853440] [] ip_rcv_finish+0x168/0x5c0
    [ 523.859624] [] ip_rcv+0x307/0x420

    Lets use u64_sync infrastructure instead. As a bonus, 64bit
    arches get optimized, as these are nop for them.

    Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 May, 2015

1 commit

  • This patch tracks the total number of inbound and outbound segments on a
    TCP socket. One may use this number to have an idea on connection
    quality when compared against the retransmissions.

    RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut

    These are a 32bit field each and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->segs_out was placed near tp->snd_nxt for good data
    locality and minimal performance impact, while tp->segs_in was placed
    near tp->bytes_received for the same reason.

    Join work with Eric Dumazet.

    Note that received SYN are accounted on the listener, but sent SYNACK
    are not accounted.

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

06 May, 2015

1 commit

  • This patch allows a server application to get the TCP SYN headers for
    its passive connections. This is useful if the server is doing
    fingerprinting of clients based on SYN packet contents.

    Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.

    The first is used on a socket to enable saving the SYN headers
    for child connections. This can be set before or after the listen()
    call.

    The latter is used to retrieve the SYN headers for passive connections,
    if the parent listener has enabled TCP_SAVE_SYN.

    TCP_SAVED_SYN is read once, it frees the saved SYN headers.

    The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
    headers.

    Original patch was written by Tom Herbert, I changed it to not hold
    a full skb (and associated dst and conntracking reference).

    We have used such patch for about 3 years at Google.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Apr, 2015

2 commits

  • This patch tracks total number of payload bytes received on a TCP socket.
    This is the sum of all changes done to tp->rcv_nxt

    RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

    This is a 64bit field, and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->bytes_received was placed near tp->rcv_nxt for
    best data locality and minimal performance impact.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Matt Mathis
    Cc: Eric Salo
    Cc: Martin Lau
    Cc: Chris Rapier
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch tracks total number of bytes acked for a TCP socket.
    This is the sum of all changes done to tp->snd_una, and allows
    for precise tracking of delivered data.

    RFC4898 named this : tcpEStatsAppHCThruOctetsAcked

    This is a 64bit field, and can be fetched both from TCP_INFO
    getsockopt() if one has a handle on a TCP socket, or from inet_diag
    netlink facility (iproute2/ss patch will follow)

    Note that tp->bytes_acked was placed near tp->snd_una for
    best data locality and minimal performance impact.

    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Cc: Matt Mathis
    Cc: Eric Salo
    Cc: Martin Lau
    Cc: Chris Rapier
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Apr, 2015

2 commits

  • Fast Open has been using an experimental option with a magic number
    (RFC6994). This patch makes the client by default use the RFC7413
    option (34) to get and send Fast Open cookies. This patch makes
    the client solicit cookies from a given server first with the
    RFC7413 option. If that fails to elicit a cookie, then it tries
    the RFC6994 experimental option. If that also fails, it uses the
    RFC7413 option on all subsequent connect attempts. If the server
    returns a Fast Open cookie then the client caches the form of the
    option that successfully elicited a cookie, and uses that form on
    later connects when it presents that cookie.

    The idea is to gradually obsolete the use of experimental options as
    the servers and clients upgrade, while keeping the interoperability
    meanwhile.

    Signed-off-by: Daniel Lee
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Daniel Lee
     
  • Fast Open has been using the experimental option with a magic number
    (RFC6994) to request and grant Fast Open cookies. This patch enables
    the server to support the official IANA option 34 in RFC7413 in
    addition.

    The change has passed all existing Fast Open tests with both
    old and new options at Google.

    Signed-off-by: Daniel Lee
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Daniel Lee
     

18 Mar, 2015

1 commit


01 Mar, 2015

1 commit

  • TSO relies on ability to defer sending a small amount of packets.
    Heuristic is to wait for future ACKS in hope to send more packets at once.
    Current algorithm uses a per socket tso_deferred field as a pseudo timer.

    This pseudo timer relies on future ACK, but there is no guarantee
    we receive them in time.

    Fix would be to use a real timer, but cost of such timer is probably too
    expensive for typical cases.

    This patch changes the logic to test the time of last transmit,
    because we should not add bursts of more than 1ms for any given flow.

    We've used this patch for about two years at Google, before FQ/pacing
    as it would reduce a fair amount of bursts.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Feb, 2015

3 commits

  • Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
    represented by a tcp_timewait_sock, we rate limit dupacks in response
    to incoming packets (a) with TCP timestamps that fail PAWS checks, or
    (b) with sequence numbers that are out of the acceptable window.

    We do not send a dupack in response to out-of-window packets if it has
    been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
    last sent a dupack in response to an out-of-window packet.

    Reported-by: Avery Fay
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Ensure that in state ESTABLISHED, where the connection is represented
    by a tcp_sock, we rate limit dupacks in response to incoming packets
    (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
    numbers or ACK numbers that are out of the acceptable window.

    We do not send a dupack in response to out-of-window packets if it has
    been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
    last sent a dupack in response to an out-of-window packet.

    There is already a similar (although global) rate-limiting mechanism
    for "challenge ACKs". When deciding whether to send a challence ACK,
    we first consult the new per-connection rate limit, and then the
    global rate limit.

    Reported-by: Avery Fay
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • In the SYN_RECV state, where the TCP connection is represented by
    tcp_request_sock, we now rate-limit SYNACKs in response to a client's
    retransmitted SYNs: we do not send a SYNACK in response to client SYN
    if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
    since we last sent a SYNACK in response to a client's retransmitted
    SYN.

    This allows the vast majority of legitimate client connections to
    proceed unimpeded, even for the most aggressive platforms, iOS and
    MacOS, which actually retransmit SYNs 1-second intervals for several
    times in a row. They use SYN RTO timeouts following the progression:
    1,1,1,1,1,2,4,8,16,32.

    Reported-by: Avery Fay
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

11 Dec, 2014

1 commit


10 Dec, 2014

2 commits

  • Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to
    control TSO size, but did this at the wrong place (sendmsg() time)

    At sendmsg() time, we might have a pessimistic view of flow rate,
    and we end up building very small skbs (with 2 MSS per skb).

    This is bad because :

    - It sends small TSO packets even in Slow Start where rate quickly
    increases.
    - It tends to make socket write queue very big, increasing tcp_ack()
    processing time, but also increasing memory needs, not necessarily
    accounted for, as fast clones overhead is currently ignored.
    - Lower GRO efficiency and more ACK packets.

    Servers with a lot of small lived connections suffer from this.

    Lets instead fill skbs as much as possible (64KB of payload), but split
    them at xmit time, when we have a precise idea of the flow rate.
    skb split is actually quite efficient.

    Patch looks bigger than necessary, because TCP Small Queue decision now
    has to take place after the eventual split.

    As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
    tcp_tso_should_defer() can be synchronized on same goal.

    Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
    contains number of mss that we can put in GSO packet, and is not
    related to the autosizing goal anymore.

    Tested:

    40 ms rtt link

    nstat >/dev/null
    netperf -H remote -l -2000000 -- -s 1000000
    nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

    Before patch :

    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/s

    87380 2000000 2000000 0.36 44.22
    IpInReceives 600 0.0
    IpOutRequests 599 0.0
    TcpOutSegs 1397 0.0
    IpExtOutOctets 2033249 0.0

    After patch :

    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 2000000 2000000 0.36 44.27
    IpInReceives 221 0.0
    IpOutRequests 232 0.0
    TcpOutSegs 1397 0.0
    IpExtOutOctets 2013953 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Al Viro

    Al Viro