14 Nov, 2016

1 commit

  • With syzkaller help, Marco Grassi found a bug in TCP stack,
    crashing in tcp_collapse()

    Root cause is that sk_filter() can truncate the incoming skb,
    but TCP stack was not really expecting this to happen.
    It probably was expecting a simple DROP or ACCEPT behavior.

    We first need to make sure no part of TCP header could be removed.
    Then we need to adjust TCP_SKB_CB(skb)->end_seq

    Many thanks to syzkaller team and Marco for giving us a reproducer.

    Signed-off-by: Eric Dumazet
    Reported-by: Marco Grassi
    Reported-by: Vladis Dronov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2016

1 commit

  • Andrey reported the following error report while running the syzkaller
    fuzzer:

    general protection fault: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff8800398c4480 task.stack: ffff88003b468000
    RIP: 0010:[] [< inline >]
    inet_exact_dif_match include/net/tcp.h:808
    RIP: 0010:[] []
    __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
    RSP: 0018:ffff88003b46f270 EFLAGS: 00010202
    RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
    RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
    R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
    R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
    FS: 00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
    Stack:
    0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
    424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
    ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
    Call Trace:
    [] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
    [] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
    [] ip_local_deliver_finish+0x332/0xad0
    net/ipv4/ip_input.c:216
    ...

    MD5 has a code path that calls __inet_lookup_listener with a null skb,
    so inet{6}_exact_dif_match needs to check skb against null before pulling
    the flag.

    Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if
    dif is l3mdev")
    Reported-by: Andrey Konovalov
    Signed-off-by: David Ahern
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    David Ahern
     

17 Oct, 2016

1 commit

  • Currently, socket lookups for l3mdev (vrf) use cases can match a socket
    that is bound to a port but not a device (ie., a global socket). If the
    sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
    based on the main table even though the packet came in from an L3 domain.
    The end result is that the connection does not establish creating
    confusion for users since the service is running and a socket shows in
    ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
    skb came through an interface enslaved to an l3mdev device and the
    tcp_l3mdev_accept is not set.

    skb's through an l3mdev interface are marked by setting a flag in
    inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
    flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
    flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
    inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the
    match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
    move is done after the socket lookup, so IP6CB is used.

    The flags field in inet_skb_parm struct needs to be increased to add
    another flag. There is currently a 1-byte hole following the flags,
    so it can be expanded to u16 without increasing the size of the struct.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

21 Sep, 2016

7 commits

  • This commit introduces an optional new "omnipotent" hook,
    cong_control(), for congestion control modules. The cong_control()
    function is called at the end of processing an ACK (i.e., after
    updating sequence numbers, the SACK scoreboard, and loss
    detection). At that moment we have precise delivery rate information
    the congestion control module can use to control the sending behavior
    (using cwnd, TSO skb size, and pacing rate) in any CA state.

    This function can also be used by a congestion control that prefers
    not to use the default cwnd reduction approach (i.e., the PRR
    algorithm) during CA_Recovery to control the cwnd and sending rate
    during loss recovery.

    We take advantage of the fact that recent changes defer the
    retransmission or transmission of new data (e.g. by F-RTO) in recovery
    until the new tcp_cong_control() function is run.

    With this commit, we only run tcp_update_pacing_rate() if the
    congestion control is not using this new API. New congestion controls
    which use the new API do not want the TCP stack to run the default
    pacing rate calculation and overwrite whatever pacing rate they have
    chosen at initialization time.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Currently the TCP send buffer expands to twice cwnd, in order to allow
    limited transmits in the CA_Recovery state. This assumes that cwnd
    does not increase in the CA_Recovery.

    For some congestion control algorithms, like the upcoming BBR module,
    if the losses in recovery do not indicate congestion then we may
    continue to raise cwnd multiplicatively in recovery. In such cases the
    current multiplier will falsely limit the sending rate, much as if it
    were limited by the application.

    This commit adds an optional congestion control callback to use a
    different multiplier to expand the TCP send buffer. For congestion
    control modules that do not specificy this callback, TCP continues to
    use the previous default of 2.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • To allow congestion control modules to use the default TSO auto-sizing
    algorithm as one of the ingredients in their own decision about TSO sizing:

    1) Export tcp_tso_autosize() so that CC modules can use it.

    2) Change tcp_tso_autosize() to allow callers to specify a minimum
    number of segments per TSO skb, in case the congestion control
    module has a different notion of the best floor for TSO skbs for
    the connection right now. For very low-rate paths or policed
    connections it can be appropriate to use smaller TSO skbs.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Add the tso_segs_goal() function in tcp_congestion_ops to allow the
    congestion control module to specify the number of segments that
    should be in a TSO skb sent by tcp_write_xmit() and
    tcp_xmit_retransmit_queue(). The congestion control module can either
    request a particular number of segments in TSO skb that we transmit,
    or return 0 if it doesn't care.

    This allows the upcoming BBR congestion control module to select small
    TSO skb sizes if the module detects that the bottleneck bandwidth is
    very low, or that the connection is policed to a low rate.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • This commit adds code to track whether the delivery rate represented
    by each rate_sample was limited by the application.

    Upon each transmit, we store in the is_app_limited field in the skb a
    boolean bit indicating whether there is a known "bubble in the pipe":
    a point in the rate sample interval where the sender was
    application-limited, and did not transmit even though the cwnd and
    pacing rate allowed it.

    This logic marks the flow app-limited on a write if *all* of the
    following are true:

    1) There is less than 1 MSS of unsent data in the write queue
    available to transmit.

    2) There is no packet in the sender's queues (e.g. in fq or the NIC
    tx queue).

    3) The connection is not limited by cwnd.

    4) There are no lost packets to retransmit.

    The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
    the connection is application-limited at the moment. If the flow is
    application-limited, it sets the tp->app_limited field. If the flow is
    application-limited then that means there is effectively a "bubble" of
    silence in the pipe now, and this silence will be reflected in a lower
    bandwidth sample for any rate samples from now until we get an ACK
    indicating this bubble has exited the pipe: specifically, until we get
    an ACK for the next packet we transmit.

    When we send every skb we record in scb->tx.is_app_limited whether the
    resulting rate sample will be application-limited.

    The code in tcp_rate_gen() checks to see when it is safe to mark all
    known application-limited bubbles of silence as having exited the
    pipe. It does this by checking to see when the delivered count moves
    past the tp->app_limited marker. At this point it zeroes the
    tp->app_limited marker, as all known bubbles are out of the pipe.

    We make room for the tx.is_app_limited bit in the skb by borrowing a
    bit from the in_flight field used by NV to record the number of bytes
    in flight. The receive window in the TCP header is 16 bits, and the
    max receive window scaling shift factor is 14 (RFC 1323). So the max
    receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
    only need 30 bits for the tx.in_flight used by NV.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • This patch generates data delivery rate (throughput) samples on a
    per-ACK basis. These rate samples can be used by congestion control
    modules, and specifically will be used by TCP BBR in later patches in
    this series.

    Key state:

    tp->delivered: Tracks the total number of data packets (original or not)
    delivered so far. This is an already-existing field.

    tp->delivered_mstamp: the last time tp->delivered was updated.

    Algorithm:

    A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

    d1: the current tp->delivered after processing the ACK
    t1: the current time after processing the ACK

    d0: the prior tp->delivered when the acked skb was transmitted
    t0: the prior tp->delivered_mstamp when the acked skb was transmitted

    When an skb is transmitted, we snapshot d0 and t0 in its control
    block in tcp_rate_skb_sent().

    When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
    or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
    to reflect the latest (d0, t0).

    Finally, tcp_rate_gen() generates a rate sample by storing
    (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

    One caveat: if an skb was sent with no packets in flight, then
    tp->delivered_mstamp may be either invalid (if the connection is
    starting) or outdated (if the connection was idle). In that case,
    we'll re-stamp tp->delivered_mstamp.

    At first glance it seems t0 should always be the time when an skb was
    transmitted, but actually this could over-estimate the rate due to
    phase mismatch between transmit and ACK events. To track the delivery
    rate, we ensure that if packets are in flight then t0 and and t1 are
    times at which packets were marked delivered.

    If the initial and final RTTs are different then one may be corrupted
    by some sort of noise. The noise we see most often is sending gaps
    caused by delayed, compressed, or stretched acks. This either affects
    both RTTs equally or artificially reduces the final RTT. We approach
    this by recording the info we need to compute the initial RTT
    (duration of the "send phase" of the window) when we recorded the
    associated inflight. Then, for a filter to avoid bandwidth
    overestimates, we generalize the per-sample bandwidth computation
    from:

    bw = delivered / ack_phase_rtt

    to the following:

    bw = delivered / max(send_phase_rtt, ack_phase_rtt)

    In large-scale experiments, this filtering approach incorporating
    send_phase_rtt is effective at avoiding bandwidth overestimates due to
    ACK compression or stretched ACKs.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Refactor the TCP min_rtt code to reuse the new win_minmax library in
    lib/win_minmax.c to simplify the TCP code.

    This is a pure refactor: the functionality is exactly the same. We
    just moved the windowed min code to make TCP easier to read and
    maintain, and to allow other parts of the kernel to use the windowed
    min/max filter code.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

09 Sep, 2016

1 commit

  • Over the years, TCP BDP has increased by several orders of magnitude,
    and some people are considering to reach the 2 Gbytes limit.

    Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
    MSS.

    In presence of packet losses (or reorders), TCP stores incoming packets
    into an out of order queue, and number of skbs sitting there waiting for
    the missing packets to be received can be in the 10^5 range.

    Most packets are appended to the tail of this queue, and when
    packets can finally be transferred to receive queue, we scan the queue
    from its head.

    However, in presence of heavy losses, we might have to find an arbitrary
    point in this queue, involving a linear scan for every incoming packet,
    throwing away cpu caches.

    This patch converts it to a RB tree, to get bounded latencies.

    Yaogong wrote a preliminary patch about 2 years ago.
    Eric did the rebase, added ofo_last_skb cache, polishing and tests.

    Tested with network dropping between 1 and 10 % packets, with good
    success (about 30 % increase of throughput in stress tests)

    Next step would be to also use an RB tree for the write queue at sender
    side ;)

    Signed-off-by: Yaogong Wang
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Ilpo Järvinen
    Acked-By: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yaogong Wang
     

30 Aug, 2016

1 commit


29 Aug, 2016

3 commits

  • When TCP operates in lossy environments (between 1 and 10 % packet
    losses), many SACK blocks can be exchanged, and I noticed we could
    drop them on busy senders, if these SACK blocks have to be queued
    into the socket backlog.

    While the main cause is the poor performance of RACK/SACK processing,
    we can try to avoid these drops of valuable information that can lead to
    spurious timeouts and retransmits.

    Cause of the drops is the skb->truesize overestimation caused by :

    - drivers allocating ~2048 (or more) bytes as a fragment to hold an
    Ethernet frame.

    - various pskb_may_pull() calls bringing the headers into skb->head
    might have pulled all the frame content, but skb->truesize could
    not be lowered, as the stack has no idea of each fragment truesize.

    The backlog drops are also more visible on bidirectional flows, since
    their sk_rmem_alloc can be quite big.

    Let's add some room for the backlog, as only the socket owner
    can selectively take action to lower memory needs, like collapsing
    receive queues or partial ofo pruning.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
    tcp_peek_len (which is just a stub function that calls tcp_inq).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add new function in proto_ops structure. This includes moving the
    typedef got sk_read_actor into net.h and removing the definition from
    tcp.h.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

24 Aug, 2016

1 commit

  • TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
    Fast Open development. Remove this config option and also
    update/clean-up the documentation of the Fast Open sysctl.

    Reported-by: Piotr Jurkiewicz
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

19 Aug, 2016

1 commit

  • When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
    tail of the write queue using tcp_add_write_queue_tail()

    Then it attempts to copy user data into this fresh skb.

    If the copy fails, we undo the work and remove the fresh skb.

    Unfortunately, this undo lacks the change done to tp->highest_sack and
    we can leave a dangling pointer (to a freed skb)

    Later, tcp_xmit_retransmit_queue() can dereference this pointer and
    access freed memory. For regular kernels where memory is not unmapped,
    this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
    returning garbage instead of tp->snd_nxt, but with various debug
    features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.

    This bug was found by Marco Grassi thanks to syzkaller.

    Fixes: 6859d49475d4 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
    Reported-by: Marco Grassi
    Signed-off-by: Eric Dumazet
    Cc: Ilpo Järvinen
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2016

1 commit

  • Some arches have virtually mapped kernel stacks, or will soon have.

    tcp_md5_hash_header() uses an automatic variable to copy tcp header
    before mangling th->check and calling crypto function, which might
    be problematic on such arches.

    David says that using percpu storage is also problematic on non SMP
    builds.

    Just use kmalloc() to allocate scratch areas.

    Signed-off-by: Eric Dumazet
    Reported-by: Andy Lutomirski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jun, 2016

1 commit

  • In previous commit 01f83d69844d307be2aa6fea88b0e8fe5cbdb2f4
    the following comments were added:

    "When peer uses tiny windows, there is no use in packetizing to sub-MSS
    pieces for the sake of SWS or making sure there are enough packets in
    the pipe for fast recovery."

    The test should be > TCP_MSS_DEFAULT not >= 512. This allows low end
    devices that send an MSS of 536 (TCP_MSS_DEFAULT) to see better network
    performance by sending it 536 bytes of data at a time instead of bounding
    to half window size (268). Other network stacks work this way, e.g. HP-UX.

    Signed-off-by: Shane Seymour
    Signed-off-by: David S. Miller

    Seymour, Shane M
     

11 Jun, 2016

1 commit

  • Add in_flight (bytes in flight when packet was sent) field
    to tx component of tcp_skb_cb and make it available to
    congestion modules' pkts_acked() function through the
    ack_sample function argument.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

12 May, 2016

2 commits

  • Currently the VRF driver uses the rx_handler to switch the skb device
    to the VRF device. Switching the dev prior to the ip / ipv6 layer
    means the VRF driver has to duplicate IP/IPv6 processing which adds
    overhead and makes features such as retaining the ingress device index
    more complicated than necessary.

    This patch moves the hook to the L3 layer just after the first NF_HOOK
    for PRE_ROUTING. This location makes exposing the original ingress device
    trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
    in the future.

    dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
    with the switched device through the packet taps to maintain current
    behavior (tcpdump can be used on either the vrf device or the enslaved
    devices).

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Replace 2 arguments (cnt and rtt) in the congestion control modules'
    pkts_acked() function with a struct. This will allow adding more
    information without having to modify existing congestion control
    modules (tcp_nv in particular needs bytes in flight when packet
    was sent).

    As proposed by Neal Cardwell in his comments to the tcp_nv patch.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

09 May, 2016

1 commit

  • Refactor tcp_skb_cb to create two overlaping areas to store
    state for incoming or outgoing skbs based on comments by
    Neal Cardwell to tcp_nv patch:

    AFAICT this patch would not require an increase in the size of
    sk_buff cb[] if it were to take advantage of the fact that the
    tcp_skb_cb header.h4 and header.h6 fields are only used in the packet
    reception code path, and this in_flight field is only used on the
    transmit side.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

29 Apr, 2016

1 commit

  • This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
    is passed to tcp_sendmsg, the eor bit will be set at the skb
    containing the last byte of the userland's msg. The eor bit
    will prevent data from appending to that skb in the future.

    The change in do_tcp_sendpages is to honor the eor set
    during the previous tcp_sendmsg(MSG_EOR) call.

    This patch handles the tcp_sendmsg case. The followup patches
    will handle other skb coalescing and fragment cases.

    One potential use case is to use MSG_EOR with
    SOF_TIMESTAMPING_TX_ACK to get a more accurate
    TCP ack timestamping on application protocol with
    multiple outgoing response messages (e.g. HTTP2).

    Packetdrill script for testing:
    ~~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 write(4, ..., 14600) = 14600
    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

    0.200 > . 1:7301(7300) ack 1
    0.200 > P. 7301:14601(7300) ack 1

    0.300 < . 1:1(0) ack 14601 win 257
    0.300 > P. 14601:15331(730) ack 1
    0.300 > P. 15331:16061(730) ack 1

    0.400 < . 1:1(0) ack 16061 win 257
    0.400 close(4) = 0
    0.400 > F. 16061:16061(0) ack 1
    0.400 < F. 1:1(0) ack 16062 win 257
    0.400 > . 16062:16062(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Suggested-by: Eric Dumazet
    Acked-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

28 Apr, 2016

4 commits

  • There is nothing related to BH in SNMP counters anymore,
    since linux-3.0.

    Rename helpers to use __ prefix instead of _BH prefix,
    for contexts where preemption is disabled.

    This more closely matches convention used to update
    percpu variables.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename NET_INC_STATS_BH() to __NET_INC_STATS()
    and NET_ADD_STATS_BH() to __NET_ADD_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In the old days (before linux-3.0), SNMP counters were duplicated,
    one for user context, and one for BH context.

    After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
    we have a single copy, and what really matters is preemption being
    enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
    respectively.

    We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
    NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
    SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
    UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()

    Following patches will rename __BH helpers to make clear their
    usage is not tied to BH being disabled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Apr, 2016

1 commit

  • Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

    This was fine back in the days when TSO/GSO were emerging, with their
    bugs, but we believe the dark age is over.

    Keeping big packets in write queues, but also in stack traversal
    has a lot of benefits.
    - Less memory overhead, because write queues have less skbs
    - Less cpu overhead at ACK processing.
    - Better SACK processing, as lot of studies mentioned how
    awful linux was at this ;)
    - Less cpu overhead to send the rtx packets
    (IP stack traversal, netfilter traversal, drivers...)
    - Better latencies in presence of losses.
    - Smaller spikes in fq like packet schedulers, as retransmits
    are not constrained by TCP Small Queues.

    1 % packet losses are common today, and at 100Gbit speeds, this
    translates to ~80,000 losses per second.
    Losses are often correlated, and we see many retransmit events
    leading to 1-MSS train of packets, at the time hosts are already
    under stress.

    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2016

1 commit


22 Apr, 2016

1 commit

  • After receiving sacks, tcp_shifted_skb() will collapse
    skbs if possible. tx_flags and tskey also have to be
    merged.

    This patch reuses the tcp_skb_collapse_tstamp() to handle
    them.

    BPF Output Before:
    ~~~~~

    BPF Output After:
    ~~~~~
    -2024 [007] d.s. 88.644374: : ee_data:14599

    Packetdrill Script:
    ~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 write(4, ..., 1460) = 1460
    +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
    0.200 write(4, ..., 13140) = 13140

    0.200 > P. 1:1461(1460) ack 1
    0.200 > . 1461:8761(7300) ack 1
    0.200 > P. 8761:14601(5840) ack 1

    0.300 < . 1:1(0) ack 1 win 257
    0.300 > P. 1:1461(1460) ack 1
    0.400 < . 1:1(0) ack 14601 win 257

    0.400 close(4) = 0
    0.400 > F. 14601:14601(0) ack 1
    0.500 < F. 1:1(0) ack 14602 win 257
    0.500 > . 14602:14602(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Tested-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

16 Apr, 2016

1 commit

  • When removing sk_refcnt manipulation on synflood, I missed that
    using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
    transitioned to 0.

    We should hold sk_refcnt instead, but this is a big deal under attack.
    (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)

    In this patch, I chose to not attach a socket to syncookies skb.

    Performance is now 5 Mpps instead of 3.2 Mpps.

    Following patch will remove last known false sharing in
    tcp_rcv_state_process()

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Apr, 2016

2 commits

  • Goal: packets dropped by a listener are accounted for.

    This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
    so that children do not inherit their parent drop count.

    Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
    sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
    We already have a separate LINUX_MIB_SYNCOOKIESSENT

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Currently, to avoid a cache line miss for accessing skb_shinfo,
    tcp_ack_tstamp skips socket that do not have
    SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
    implemented based on an implicit assumption that the
    SOF_TIMESTAMPING_TX_ACK is set via socket options for the
    duration that ACK timestamps are needed.

    To implement per-write timestamps, this check should be
    removed and replaced with a per-packet alternative that
    quickly skips packets missing ACK timestamps marks without
    a cache-line miss.

    To enable per-packet marking without a cache line miss, use
    one bit in TCP_SKB_CB to mark a whether a SKB might need a
    ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
    modified and work as before.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

03 Apr, 2016

1 commit

  • For non-SACK connections, cwnd is lowered to inflight plus 3 packets
    when the recovery ends. This is an optional feature in the NewReno
    RFC 2582 to reduce the potential burst when cwnd is "re-opened"
    after recovery and inflight is low.

    This feature is questionably effective because of PRR: when
    the recovery ends (i.e., snd_una == high_seq) NewReno holds the
    CA_Recovery state for another round trip to prevent false fast
    retransmits. But if the inflight is low, PRR will overwrite the
    moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
    receiver responds bogus ACKs (i.e., acking future data) to speed up
    transfer after recovery, it can only induce a burst up to a window
    worth of data packets by acking up to SND.NXT. A restart from (short)
    idle or receiving streched ACKs can both cause such bursts as well.

    On the other hand, if the recovery ends because the sender
    detects the losses were spurious (e.g., reordering). This feature
    unconditionally lowers a reverted cwnd even though nothing
    was lost.

    By principle loss recovery module should not update cwnd. Further
    pacing is much more effective to reduce burst. Hence this patch
    removes the cwnd moderation feature.

    v2 changes: revised commit message on bogus ACKs and burst, and
    missing signature

    Signed-off-by: Matt Mathis
    Signed-off-by: Neal Cardwell
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

20 Mar, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Pull crypto update from Herbert Xu:
    "Here is the crypto update for 4.6:

    API:
    - Convert remaining crypto_hash users to shash or ahash, also convert
    blkcipher/ablkcipher users to skcipher.
    - Remove crypto_hash interface.
    - Remove crypto_pcomp interface.
    - Add crypto engine for async cipher drivers.
    - Add akcipher documentation.
    - Add skcipher documentation.

    Algorithms:
    - Rename crypto/crc32 to avoid name clash with lib/crc32.
    - Fix bug in keywrap where we zero the wrong pointer.

    Drivers:
    - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
    - Add PIC32 hwrng driver.
    - Support BCM6368 in bcm63xx hwrng driver.
    - Pack structs for 32-bit compat users in qat.
    - Use crypto engine in omap-aes.
    - Add support for sama5d2x SoCs in atmel-sha.
    - Make atmel-sha available again.
    - Make sahara hashing available again.
    - Make ccp hashing available again.
    - Make sha1-mb available again.
    - Add support for multiple devices in ccp.
    - Improve DMA performance in caam.
    - Add hashing support to rockchip"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
    crypto: qat - remove redundant arbiter configuration
    crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
    crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
    crypto: qat - Change the definition of icp_qat_uof_regtype
    hwrng: exynos - use __maybe_unused to hide pm functions
    crypto: ccp - Add abstraction for device-specific calls
    crypto: ccp - CCP versioning support
    crypto: ccp - Support for multiple CCPs
    crypto: ccp - Remove check for x86 family and model
    crypto: ccp - memset request context to zero during import
    lib/mpi: use "static inline" instead of "extern inline"
    lib/mpi: avoid assembler warning
    hwrng: bcm63xx - fix non device tree compatibility
    crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
    crypto: qat - The AE id should be less than the maximal AE number
    lib/mpi: Endianness fix
    crypto: rockchip - add hash support for crypto engine in rk3288
    crypto: xts - fix compile errors
    crypto: doc - add skcipher API documentation
    crypto: doc - update AEAD AD handling
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Per RFC4898, they count segments sent/received
    containing a positive length data segment (that includes
    retransmission segments carrying data). Unlike
    tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
    carrying no data (e.g. pure ack).

    The patch also updates the segs_in in tcp_fastopen_add_skb()
    so that segs_in >= data_segs_in property is kept.

    Together with retransmission data, tcpi_data_segs_out
    gives a better signal on the rxmit rate.

    v6: Rebase on the latest net-next

    v5: Eric pointed out that checking skb->len is still needed in
    tcp_fastopen_add_skb() because skb can carry a FIN without data.
    Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
    helper is used. Comment is added to the fastopen case to explain why
    segs_in has to be reset and tcp_segs_in() has to be called before
    __skb_pull().

    v4: Add comment to the changes in tcp_fastopen_add_skb()
    and also add remark on this case in the commit message.

    v3: Add const modifier to the skb parameter in tcp_segs_in()

    v2: Rework based on recent fix by Eric:
    commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

    Signed-off-by: Martin KaFai Lau
    Cc: Chris Rapier
    Cc: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

10 Mar, 2016

1 commit


23 Feb, 2016

1 commit