29 Apr, 2016

7 commits

  • This is never called with a NULL "buf" and anyway, we dereference 's' on
    the lines before so it would Oops before we reach the check.

    Signed-off-by: Dan Carpenter
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • There's no need to calculate rps hash if it was not enabled. So this
    patch export rps_needed and check it before trying to get rps
    hash. Tests (using pktgen to inject packets to guest) shows this can
    improve pps about 13% (when rps is disabled).

    Before:
    ~1150000 pps
    After:
    ~1300000 pps

    Cc: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    ----
    Changes from V1:
    - Fix build when CONFIG_RPS is not set
    Signed-off-by: David S. Miller

    Jason Wang
     
  • When fragmenting a skb, the next_skb should carry
    the eor from prev_skb. The eor of prev_skb should
    also be reset.

    Packetdrill script for testing:
    ~~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
    0.200 sendto(4, ..., 730, 0, ..., ...) = 730

    0.200 > . 1:7301(7300) ack 1
    0.200 > . 7301:14601(7300) ack 1

    0.300 < . 1:1(0) ack 14601 win 257
    0.300 > P. 14601:15331(730) ack 1
    0.300 > P. 15331:16061(730) ack 1

    0.400 < . 1:1(0) ack 16061 win 257
    0.400 close(4) = 0
    0.400 > F. 16061:16061(0) ack 1
    0.400 < F. 1:1(0) ack 16062 win 257
    0.400 > . 16062:16062(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch:
    1. Prevent next_skb from coalescing to the prev_skb if
    TCP_SKB_CB(prev_skb)->eor is set
    2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
    allowed

    Packetdrill script for testing:
    ~~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
    0.200 write(4, ..., 11680) = 11680

    0.200 > P. 1:731(730) ack 1
    0.200 > P. 731:1461(730) ack 1
    0.200 > . 1461:8761(7300) ack 1
    0.200 > P. 8761:13141(4380) ack 1

    0.300 < . 1:1(0) ack 1 win 257
    0.300 > P. 1:731(730) ack 1
    0.300 > P. 731:1461(730) ack 1
    0.400 < . 1:1(0) ack 13141 win 257

    0.400 close(4) = 0
    0.400 > F. 13141:13141(0) ack 1
    0.500 < F. 1:1(0) ack 13142 win 257
    0.500 > . 13142:13142(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
    is passed to tcp_sendmsg, the eor bit will be set at the skb
    containing the last byte of the userland's msg. The eor bit
    will prevent data from appending to that skb in the future.

    The change in do_tcp_sendpages is to honor the eor set
    during the previous tcp_sendmsg(MSG_EOR) call.

    This patch handles the tcp_sendmsg case. The followup patches
    will handle other skb coalescing and fragment cases.

    One potential use case is to use MSG_EOR with
    SOF_TIMESTAMPING_TX_ACK to get a more accurate
    TCP ack timestamping on application protocol with
    multiple outgoing response messages (e.g. HTTP2).

    Packetdrill script for testing:
    ~~~~~~
    +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
    +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4
    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

    0.200 write(4, ..., 14600) = 14600
    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
    0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

    0.200 > . 1:7301(7300) ack 1
    0.200 > P. 7301:14601(7300) ack 1

    0.300 < . 1:1(0) ack 14601 win 257
    0.300 > P. 14601:15331(730) ack 1
    0.300 > P. 15331:16061(730) ack 1

    0.400 < . 1:1(0) ack 16061 win 257
    0.400 close(4) = 0
    0.400 > F. 16061:16061(0) ack 1
    0.400 < F. 1:1(0) ack 16062 win 257
    0.400 > . 16062:16062(0) ack 2

    Signed-off-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Cc: Willem de Bruijn
    Cc: Yuchung Cheng
    Suggested-by: Eric Dumazet
    Acked-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
    the timestamp of the TCP acknowledgement should be reported on
    error queue. Since accessing skb_shinfo is likely to incur a
    cache-line miss at the time of receiving the ack, the
    txstamp_ack bit was added in tcp_skb_cb, which is set iff
    the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
    SKBTX_ACK_TSTAMP flag redundant.

    Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
    everywhere.

    Note that this frees one bit in shinfo->tx_flags.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Martin KaFai Lau
    Suggested-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.

    tcp_tx_timestamp() receives the tsflags as a parameter. As a
    result the "sk->sk_tsflags || tsflags" is redundant, since
    tsflags already includes sk->sk_tsflags plus overrides from
    control messages.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

28 Apr, 2016

18 commits

  • There is nothing related to BH in SNMP counters anymore,
    since linux-3.0.

    Rename helpers to use __ prefix instead of _BH prefix,
    for contexts where preemption is disabled.

    This more closely matches convention used to update
    percpu variables.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • IPv6 ICMP stats are atomics anyway.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
    and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename NET_INC_STATS_BH() to __NET_INC_STATS()
    and NET_ADD_STATS_BH() to __NET_ADD_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
    better express this is used in non preemptible context.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Remove misleading _BH suffix.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
    and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In the old days (before linux-3.0), SNMP counters were duplicated,
    one for user context, and one for BH context.

    After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%")
    we have a single copy, and what really matters is preemption being
    enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
    respectively.

    We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
    NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
    SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
    UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()

    Following patches will rename __BH helpers to make clear their
    usage is not tied to BH being disabled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Minor overlapping changes in the conflicts.

    In the macsec case, the change of the default ID macro
    name overlapped with the 64-bit netlink attribute alignment
    fixes in net-next.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Similar to 3bfd847203c6 ("net: Use passed in table for nexthop lookups")
    for IPv4, if the route spec contains a table id use that to lookup the
    next hop first and fall back to a full lookup if it fails (per the fix
    4c9bcd117918b ("net: Fix nexthop lookups")).

    Example:

    root@kenny:~# ip -6 ro ls table red
    local 2100:1::1 dev lo proto none metric 0 pref medium
    2100:1::/120 dev eth1 proto kernel metric 256 pref medium
    local 2100:2::1 dev lo proto none metric 0 pref medium
    2100:2::/120 dev eth2 proto kernel metric 256 pref medium
    local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
    local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
    fe80::/64 dev eth1 proto kernel metric 256 pref medium
    fe80::/64 dev eth2 proto kernel metric 256 pref medium
    ff00::/8 dev red metric 256 pref medium
    ff00::/8 dev eth1 metric 256 pref medium
    ff00::/8 dev eth2 metric 256 pref medium
    unreachable default dev lo metric 240 error -113 pref medium

    root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
    RTNETLINK answers: No route to host

    Route add fails even though 2100:1::64 is a reachable next hop:
    root@kenny:~# ping6 -I red 2100:1::64
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms

    With this patch:
    root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
    root@kenny:~# ip -6 ro ls table red
    local 2100:1::1 dev lo proto none metric 0 pref medium
    2100:1::/120 dev eth1 proto kernel metric 256 pref medium
    local 2100:2::1 dev lo proto none metric 0 pref medium
    2100:2::/120 dev eth2 proto kernel metric 256 pref medium
    2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium
    local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
    local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
    fe80::/64 dev eth1 proto kernel metric 256 pref medium
    fe80::/64 dev eth2 proto kernel metric 256 pref medium
    ff00::/8 dev red metric 256 pref medium
    ff00::/8 dev eth1 metric 256 pref medium
    ff00::/8 dev eth2 metric 256 pref medium
    unreachable default dev lo metric 240 error -113 pref medium

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

27 Apr, 2016

8 commits


26 Apr, 2016

7 commits

  • It was a simple idea -- save IPv6 configured addresses on a link down
    so that IPv6 behaves similar to IPv4. As always the devil is in the
    details and the IPv6 stack as too many behavioral differences from IPv4
    making the simple idea more complicated than it needs to be.

    The current implementation for keeping IPv6 addresses can panic or spit
    out a warning in one of many paths:

    1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
    rt6_fill_node while handling a route dump request.

    2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
    fib6_del

    3. Panic in fib6_purge_rt because rt6i_ref count is not 1.

    The root cause of all these is references related to the host route for
    an address that is retained.

    So, this patch deletes the host route every time the ifdown loop runs.
    Since the host route is deleted and will be re-generated an up there is
    no longer a need for the l3mdev fix up. On the 'admin up' side move
    addrconf_permanent_addr into the NETDEV_UP event handling so that it
    runs only once versus on UP and CHANGE events.

    All of the current panics and warnings appear to be related to
    addresses on the loopback device, but given the catastrophic nature when
    a bug is triggered this patch takes the conservative approach and evicts
    all host routes rather than trying to determine when it can be re-used
    and when it can not. That can be a later optimizaton if desired.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • This reverts commit 841645b5f2dfceac69b78fcd0c9050868d41ea61.

    Ok, this puts the feature back. I've decided to apply David A.'s
    bug fix and run with that rather than make everyone wait another
    whole release for this feature.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Support checksum neutral ILA as described in the ILA draft. The low
    order 16 bits of the identifier are used to contain the checksum
    adjustment value.

    The csum-mode parameter is added to described checksum processing. There
    are three values:
    - adjust transport checksum (previous behavior)
    - do checksum neutral mapping
    - do nothing

    On output the csum-mode in the ila_params is checked and acted on. If
    mode is checksum neutral mapping then to mapping and set C-bit.

    On input, C-bit is checked. If it is set checksum-netural mapping is
    done (regardless of csum-mode in ila params) and C-bit will be cleared.
    If it is not set then action in csum-mode is taken.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Change model of xlat to be used only for input where lookup is done on
    the locator part of an address (comparing to locator_match as key
    in rhashtable). This is needed for checksum neutral translation
    which obfuscates the low order 16 bits of the identifier. It also
    permits hosts to be in muliple ILA domains (each locator can map
    to a different SIR address). A check is also added to disallow
    translating non-ILA addresses (check of type in identifier).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add structures for identifiers, locators, and an ila address which
    is composed of a locator and identifier and in6_addr can be cast to
    it. This includes a three bit type field and enums for the types defined
    in ILA I-D.

    In ILA lwt don't allow user to set a translation for a non-ILA
    address (type of identifier is zero meaning it is an IID). This also
    requires that the destination prefix is at least 65 bytes (64
    bit locator and first byte of identifier).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • The memcpy of ipv6 header destination address to the skb control block
    (sbk->cb) in header_create() results in currupted memory when bt_xmit()
    is issued. The skb->cb is "released" in the return of header_create()
    making room for lower layer to minipulate the skb->cb.

    The value retrieved in bt_xmit is not persistent across header creation
    and sending, and the lower layer will overwrite portions of skb->cb,
    making the copied destination address wrong.

    The memory corruption will lead to non-working multicast as the first 4
    bytes of the copied destination address is replaced by a value that
    resolves into a non-multicast prefix.

    This fix removes the dependency on the skb control block between header
    creation and send, by moving the destination address memcpy to the send
    function path (setup_create, which is called from bt_xmit).

    Signed-off-by: Glenn Ruben Bakke
    Acked-by: Jukka Rissanen
    Signed-off-by: Marcel Holtmann
    Cc: stable@vger.kernel.org # 4.5+

    Glenn Ruben Bakke
     
  • rds-stress experiments with request size 256 bytes, 8K acks,
    using 16 threads show a 40% improvment when pskb_extract()
    replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
    pattern in the Rx path, so we leverage the perf gain with
    this commit.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan