19 Jan, 2017

27 commits

  • Now that the DSA Ethernet switches are true Linux devices, the CPU
    switch is not necessarily the first one. If its address is higher than
    the second switch on the same MDIO bus, its index will be 1, not 0.

    Avoid any confusion by using dst->cpu_switch instead of dst->ds[0].

    Signed-off-by: Vivien Didelot
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Store a dsa_switch pointer to the CPU switch in the tree instead of only
    its index. This avoids the need to initialize it to -1.

    Signed-off-by: Vivien Didelot
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Check "ch" on NULL first, then get ctlr.

    Signed-off-by: Ivan Khoronzhuk
    Signed-off-by: David S. Miller

    Ivan Khoronzhuk
     
  • Jason Wang says:

    ====================
    vhost_net tx batching

    This series tries to implement tx batching support for vhost. This was
    done by using MSG_MORE as a hint for under layer socket. The backend
    (e.g tap) can then batch the packets temporarily in a list and
    submit it all once the number of bacthed exceeds a limitation.

    Tests shows obvious improvement on guest pktgen over over
    mlx4(noqueue) on host:

    Mpps -+%
    rx-frames = 0 0.91 +0%
    rx-frames = 4 1.00 +9.8%
    rx-frames = 8 1.00 +9.8%
    rx-frames = 16 1.01 +10.9%
    rx-frames = 32 1.07 +17.5%
    rx-frames = 48 1.07 +17.5%
    rx-frames = 64 1.08 +18.6%
    rx-frames = 64 (no MSG_MORE) 0.91 +0%

    Changes from V4:
    - stick to NAPI_POLL_WEIGHT for rx-frames is user specify a value
    greater than it.
    Changes from V3:
    - use ethtool instead of module parameter to control the maximum
    number of batched packets
    - avoid overhead when MSG_MORE were not set and no packet queued
    Changes from V2:
    - remove uselss queue limitation check (and we don't drop any packet now)
    Changes from V1:
    - drop NAPI handler since we don't use NAPI now
    - fix the issues that may exceeds max pending of zerocopy
    - more improvement on available buffer detection
    - move the limitation of batched pacekts from vhost to tuntap
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We can only process 1 packet at one time during sendmsg(). This often
    lead bad cache utilization under heavy load. So this patch tries to do
    some batching during rx before submitting them to host network
    stack. This is done through accepting MSG_MORE as a hint from
    sendmsg() caller, if it was set, batch the packet temporarily in a
    linked list and submit them all once MSG_MORE were cleared.

    Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

    Mpps -+%
    rx-frames = 0 0.91 +0%
    rx-frames = 4 1.00 +9.8%
    rx-frames = 8 1.00 +9.8%
    rx-frames = 16 1.01 +10.9%
    rx-frames = 32 1.07 +17.5%
    rx-frames = 48 1.07 +17.5%
    rx-frames = 64 1.08 +18.6%
    rx-frames = 64 (no MSG_MORE) 0.91 +0%

    User were allowed to change per device batched packets through
    ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
    to prevent bh from being disabled too long.

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch tries to utilize tuntap rx batching by peeking the tx
    virtqueue during transmission, if there's more available buffers in
    the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
    to batch the packets.

    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch tries to do several tweaks on vhost_vq_avail_empty() for a
    better performance:

    - check cached avail index first which could avoid userspace memory access.
    - using unlikely() for the failure of userspace access
    - check vq->last_avail_idx instead of cached avail index as the last
    step.

    This patch is need for batching supports which needs to peek whether
    or not there's still available buffers in the ring.

    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Relax ordering(RO) is one feature of 82599 NIC, to enable this feature can
    enhance the performance for some cpu architecure, such as SPARC and so on.
    Currently it only supports one special cpu architecture(SPARC) in 82599
    driver to enable RO feature, this is not very common for other cpu architecture
    which really needs RO feature.
    This patch add one common config CONFIG_ARCH_WANT_RELAX_ORDER to set RO feature,
    and should define CONFIG_ARCH_WANT_RELAX_ORDER in sparc Kconfig firstly.

    Signed-off-by: Mao Wenan
    Reviewed-by: Alexander Duyck
    Reviewed-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Mao Wenan
     
  • David Ahern says:

    ====================
    net: ipv6: simplify rt6_fill_node

    Remove a couple of unnecessary input arguments to rt6_fill_node.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route
    where a user is requesting a prefix only dump. Simplify rt6_fill_node
    by removing the prefix arg and moving the prefix check to rt6_dump_route.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
    simplify rt6_fill_node accordingly.

    rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
    nowait arg from it as well.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Xin Long says:

    ====================
    sctp: add sender-side procedures for stream reconf ssn reset request chunk

    Patch 6/6 is to implement sender-side procedures for the Outgoing
    and Incoming SSN Reset Request Parameter described in rfc6525
    section 5.1.2 and 5.1.3

    Patches 1-5/6 are ahead of it to define some apis and asoc members
    for it.

    Note that with this patchset, asoc->reconf_enable has no chance yet to
    be set, until the patch "sctp: add get and set sockopt for reconf_enable"
    is applied in the future. As we can not just enable it when sctp is not
    capable of processing reconf chunk yet.

    v1->v2:
    - put these into a smaller group.
    - rename some temporary variables in the codes.
    - rename the titles of the commits and improve some changelogs.
    v2->v3:
    - re-split the patchset and make sure it has no dead codes for review.
    v3->v4:
    - move sctp_make_reconf() into patch 1/6 to avoid kbuild warning.
    - drop unused struct sctp_strreset_req.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch is to implement sender-side procedures for the Outgoing
    and Incoming SSN Reset Request Parameter described in rfc6525 section
    5.1.2 and 5.1.3.

    It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2
    for users.

    Note that the new asoc member strreset_outstanding is to make sure
    only one reconf request chunk on the fly as rfc6525 section 5.1.1
    demands.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
    strreset_enable to indicate which reconf request type it supports,
    which is described in rfc6525 section 6.3.1.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to add reconf_enable field in all of asoc ep and netns
    to indicate if they support stream reset.

    When initializing, asoc reconf_enable get the default value from ep
    reconf_enable which is from netns netns reconf_enable by default.

    It is also to add reconf_capable in asoc peer part to know if peer
    supports reconf_enable, the value is set if ext params have reconf
    chunk support when processing init chunk, just as rfc6525 section
    5.1.1 demands.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to add a primitive based on sctp primitive frame for
    sending stream reconf request. It works as the other primitives,
    and create a SCTP_CMD_REPLY command to send the request chunk out.

    sctp_primitive_RECONF would be the api to send a reconf request
    chunk.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to add a per transport timer based on sctp timer frame
    for stream reconf chunk retransmission. It would start after sending
    a reconf request chunk, and stop after receiving the response chunk.

    If the timer expires, besides retransmitting the reconf request chunk,
    it would also do the same thing with data RTO timer. like to increase
    the appropriate error counts, and perform threshold management, possibly
    destroying the asoc if sctp retransmission thresholds are exceeded, just
    as section 5.1.1 describes.

    This patch is also to add asoc strreset_chunk, it is used to save the
    reconf request chunk, so that it can be retransmitted, and to check if
    the response is really for this request by comparing the information
    inside with the response chunk as well.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to add asoc strreset_outseq and strreset_inseq for
    saving the reconf request sequence, initialize them when create
    assoc and process init, and also to define Incoming and Outgoing
    SSN Reset Request Parameter described in rfc6525 section 4.1 and
    4.2, As they can be in one same chunk as section rfc6525 3.1-3
    describes, it makes them in one function.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • Josef Bacik says:

    ====================
    Rework inet_csk_get_port

    V3->V4:
    -Removed the random include of addrconf.h that is no longer needed.

    V2->V3:
    -Dropped the fastsock from the tb and instead just carry the saddrs, family, and
    ipv6 only flag.
    -Reworked the helper functions to deal with this change so I could still use
    them when checking the fast path.
    -Killed tb->num_owners as per Eric's request.
    -Attached a reproducer to the bottom of this email.

    V1->V2:
    -Added a new patch 'inet: collapse ipv4/v6 rcv_saddr_equal functions into one'
    at Hannes' suggestion.
    -Dropped ->bind_conflict and just use the new helper.
    -Fixed a compile bug from the original ->bind_conflict patch.

    The original description of the series follows:

    At some point recently the guys working on our load balancer added the ability
    to use SO_REUSEPORT. When they restarted their app with this option enabled
    they immediately hit a softlockup on what appeared to be the
    inet_bind_bucket->lock. Eventually what all of our debugging and discussion led
    us to was the fact that the application comes up without SO_REUSEPORT, shuts
    down which creates around 100k twsk's, and then comes up and tries to open a
    bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket
    owners list under the lock. Since this lock is needed for dealing with the
    twsk's and basically anything else related to connections we would softlockup,
    and sometimes not ever recover.

    To solve this problem I did what you see in Path 5/5. Once we have a
    SO_REUSEPORT socket on the tb->owners list we know that the socket has no
    conflicts with any of the other sockets on that list. So we can add a copy of
    the sock_common (really all we need is the recv_saddr but it seemed ugly to copy
    just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've
    copied the whole common) in order to check subsequent SO_REUSEPORT sockets. If
    they match the previous one then we can skip the expensive
    inet_csk_bind_conflict check. This is what eliminated the soft lockup that we
    were seeing.

    Patches 1-4 are cleanups and re-workings. For instance when we specify port ==
    0 we need to find an open port, but we would do two passes through
    inet_csk_bind_conflict every time we found a possible port. We would also keep
    track of the smallest_port value in order to try and use it if we found no
    port our first run through. This however made no sense as it would have had to
    fail the first pass through inet_csk_bind_conflict, so would not actually pass
    the second pass through either. Finally I split the function into two functions
    in order to make it easier to read and to distinguish between the two behaviors.

    I have tested this on one of our load balancing boxes during peak traffic and it
    hasn't fallen over. But this is not my area, so obviously feel free to point
    out where I'm being stupid and I'll get it fixed up and retested. Thanks,
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
    never set it again. Which means that in the future if we end up adding a bunch
    of reuseport sk's to that tb we'll have to do the expensive scan every time.
    Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
    so we know what comparison to make, and the ipv6 only setting so we can make
    sure to compare with new sockets appropriately. Once one sk has made it onto
    the list we know that there are no potential bind conflicts on the owners list
    that match that sk's rcv_addr. So copy the sk's information into our bind
    bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
    an extra check for subsequent reuseport sockets and skip the expensive bind
    conflict check.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • inet_csk_get_port does two different things, it either scans for an open port,
    or it tries to see if the specified port is available for use. Since these two
    operations have different rules and are basically independent lets split them
    into two different functions to make them both more readable.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • This is just wasted time, we've already found a tb that doesn't have a bind
    conflict, and we don't drop the head lock so scanning again isn't going to give
    us a different answer. Instead move the tb->reuse setting logic outside of the
    found_tb path and put it in the success: path. Then make it so that we don't
    goto again if we find a bind conflict in the found_tb path as we won't reach
    this anymore when we are scanning for an ephemeral port.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • In inet_csk_get_port we seem to be using smallest_port to figure out where the
    best place to look for a SO_REUSEPORT sk that matches with an existing set of
    SO_REUSEPORT's. However if we get to the logic

    if (smallest_size != -1) {
    port = smallest_port;
    goto have_port;
    }

    we will do a useless search, because we would have already done the
    inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
    would have gone to found_tb and succeeded. Since this logic makes us do yet
    another trip through inet_csk_bind_conflict for a port we know won't work just
    delete this code and save us the time.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
    is how they check the rcv_saddr, so delete this call back and simply
    change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • We pass these per-protocol equal functions around in various places, but
    we can just have one function that checks the sk->sk_family and then do
    the right comparison function. I've also changed the ipv4 version to
    not cast to inet_sock since it is unneeded.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • This patch adds more info to stmicro' Kconfig files in order to be clearer
    that the driver can be used by ethernet cards based on 10/100/1000/EQOS
    Synopsys IP Cores.

    EQOS was also added stmmac/Kconfig Kconfig, since dwmac4 is in fact EQoS,
    one of Synopsys Ethernet IPs. More info at:
    https://www.synopsys.com/dw/ipdir.php?ds=dwc_ether_qos

    Signed-off-by: Joao Pinto
    Signed-off-by: David S. Miller

    jpinto
     
  • This patch adds bpf_xdp_adjust_head() support to mlx5e.

    1. rx_headroom is added to struct mlx5e_rq. It uses
    an existing 4 byte hole in the struct.
    2. The adjusted data length is checked against
    MLX5E_XDP_MIN_INLINE and MLX5E_SW2HW_MTU(rq->netdev->mtu).
    3. The macro MLX5E_SW2HW_MTU is moved from en_main.c to en.h.
    MLX5E_HW2SW_MTU is also moved to en.h for symmetric reason
    but it is not a must.

    v2:
    - Keep the xdp specific logic in mlx5e_xdp_handle()
    - Update dma_len after the sanity checks in mlx5e_xmit_xdp_frame()

    Signed-off-by: Martin KaFai Lau
    Acked-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

18 Jan, 2017

13 commits

  • Using a Mac OSX box as a client connecting to a Linux server, we have found
    that when certain applications (such as 'ab'), are abruptly terminated
    (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
    FIN is accepted by the Linux stack but the RST is sent with the same
    sequence number as the FIN, and Linux responds with a challenge ACK per
    RFC 5961. The OSX client then sometimes (they are rate-limited) does not
    reply with any RST as would be expected on a closed socket.

    This results in sockets accumulating on the Linux server left mostly in
    the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
    This sequence of events can tie up a lot of resources on the Linux server
    since there may be a lot of data in write buffers at the time of the RST.
    Accepting a RST equal to rcv_nxt - 1, after we have already successfully
    processed a FIN, has made a significant difference for us in practice, by
    freeing up unneeded resources in a more expedient fashion.

    A packetdrill test demonstrating the behavior:

    // testing mac osx rst behavior

    // Establish a connection
    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32768
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 32768
    0.200 accept(3, ..., ...) = 4

    // Client closes the connection
    0.300 < F. 1:1(0) ack 1 win 32768

    // now send rst with same sequence
    0.300 < R. 1:1(0) ack 1 win 32768

    // make sure we are in TCP_CLOSE
    0.400 %{
    assert tcpi_state == 7
    }%

    Signed-off-by: Jason Baron
    Cc: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jason Baron
     
  • Make the needlessly global struct ethtool_ops ethoc_ethtool_ops static
    to fix a sparse warning.

    Signed-off-by: Tobias Klauser
    Signed-off-by: David S. Miller

    Tobias Klauser
     
  • Edward Cree says:

    ====================
    sfc: RX hash configuration

    This series improves support for getting and setting RX hashing
    configuration on Solarflare adapters through ethtool.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Ensures that we report the key and indirection table the NIC is using,
    rather than (if setting them failed earlier) what we wanted it to use.

    Signed-off-by: Edward Cree
    Signed-off-by: David S. Miller

    Edward Cree
     
  • Signed-off-by: Edward Cree
    Signed-off-by: David S. Miller

    Edward Cree
     
  • Signed-off-by: Ganesh Goudar
    Signed-off-by: David S. Miller

    Ganesh Goudar
     
  • The inet_num is u16, so use %hu instead of casting it to int. And
    the sk_bound_dev_if is int actually, so it needn't cast to int.

    Signed-off-by: Gao Feng
    Signed-off-by: David S. Miller

    Gao Feng
     
  • Use eth_zero_addr to assign zero address to the given address array
    instead of memset when the second argument in memset is address
    of zero. Also, it makes the code clearer

    Signed-off-by: Shyam Saini
    Signed-off-by: David S. Miller

    Shyam Saini
     
  • Changed type of csum field in struct igmpv3_query from __be16 to
    __sum16 to eliminate type warning, made same change in struct
    igmpv3_report for consistency.

    Fixed up an ntohs() where htons() should have been used instead.

    Signed-off-by: Lance Richardson
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Lance Richardson
     
  • David S. Miller
     
  • Robert Shearman says:

    ====================
    mpls: Packet stats

    This patchset records per-interface packet stats in the MPLS
    forwarding path and exports them using a nest of attributes root at a
    new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages:

    [IFLA_STATS_AF_SPEC]
    -> [AF_MPLS]
    -> [MPLS_STATS_LINK]
    -> struct mpls_link_stats

    The first patch adds the rtnl infrastructure for this, including a new
    callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The
    second patch records MPLS stats and makes use of the infrastructure to
    export them. The rtnl infrastructure could also be used to export IPv6
    stats in the future.

    Changes in v2:
    - make incrementing IPv6 stats in mpls_stats_inc_outucastpkts
    conditional on CONFIG_IPV6 to fix build with CONFIG_IPV6=n
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Having MPLS packet stats is useful for observing network operation and
    for diagnosing network problems. In the absence of anything better,
    RFC2863 and RFC3813 are used for guidance for which stats to expose
    and the semantics of them. In particular rx_noroutes maps to in
    unknown protos in RFC2863. The stats are exposed to userspace via
    AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
    RTM_GETSTATS messages.

    All the introduced fields are 64-bit, even error ones, to ensure no
    overflow with long uptimes. Per-CPU counters are used to avoid
    cache-line contention on the commonly used fields. The other fields
    have also been made per-CPU for code to avoid performance problems in
    error conditions on the assumption that on some platforms the cost of
    atomic operations could be more expensive than sending the packet
    (which is what would be done in the success case). If that's not the
    case, we could instead not use per-CPU counters for these fields.

    Only unicast and non-fragment are exposed at the moment, but other
    counters can be exposed in the future either by adding to the end of
    struct mpls_link_stats or by additional netlink attributes in the
    AF_MPLS IFLA_STATS_AF_SPEC nested attribute.

    Signed-off-by: Robert Shearman
    Signed-off-by: David S. Miller

    Robert Shearman
     
  • Add the functionality for including address-family-specific per-link
    stats in RTM_GETSTATS messages. This is done through adding a new
    IFLA_STATS_AF_SPEC attribute under which address family attributes are
    nested and then the AF-specific attributes can be further nested. This
    follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
    advantage of presenting an easily extended hierarchy. The rtnl_af_ops
    structure is extended to provide AFs with the opportunity to fill and
    provide the size of their stats attributes.

    One alternative would have been to provide AFs with the ability to add
    attributes directly into the RTM_GETSTATS message without a nested
    hierarchy. I discounted this approach as it increases the rate at
    which the 32 attribute number space is used up and it makes
    implementation a little more tricky for stats dump resuming (at the
    moment the order in which attributes are added to the message has to
    match the numeric order of the attributes).

    Another alternative would have been to register per-AF RTM_GETSTATS
    handlers. I discounted this approach as I perceived a common use-case
    to be getting all the stats for an interface and this approach would
    necessitate multiple requests/dumps to retrieve them all.

    Signed-off-by: Robert Shearman
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Robert Shearman