10 Aug, 2018

40 commits

  • This is preparation for XDP TX and ndo_xdp_xmit.
    This allows napi handler to handle xdp_frames through xdp ring as well
    as sk_buff.

    v8:
    - Don't use xdp_frame pointer address to calculate skb->head and
    headroom.

    v7:
    - Use xdp_scrub_frame() instead of memset().

    v3:
    - Revert v2 change around rings and use a flag to differentiate skb and
    xdp_frame, since bulk skb xmit makes little performance difference
    for now.

    v2:
    - Use another ring instead of using flag to differentiate skb and
    xdp_frame. This approach makes bulk skb transmit possible in
    veth_xmit later.
    - Clear xdp_frame feilds in skb->head.
    - Implement adjust_tail.

    Signed-off-by: Toshiaki Makita
    Acked-by: John Fastabend
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • xdp_frame has kernel pointers which should not be readable from bpf
    programs. When we want to reuse xdp_frame region but it may be read by
    bpf programs later, we can use this helper to clear kernel pointers.
    This is more efficient than calling memset() for the entire struct.

    Signed-off-by: Toshiaki Makita
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • Oversized packets including GSO packets can be dropped if XDP is
    enabled on receiver side, so don't send such packets from peer.

    Drop TSO and SCTP fragmentation features so that veth devices themselves
    segment packets with XDP enabled. Also cap MTU accordingly.

    v4:
    - Don't auto-adjust MTU but cap max MTU.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This is the basic implementation of veth driver XDP.

    Incoming packets are sent from the peer veth device in the form of skb,
    so this is generally doing the same thing as generic XDP.

    This itself is not so useful, but a starting point to implement other
    useful veth XDP features like TX and REDIRECT.

    This introduces NAPI when XDP is enabled, because XDP is now heavily
    relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
    enqueues packets to the ring and peer NAPI handler drains the ring.

    Currently only one ring is allocated for each veth device, so it does
    not scale on multiqueue env. This can be resolved by allocating rings
    on the per-queue basis later.

    Note that NAPI is not used but netif_rx is used when XDP is not loaded,
    so this does not change the default behaviour.

    v6:
    - Check skb->len only when allocation is needed.
    - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
    events.

    v3:
    - Fix race on closing the device.
    - Add extack messages in ndo_bpf.

    v2:
    - Squashed with the patch adding NAPI.
    - Implement adjust_tail.
    - Don't acquire consumer lock because it is guarded by NAPI.
    - Make poll_controller noop since it is unnecessary.
    - Register rxq_info on enabling XDP rather than on opening the device.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This is needed for veth XDP which does skb_copy_expand()-like operation.

    v2:
    - Drop skb_copy_header part because it has already been exported now.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • Jesper Dangaard Brouer says:

    ====================
    Background: cpumap moves the SKB allocation out of the driver code,
    and instead allocate it on the remote CPU, and invokes the regular
    kernel network stack with the newly allocated SKB.

    The idea behind the XDP CPU redirect feature, is to use XDP as a
    load-balancer step in-front of regular kernel network stack. But the
    current sample code does not provide a good example of this. Part of
    the reason is that, I have implemented this as part of Suricata XDP
    load-balancer.

    Given this is the most frequent feature request I get. This patchset
    implement the same XDP load-balancing as Suricata does, which is a
    symmetric hash based on the IP-pairs + L4-protocol.

    The expected setup for the use-case is to reduce the number of NIC RX
    queues via ethtool (as XDP can handle more per core), and via
    smp_affinity assign these RX queues to a set of CPUs, which will be
    handling RX packets. The CPUs that runs the regular network stack is
    supplied to the sample xdp_redirect_cpu tool by specifying
    the --cpu option multiple times on the cmdline.

    I do note that cpumap SKB creation is not feature complete yet, and
    more work is coming. E.g. given GRO is not implemented yet, do expect
    TCP workloads to be slower. My measurements do indicate UDP workloads
    are faster.
    ====================

    Signed-off-by: Daniel Borkmann

    Daniel Borkmann
     
  • This implement XDP CPU redirection load-balancing across available
    CPUs, based on the hashing IP-pairs + L4-protocol. This equivalent to
    xdp-cpu-redirect feature in Suricata, which is inspired by the
    Suricata 'ippair' hashing code.

    An important property is that the hashing is flow symmetric, meaning
    that if the source and destination gets swapped then the selected CPU
    will remain the same. This is helps locality by placing both directions
    of a flows on the same CPU, in a forwarding/routing scenario.

    The hashing INITVAL (15485863 the 10^6th prime number) was fairly
    arbitrary choosen, but experiments with kernel tree pktgen scripts
    (pktgen_sample04_many_flows.sh +pktgen_sample05_flow_per_thread.sh)
    showed this improved the distribution.

    This patch also change the default loaded XDP program to be this
    load-balancer. As based on different user feedback, this seems to be
    the expected behavior of the sample xdp_redirect_cpu.

    Link: https://github.com/OISF/suricata/commit/796ec08dd7a63
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Jesper Dangaard Brouer
     
  • Adjusted function call API to take an initval. This allow the API
    user to set the initial value, as a seed. This could also be used for
    inputting the previous hash.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Jesper Dangaard Brouer
     
  • This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.

    The reverted commit adds a WARN to check against NULL entries in the
    mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
    driver) fast path is required to make a paired
    xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
    addition, a driver using a different allocation scheme than the
    default MEM_TYPE_PAGE_SHARED is required to additionally call
    xdp_rxq_info_reg_mem_model.

    For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
    that the mem_id_ht rhashtable has a properly inserted allocator id. If
    not, this would be a driver bug. A NULL pointer kernel OOPS is
    preferred to the WARN.

    Suggested-by: Jesper Dangaard Brouer
    Signed-off-by: Björn Töpel
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Björn Töpel
     
  • Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst
    adding some tracepoints.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Jose Abreu says:

    ====================
    Add support for XGMAC2 in stmmac

    This series adds support for 10Gigabit IP in stmmac. The IP is called XGMAC2
    and has many similarities with GMAC4. Due to this, its relatively easy to
    incorporate this new IP into stmmac driver by adding a new block and
    filling the necessary callbacks.

    The functionality added by this series is still reduced but its only a
    starting point which will later be expanded.

    I splitted the patches into funcionality and to ease the review. Only the
    patch 8/9 really enables the XGMAC2 block by adding a new compatible string.

    Version 4 addresses review comments of Florian Fainelli and Rob Herring.

    NOTE: Although the IP supports 10G, for now it was only possible to test it
    at 1G speed due to 10G PHY HW shipping problems. Here follows iperf3
    results at 1G:

    Connecting to host 192.168.0.10, port 5201
    [ 4] local 192.168.0.3 port 39178 connected to 192.168.0.10 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.00 sec 110 MBytes 920 Mbits/sec 0 482 KBytes
    [ 4] 1.00-2.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
    [ 4] 2.00-3.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
    [ 4] 3.00-4.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
    [ 4] 4.00-5.00 sec 112 MBytes 935 Mbits/sec 0 482 KBytes
    [ 4] 5.00-6.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
    [ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
    [ 4] 7.00-8.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
    [ 4] 8.00-9.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
    [ 4] 9.00-10.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec 0 sender
    [ 4] 0.00-10.00 sec 1.09 GBytes 938 Mbits/sec receiver
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Adds the documentation for XGMAC2 DT bindings.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Cc: Sergei Shtylyov
    Cc: devicetree@vger.kernel.org
    Cc: Rob Herring
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add the bindings parsing for XGMAC2 IP block.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Now that we have all the XGMAC related callbacks, lets start integrating
    this IP block into main driver.

    Also, we corrected the initialization flow to only start DMA after
    setting descriptors length.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Cc: Andrew Lunn
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • XGMAC2 uses the same engine of timestamping as GMAC4. Let's use the same
    callbacks.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add the MDIO related funcionalities for the new IP block XGMAC2.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Cc: Andrew Lunn
    Cc: Florian Fainelli
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add the descriptor related callbacks for the new IP block XGMAC2.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add the DMA related callbacks for the new IP block XGMAC2.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Cc: Florian Fainelli
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add the MAC related callbacks for the new IP block XGMAC2.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Add a new entry to HWIF table for XGMAC 2.10. For now we fill it with
    empty callbacks which will be added in posterior patches.

    Signed-off-by: Jose Abreu
    Cc: David S. Miller
    Cc: Joao Pinto
    Cc: Giuseppe Cavallaro
    Cc: Alexandre Torgue
    Signed-off-by: David S. Miller

    Jose Abreu
     
  • Andrew Lunn says:

    ====================
    More complete PHYLINK support for mv88e6xxx

    Previous patches added sufficient PHYLINK support to the mv88e6xxx
    that it did not break existing use cases, basically fixed-link phys.

    This patchset builds out the support so that SFP modules, up to
    2.5Gbps can be supported, on mv88e6390X, on ports 9 and 10. It also
    provides a framework which can be extended to support SFPs on ports
    2-8 of mv88e6390X, 10Gbps PHYs, and SFP support on the 6352 family.

    Russell King did much of the initial work, implementing the validate
    and mac_link_state calls. However, there is an important TODO in the
    commit message:

    needs to call phylink_mac_change() when the port link comes up/goes down.

    The remaining patches implement this, by adding more support for the
    SERDES interfaces, in particular, interrupt support so we get notified
    when the SERDES gains/looses sync.

    This has been tested on the ZII devel C, using a Clearfog as peer
    device.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When a port changes CMODE, the SERDES interface being used can change.
    Disable interrupts for the old SERDES interface, and enable interrupts
    on the new.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • phylink wants to know when the MAC layers notices a change in the
    link. For the 6390 family, this is a change in the SERDES state.

    Add interrupt support for the SERDES interface used to implement
    SGMII/1000Base-X/2500Base-X. This is currently limited to ports 9 and
    10. Support for the 10G SERDES and other ports will be added later,
    building on this basic framework.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • An up coming change will register interrupts for individual switch
    ports, using the mv88e6xxx_port as the interrupt context information.
    Add members to the mv88e6xxx_port structure so we can link it back to
    the mv88e6xxx_chip member the port belongs to and the port number of
    the port.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The 6390 family has a number of SERDES interfaces per port. When the
    cmode changes, eg 1000Base-X to XAUI, the SERDES interface in use will
    also change. Power down the old SERDES interface and power up the new
    SERDES interface.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The ports CMODE indicates the type of link between the MAC and the
    PHY. It is used often in the SERDES code. Rather than read it each
    time, cache its value.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The 6390 has three different SERDES interface types. 2500Base-X is
    implemented by the SGMII/1000Base-X SERDES. So power on/off the
    correct SERDES.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Add a helper for accessing SERDES registers of the 6390 family.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • There is a need to add more functions manipulating the SERDES
    interfaces. Cleanup the namespace.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The 6390 has two SERDES interfaces, used by ports 9 and 10. The 6390X
    has eight SERDES interfaces. These allow ports 9 and 10 to do 10G. Or
    if lower speeds are used, some of the SERDES interfaces can be used by
    ports 2-8 for 1000Base-X.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • The 6390 family has 8 SERDES lanes. What ports use these lanes depends
    on how ports 9 and 10 are configured. If 9 and 10 does not make use of
    a line, one of the lower ports can use it.

    Add a function to return the lane a port is using, if any, and simplify
    the code to power up/down the lane.

    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Add rudimentary phylink support to mv88e6xxx.

    TODO:
    - needs to call phylink_mac_change() when the port link comes up/goes down.

    Signed-off-by: Russell King
    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Russell King
     
  • Add a helper for MAC drivers to use in their validate callback to deal
    with 2500BaseX vs 1000BaseX modes, where the hardware supports both
    but it is not possible to automatically select between them.

    This helper defaults to 1000BaseX, as that is the 802.3 standard, and
    will allow users to select 2500BaseX either by forcing the speed if
    AN is disabled, or by changing the advertising mask if AN is enabled.
    Disabling AN is not recommended as it is only the speed that we're
    interested in controlling, not the duplex or pause mode parameters.

    Signed-off-by: Russell King
    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Russell King
     
  • The 6185 can enable/disable 802.3z pause be setting the MyPause bit in
    the port status register. Add an op to support this.

    Signed-off-by: Russell King
    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Ido Schimmel says:

    ====================
    mlxsw: Various updates

    Patches 1-3 update the driver to use a new firmware version. Due to a
    recently discovered issue, the version (and future ones) does not
    support matching on VLAN ID at egress. This is enforced in the driver
    and reported back to the user via extack.

    Patch 4 adds a new selftest for the recently introduced algorithmic
    TCAM.

    Patch 5 converts the driver to use SPDX identifiers.

    Patches 6-7 fix a bug in ethtool stats reporting and expose counters for
    all 16 TCs, following recent MC-aware changes that utilize TCs 8-15.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Before MC-aware mode was enabled in commit 7b8195306694 ("mlxsw:
    spectrum: Configure MC-aware mode on mlxsw ports"), only 8 traffic
    classes were used. Under MC-aware regime, however, besides using TCs
    0-7 for UC traffic, it additionally uses TCs 8-15 for BUM traffic. It
    is therefore desirable to show counters for these TCs as well.

    Update ethtool stats pool length, mlxsw_sp_port_get_strings() and
    mlxsw_sp_port_get_stats() to include artifacts for all 16 TCs. For
    consistency and simplicity, expose tc_no_buffer_discard_uc_tc for BUM
    TCs as well, even though it ought to stay at 0 all the time.

    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • The function mlxsw_sp_port_get_sset_count() is supposed to return the
    total number of ethtool strings that mlxsw supports. Specifically for
    names of statistic counters (the only string type that mlxsw supports
    as of now), that number is stored in MLXSW_SP_PORT_ETHTOOL_STATS_LEN.
    However, when adding RFC-2891 counters, that define wasn't updated to
    include the new counters. As a result, ethtool snips out the counters
    towards the end of the list, which contains per-TC counters, and only
    the first three traffic classes end up being reported.

    Fix by adding MLXSW_SP_PORT_HW_RFC_2819_STATS_LEN as appropriate.

    Fixes: 1222d15a01c7 ("mlxsw: spectrum: Expose counters for various packet sizes")
    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Recent FW fixes a bug and allows to load newly flashed FW image after
    reset. So make sure the reset happens after flash. Indicate the need
    down to PCI layer by -EAGAIN.

    Signed-off-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Jiri Pirko