27 Mar, 2020

25 commits

  • A future change is going to implement a new devlink command to request
    a snapshot on demand. As part of this, the logic for handling the
    snapshot ids will be refactored. To simplify the snapshot id allocation
    function, move it to a separate function prefixed by `__`. This helper
    function will assume the lock is held.

    While no other callers will exist, it simplifies refactoring the logic
    because there is no need to complicate the function with gotos to handle
    unlocking on failure.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • The devlink_region_snapshot_create function returns -ENOMEM when the
    maximum number of snapshots has been reached. This is confusing because
    it is not an issue of being out of memory. Change this to use -ENOSPC
    instead.

    Reported-by: Jiri Pirko
    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • A future change is going to add a new devlink command to request
    a snapshot on demand. This function will want to call the
    devlink_region_snapshot_create function while already holding the
    devlink instance lock.

    Extract the logic of this function into a static function prefixed by
    `__` to indicate that it is an internal helper function. Modify the
    original function to be implemented in terms of the new locked
    function.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • The function documentation comment for devlink_region_snapshot_create
    included a literal tab character between 'future analyses' that was
    difficult to spot as it happened to only display as one space wide.

    Fix the comment to use a space here instead of a stray tab appearing in
    the middle of a sentence.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • It does not makes sense that two snapshots for a given region would use
    different destructors. Simplify snapshot creation by adding
    a .destructor op for regions.

    This operation will replace the data_destructor for the snapshot
    creation, and makes snapshot creation easier.

    Noticed-by: Jakub Kicinski
    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Modify the devlink region code in preparation for adding new operations
    on regions.

    Create a devlink_region_ops structure, and move the name pointer from
    within the devlink_region structure into the ops structure (similar to
    the devlink_health_reporter_ops).

    This prepares the regions to enable support of additional operations in
    the future such as requesting snapshots, or accessing the region
    directly without a snapshot.

    In order to re-use the constant strings in the mlx4 driver their
    declaration must be changed to 'const char * const' to ensure the
    compiler realizes that both the data and the pointer cannot change.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jakub Kicinski
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Lorenzo Bianconi says:

    ====================
    veth: move ndo_xdp_xmit stats to peer veth_rq

    Move ndo_xdp_xmit ethtool stats accounting to peer veth_rq.
    Move XDP_TX accounting to veth_xdp_flush_bq routine.

    Changes since v1:
    - rename xdp_xmit[_err] counters to peer_tq_xdp_xmit[_err]
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Rely on 'remote' veth_rq to account ndo_xdp_xmit ethtool counters.
    Move XDP_TX accounting to veth_xdp_flush_bq routine.
    Remove 'rx' prefix in rx xdp ethool counters

    Signed-off-by: Lorenzo Bianconi
    Acked-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Lorenzo Bianconi
     
  • Substitute net_device point with veth_rq one in veth_xdp_flush_bq,
    veth_xdp_flush and veth_xdp_tx signature. This is a preliminary patch
    to account xdp_xmit counter on 'receiving' veth_rq

    Acked-by: Toshiaki Makita
    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: David S. Miller

    Lorenzo Bianconi
     
  • Move away from the deprecated API and return the shiny new ERRPTR where
    useful.

    Signed-off-by: Wolfram Sang
    Signed-off-by: David S. Miller

    Wolfram Sang
     
  • Move away from the deprecated API and return the shiny new ERRPTR where
    useful.

    Signed-off-by: Wolfram Sang
    Signed-off-by: David S. Miller

    Wolfram Sang
     
  • Petr Machata says:

    ====================
    Implement stats_update callback for pedit and skbedit

    The stats_update callback is used for adding HW counters to the SW ones.
    Both skbedit and pedit actions are actually recognized by flow_offload.h,
    but do not implement these callbacks. As a consequence, the reported values
    are only the SW ones, even where there is a HW counter available.

    Patch #1 adds the callback to action skbedit, patch #2 adds it to action
    pedit. Patch #3 tweaks an skbedit selftest with a check that would have
    caught this problem.

    The pedit test is not likewise tweaked, because the iproute2 pedit action
    currently does not support JSON dumping. This will be addressed later.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Currently the test checks the observable effect of skbedit priority:
    queueing of packets at the correct qdisc band. It therefore misses the fact
    that the counters for offloaded rules are not updated. Add an extra check
    for the counter.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Implement this callback in order to get the offloaded stats added to the
    kernel stats.

    Reported-by: Alexander Petrovskiy
    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Implement this callback in order to get the offloaded stats added to the
    kernel stats.

    Reported-by: Alexander Petrovskiy
    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Ido Schimmel says:

    ====================
    mlxsw: Offload TC action pedit munge dsfield

    Petr says:

    The Spectrum switches allow packet prioritization based on DSCP on ingress,
    and update of DSCP on egress. This is configured through the DCB APP rules.
    For some use cases, assigning a custom DSCP value based on an ACL match is
    a better tool. To that end, offload FLOW_ACTION_MANGLE to permit changing
    of dsfield as a whole, or DSCP and ECN values in isolation.

    After fixing a commentary nit in patch #1, and mlxsw naming in patch #2,
    patches #3 and #4 add the offload to mlxsw.

    Patch #5 adds a forwarding selftest for pedit dsfield, applicable to SW as
    well as HW datapaths. Patch #6 adds a mlxsw-specific test to verify DSCP
    rewrite due to DCB APP rules is not performed on pedited packets.

    The tests only cover IPv4 dsfield setting. We have tests for IPv6 as well,
    but would like to postpone their contribution until the corresponding
    iproute patches have been accepted.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When DSCP is updated through an offloaded pedit action, DSCP rewrite on
    egress should be disabled. Add a test that check that it is so.

    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Add a test that runs packets with dsfield set, and test that pedit adjusts
    the DSCP or ECN parts or the whole field.

    Signed-off-by: Petr Machata
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Offload action pedit ex munge when used with a flower classifier. Only
    allow setting of DSCP, ECN, or the whole DSField in IPv4 and IPv6 packets.

    Signed-off-by: Petr Machata
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • The QOS_ACTION is used for manipulating the QOS attributes of the packet.
    Add the defines and helpers related to DSCP and ECN fields, and dscp_rw.

    Signed-off-by: Petr Machata
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • The original idea was to reuse this set of actions for ECN rewrite as well,
    but on second look, it's not such a great idea. These two items should each
    have its own command. Rename the existing enum to make it obvious that it
    belongs to switch_prio_cmd.

    Signed-off-by: Petr Machata
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • This field references FLOW_ACTION_PACKET_EDIT. Such action does not exist
    though. Instead the field is used for FLOW_ACTION_MANGLE and _ADD.

    Signed-off-by: Petr Machata
    Reviewed-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Saeed Mahameed says:

    ====================
    mlx5-updates-2020-03-25

    1) Cleanups from Dan Carpenter and wenxu.

    2) Paul and Roi, Some minor updates and fixes to E-Switch to address
    issues introduced in the previous reg_c0 updates series.

    3) Eli Cohen simplifies and improves flow steering matching group searches
    and flow table entries version management.

    4) Parav Pandit, improves devlink eswitch mode changes thread safety.
    By making devlink rely on driver for thread safety and introducing mlx5
    eswitch mode change protection.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • drivers/net/ethernet/atheros/atlx/atl2.c:40:19: warning: ‘atl2_driver_string’ defined but not used [-Wunused-const-variable=]
    static const char atl2_driver_string[] = "Atheros(R) L2 Ethernet Driver";
    ^~~~~~~~~~~~~~~~~~

    commit ea973742140b ("net/atheros: Clean atheros code from driver version")
    left behind this, remove it.

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     
  • In the commit f73b12812a3d
    ("tipc: improve throughput between nodes in netns"), we're missing a check
    to handle TIPC_DIRECT_MSG type, it's still using old sending mechanism for
    this message type. So, throughput improvement is not significant as
    expected.

    Besides that, when sending a large message with that type, we're also
    handle wrong receiving queue, it should be enqueued in socket receiving
    instead of multicast messages.

    Fix this by adding the missing case for TIPC_DIRECT_MSG.

    Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns")
    Reported-by: Tuong Lien
    Signed-off-by: Hoang Le
    Acked-by: Jon Maloy
    Signed-off-by: David S. Miller

    Hoang Le
     

26 Mar, 2020

15 commits

  • Currently eswitch mode change is occurring from 2 different execution
    contexts as below.
    1. sriov sysfs enable/disable
    2. devlink eswitch set commands

    Both of them need to access eswitch related data structures in
    synchronized manner.
    Without any synchronization below race condition exist.

    SR-IOV enable/disable with devlink eswitch mode change:

    cpu-0 cpu-1
    ----- -----
    mlx5_device_disable_sriov() mlx5_devlink_eswitch_mode_set()
    mlx5_eswitch_disable() esw_offloads_stop()
    esw_offloads_disable() mlx5_eswitch_disable()
    esw_offloads_disable()

    Hence, they are synchronized using a new mode_lock.
    eswitch's state_lock is not used as it can lead to a deadlock scenario
    below and state_lock is only for vport and fdb exclusive access.

    ip link set vf
    netlink rcv_msg() - Lock A
    rtnl_lock
    vfinfo()
    esw->state_lock() - Lock B
    devlink eswitch_set
    devlink_mutex
    esw->state_lock() - Lock B
    attach_netdev()
    register_netdev()
    rtnl_lock - Lock A

    Alternatives considered:
    1. Acquiring rtnl lock before taking esw->state_lock to follow similar
    locking sequence as ip link flow during eswitch mode set.
    rtnl lock is not good idea for two reasons.
    (a) Holding rtnl lock for several hundred device commands is not good
    idea.
    (b) It leads to below and more similar deadlocks.

    devlink eswitch_set
    devlink_mutex
    rtnl_lock - Lock A
    esw->state_lock() - Lock B
    eswitch_disable()
    reload()
    ib_register_device()
    ib_cache_setup_one()
    rtnl_lock()

    2. Exporting devlink lock may lead to undesired use of it in vendor
    driver(s) in future.

    3. Unloading representors outside of the mode_lock requires
    serialization with other process trying to enable the eswitch.

    4. Differing the representors life cycle to a different workqueue
    requires synchronization with func_change_handler workqueue.

    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Reviewed-by: Mark Bloch
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • Subsequent patch protects eswitch mode changes across sriov and devlink
    interfaces. It is desirable for eswitch to provide thread safe eswitch
    enable and disable APIs.
    Hence, extend eswitch enable API to optionally update num_vfs when
    requested.

    In subsequent patch, eswitch num_vfs are updated after all the eswitch
    users eswitch drops its reference count.

    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Reviewed-by: Mark Bloch
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • In order to check eswitch state under a lock, prepare code to split
    capability check and eswitch state check into two helper functions.

    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Reviewed-by: Mark Bloch
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • devlink_nl_cmd_eswitch_set_doit() doesn't hold devlink->lock mutex while
    invoking driver callback. This is likely due to eswitch mode setting
    involves adding/remove devlink ports, health reporters or
    other devlink objects for a devlink device.

    So it is driver responsiblity to ensure thread safe eswitch state
    transition happening via either sriov legacy enablement or via devlink
    eswitch set callback.

    Therefore, get() callback should also be invoked without holding
    devlink->lock mutex.
    Vendor driver can use same internal lock which it uses during eswitch
    mode set() callback.
    This makes get() and set() implimentation symmetric in devlink core and
    in vendor drivers.

    Hence, remove holding devlink->lock mutex during eswitch get() callback.

    Failing to do so results into below deadlock scenario when mlx5_core
    driver is improved to handle eswitch mode set critical section invoked
    by devlink and sriov sysfs interface in subsequent patch.

    devlink_nl_cmd_eswitch_set_doit()
    mlx5_eswitch_mode_set()
    mutex_lock(esw->mode_lock) lock); lock); mode_lock)
    Reviewed-by: Mark Bloch
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • mlx5_unload_one() always returns 0.
    Simplify callers of mlx5_unload_one() and remove the dead code.

    Reviewed-by: Moshe Shemesh
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • mlx5_register_device() doesn't check for any error and always returns 0.
    Simplify mlx5_register_device() to return void and its caller, remove
    dead code related to it.

    Reviewed-by: Moshe Shemesh
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • Group version is used when modifying a rule is allowed
    (FLOW_ACT_NO_APPEND is clear) to detect a case where the rule was found
    but while the groups where unlocked a new FTE was added. In this case,
    the added FTE could be one with identical match value so we need to
    attempt again with group lock held.

    Change the code so version is retrieved only when FLOW_ACT_NO_APPEND is
    cleared. As result, later compare can also be avoided if FLOW_ACT_NO_APPEND
    is cleared.

    Also improve comments text.

    Signed-off-by: Eli Cohen
    Reviewed-by: Mark Bloch
    Signed-off-by: Saeed Mahameed

    Eli Cohen
     
  • FTE version is not used anywhere in the code so avoid incrementing it.

    Signed-off-by: Eli Cohen
    Reviewed-by: Mark Bloch
    Signed-off-by: Saeed Mahameed

    Eli Cohen
     
  • When adding a rule to a flow group we need increment the version of the
    group. Current code fails to do that and as a result, when trying to add
    a rule, we will fail to discover a case where an FTE with the same match
    value was added while we scanned the groups of the same match criteria,
    thus we may try to add an FTE that was already added.

    Signed-off-by: Eli Cohen
    Reviewed-by: Mark Bloch
    Signed-off-by: Saeed Mahameed

    Eli Cohen
     
  • Instead of using two different structs for searching groups with the
    same match, use a single struct and thus simplify the code, make it more
    readable and smaller size which means less code cache misses.

    text data bss dec hex
    before: 35524 2744 0 38268 957c
    after: 35038 2744 0 37782 9396

    When testing add 70000 rules, delete all the rules, and repeat three
    times taking the average, we get (time in seconds):

    Before the change: insert 16.80, delete 11.02
    After the change: insert 16.55, delete 10.95

    Signed-off-by: Eli Cohen
    Reviewed-by: Mark Bloch
    Reviewed-by: Maor Gottlieb
    Signed-off-by: Saeed Mahameed

    Eli Cohen
     
  • The correct type is u32.

    Fixes: d18296ffd9cc ("net/mlx5: E-Switch, Introduce global tables")
    Signed-off-by: Roi Dayan
    Reviewed-by: Paul Blakey
    Signed-off-by: Saeed Mahameed

    Roi Dayan
     
  • We allocate a temporary memory but forget to free it.

    Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
    Signed-off-by: Roi Dayan
    Reviewed-by: Paul Blakey
    Signed-off-by: Saeed Mahameed

    Roi Dayan
     
  • Register c0 loopback is needed to fully support chains and prios.

    Enable chains and prio only if loopback (of reg c1 which came together
    with c0), is enabled. To be able to check that, move enabling of loopback
    before eswitch chains init.

    Signed-off-by: Paul Blakey
    Reviewed-by: Roi Dayan
    Signed-off-by: Saeed Mahameed

    Paul Blakey
     
  • Reg c0/c1 matching, rewrite of regs c0/c1, and copy header of regs c1,B
    is needed for the restore table to function, might not be supported by
    firmware, and creation of the restore table or the copy header will
    fail.

    Check reg_c1 loopback support, as firmware which supports this,
    should have all of the above.

    Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
    Signed-off-by: Paul Blakey
    Reviewed-by: Roi Dayan
    Signed-off-by: Saeed Mahameed

    Paul Blakey
     
  • The function mlx5e_rep_setup_ft_cb check chain_index is zero twice.

    Signed-off-by: wenxu
    Signed-off-by: Saeed Mahameed

    wenxu