31 Mar, 2020

6 commits

  • The previous patch allowed device drivers to publish their default
    binding between packet trap policers and packet trap groups. However,
    some users might not be content with this binding and would like to
    change it.

    In case user space passed a packet trap policer identifier when setting
    a packet trap group, invoke the appropriate device driver callback and
    pass the new policer identifier.

    v2:
    * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
    devlink_trap_group_set() and bail if not present
    * Add extack error message in case trap group was partially modified

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Acked-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Packet trap groups are used to aggregate logically related packet traps.
    Currently, these groups allow user space to batch operations such as
    setting the trap action of all member traps.

    In order to prevent the CPU from being overwhelmed by too many trapped
    packets, it is desirable to bind a packet trap policer to these groups.
    For example, to limit all the packets that encountered an exception
    during routing to 10Kpps.

    Allow device drivers to bind default packet trap policers to packet trap
    groups when the latter are registered with devlink.

    The next patch will enable user space to change this default binding.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Devices capable of offloading the kernel's datapath and perform
    functions such as bridging and routing must also be able to send (trap)
    specific packets to the kernel (i.e., the CPU) for processing.

    For example, a device acting as a multicast-aware bridge must be able to
    trap IGMP membership reports to the kernel for processing by the bridge
    module.

    In most cases, the underlying device is capable of handling packet rates
    that are several orders of magnitude higher compared to those that can
    be handled by the CPU.

    Therefore, in order to prevent the underlying device from overwhelming
    the CPU, devices usually include packet trap policers that are able to
    police the trapped packets to rates that can be handled by the CPU.

    This patch allows capable device drivers to register their supported
    packet trap policers with devlink. User space can then tune the
    parameters of these policer (currently, rate and burst size) and read
    from the device the number of packets that were dropped by the policer,
    if supported.

    Subsequent patches in the series will allow device drivers to create
    default binding between these policers and packet trap groups and allow
    user space to change the binding.

    v2:
    * Add 'strict_start_type' in devlink policy
    * Have device drivers provide max/min rate/burst size for each policer.
    Use them to check validity of user provided parameters

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • On low memory system, run time dumps can consume too much memory. Add
    administrator ability to disable auto dumps per reporter as part of the
    error flow handle routine.

    This attribute is not relevant while executing
    DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET.

    By default, auto dump is activated for any reporter that has a dump method,
    as part of the reporter registration to devlink.

    Signed-off-by: Eran Ben Elisha
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Eran Ben Elisha
     
  • When health reporter is registered to devlink, devlink will implicitly set
    auto recover if and only if the reporter has a recover method. No reason
    to explicitly get the auto recover flag from the driver.

    Remove this flag from all drivers that called
    devlink_health_reporter_create.

    All existing health reporters set auto recovery to true if they have a
    recover method.

    Yet, administrator can unset auto recover via netlink command as prior to
    this patch.

    Signed-off-by: Eran Ben Elisha
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Eran Ben Elisha
     
  • The rest of the devlink code sets the extack message using
    NL_SET_ERR_MSG_MOD. Change the existing appearances of NL_SET_ERR_MSG
    to NL_SET_ERR_MSG_MOD.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Mar, 2020

9 commits

  • Implement support for the DEVLINK_CMD_REGION_NEW command for creating
    snapshots. This new command parallels the existing
    DEVLINK_CMD_REGION_DEL.

    In order for DEVLINK_CMD_REGION_NEW to work for a region, the new
    ".snapshot" operation must be implemented in the region's ops structure.

    The desired snapshot id must be provided. This helps avoid confusion on
    the purpose of DEVLINK_CMD_REGION_NEW, and keeps the API simpler.

    The requested id will be inserted into the xarray tracking the number of
    snapshots using each id. If this id is already used by another snapshot
    on any region, an error will be returned.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Each snapshot created for a devlink region must have an id. These ids
    are supposed to be unique per "event" that caused the snapshot to be
    created. Drivers call devlink_region_snapshot_id_get to obtain a new id
    to use for a new event trigger. The id values are tracked per devlink,
    so that the same id number can be used if a triggering event creates
    multiple snapshots on different regions.

    There is no mechanism for snapshot ids to ever be reused. Introduce an
    xarray to store the count of how many snapshots are using a given id,
    replacing the snapshot_id field previously used for picking the next id.

    The devlink_region_snapshot_id_get() function will use xa_alloc to
    insert an initial value of 1 value at an available slot between 0 and
    U32_MAX.

    The new __devlink_snapshot_id_increment() and
    __devlink_snapshot_id_decrement() functions will be used to track how
    many snapshots currently use an id.

    Drivers must now call devlink_snapshot_id_put() in order to release
    their reference of the snapshot id after adding region snapshots.

    By tracking the total number of snapshots using a given id, it is
    possible for the decrement() function to erase the id from the xarray
    when it is not in use.

    With this method, a snapshot id can become reused again once all
    snapshots that referred to it have been deleted via
    DEVLINK_CMD_REGION_DEL, and the driver has finished adding snapshots.

    This work also paves the way to introduce a mechanism for userspace to
    request a snapshot.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • The devlink_snapshot_id_get() function returns a snapshot id. The
    snapshot id is a u32, so there is no way to indicate an error code.

    A future change is going to possibly add additional cases where this
    function could fail. Refactor the function to return the snapshot id in
    an argument, so that it can return zero or an error value.

    This ensures that snapshot ids cannot be confused with error values, and
    aids in the future refactor of snapshot id allocation management.

    Because there is no current way to release previously used snapshot ids,
    add a simple check ensuring that an error is reported in case the
    snapshot_id would over flow.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • A future change is going to implement a new devlink command to request
    a snapshot on demand. As part of this, the logic for handling the
    snapshot ids will be refactored. To simplify the snapshot id allocation
    function, move it to a separate function prefixed by `__`. This helper
    function will assume the lock is held.

    While no other callers will exist, it simplifies refactoring the logic
    because there is no need to complicate the function with gotos to handle
    unlocking on failure.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • The devlink_region_snapshot_create function returns -ENOMEM when the
    maximum number of snapshots has been reached. This is confusing because
    it is not an issue of being out of memory. Change this to use -ENOSPC
    instead.

    Reported-by: Jiri Pirko
    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • A future change is going to add a new devlink command to request
    a snapshot on demand. This function will want to call the
    devlink_region_snapshot_create function while already holding the
    devlink instance lock.

    Extract the logic of this function into a static function prefixed by
    `__` to indicate that it is an internal helper function. Modify the
    original function to be implemented in terms of the new locked
    function.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • The function documentation comment for devlink_region_snapshot_create
    included a literal tab character between 'future analyses' that was
    difficult to spot as it happened to only display as one space wide.

    Fix the comment to use a space here instead of a stray tab appearing in
    the middle of a sentence.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • It does not makes sense that two snapshots for a given region would use
    different destructors. Simplify snapshot creation by adding
    a .destructor op for regions.

    This operation will replace the data_destructor for the snapshot
    creation, and makes snapshot creation easier.

    Noticed-by: Jakub Kicinski
    Signed-off-by: Jacob Keller
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Modify the devlink region code in preparation for adding new operations
    on regions.

    Create a devlink_region_ops structure, and move the name pointer from
    within the devlink_region structure into the ops structure (similar to
    the devlink_health_reporter_ops).

    This prepares the regions to enable support of additional operations in
    the future such as requesting snapshots, or accessing the region
    directly without a snapshot.

    In order to re-use the constant strings in the mlx4 driver their
    declaration must be changed to 'const char * const' to ensure the
    compiler realizes that both the data and the pointer cannot change.

    Signed-off-by: Jacob Keller
    Reviewed-by: Jakub Kicinski
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     

26 Mar, 2020

1 commit

  • devlink_nl_cmd_eswitch_set_doit() doesn't hold devlink->lock mutex while
    invoking driver callback. This is likely due to eswitch mode setting
    involves adding/remove devlink ports, health reporters or
    other devlink objects for a devlink device.

    So it is driver responsiblity to ensure thread safe eswitch state
    transition happening via either sriov legacy enablement or via devlink
    eswitch set callback.

    Therefore, get() callback should also be invoked without holding
    devlink->lock mutex.
    Vendor driver can use same internal lock which it uses during eswitch
    mode set() callback.
    This makes get() and set() implimentation symmetric in devlink core and
    in vendor drivers.

    Hence, remove holding devlink->lock mutex during eswitch get() callback.

    Failing to do so results into below deadlock scenario when mlx5_core
    driver is improved to handle eswitch mode set critical section invoked
    by devlink and sriov sysfs interface in subsequent patch.

    devlink_nl_cmd_eswitch_set_doit()
    mlx5_eswitch_mode_set()
    mutex_lock(esw->mode_lock) lock); lock); mode_lock)
    Reviewed-by: Mark Bloch
    Signed-off-by: Parav Pandit
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     

24 Mar, 2020

3 commits

  • Packet trap groups are now explicitly registered by drivers and not
    implicitly registered when the packet traps are registered. Therefore,
    there is no need to encode entire group structure the trap is associated
    with inside the trap structure.

    Instead, only pass the group identifier. Refer to it as initial group
    identifier, as future patches will allow user space to move traps
    between groups.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Now that drivers explicitly register their supported packet trap groups
    there is no for devlink to create them on-demand and destroy them when
    their reference count reaches zero.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Currently, packet trap groups are implicitly registered by drivers upon
    packet trap registration. When the traps are registered, each is
    associated with a group and the group is created by devlink, if it does
    not exist already.

    This makes it difficult for drivers to pass additional attributes for
    the groups.

    Therefore, as a preparation for future patches that require passing
    additional group attributes, add an API to explicitly register /
    unregister these groups.

    Next patches will convert existing drivers to use this API.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

13 Mar, 2020

1 commit


04 Mar, 2020

3 commits

  • Currently mlx5 PCI PF and VF devlink devices register their ports as
    physical port in non-representors mode.

    Introduce a new port flavour as virtual so that virtual devices can
    register 'virtual' flavour to make it more clear to users.

    An example of one PCI PF and 2 PCI virtual functions, each having
    one devlink port.

    $ devlink port show
    pci/0000:06:00.0/1: type eth netdev ens2f0 flavour physical port 0
    pci/0000:06:00.2/1: type eth netdev ens2f2 flavour virtual port 0
    pci/0000:06:00.3/1: type eth netdev ens2f3 flavour virtual port 0

    Reviewed-by: Jiri Pirko
    Signed-off-by: Parav Pandit
    Acked-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Parav Pandit
     
  • DEVLINK_ATTR_REGION_CHUNK_ADDR and DEVLINK_ATTR_REGION_CHUNK_LEN
    lack entries in the netlink policy. Corresponding nla_get_u64()s
    may read beyond the end of the message.

    Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • DEVLINK_ATTR_PARAM_VALUE_DATA may have different types
    so it's not checked by the normal netlink policy. Make
    sure the attribute length is what we expect.

    Fixes: e3b7ca18ad7b ("devlink: Add param set command")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

29 Feb, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     

28 Feb, 2020

1 commit


27 Feb, 2020

1 commit

  • list_for_each_entry_rcu() has built-in RCU and lock checking.

    Pass cond argument to list_for_each_entry_rcu() to silence
    false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled.

    The devlink->lock is held when devlink_dpipe_table_find()
    is called in non RCU read side section. Therefore, pass struct devlink
    to devlink_dpipe_table_find() for lockdep checking.

    Signed-off-by: Madhuparna Bhowmik
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Madhuparna Bhowmik
     

26 Feb, 2020

2 commits


25 Feb, 2020

1 commit


24 Feb, 2020

1 commit


19 Feb, 2020

1 commit


05 Feb, 2020

1 commit

  • commit fdd41ec21e15 ("devlink: Return right error code in case of errors
    for region read") modified the region read code to report errors
    properly in unexpected cases.

    In the case where the start_offset and ret_offset match, it unilaterally
    converted this into an error. This causes an issue for the "dump"
    version of the command. In this case, the devlink region dump will
    always report an invalid argument:

    000000000000ffd0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    000000000000ffe0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    devlink answers: Invalid argument
    000000000000fff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

    This occurs because the expected flow for the dump is to return 0 after
    there is no further data.

    The simplest fix would be to stop converting the error code to -EINVAL
    if start_offset == ret_offset. However, avoid unnecessary work by
    checking for when start_offset is larger than the region size and
    returning 0 upfront.

    Fixes: fdd41ec21e15 ("devlink: Return right error code in case of errors for region read")
    Signed-off-by: Jacob Keller
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jacob Keller
     

25 Jan, 2020

1 commit

  • Devlink health recover notifications were added only on driver direct
    updates of health_state through devlink_health_reporter_state_update().
    Add notifications on updates of health_state by devlink flows of report
    and recover.

    Moved functions devlink_nl_health_reporter_fill() and
    devlink_recover_notify() to avoid forward declaration.

    Fixes: 97ff3bd37fac ("devlink: add devink notification when reporter update health state")
    Signed-off-by: Moshe Shemesh
    Signed-off-by: David S. Miller

    Moshe Shemesh
     

20 Jan, 2020

1 commit


19 Jan, 2020

3 commits

  • Add packet trap that can report NVE packets that the device decided to
    drop because their overlay source MAC is multicast.

    Signed-off-by: Amit Cohen
    Acked-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Amit Cohen
     
  • Add packet traps that can report packets that were dropped during tunnel
    decapsulation.

    Signed-off-by: Amit Cohen
    Acked-by: Jiri Pirko
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Amit Cohen
     
  • Add packet trap that can report packets that reached the router, but are
    non-routable. For example, IGMP queries can be flooded by the device in
    layer 2 and reach the router. Such packets should not be routed and
    instead dropped.

    Signed-off-by: Amit Cohen
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Amit Cohen
     

12 Jan, 2020

1 commit


11 Jan, 2020

1 commit

  • The commit cited below causes devlink to emit a warning if a type was
    not set on a devlink port for longer than 30 seconds to "prevent
    misbehavior of drivers". This proved to be problematic when
    unregistering the backing netdev. The flow is always:

    devlink_port_type_clear() // schedules the warning
    unregister_netdev() // blocking
    devlink_port_unregister() // cancels the warning

    The call to unregister_netdev() can block for long periods of time for
    various reasons: RTNL lock is contended, large amounts of configuration
    to unroll following dismantle of the netdev, etc. This results in
    devlink emitting a warning despite the driver behaving correctly.

    In emulated environments (of future hardware) which are usually very
    slow, the warning can also be emitted during port creation as more than
    30 seconds can pass between the time the devlink port is registered and
    when its type is set.

    In addition, syzbot has hit this warning [1] 1974 times since 07/11/19
    without being able to produce a reproducer. Probably because
    reproduction depends on the load or other bugs (e.g., RTNL not being
    released).

    To prevent bogus warnings, increase the timeout to 1 hour.

    [1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb

    Fixes: 136bf27fc0e9 ("devlink: add warning in case driver does not set port type")
    Signed-off-by: Ido Schimmel
    Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com
    Reported-by: Alex Veber
    Tested-by: Alex Veber
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

09 Jan, 2020

1 commit