29 Jul, 2020

24 commits

  • The pldmfw library is used to implement common logic needed to flash
    devices based on firmware files using the format described by the PLDM
    for Firmware Update standard.

    This library consists of logic to parse the PLDM file format from
    a firmware file object, as well as common logic for sending the relevant
    PLDM header data to the device firmware.

    A simple ops table is provided so that device drivers can implement
    device specific hardware interactions while keeping the common logic to
    the pldmfw library.

    This library will be used by the Intel ice networking driver as part of
    implementing device flash update via devlink. The library aims to be
    vendor and device agnostic. For this reason, it has been placed in
    lib/pldmfw, in the hopes that other devices which use the PLDM firmware
    file format may benefit from it in the future. However, do note that not
    all features defined in the PLDM standard have been implemented.

    Signed-off-by: Jacob Keller
    Signed-off-by: David S. Miller

    Jacob Keller
     
  • Mat Martineau says:

    ====================
    mptcp: Exchange MPTCP DATA_FIN/DATA_ACK before TCP FIN

    This series allows the MPTCP-level connection to be closed with the
    peers exchanging DATA_FIN and DATA_ACK according to the state machine in
    appendix D of RFC 8684. The process is very similar to the TCP
    disconnect state machine.

    The prior code sends DATA_FIN only when TCP FIN packets are sent, and
    does not allow for the MPTCP-level connection to be half-closed.

    Patch 8 ("mptcp: Use full MPTCP-level disconnect state machine") is the
    core of the series. Earlier patches in the series have some small fixes
    and helpers in preparation, and the final four small patches do some
    cleanup.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The MPTCP socket's write_seq member can be read without the msk lock
    held, so use WRITE_ONCE() to store it.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • The MPTCP socket's write_seq member should be read with READ_ONCE() when
    the msk lock is not held.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Bare TCP ack skbs are freed right after MPTCP sees them, so the work to
    allocate, zero, and populate the MPTCP skb extension is wasted. Detect
    these skbs and do not add skb extensions to them.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • The MPTCP state machine handles disconnections on non-fallback connections,
    but the mptcp_sock still needs to get notified when fallback subflows
    disconnect.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • RFC 8684 appendix D describes the connection state machine for
    MPTCP. This patch implements the DATA_FIN / DATA_ACK exchanges and
    MPTCP-level socket state changes described in that appendix, rather than
    simply sending DATA_FIN along with TCP FIN when disconnecting subflows.

    DATA_FIN is now sent and acknowledged before shutting down the
    subflows. Received DATA_FIN information (if not part of a data packet)
    is written to the MPTCP socket when the incoming DSS option is parsed by
    the subflow, and the MPTCP worker is scheduled to process the
    flag. DATA_FIN received as part of a full DSS mapping will be handled
    when the mapping is processed.

    The DATA_FIN is acknowledged by the worker if the reader is caught
    up. If there is still data to be moved to the MPTCP-level queue, ack_seq
    will be incremented to account for the DATA_FIN when it reaches the end
    of the stream and a DATA_ACK will be sent to the peer.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • After DATA_FIN has been sent, the peer will acknowledge it. An ack of
    the relevant MPTCP-level sequence number will update the MPTCP
    connection state appropriately.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • This will be used to transition to the appropriate state on close and
    determine if a DATA_FIN needs to be sent for that state transition.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Incoming DATA_FIN headers need to propagate the presence of the DATA_FIN
    bit and the associated sequence number to the MPTCP layer, even when
    arriving on a bare ACK that does not get added to the receive queue. Add
    structure members to store the DATA_FIN information and helpers to set
    and check those values.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Since DATA_FIN information is the same for every subflow, store it only
    in the mptcp_sock.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • mptcp_close() acquires the msk lock, so it clearly should not be held
    before the function is called.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • A MPTCP socket where sending has been shut down should not attempt to
    send additional data, since DATA_FIN has already been sent.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • RFC 8684-compliant DATA_FIN needs to be sent and ack'd before subflows
    are closed with TCP FIN, so write DATA_FIN DSS headers whenever their
    transmission has been enabled by the MPTCP connection-level socket.

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • Christoph Hellwig says:

    ====================
    sockptr_t fixes v2

    a bunch of fixes for the sockptr_t conversion

    Changes since v1:
    - fix a user pointer dereference braino in bpfilter
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Make sure not just the pointer itself but the whole range lies in
    the user address space. For that pass the length and then use
    the access_ok helper to do the check.

    Fixes: 6d04fe15f78a ("net: optimize the sockptr_t for unified kernel/user address spaces")
    Reported-by: David Laight
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • sockptr_advance never properly worked. Replace it with _offset variants
    of copy_from_sockptr and copy_to_sockptr.

    Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
    Reported-by: Jason A. Donenfeld
    Reported-by: Ido Schimmel
    Signed-off-by: Christoph Hellwig
    Acked-by: Jason A. Donenfeld
    Tested-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • While the kernel in general is not strict aliasing safe we can trivially
    do that in sockptr_is_null without affecting code generation, so always
    check the actually assigned union member.

    Reported-by: Jan Engelhardt
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • This was accidentally removed in an unrelated commit.

    Fixes: c2f12630c60f ("netfilter: switch nf_setsockopt to sockptr_t")
    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • Ido Schimmel says:

    ====================
    mlxsw: Add support for QSFP-DD transceiver type

    This patch set from Vadim adds support for Quad Small Form Factor
    Pluggable Double Density (QSFP-DD) modules in mlxsw.

    Patch #1 enables dumping of QSFP-DD module information through ethtool.

    Patch #2 enables reading of temperature thresholds from QSFP-DD modules
    for hwmon and thermal zone purposes.

    Changes since v1 [1]:

    Only rebase on top of net-next. After discussing with Andrew and Adrian
    we agreed that current approach is OK and that in the future we can
    follow Andrew's suggestion to "make a new API where user space can
    request any pages it want, and specify the size of the page". This
    should allow us "to work around known issues when manufactures get their
    EEPROM wrong".

    [1] https://lore.kernel.org/netdev/20200626144724.224372-1-idosch@idosch.org/#t
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Allow QSFP-DD transceivers temperature thresholds reading for hardware
    monitoring and thermal control.

    For this type, the thresholds are located in page 02h according to the
    "Module and Lane Thresholds" description from Common Management
    Interface Specification.

    Signed-off-by: Vadim Pasternak
    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Vadim Pasternak
     
  • The Quad Small Form Factor Pluggable Double Density (QSFP-DD) hardware
    specification defines a form factor that supports up to 400 Gbps in
    aggregate over an 8x50-Gbps electrical interface. The QSFP-DD supports
    both optical and copper interfaces.

    Implementation is based on Common Management Interface Specification;
    Rev 4.0 May 8, 2019. Table 8-2 "Identifier and Status Summary (Lower
    Page)" from this spec defines "Id and Status" fields located at offsets
    00h - 02h. Bit 2 at offset 02h ("Flat_mem") specifies QSFP EEPROM memory
    mode, which could be "upper memory flat" or "paged". Flat memory mode is
    coded "1", and indicates that only page 00h is implemented in EEPROM.
    Paged memory is coded "0" and indicates that pages 00h, 01h, 02h, 10h
    and 11h are implemented. Pages 10h and 11h are currently not supported
    by the driver.

    "Flat" memory mode is used for the passive copper transceivers. For this
    type only page 00h (256 bytes) is available. "Paged" memory is used for
    the optical transceivers. For this type pages 00h (256 bytes), 01h (128
    bytes) and 02h (128 bytes) are available. Upper page 01h contains static
    advertising field, while upper page 02h contains the module-defined
    thresholds and lane-specific monitors.

    Extend enumerator 'mlxsw_reg_mcia_eeprom_module_info_id' with additional
    field 'MLXSW_REG_MCIA_EEPROM_MODULE_INFO_TYPE_ID'. This field is used to
    indicate for QSFP-DD transceiver type which memory mode is to be used.

    Expose 256 bytes buffer for QSFP-DD passive copper transceiver and
    512 bytes buffer for optical.

    Signed-off-by: Vadim Pasternak
    Signed-off-by: Ido Schimmel
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Vadim Pasternak
     
  • Saeed Mahameed says:

    ====================
    mlx5-updates-2020-07-28

    Misc and small update to mlx5 driver:

    1) Aya adds PCIe relaxed ordering support for mlx5 netdev queues.
    2) Eran Refactors pages data base to be per vf/function to speedup
    unload time.
    3) Parav changes eswitch steering initialization to account for
    tota_vports rather than for only active vports and
    Link non uplink representors to PCI device, for uniform naming scheme.

    4) Tariq, trivial RX code improvements and missing inidirect calls
    wrappers.

    5) Small cleanup patches
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The .suspend() and .resume() callbacks are not defined for this driver.
    Still, their power management structure follows the legacy framework. To
    bring it under the generic framework, simply remove the binding of
    callbacks from "struct pci_driver".

    Change code indentation from space to tab in "struct pci_driver".

    Signed-off-by: Vaibhav Gupta
    Signed-off-by: David S. Miller

    Vaibhav Gupta
     

28 Jul, 2020

16 commits

  • list_for_each_entry is able to handle an empty list.
    The only effect of avoiding the loop is not initializing the
    index variable.
    Drop list_empty tests in cases where these variables are not
    used.

    Note that list_for_each_entry is defined in terms of list_first_entry,
    which indicates that it should not be used on an empty list. But in
    list_for_each_entry, the element obtained by list_first_entry is not
    really accessed, only the address of its list_head field is compared
    to the address of the list head, so the list_first_entry is safe.

    The semantic patch that makes this change is as follows (with another
    variant for the no brace case): (http://coccinelle.lip6.fr/)

    @@
    expression x,e;
    iterator name list_for_each_entry;
    statement S;
    identifier i;
    @@

    -if (!(list_empty(x))) {
    list_for_each_entry(i,x,...) S
    - }
    ... when != i
    ? i = e

    Signed-off-by: Julia Lawall
    Signed-off-by: Saeed Mahameed

    Julia Lawall
     
  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Saeed Mahameed

    Gustavo A. R. Silva
     
  • There is no need to print on each unsuccessful matcher
    ip_version combination since it probably will happen when
    trying to create all the possible combinations.
    On a real failure we have a print in the calling function.

    Signed-off-by: Alex Vesker
    Signed-off-by: Saeed Mahameed

    Alex Vesker
     
  • The concept of Relaxed Ordering in the PCI Express environment allows
    switches in the path between the Requester and Completer to reorder some
    transactions just received before others that were previously enqueued.

    In ETH driver, there is no question of write integrity since each memory
    segment is written only once per cycle. In addition, the driver doesn't
    access the memory shared with the hardware until the corresponding CQE
    arrives indicating all PCI transactions are done.

    Running TCP single stream over ConnectX-4 LX, ARM CPU on remote-numa has
    300% improvement in the bandwidth.

    With relaxed ordering turned off: BW:10 [GB/s]
    With relaxed ordering turned on: BW:40 [GB/s]

    The driver turns relaxed ordering with respect to the firmware
    capabilities and the return value from pcie_relaxed_ordering_enabled().

    Signed-off-by: Aya Levin
    Signed-off-by: Saeed Mahameed

    Aya Levin
     
  • Use the indirect call wrapper API macros for declaration and scope
    of the RX post WQEs functions.

    Signed-off-by: Tariq Toukan
    Reviewed-by: Maxim Mikityanskiy
    Signed-off-by: Saeed Mahameed

    Tariq Toukan
     
  • Move them from the generic header file "en.h", to the
    datapath header file "txrx.h".

    Signed-off-by: Tariq Toukan
    Reviewed-by: Maxim Mikityanskiy
    Signed-off-by: Saeed Mahameed

    Tariq Toukan
     
  • Instead of exposing the RQ datapath handlers (from en_rx.c) so that
    they are set in the control path (in en_main.c), wrap this logic
    in a single function in en_rx.c and expose it alone.

    Every profile will now have a pointer to the new mlx5e_rx_handlers
    structure, instead of directly pointing to the previously-exposed
    RQ handlers.

    This significantly improves locality and modularity of the driver,
    and allows many functions in en_rx.c to become static.

    Signed-off-by: Tariq Toukan
    Reviewed-by: Maxim Mikityanskiy
    Signed-off-by: Saeed Mahameed

    Tariq Toukan
     
  • Currently PF and VF representors are exposed as virtual device.
    They are not linked to its parent PCI device like how uplink
    representor is linked.
    Due to this, PF and VF representors cannot benefit of the
    systemd defined naming scheme. This requires special handling
    by the users.

    Hence, link the PF and VF representors to their parent PCI device
    similar to existing uplink representor netdevice.

    Example:
    udevadm output before linking to PCI device:
    $ udevadm test-builtin net_id /sys/class/net/eth6
    Load module index
    Network interface NamePolicy= disabled on kernel command line, ignoring.
    Parsed configuration file /usr/lib/systemd/network/99-default.link
    Created link configuration context.
    Using default interface naming scheme 'v243'.
    ID_NET_NAMING_SCHEME=v243
    Unload module index
    Unloaded link configuration context.

    udevadm output after linking to PCI device:
    $ udevadm test-builtin net_id /sys/class/net/eth6
    Load module index
    Network interface NamePolicy= disabled on kernel command line, ignoring.
    Parsed configuration file /usr/lib/systemd/network/99-default.link
    Created link configuration context.
    Using default interface naming scheme 'v243'.
    ID_NET_NAMING_SCHEME=v243
    ID_NET_NAME_PATH=enp0s8f0npf0vf0
    Unload module index
    Unloaded link configuration context.

    In past there was little concern over seeing 10,000 lines output
    showing up at thread [1] is not applicable as ndo ops for VF
    handling is not exposed for all the 100 repesentors for mlx5 devices.

    Additionally alternative device naming [2] to overcome shorter device
    naming is also part of the latest systemd release v245.

    [1] https://marc.info/?l=linux-netdev&m=152657949117904&w=2
    [2] https://lwn.net/Articles/814068/

    Signed-off-by: Parav Pandit
    Reviewed-by: Roi Dayan
    Acked-by: Jiri Pirko
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • Currently steering table and rx group initialization helper
    routines works on the total_vports passed as input parameter.

    Both eswitch helpers work on the mlx5_eswitch and thereby have access
    to esw->total_vports. Hence use it directly instead of passing it
    via function input arguments.

    Signed-off-by: Parav Pandit
    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • Total e-switch vports are already stored in mlx5_eswitch total_vports.
    Avoid copy of it in nvports and reuse existing total_vports calculation.

    Signed-off-by: Parav Pandit
    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • When eswitch is enabled, VFs might not be enabled. Hence, consider
    maximum number of VFs.
    This further closes the gap between handling VF vports between ECPF and
    PF.

    Fixes: ea2128fd632c ("net/mlx5: E-switch, Reduce dependency on num_vfs during mode set")
    Signed-off-by: Parav Pandit
    Reviewed-by: Roi Dayan
    Reviewed-by: Bodong Wang
    Signed-off-by: Saeed Mahameed

    Parav Pandit
     
  • Add function ID to reclaim pages debug log for better user visibility.

    Signed-off-by: Avihu Hagag
    Reviewed-by: Eran Ben Elisha
    Signed-off-by: Saeed Mahameed

    Avihu Hagag
     
  • Per page request event, FW request to allocated or release pages for a
    single function. Driver maintains FW pages object per function, so there
    is no need to hold one global page data-base. Instead, have a page
    data-base per function, which will improve performance release flow in all
    cases, especially for "release all pages".

    As the range of function IDs is large and not sequential, use xarray to
    store a per function ID page data-base, where the function ID is the key.

    Upon first allocation of a page to a function ID, create the page
    data-base per function. This data-base will be released only at pagealloc
    mechanism cleanup.

    NIC: ConnectX-4 Lx
    CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    Test case: 32 VFs, measure release pages on one VF as part of FLR
    Before: 0.021 Sec
    After: 0.014 Sec

    The improvement depends on amount of VFs and memory utilization
    by them. Time measurements above were taken from idle system.

    Signed-off-by: Eran Ben Elisha
    Reviewed-by: Mark Bloch
    Signed-off-by: Saeed Mahameed

    Eran Ben Elisha
     
  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1].

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • Tony Nguyen says:

    ====================
    1GbE Intel Wired LAN Driver Updates 2020-07-27

    This series contains updates to igc driver only.

    Sasha cleans up double definitions, unneeded and non applicable
    registers, and removes unused fields in structs. Ensures the Receive
    Descriptor Minimum Threshold Count is cleared and fixes a static checker
    error.

    v2: Remove fields from hw_stats in patches that removed their uses.
    Reworded patch descriptions for patches 1, 2, and 4.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Currently n_rq_elems is being assigned to params.elem_size instead of the
    field params.num_elems. Coverity is detecting this as a double assingment
    to params.elem_size and reporting this as an usused value on the first
    assignment. Fix this.

    Addresses-Coverity: ("Unused value")
    Fixes: b6db3f71c976 ("qed: simplify chain allocation with init params struct")
    Signed-off-by: Colin Ian King
    Acked-by: Alexander Lobakin
    Signed-off-by: David S. Miller

    Colin Ian King