23 Feb, 2019

1 commit


20 Feb, 2019

8 commits

  • Add support for new LINK messages to allow adding and deleting rdma
    interfaces. This will be used initially for soft rdma drivers which
    instantiate device instances dynamically by the admin specifying a netdev
    device to use. The rdma_rxe module will be the first user of these
    messages.

    The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
    with the rdma core if they provide link add/delete functions. Each driver
    registers with a unique "type" string, that is used to dispatch messages
    coming from user space. A new RDMA_NLDEV_ATTR is defined for the "type"
    string. User mode will pass 3 attributes in a NEWLINK message:
    RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
    RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
    RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
    link. The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
    the device to delete.

    Signed-off-by: Steve Wise
    Reviewed-by: Leon Romanovsky
    Reviewed-by: Michael J. Ruhl
    Signed-off-by: Jason Gunthorpe

    Steve Wise
     
  • Since rxe allows unregistration from other threads the rxe pointer can
    become invalid any moment after ib_register_driver returns. This could
    cause a user triggered use after free.

    Add another driver callback to be called right after the device becomes
    registered to complete any device setup required post-registration. This
    callback has enough core locking to prevent the device from becoming
    unregistered.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • These APIs are intended to support drivers that exist outside the usual
    driver core probe()/remove() callbacks. Normally the driver core will
    prevent remove() from running concurrently with probe(), once this safety
    is lost drivers need more support to get the locking and lifetimes right.

    ib_unregister_driver() is intended to be used during module_exit of a
    driver using these APIs. It unregisters all the associated ib_devices.

    ib_unregister_device_and_put() is to be used by a driver-specific removal
    function (ie removal by name, removal from a netdev notifier, removal from
    netlink)

    ib_unregister_queued() is to be used from netdev notifier chains where
    RTNL is held.

    The locking is tricky here since once things become async it is possible
    to race unregister with registration. This is largely solved by relying on
    the registration refcount, unregistration will only ever work on something
    that has a positive registration refcount - and then an unregistration
    mutex serializes all competing unregistrations of the same device.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Several drivers need to find the ib_device from a given netdev. rxe needs
    this at speed in an unsleepable context, so choose to implement the
    translation using a RCU safe hash table.

    The hash table can have a many to one mapping. This is intended to support
    some future case where multiple IB drivers (ie iWarp and RoCE) connect to
    the same netdevs. driver_ids will need to be different to support this.

    In the process this makes the struct ib_device and ib_port_data RCU safe
    by deferring their kfrees.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • The associated netdev should not actually be very dynamic, so for most
    drivers there is no reason for a callback like this. Provide an API to
    inform the core code about the net dev affiliation and use a core
    maintained data structure instead.

    This allows the core code to be more aware of the ndev relationship which
    will allow some new APIs based around this.

    This also uses locking that makes some kind of sense, many drivers had a
    confusing RCU lock, or missing locking which isn't right.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Like the other cases there no real reason to have another array just for
    the cache. This larger conversion gets its own patch.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • There is no reason to have three allocations of per-port data. Combine
    them together and make the lifetime for all the per-port data match the
    struct ib_device.

    Following patches will require more port-specific data, now there is a
    good place to put it.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • We have many loops iterating over all of the end port numbers on a struct
    ib_device, simplify them with a for_each helper.

    Reviewed-by: Parav Pandit
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

19 Feb, 2019

4 commits


16 Feb, 2019

3 commits


13 Feb, 2019

1 commit


10 Feb, 2019

1 commit

  • Due to concurrent work by myself and Jason, a normal fast forward merge
    was not possible. This brings in a number of hfi1 changes, mainly the
    hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
    and integrated over a period of days.

    Signed-off-by: Doug Ledford

    Doug Ledford
     

09 Feb, 2019

10 commits

  • The locking here started out with a single lock that covered everything
    and then has lately veered into crazy town.

    The fundamental problem is that several places need to iterate over a
    linked list, but also need to drop their locks to avoid deadlock during
    client callbacks.

    xarray's restartable iteration offers a simple solution to the
    problem. Once all the lists are xarrays we can drop locks in the places
    that need that and rely on xarray to provide consistency and locking for
    the data structure.

    The resulting simplification is that each of the three lists has a
    dedicated rwsem that must be held when working with the list it
    covers. One data structure is no longer covered by multiple locks.

    The sleeping semaphore is selected because the read side generally needs
    to be held over something sleeping, and using RCU reader locking in those
    cases is overkill.

    In the process this simplifies the entire registration/unregistration flow
    to be the expected list of setups and the reversed list of matching
    teardowns, and the registration lock 'refcount' can now be revised to be
    released after the ULPs are removed, providing a very sane semantic for
    this feature.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Now that we have a small ID for each client we can use xarray instead of
    linearly searching linked lists for client data. This will give much
    faster and scalable client data lookup, and will lets us revise the
    locking scheme.

    Since xarray can store 'going_down' using a mark just entirely eliminate
    the struct ib_client_data and directly store the client_data value in the
    xarray. However this does require a special iterator as we must still
    iterate over any NULL client_data values.

    Also eliminate the client_data_lock in favour of internal xarray locking.

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This gives each client a unique ID and will let us move client_data to use
    xarray, and revise the locking scheme.

    clients have to be add/removed in strict FIFO/LIFO order as they
    interdepend. To support this the client_ids are assigned to increase in
    FIFO order. The existing linked list is kept to support reverse iteration
    until xarray can get a reverse iteration API.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Parav Pandit

    Jason Gunthorpe
     
  • This really has no purpose anymore, refcount can be used to tell if the
    device is still registered. Keeping it around just invites mis-use.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Parav Pandit

    Jason Gunthorpe
     
  • The PD allocations in IB/core allows us to simplify drivers and their
    error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
    in had with relevant update in .dealloc_pd().

    We will use this opportunity and convert .dealloc_pd() to don't fail, as
    it was suggested a long time ago, failures are not happening as we have
    never seen a WARN_ON print.

    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Leon Romanovsky
     
  • Add new macros to be used in drivers while registering ops structure and
    IB/core while calling allocation routines, so drivers won't need to
    perform kzalloc/kfree in their paths.

    The change in allocation stage allows us to initialize common fields prior
    to calling to drivers (e.g. restrack).

    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Leon Romanovsky
     
  • When creating many MAD agents in a short period of time, receive packet
    processing can be delayed long enough to cause timeouts while new agents
    are being added to the atomic notifier chain with IRQs disabled. Notifier
    chain registration and unregstration is an O(n) operation. With large
    numbers of MAD agents being created and destroyed simultaneously the CPUs
    spend too much time with interrupts disabled.

    Instead of each MAD agent registering for it's own LSM notification,
    maintain a list of agents internally and register once, this registration
    already existed for handling the PKeys. This list is write mostly, so a
    normal spin lock is used vs a read/write lock. All MAD agents must be
    checked, so a single list is used instead of breaking them down per
    device.

    Notifier calls are done under rcu_read_lock, so there isn't a risk of
    similar packet timeouts while checking the MAD agents security settings
    when notified.

    Signed-off-by: Daniel Jurgens
    Reviewed-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Acked-by: Paul Moore
    Signed-off-by: Jason Gunthorpe

    Daniel Jurgens
     
  • Move the security related fields above the u8s to eliminate a hole in the
    struct.

    pahole before:
    struct ib_mad_agent {
    ...
    u32 hi_tid; /* 48 4 */
    u32 flags; /* 52 4 */
    u8 port_num; /* 56 1 */
    u8 rmpp_version; /* 57 1 */

    /* XXX 6 bytes hole, try to pack */

    /* --- cacheline 1 boundary (64 bytes) --- */
    void * security; /* 64 8 */
    bool smp_allowed; /* 72 1 */
    bool lsm_nb_reg; /* 73 1 */

    /* XXX 6 bytes hole, try to pack */

    struct notifier_block lsm_nb; /* 80 24 */

    /* XXX last struct has 4 bytes of padding */

    /* size: 104, cachelines: 2, members: 14 */
    ...
    };

    pahole after:
    struct ib_mad_agent {
    ...
    u32 hi_tid; /* 48 4 */
    u32 flags; /* 52 4 */
    void * security; /* 56 8 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct notifier_block lsm_nb; /* 64 24 */

    /* XXX last struct has 4 bytes of padding */

    u8 port_num; /* 88 1 */
    u8 rmpp_version; /* 89 1 */
    bool smp_allowed; /* 90 1 */
    bool lsm_nb_reg; /* 91 1 */

    /* size: 96, cachelines: 2, members: 14 */
    ...
    };

    Signed-off-by: Daniel Jurgens
    Reviewed-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Daniel Jurgens
     
  • This allows drivers to know the tos was actively set by the application.

    Signed-off-by: Steve Wise
    Signed-off-by: Jason Gunthorpe

    Steve Wise
     
  • Define new option in 'rdma_set_option' to override calculated QP timeout
    when requested to provide QP attributes to modify a QP.

    At the same time, pack tos_set to be bitfield.

    Signed-off-by: Danit Goldberg
    Reviewed-by: Moni Shoua
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Parav Pandit
    Signed-off-by: Jason Gunthorpe

    Danit Goldberg
     

06 Feb, 2019

8 commits

  • This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
    framework. The TID RDMA WRITE protocol is an end-to-end protocol
    between the hfi1 drivers on two OPA nodes that converts a qualified
    RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
    on the responder side.

    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Mitko Haralanov
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • The s_ack_queue is managed by two pointers into the ring:
    r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
    where the next received request is going to be placed and s_tail_ack_queue
    is the entry of the request currently being processed. This works
    perfectly fine for normal Verbs as the requests are processed one at a
    time and the s_tail_ack_queue is not moved until the request that it
    points to is fully completed.

    In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
    the two pointers can easily be used to determine "queue full" and "queue
    empty" conditions.

    The detection of these two conditions are imported in determining when an
    old entry can safely be overwritten with a new received request and the
    resources associated with the old request be safely released.

    When pipelined TID RDMA WRITE is introduced into this mix, things look
    very different. r_head_ack_queue is still the point at which a newly
    received request will be inserted, s_tail_ack_queue is still the
    currently processed request. However, with pipelined TID RDMA WRITE
    requests, s_tail_ack_queue moves to the next request once all TID RDMA
    WRITE responses for that request have been sent. The rest of the protocol
    for a particular request is managed by other pointers specific to TID RDMA
    - r_tid_tail and r_tid_ack - which point to the entries for which the next
    TID RDMA DATA packets are going to arrive and the request for which
    the next TID RDMA ACK packets are to be generated, respectively.

    What this means is that entries in the ring, which are "behind"
    s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
    longer considered complete. This is where the problem is - a newly
    received request could potentially overwrite a still active TID RDMA WRITE
    request.

    The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
    normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
    response. Since TID RDMA WRITE responses are processed by the normal Verbs
    send engine, s_tail_ack_queue had to be moved to the next entry once all
    TID RDMA WRITE response packets were sent to get the desired pipelining
    between requests. Doing otherwise would mean that the normal Verbs send
    engine would not be able to send the TID RDMA WRITE responses for the next
    TID RDMA request until the current one is fully completed.

    This patch introduces the s_acked_ack_queue index to point to the next
    request to complete on the responder side. For requests other than TID
    RDMA WRITE, s_acked_ack_queue should always be kept in sync with
    s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
    s_tail_ack_queue.

    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Mitko Haralanov
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • This patch adds the functions to build TID RDMA WRITE request.
    The work request opcode, packet opcode, and packet formats for TID
    RDMA WRITE protocol are also defined in this patch.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Ashutosh Dixit
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • The RC retry timeout value is based on the estimated time for the
    response packet to come back. However, for TID RDMA READ request, due
    to the use of header suppression, the driver is normally not notified
    for each incoming response packet until the last TID RDMA READ response
    packet. Consequently, the retry timeout value should be extended to
    cover the transaction time for the entire length of a segment (default
    256K) instead of that for a single packet. This patch addresses the
    issue by introducing new retry timer functions to account for multiple
    packets and wrapper functions for backward compatibility.

    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • This patch adds the helper functions to build the TID RDMA READ request
    on the requester side. The key is to allocate TID resources (TID flow
    and TID entries) and send the resource information to the responder side
    along with the read request. Since the TID resources are limited, each
    TID RDMA READ request has to be split into segments with a default
    segment size of 256K. A software flow is allocated to track the data
    transaction for each segment. The work request opcode, packet opcode, and
    packet formats for TID RDMA READ protocol are also defined in this patch.

    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • TID entries are used by hfi1 hardware to receive data payload from
    incoming packets directly into a user buffer and thus avoid data copying
    by software. This patch implements the functions for TID allocation,
    freeing, and programming TID RcvArray entries in hardware for kernel
    clients. TID entries are managed via lists of TID groups similar to PSM.
    Furthermore, to track TID resource allocation for each request, software
    flows are also allocated and freed as needed. Since software flows
    consume large amount of memory for tracking TID allocation and freeing,
    it is generally desirable to allocate them dynamically in the send queue
    and only for TID RDMA requests, but pre-allocate them for receive queue
    because the send queue could have thousands of entries while the receive
    queue has only a limited number of entries.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Ashutosh Dixit
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • This patch moves some RC helper functions into a header file so that
    they can be called from both RC and TID RDMA functions. In addition,
    a common function for rewinding a request is created in rdmavt so that
    it can be shared between qib and hfi1 driver.

    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Mitko Haralanov
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • Move the iwpm kdoc comments from the prototype declarations to above
    the function bodies. There are no functional changes in this patch.

    Signed-off-by: Steve Wise
    Signed-off-by: Jason Gunthorpe

    Steve Wise
     

05 Feb, 2019

4 commits

  • A soft iwarp driver that uses the host TCP stack via a kernel mode socket
    does not need port mapping. In fact, if the port map daemon, iwpmd, is
    running, then iwpmd must not try and create/bind a socket to the actual
    port for a soft iwarp connection, since the driver already has that socket
    bound.

    Yet if the soft iwarp driver wants to interoperate with hard iwarp devices
    that -are- using port mapping, then the soft iwarp driver's mappings still
    need to be maintained and advertised by the iwpm protocol.

    This patch enhances the rdma driveriwcm interface to allow an iwarp
    driver to specify that it does not want port mapping. The iwpm
    kerneliwpmd interface is also enhanced to pass up this information on
    map requests.

    Care is taken to interoperate with the current iwpmd version (ABI version
    3) and only use the new NL attributes if iwpmd supports ABI version 4.

    The ABI version define has also been created in rdma_netlink.h so both
    kernel and user code can share it. The iwcm and iwpmd negotiate the ABI
    version to use with a new HELLO netlink message.

    Signed-off-by: Steve Wise
    Reviewed-by: Tatyana Nikolova
    Signed-off-by: Jason Gunthorpe

    Steve Wise
     
  • Linux 5.0-rc5

    Needed to merge the include/uapi changes so we have an up to date
    single-tree for these files. Patches already posted are also expected to
    need this for dependencies.

    Jason Gunthorpe
     
  • Keeping single line wrapper functions is not useful. Hence remove the
    ib_sg_dma_address() and ib_sg_dma_len() functions. This patch does not
    change any functionality.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jason Gunthorpe

    Bart Van Assche
     
  • Expose XRC ODP capabilities as part of the extended device capabilities.

    Signed-off-by: Moni Shoua
    Reviewed-by: Majd Dibbiny
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Moni Shoua