23 Feb, 2019
1 commit
-
Following the PD conversion patch, do the same for ucontext allocations.
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
20 Feb, 2019
8 commits
-
Add support for new LINK messages to allow adding and deleting rdma
interfaces. This will be used initially for soft rdma drivers which
instantiate device instances dynamically by the admin specifying a netdev
device to use. The rdma_rxe module will be the first user of these
messages.The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
with the rdma core if they provide link add/delete functions. Each driver
registers with a unique "type" string, that is used to dispatch messages
coming from user space. A new RDMA_NLDEV_ATTR is defined for the "type"
string. User mode will pass 3 attributes in a NEWLINK message:
RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
link. The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
the device to delete.Signed-off-by: Steve Wise
Reviewed-by: Leon Romanovsky
Reviewed-by: Michael J. Ruhl
Signed-off-by: Jason Gunthorpe -
Since rxe allows unregistration from other threads the rxe pointer can
become invalid any moment after ib_register_driver returns. This could
cause a user triggered use after free.Add another driver callback to be called right after the device becomes
registered to complete any device setup required post-registration. This
callback has enough core locking to prevent the device from becoming
unregistered.Signed-off-by: Jason Gunthorpe
-
These APIs are intended to support drivers that exist outside the usual
driver core probe()/remove() callbacks. Normally the driver core will
prevent remove() from running concurrently with probe(), once this safety
is lost drivers need more support to get the locking and lifetimes right.ib_unregister_driver() is intended to be used during module_exit of a
driver using these APIs. It unregisters all the associated ib_devices.ib_unregister_device_and_put() is to be used by a driver-specific removal
function (ie removal by name, removal from a netdev notifier, removal from
netlink)ib_unregister_queued() is to be used from netdev notifier chains where
RTNL is held.The locking is tricky here since once things become async it is possible
to race unregister with registration. This is largely solved by relying on
the registration refcount, unregistration will only ever work on something
that has a positive registration refcount - and then an unregistration
mutex serializes all competing unregistrations of the same device.Signed-off-by: Jason Gunthorpe
-
Several drivers need to find the ib_device from a given netdev. rxe needs
this at speed in an unsleepable context, so choose to implement the
translation using a RCU safe hash table.The hash table can have a many to one mapping. This is intended to support
some future case where multiple IB drivers (ie iWarp and RoCE) connect to
the same netdevs. driver_ids will need to be different to support this.In the process this makes the struct ib_device and ib_port_data RCU safe
by deferring their kfrees.Signed-off-by: Jason Gunthorpe
-
The associated netdev should not actually be very dynamic, so for most
drivers there is no reason for a callback like this. Provide an API to
inform the core code about the net dev affiliation and use a core
maintained data structure instead.This allows the core code to be more aware of the ndev relationship which
will allow some new APIs based around this.This also uses locking that makes some kind of sense, many drivers had a
confusing RCU lock, or missing locking which isn't right.Signed-off-by: Jason Gunthorpe
-
Like the other cases there no real reason to have another array just for
the cache. This larger conversion gets its own patch.Signed-off-by: Jason Gunthorpe
-
There is no reason to have three allocations of per-port data. Combine
them together and make the lifetime for all the per-port data match the
struct ib_device.Following patches will require more port-specific data, now there is a
good place to put it.Signed-off-by: Jason Gunthorpe
-
We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.Reviewed-by: Parav Pandit
Signed-off-by: Jason Gunthorpe
19 Feb, 2019
4 commits
-
There is no need to expose internals of restrack DB to IB/core.
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
XArray uses internal lock for updates to XArray. This means that our
external RW lock is needed to ensure that entry is not deleted while we
are performing iteration over list.Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
Add new general helper to get restrack entry given by ID and their
respective type.Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
The additions of .doit callbacks posses new access pattern to the resource
entries by some user visible index. Back then, the legacy DB was
implemented as hash because per-index access wasn't needed and XArray
wasn't accepted yet.Acceptance of XArray together with per-index access requires the refresh
of DB implementation.Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
16 Feb, 2019
3 commits
-
Now when we have the udata passed to all the ib_xxx object creation APIs
and the additional macro 'rdma_udata_to_drv_context' to get the
ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally
start to remove the dependency of the drivers in the
ib_xxx->uobject->context.Signed-off-by: Shamir Rabinovitch
Signed-off-by: Jason Gunthorpe -
Helper function to get driver's context out of ib_udata wrapped in
uverbs_attr_bundle for user objects or NULL for kernel objects.Signed-off-by: Shamir Rabinovitch
Signed-off-by: Jason Gunthorpe -
Add ib_ucontext to the uverbs_attr_bundle sent down the iocl and cmd flows
as soon as the flow has ib_uobject.In addition, remove rdma_get_ucontext helper function that is only used by
ib_umem_get.Signed-off-by: Shamir Rabinovitch
Signed-off-by: Jason Gunthorpe
13 Feb, 2019
1 commit
-
I had merged the hfi1-tid code into my local copy of for-next, but was
waiting on 0day testing before pushing it (I pushed it to my wip
branch). Having waited several days for 0day testing to show up, I'm
finally just going to push it out. In the meantime, though, Jason
pushed other stuff to for-next, so I needed to merge up the branches
before pushing.Signed-off-by: Doug Ledford
10 Feb, 2019
1 commit
-
Due to concurrent work by myself and Jason, a normal fast forward merge
was not possible. This brings in a number of hfi1 changes, mainly the
hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
and integrated over a period of days.Signed-off-by: Doug Ledford
09 Feb, 2019
10 commits
-
The locking here started out with a single lock that covered everything
and then has lately veered into crazy town.The fundamental problem is that several places need to iterate over a
linked list, but also need to drop their locks to avoid deadlock during
client callbacks.xarray's restartable iteration offers a simple solution to the
problem. Once all the lists are xarrays we can drop locks in the places
that need that and rely on xarray to provide consistency and locking for
the data structure.The resulting simplification is that each of the three lists has a
dedicated rwsem that must be held when working with the list it
covers. One data structure is no longer covered by multiple locks.The sleeping semaphore is selected because the read side generally needs
to be held over something sleeping, and using RCU reader locking in those
cases is overkill.In the process this simplifies the entire registration/unregistration flow
to be the expected list of setups and the reversed list of matching
teardowns, and the registration lock 'refcount' can now be revised to be
released after the ULPs are removed, providing a very sane semantic for
this feature.Signed-off-by: Jason Gunthorpe
-
Now that we have a small ID for each client we can use xarray instead of
linearly searching linked lists for client data. This will give much
faster and scalable client data lookup, and will lets us revise the
locking scheme.Since xarray can store 'going_down' using a mark just entirely eliminate
the struct ib_client_data and directly store the client_data value in the
xarray. However this does require a special iterator as we must still
iterate over any NULL client_data values.Also eliminate the client_data_lock in favour of internal xarray locking.
Signed-off-by: Jason Gunthorpe
-
This gives each client a unique ID and will let us move client_data to use
xarray, and revise the locking scheme.clients have to be add/removed in strict FIFO/LIFO order as they
interdepend. To support this the client_ids are assigned to increase in
FIFO order. The existing linked list is kept to support reverse iteration
until xarray can get a reverse iteration API.Signed-off-by: Jason Gunthorpe
Reviewed-by: Parav Pandit -
This really has no purpose anymore, refcount can be used to tell if the
device is still registered. Keeping it around just invites mis-use.Signed-off-by: Jason Gunthorpe
Reviewed-by: Parav Pandit -
The PD allocations in IB/core allows us to simplify drivers and their
error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
in had with relevant update in .dealloc_pd().We will use this opportunity and convert .dealloc_pd() to don't fail, as
it was suggested a long time ago, failures are not happening as we have
never seen a WARN_ON print.Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
Add new macros to be used in drivers while registering ops structure and
IB/core while calling allocation routines, so drivers won't need to
perform kzalloc/kfree in their paths.The change in allocation stage allows us to initialize common fields prior
to calling to drivers (e.g. restrack).Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
When creating many MAD agents in a short period of time, receive packet
processing can be delayed long enough to cause timeouts while new agents
are being added to the atomic notifier chain with IRQs disabled. Notifier
chain registration and unregstration is an O(n) operation. With large
numbers of MAD agents being created and destroyed simultaneously the CPUs
spend too much time with interrupts disabled.Instead of each MAD agent registering for it's own LSM notification,
maintain a list of agents internally and register once, this registration
already existed for handling the PKeys. This list is write mostly, so a
normal spin lock is used vs a read/write lock. All MAD agents must be
checked, so a single list is used instead of breaking them down per
device.Notifier calls are done under rcu_read_lock, so there isn't a risk of
similar packet timeouts while checking the MAD agents security settings
when notified.Signed-off-by: Daniel Jurgens
Reviewed-by: Parav Pandit
Signed-off-by: Leon Romanovsky
Acked-by: Paul Moore
Signed-off-by: Jason Gunthorpe -
Move the security related fields above the u8s to eliminate a hole in the
struct.pahole before:
struct ib_mad_agent {
...
u32 hi_tid; /* 48 4 */
u32 flags; /* 52 4 */
u8 port_num; /* 56 1 */
u8 rmpp_version; /* 57 1 *//* XXX 6 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
void * security; /* 64 8 */
bool smp_allowed; /* 72 1 */
bool lsm_nb_reg; /* 73 1 *//* XXX 6 bytes hole, try to pack */
struct notifier_block lsm_nb; /* 80 24 */
/* XXX last struct has 4 bytes of padding */
/* size: 104, cachelines: 2, members: 14 */
...
};pahole after:
struct ib_mad_agent {
...
u32 hi_tid; /* 48 4 */
u32 flags; /* 52 4 */
void * security; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct notifier_block lsm_nb; /* 64 24 *//* XXX last struct has 4 bytes of padding */
u8 port_num; /* 88 1 */
u8 rmpp_version; /* 89 1 */
bool smp_allowed; /* 90 1 */
bool lsm_nb_reg; /* 91 1 *//* size: 96, cachelines: 2, members: 14 */
...
};Signed-off-by: Daniel Jurgens
Reviewed-by: Parav Pandit
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe -
This allows drivers to know the tos was actively set by the application.
Signed-off-by: Steve Wise
Signed-off-by: Jason Gunthorpe -
Define new option in 'rdma_set_option' to override calculated QP timeout
when requested to provide QP attributes to modify a QP.At the same time, pack tos_set to be bitfield.
Signed-off-by: Danit Goldberg
Reviewed-by: Moni Shoua
Signed-off-by: Leon Romanovsky
Reviewed-by: Parav Pandit
Signed-off-by: Jason Gunthorpe
06 Feb, 2019
8 commits
-
This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.Reviewed-by: Mike Marciniszyn
Signed-off-by: Mitko Haralanov
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
The s_ack_queue is managed by two pointers into the ring:
r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
where the next received request is going to be placed and s_tail_ack_queue
is the entry of the request currently being processed. This works
perfectly fine for normal Verbs as the requests are processed one at a
time and the s_tail_ack_queue is not moved until the request that it
points to is fully completed.In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
the two pointers can easily be used to determine "queue full" and "queue
empty" conditions.The detection of these two conditions are imported in determining when an
old entry can safely be overwritten with a new received request and the
resources associated with the old request be safely released.When pipelined TID RDMA WRITE is introduced into this mix, things look
very different. r_head_ack_queue is still the point at which a newly
received request will be inserted, s_tail_ack_queue is still the
currently processed request. However, with pipelined TID RDMA WRITE
requests, s_tail_ack_queue moves to the next request once all TID RDMA
WRITE responses for that request have been sent. The rest of the protocol
for a particular request is managed by other pointers specific to TID RDMA
- r_tid_tail and r_tid_ack - which point to the entries for which the next
TID RDMA DATA packets are going to arrive and the request for which
the next TID RDMA ACK packets are to be generated, respectively.What this means is that entries in the ring, which are "behind"
s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
longer considered complete. This is where the problem is - a newly
received request could potentially overwrite a still active TID RDMA WRITE
request.The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
response. Since TID RDMA WRITE responses are processed by the normal Verbs
send engine, s_tail_ack_queue had to be moved to the next entry once all
TID RDMA WRITE response packets were sent to get the desired pipelining
between requests. Doing otherwise would mean that the normal Verbs send
engine would not be able to send the TID RDMA WRITE responses for the next
TID RDMA request until the current one is fully completed.This patch introduces the s_acked_ack_queue index to point to the next
request to complete on the responder side. For requests other than TID
RDMA WRITE, s_acked_ack_queue should always be kept in sync with
s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
s_tail_ack_queue.Reviewed-by: Mike Marciniszyn
Signed-off-by: Mitko Haralanov
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
This patch adds the functions to build TID RDMA WRITE request.
The work request opcode, packet opcode, and packet formats for TID
RDMA WRITE protocol are also defined in this patch.Signed-off-by: Mitko Haralanov
Signed-off-by: Mike Marciniszyn
Signed-off-by: Ashutosh Dixit
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
The RC retry timeout value is based on the estimated time for the
response packet to come back. However, for TID RDMA READ request, due
to the use of header suppression, the driver is normally not notified
for each incoming response packet until the last TID RDMA READ response
packet. Consequently, the retry timeout value should be extended to
cover the transaction time for the entire length of a segment (default
256K) instead of that for a single packet. This patch addresses the
issue by introducing new retry timer functions to account for multiple
packets and wrapper functions for backward compatibility.Reviewed-by: Mike Marciniszyn
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
This patch adds the helper functions to build the TID RDMA READ request
on the requester side. The key is to allocate TID resources (TID flow
and TID entries) and send the resource information to the responder side
along with the read request. Since the TID resources are limited, each
TID RDMA READ request has to be split into segments with a default
segment size of 256K. A software flow is allocated to track the data
transaction for each segment. The work request opcode, packet opcode, and
packet formats for TID RDMA READ protocol are also defined in this patch.Reviewed-by: Mike Marciniszyn
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
TID entries are used by hfi1 hardware to receive data payload from
incoming packets directly into a user buffer and thus avoid data copying
by software. This patch implements the functions for TID allocation,
freeing, and programming TID RcvArray entries in hardware for kernel
clients. TID entries are managed via lists of TID groups similar to PSM.
Furthermore, to track TID resource allocation for each request, software
flows are also allocated and freed as needed. Since software flows
consume large amount of memory for tracking TID allocation and freeing,
it is generally desirable to allocate them dynamically in the send queue
and only for TID RDMA requests, but pre-allocate them for receive queue
because the send queue could have thousands of entries while the receive
queue has only a limited number of entries.Signed-off-by: Mitko Haralanov
Signed-off-by: Ashutosh Dixit
Signed-off-by: Mike Marciniszyn
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
This patch moves some RC helper functions into a header file so that
they can be called from both RC and TID RDMA functions. In addition,
a common function for rewinding a request is created in rdmavt so that
it can be shared between qib and hfi1 driver.Reviewed-by: Mike Marciniszyn
Signed-off-by: Mitko Haralanov
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford -
Move the iwpm kdoc comments from the prototype declarations to above
the function bodies. There are no functional changes in this patch.Signed-off-by: Steve Wise
Signed-off-by: Jason Gunthorpe
05 Feb, 2019
4 commits
-
A soft iwarp driver that uses the host TCP stack via a kernel mode socket
does not need port mapping. In fact, if the port map daemon, iwpmd, is
running, then iwpmd must not try and create/bind a socket to the actual
port for a soft iwarp connection, since the driver already has that socket
bound.Yet if the soft iwarp driver wants to interoperate with hard iwarp devices
that -are- using port mapping, then the soft iwarp driver's mappings still
need to be maintained and advertised by the iwpm protocol.This patch enhances the rdma driveriwcm interface to allow an iwarp
driver to specify that it does not want port mapping. The iwpm
kerneliwpmd interface is also enhanced to pass up this information on
map requests.Care is taken to interoperate with the current iwpmd version (ABI version
3) and only use the new NL attributes if iwpmd supports ABI version 4.The ABI version define has also been created in rdma_netlink.h so both
kernel and user code can share it. The iwcm and iwpmd negotiate the ABI
version to use with a new HELLO netlink message.Signed-off-by: Steve Wise
Reviewed-by: Tatyana Nikolova
Signed-off-by: Jason Gunthorpe -
Linux 5.0-rc5
Needed to merge the include/uapi changes so we have an up to date
single-tree for these files. Patches already posted are also expected to
need this for dependencies. -
Keeping single line wrapper functions is not useful. Hence remove the
ib_sg_dma_address() and ib_sg_dma_len() functions. This patch does not
change any functionality.Signed-off-by: Bart Van Assche
Signed-off-by: Jason Gunthorpe -
Expose XRC ODP capabilities as part of the extended device capabilities.
Signed-off-by: Moni Shoua
Reviewed-by: Majd Dibbiny
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe