25 Sep, 2017

2 commits

  • The ib_mr->length represents the length of the MR in bytes as per
    the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).

    Currently ib_mr->length field is defined as only 32-bits field.
    This might result into truncation and failed WRs of consumers who
    registers more than 4GB bytes memory regions and whose WRs accessing
    such MRs.

    This patch makes the length 64-bit to avoid such truncation.

    Cc: Sagi Grimberg
    Cc: Chuck Lever
    Cc: Faisal Latif
    Fixes: 4c67e2bfc8b7 ("IB/core: Introduce new fast registration API")
    Signed-off-by: Ilya Lesokhin
    Signed-off-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Parav Pandit
     
  • The tag matching functionality is implemented by mlx5 driver
    by extending XRQ, however this internal kernel information was
    exposed to user space applications with *xrq* name instead of *tm*.

    This patch renames *xrq* to *tm* to handle that.

    Fixes: 8d50505ada72 ("IB/uverbs: Expose XRQ capabilities")
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Leon Romanovsky
     

10 Sep, 2017

1 commit

  • Pull nfsd updates from Bruce Fields:
    "More RDMA work and some op-structure constification from Chuck Lever,
    and a small cleanup to our xdr encoding"

    * tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux:
    svcrdma: Estimate Send Queue depth properly
    rdma core: Add rdma_rw_mr_payload()
    svcrdma: Limit RQ depth
    svcrdma: Populate tail iovec when receiving
    nfsd: Incoming xdr_bufs may have content in tail buffer
    svcrdma: Clean up svc_rdma_build_read_chunk()
    sunrpc: Const-ify struct sv_serv_ops
    nfsd: Const-ify NFSv4 encoding and decoding ops arrays
    sunrpc: Const-ify instances of struct svc_xprt_ops
    nfsd4: individual encoders no longer see error cases
    nfsd4: skip encoder in trivial error cases
    nfsd4: define ->op_release for compound ops
    nfsd4: opdesc will be useful outside nfs4proc.c
    nfsd4: move some nfsd4 op definitions to xdr4.h

    Linus Torvalds
     

09 Sep, 2017

1 commit

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

06 Sep, 2017

1 commit

  • The amount of payload per MR depends on device capabilities and
    the memory registration mode in use. The new rdma_rw API hides both,
    making it difficult for ULPs to determine how large their transport
    send queues need to be.

    Expose the MR payload information via a new API.

    Signed-off-by: Chuck Lever
    Acked-by: Doug Ledford
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

31 Aug, 2017

9 commits

  • In order to use the parsing tree, we need to assign the root
    to all drivers. Currently, we just assign the default parsing
    tree via ib_uverbs_add_one. The driver could override this by
    assigning a parsing tree prior to registering the device.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • In this phase, we don't want to change all the drivers to use
    flexible driver's specific attributes. Therefore, we add two default
    attributes: UHW_IN and UHW_OUT. These attributes are optional in some
    methods and they encode the driver specific command data. We add
    a function that extract this data and creates the legacy udata over
    it.

    Driver's data should start from UVERBS_UDATA_DRIVER_DATA_FLAG. This
    turns on the first bit of the namespace, indicating this attribute
    belongs to the driver's namespace.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Add a new ib_user_ioctl_verbs.h which exports all required ABI
    enums and structs to the user-space.
    Export the default types to user-space through this file.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • When some objects are destroyed, we need to extract their status at
    destruction. After object's destruction, this status
    (e.g. events_reported) relies in the uobject. In order to have the
    latest and correct status, the underlying object should be destroyed,
    but we should keep the uobject alive and read this information off the
    uobject. We introduce a rdma_explicit_destroy function. This function
    destroys the class type object (for example, the IDR class type which
    destroys the underlying object as well) and then convert the uobject
    to be of a null class type. This uobject will then be destroyed as any
    other uobject once uverbs_finalize_object[s] is called.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • This patch adds macros for declaring objects, methods and
    attributes. These definitions are later used by downstream patches
    to declare some of the default types.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Different drivers support different features and even subset of the
    common uverbs implementation. Currently, this is handled as bitmask
    in every driver that represents which kind of methods it supports, but
    doesn't go down to attributes granularity. Moreover, drivers might
    want to add their specific types, methods and attributes to let
    their user-space counter-parts be exposed to some more efficient
    abstractions. It means that existence of different features is
    validated syntactically via the parsing infrastructure rather than
    using a complex in-handler logic.

    In order to do that, we allow defining features and abstractions
    as parsing trees. These per-feature parsing tree could be merged
    to an efficient (perfect-hash based) parsing tree, which is later
    used by the parsing infrastructure.

    To sum it up, this makes a parse tree unique for a device and
    represents only the features this particular device supports.
    This is done by having a root specification tree per feature.
    Before a device registers itself as an IB device, it merges
    all these trees into one parsing tree. This parsing tree
    is used to parse all user-space commands.

    A future user-space application could read this parse tree. This
    tree represents which objects, methods and attributes are
    supported by this device.

    This is based on the idea of
    Jason Gunthorpe

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • This adds the DEVICE object. This object supports creating the context
    that all objects are created from. Moreover, it supports executing
    methods which are related to the device itself, such as QUERY_DEVICE.
    This is a singleton object (per file instance).

    All standard objects are put in the root structure. This root will later
    on be used in drivers as the source for their whole parsing tree.
    Later on, when new features are added, these drivers could mix this root
    with other customized objects.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Switch all uverbs_type_attrs_xxxx with DECLARE_UVERBS_OBJECT
    macros. This will be later used in order to embed the object
    specific methods in the objects as well.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • In this ioctl interface, processing the command starts from
    properties of the command and fetching the appropriate user objects
    before calling the handler.

    Parsing and validation is done according to a specifier declared by
    the driver's code. In the driver, all supported objects are declared.
    These objects are separated to different object namepsaces. Dividing
    objects to namespaces is done at initialization by using the higher
    bits of the object ids. This initialization can mix objects declared
    in different places to one parsing tree using in this ioctl interface.

    For each object we list all supported methods. Similarly to objects,
    methods are separated to method namespaces too. Namespacing is done
    similarly to the objects case. This could be used in order to add
    methods to an existing object.

    Each method has a specific handler, which could be either a default
    handler or a driver specific handler.
    Along with the handler, a bunch of attributes are specified as well.
    Similarly to objects and method, attributes are namespaced and hashed
    by their ids at initialization too. All supported attributes are
    subject to automatic fetching and validation. These attributes include
    the command, response and the method's related objects' ids.

    When these entities (objects, methods and attributes) are used, the
    high bits of the entities ids are used in order to calculate the hash
    bucket index. Then, these high bits are masked out in order to have a
    zero based index. Since we use these high bits for both bucketing and
    namespacing, we get a compact representation and O(1) array access.
    This is mandatory for efficient dispatching.

    Each attribute has a type (PTR_IN, PTR_OUT, IDR and FD) and a length.
    Attributes could be validated through some attributes, like:
    (*) Minimum size / Exact size
    (*) Fops for FD
    (*) Object type for IDR

    If an IDR/fd attribute is specified, the kernel also states the object
    type and the required access (NEW, WRITE, READ or DESTROY).
    All uobject/fd management is done automatically by the infrastructure,
    meaning - the infrastructure will fail concurrent commands that at
    least one of them requires concurrent access (WRITE/DESTROY),
    synchronize actions with device removals (dissociate context events)
    and take care of reference counting (increase/decrease) for concurrent
    actions invocation. The reference counts on the actual kernel objects
    shall be handled by the handlers.

    objects
    +--------+
    | |
    | | methods +--------+
    | | ns method method_spec +-----+ |len |
    +--------+ +------+[d]+-------+ +----------------+[d]+------------+ |attr1+-> |type |
    | object +> |method+-> | spec +-> + attr_buckets +-> |default_chain+--> +-----+ |idr_type|
    +--------+ +------+ |handler| | | +------------+ |attr2| |access |
    | | | | +-------+ +----------------+ |driver chain| +-----+ +--------+
    | | | | +------------+
    | | +------+
    | |
    | |
    | |
    | |
    | |
    | |
    | |
    | |
    | |
    | |
    +--------+

    [d] = Hash ids to groups using the high order bits

    The right types table is also chosen by using the high bits from
    the ids. Currently we have either default or driver specific groups.

    Once validation and object fetching (or creation) completed, we call
    the handler:
    int (*handler)(struct ib_device *ib_dev, struct ib_uverbs_file *ufile,
    struct uverbs_attr_bundle *ctx);

    ctx bundles attributes of different namespaces. Each element there
    is an array of attributes which corresponds to one namespaces of
    attributes. For example, in the usually used case:

    ctx core
    +----------------------------+ +------------+
    | core: +---> | valid |
    +----------------------------+ | cmd_attr |
    | driver: | +------------+
    |----------------------------+--+ | valid |
    | | cmd_attr |
    | +------------+
    | | valid |
    | | obj_attr |
    | +------------+
    |
    | drivers
    | +------------+
    +> | valid |
    | cmd_attr |
    +------------+
    | valid |
    | cmd_attr |
    +------------+
    | valid |
    | obj_attr |
    +------------+

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     

30 Aug, 2017

2 commits

  • The new ioctl based infrastructure either commits or rollbacks
    all objects of the method as one transaction. In order to do
    that, we introduce a notion of dealing with a collection of
    objects that are related to a specific method.

    This also requires adding a notion of a method and attribute.
    A method contains a hash of attributes, where each bucket
    contains several attributes. The attributes are hashed according
    to their namespace which resides in the four upper bits of the id.

    For example, an object could be a CQ, which has an action of CREATE_CQ.
    This action has multiple attributes. For example, the CQ's new handle
    and the comp_channel. Each layer in this hierarchy - objects, methods
    and attributes is split into namespaces. The basic example for that is
    one namespace representing the default entities and another one
    representing the driver specific entities.

    When declaring these methods and attributes, we actually declare
    their specifications. When a method is executed, we actually
    allocates some space to hold auxiliary information. This auxiliary
    information contains meta-data about the required objects, such
    as pointers to their type information, pointers to the uobjects
    themselves (if exist), etc.
    The specification, along with the auxiliary information we allocated
    and filled is given to the finalize_objects function.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • The ioctl infrastructure treats all user-objects in the same manner.
    It gets objects ids from the user-space and by using the object type
    and type attributes mentioned in the object specification, it executes
    this required method. Passing an object id from the user-space as
    an attribute is carried out in three stages. The first is carried out
    before the actual handler and the last is carried out afterwards.

    The different supported operations are read, write, destroy and create.
    In the first stage, the former three actions just fetches the object
    from the repository (by using its id) and locks it. The last action
    allocates a new uobject. Afterwards, the second stage is carried out
    when the handler itself carries out the required modification of the
    object. The last stage is carried out after the handler finishes and
    commits the result. The former two operations just unlock the object.
    Destroy calls the "free object" operation, taking into account the
    object's type and releases the uobject as well. Creation just adds the
    new uobject to the repository, making the object visible to the
    application.

    In order to abstract these details from the ioctl infrastructure
    layer, we add uverbs_get_uobject_from_context and
    uverbs_finalize_object functions which corresponds to the first
    and last stages respectively.

    Signed-off-by: Matan Barak
    Reviewed-by: Yishai Hadas
    Signed-off-by: Doug Ledford

    Matan Barak
     

29 Aug, 2017

5 commits

  • This patch adds new SRQ type - IB_SRQT_TM. The new SRQ type supports tag
    matching and rendezvous offloads for MPI applications.

    When SRQ receives a message it will search through the matching list
    for the corresponding posted receive buffer. The process of searching
    the matching list is called tag matching.
    In case the tag matching results in a match, the received message will
    be placed in the address specified by the receive buffer. In case no
    match was found the message will be placed in a generic buffer until the
    corresponding receive buffer will be posted. These messages are called
    unexpected and their set is called an unexpected list.

    Signed-off-by: Artemy Kovalyov
    Reviewed-by: Yossi Itigin
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Artemy Kovalyov
     
  • Before this change CQ attached to SRQ was part of XRC specific extension.
    Moving CQ handle out makes it available to other types extending SRQ
    functionality.

    Signed-off-by: Artemy Kovalyov
    Reviewed-by: Yossi Itigin
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Artemy Kovalyov
     
  • This patch adds following TM XRQ capabilities:

    * max_rndv_hdr_size - Max size of rendezvous request message
    * max_num_tags - Max number of entries in tag matching list
    * max_ops - Max number of outstanding list operations
    * max_sge - Max number of SGE in tag matching entry
    * flags - the following flags are currently defined:
    - IB_TM_CAP_RC - Support tag matching on RC transport

    Signed-off-by: Artemy Kovalyov
    Reviewed-by: Yossi Itigin
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Artemy Kovalyov
     
  • A destroy of an MR prior to destroying the QP can cause the following
    diagnostic if the QP is referencing the MR being de-registered:

    hfi1 0000:05:00.0: hfi1_0: rvt_dereg_mr timeout mr ffff8808562108
    00 pd ffff880859b20b00

    The solution is to when the a non-zero refcount is encountered when
    the MR is destroyed the QPs needs to be iterated looking for QPs in
    the same PD as the MR. If rvt_qp_mr_clean() detects any such QP
    references the rkey/lkey, the QP needs to be put into an error state
    via a call to rvt_qp_error() which will trigger the clean up of any
    stuck references.

    This solution is as specified in IBTA 1.3 Volume 1 11.2.10.5.

    [This is reproduced with the 0.4.9 version of qperf and the rc_bw test]

    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Mike Marciniszyn
     
  • There are currently 3 spots in the qib and hfi1 driver that have
    knowledge of the internal QP hash list that should only be in
    scope to rdmavt QP code.

    Add an iterator API for processing all QPs to hide the
    nature of the RCU hashlist.

    The API consists of:
    - rvt_qp_iter_init()
    * For iterating QPs one at a time for seq_file semantics
    - rvt_qp_iter_next()
    * For iterating QPs one at a time for seq_file semantics
    - rvt_qp_iter()
    * For iterating all QPs

    The first two are used for things like seq_file prints.

    The last is for code that just needs to iterate all QPs
    in the system.

    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Mike Marciniszyn
     

25 Aug, 2017

4 commits

  • Cleanup patch prior exporting the ib_device_cap_flags
    to the user space. In this patch, we are aligning the
    indentation, removing IB_DEVICE_INIT_TYPE and IB_DEVICE_RESERVED
    fields, because it is not used in the kernel.

    Signed-off-by: Leon Romanovsky
    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Leon Romanovsky
     
  • The functions ib_register_event_handler() and
    ib_unregister_event_handler() always returned success and they can't fail.

    Let's convert those functions to be void, remove redundant checks and
    cleanup tons of goto statements.

    Signed-off-by: Leon Romanovsky
    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Leon Romanovsky
     
  • Pick up -rc fixes.

    Signed-off-by: Doug Ledford

    Doug Ledford
     
  • Commit 44c58487d51a ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
    introduced the concept of type in ah_attr:
    * During ib_register_device, each port is checked for its type which
    is stored in ib_device's port_immutable array.
    * During uverbs' modify_qp, the type is inferred using the port number
    in ib_uverbs_qp_dest struct (address vector) by accessing the
    relevant port_immutable array and the type is passed on to
    providers.

    IB spec (version 1.3) enforces a valid port value only in Reset to
    Init. During Init to RTR, the address vector must be valid but port
    number is not mentioned as a field in the address vector, so its
    value is not validated, which leads to accesses to a non-allocated
    memory when inferring the port type.

    Save the real port number in ib_qp during modify to Init (when the
    comp_mask indicates that the port number is valid) and use this value
    to infer the port type.

    Avoid copying the address vector fields if the matching bit is not set
    in the attr_mask. Address vector can't be modified before the port, so
    no valid flow is affected.

    Fixes: 44c58487d51a ('IB/core: Define 'ib' and 'roce' rdma_ah_attr types')
    Signed-off-by: Noa Osherovich
    Reviewed-by: Yishai Hadas
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Noa Osherovich
     

23 Aug, 2017

8 commits


19 Aug, 2017

3 commits

  • This patch series primarily increases sizes of variables that hold
    lid values from 16 to 32 bits. Additionally, it adds a check in
    the IB mad stack to verify a properly formatted MAD when OPA
    extended LIDs are used.

    Signed-off-by: Don Hiatt
    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Hiatt, Don
     
  • Conflicts:
    drivers/infiniband/core/iwcm.c - The rdma_netlink patches in
    HEAD and the iwarp cm workqueue fix (don't use WQ_MEM_RECLAIM,
    we aren't safe for that context) touched the same code.

    Signed-off-by: Doug Ledford

    Doug Ledford
     
  • A sockaddr_in structure on the stack getting passed into rdma_ip2gid
    triggers this warning, since we memcpy into a larger sockaddr_in6
    structure:

    In function 'memcpy',
    inlined from 'rdma_ip2gid' at include/rdma/ib_addr.h:175:3,
    inlined from 'addr_event.isra.4.constprop' at drivers/infiniband/core/roce_gid_mgmt.c:693:2,
    inlined from 'inetaddr_event' at drivers/infiniband/core/roce_gid_mgmt.c:716:9:
    include/linux/string.h:305:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter

    The warning seems appropriate here, but the code is also clearly
    correct, so we really just want to shut up this instance of the
    output.

    The best way I found so far is to avoid the memcpy() call and instead
    replace it with a struct assignment.

    Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions")
    Cc: Daniel Micay
    Cc: Kees Cook
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Doug Ledford

    Arnd Bergmann
     

11 Aug, 2017

2 commits


10 Aug, 2017

2 commits

  • There is a need to forward FW version to user space
    application through RDMA netlink. In order to make it safe, there
    is need to declare nla_policy and limit the size of FW string.

    The new define IB_FW_VERSION_NAME_MAX will limit the size of
    FW version string. That define was chosen to be equal to
    ETHTOOL_FWVERS_LEN, because many drivers anyway are limited
    by that value indirectly.

    The introduction of this define allows us to remove the string size
    from get_fw_str function signature.

    Signed-off-by: Leon Romanovsky

    Leon Romanovsky
     
  • The .doit callback is used by netlink core to differentiate
    between get and set operations. Common convention is to use
    that call for command operations like (SET, ADD, e.t.c.) and/or
    access without NLF_M_DUMP flag.

    This commit adds proper declaration and implementation
    to RDMA netlink.

    Signed-off-by: Leon Romanovsky
    Reviewed-by: Steve Wise

    Leon Romanovsky