06 May, 2015

2 commits

  • Addresses the following kernel logs seen during boot of sparc systems:

    Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
    Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
    Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
    Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
    Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]

    Signed-off-by: David Ahern
    Signed-off-by: Doug Ledford

    David Ahern
     
  • Signed-off-by: Honggang Li
    Acked-by: Sean Hefty
    Signed-off-by: Doug Ledford

    Honggang LI
     

05 May, 2015

1 commit

  • …necting peer to its clients

    Add functionality to enable the port mapper on the passive side to provide to its
    clients the actual (non-mapped) ip/tcp address information of the connecting peer

    1) Adding remote_info_cb() to process the address info of the connecting peer
    The address info is provided by the user space port mapper service when
    the connection is initiated by the peer
    2) Adding a hash list to store the remote address info
    3) Adding functionality to add/remove the remote address info
    After the info has been provided to the port mapper client,
    it is removed from the hash list

    Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
    Reviewed-by: Steve Wise <swise@opengridcomputing.com>
    Signed-off-by: Doug Ledford <dledford@redhat.com>

    Tatyana Nikolova
     

06 Feb, 2015

1 commit

  • While commit 7e36ef8205ff ("IB/core: Temporarily disable
    ex_query_device uverb") is correct as it makes the extended
    QUERY_DEVICE uverb (which came as part of commit 5a77abf9a97a
    ("IB/core: Add support for extended query device caps") and commit
    860f10a799c8 ("IB/core: Add flags for on demand paging support")) not
    available to userspace, it doesn't address the initial issue regarding
    ib_copy_to_udata() [1][2].

    Additionally, further discussions around this new uverb seems to
    conclude it would require a different data structure than the one
    currently described in [3].

    Both of these issues require a revert of the changes, so this patch
    partially reverts commit 8cdd312cfed7 ("IB/mlx5: Implement the ODP
    capability query verb") and commit 860f10a799c8 ("IB/core: Add flags
    for on demand paging support") and fully reverts commit 5a77abf9a97a
    ("IB/core: Add support for extended query device caps").

    [1] "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
    http://mid.gmane.org/1418733236.2779.26.camel@opteya.com

    [2] "Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb"
    http://mid.gmane.org/1423067503.3030.83.camel@opteya.com

    [3] "RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask"
    http://mid.gmane.org/2807E5FD2F6FDA4886F6618EAC48510E0CC12C30@CRSMSX101.amr.corp.intel.com

    Cc: Eli Cohen
    Cc: Haggai Eran
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Sagi Grimberg
    Cc: Shachar Raindel
    Signed-off-by: Yann Droneaud
    Signed-off-by: Roland Dreier

    Yann Droneaud
     

16 Dec, 2014

7 commits

  • * Add an interval tree implementation for ODP umems. Create an
    interval tree for each ucontext (including a count of the number of
    ODP MRs in this context, semaphore, etc.), and register ODP umems in
    the interval tree.
    * Add MMU notifiers handling functions, using the interval tree to
    notify only the relevant umems and underlying MRs.
    * Register to receive MMU notifier events from the MM subsystem upon
    ODP MR registration (and unregister accordingly).
    * Add a completion object to synchronize the destruction of ODP umems.
    * Add mechanism to abort page faults when there's a concurrent invalidation.

    The way we synchronize between concurrent invalidations and page
    faults is by keeping a counter of currently running invalidations, and
    a sequence number that is incremented whenever an invalidation is
    caught. The page fault code checks the counter and also verifies that
    the sequence number hasn't progressed before it updates the umem's
    page tables. This is similar to what the kvm module does.

    In order to prevent the case where we register a umem in the middle of
    an ongoing notifier, we also keep a per ucontext counter of the total
    number of active mmu notifiers. We only enable new umems when all the
    running notifiers complete.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Shachar Raindel
    Signed-off-by: Haggai Eran
    Signed-off-by: Yuval Dagan
    Signed-off-by: Roland Dreier

    Haggai Eran
     
  • * Extend the umem struct to keep the ODP related data.
    * Allocate and initialize the ODP related information in the umem
    (page_list, dma_list) and freeing as needed in the end of the run.
    * Store a reference to the process PID struct in the ucontext. Used to
    safely obtain the task_struct and the mm during fault handling,
    without preventing the task destruction if needed.
    * Add 2 helper functions: ib_umem_odp_map_dma_pages and
    ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses
    of specific pages of the umem (and, currently, pin them).
    * Support for page faults only - IB core will keep the reference on
    the pages used and call put_page when freeing an ODP umem
    area. Invalidations support will be added in a later patch.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Shachar Raindel
    Signed-off-by: Haggai Eran
    Signed-off-by: Majd Dibbiny
    Signed-off-by: Roland Dreier

    Shachar Raindel
     
  • * Add a configuration option for enable on-demand paging support in
    the infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a
    later patch, this configuration option will select the MMU_NOTIFIER
    configuration option to enable mmu notifiers.
    * Add a flag for on demand paging (ODP) support in the IB device capabilities.
    * Add a flag to request ODP MR in the access flags to reg_mr.
    * Fail registrations done with the ODP flag when the low-level driver
    doesn't support this.
    * Change the conditions in which an MR will be writable to explicitly
    specify the access flags. This is to avoid making an MR writable just
    because it is an ODP MR.
    * Add a ODP capabilities to the extended query device verb.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Shachar Raindel
    Signed-off-by: Haggai Eran
    Signed-off-by: Roland Dreier

    Sagi Grimberg
     
  • Add extensible query device capabilities verb to allow adding new features.
    ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to
    copy capability fields to be used by both ib_uverbs_query_device and
    ib_uverbs_ex_query_device.

    Signed-off-by: Eli Cohen
    Signed-off-by: Haggai Eran
    Signed-off-by: Roland Dreier

    Eli Cohen
     
  • Add a helper function mlx5_ib_read_user_wqe to read information from
    user-space owned work queues. The function will be used in a later
    patch by the page-fault handling code in mlx5_ib.

    Signed-off-by: Haggai Eran

    [ Add stub for ib_umem_copy_from() for CONFIG_INFINIBAND_USER_MEM=n
    - Roland ]

    Signed-off-by: Roland Dreier

    Haggai Eran
     
  • In some drivers there's a need to read data from a user space area
    that was pinned using ib_umem when running from a different process
    context.

    The ib_umem_copy_from function allows reading data from the physical
    pages pinned in the ib_umem struct.

    Signed-off-by: Haggai Eran
    Signed-off-by: Roland Dreier

    Haggai Eran
     
  • In order to allow umems that do not pin memory, we need the umem to
    keep track of its region's address.

    This makes the offset field redundant, and so this patch removes it.

    Signed-off-by: Haggai Eran
    Signed-off-by: Roland Dreier

    Haggai Eran
     

09 Oct, 2014

1 commit

  • Expose more signature setting parameters. We modify the signature API
    to allow usage of some new execution parameters relevant to data
    integrity feature.

    This patch modifies ib_sig_domain structure by:

    - Deprecate DIF type in signature API (operation will
    be determined by the parameters alone, no DIF type awareness)
    - Add APPTAG check bitmask (for input domain)
    - Add REFTAG remap (increment) flag for each domain
    - Add APPTAG/REFTAG escape options for each domain

    The mlx5 driver is modified to follow the new parameters in HW
    signature setup.

    At the moment the callers (iser/isert) hard-code new parameters (by
    DIF type). In the future, callers will retrieve them from the scsi
    command structure.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Roland Dreier

    Sagi Grimberg
     

20 Sep, 2014

1 commit

  • In debugging an application that receives -ENOMEM from ib_reg_mr(), I
    found that ib_umem_get() can fail because the pinned_vm count has
    wrapped causing it to always be larger than the lock limit even with
    RLIMIT_MEMLOCK set to RLIM_INFINITY.

    The wrapping of pinned_vm occurs because the process that calls
    ib_reg_mr() will have its mm->pinned_vm count incremented. Later a
    different process with a different mm_struct than the one that
    allocated the ib_umem struct ends up releasing it which results in
    decrementing the new processes mm->pinned_vm count past zero and
    wrapping.

    I'm not entirely sure what circumstances cause a different process to
    release the ib_umem than the one that allocated it but the kernel
    stack trace of the freeing process from my situation looks like the
    following:

    Call Trace:
    [] dump_stack+0x19/0x1b
    [] ib_umem_release+0x1f5/0x200 [ib_core]
    [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
    [] ib_destroy_qp+0x12c/0x170 [ib_core]
    [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
    [] __fput+0xba/0x240
    [] ____fput+0xe/0x10
    [] task_work_run+0xc4/0xe0
    [] do_notify_resume+0x95/0xa0
    [] int_signal+0x12/0x17

    The following patch fixes the issue by storing the pid struct of the
    process that calls ib_umem_get() so that ib_umem_release and/or
    ib_umem_account() can properly decrement the pinned_vm count of the
    correct mm_struct.

    Signed-off-by: Shawn Bohrer
    Reviewed-by: Shachar Raindel
    Signed-off-by: Roland Dreier

    Shawn Bohrer
     

14 Aug, 2014

1 commit


11 Aug, 2014

2 commits


02 Aug, 2014

1 commit

  • Memory re-registration is a feature that enables changing the
    attributes of a memory region registered by user-space, including PD,
    translation (address and length) and access flags.

    Add the required support in uverbs and the kernel verbs API.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Matan Barak
     

11 Jun, 2014

2 commits

  • … 'noio', 'ocrdma', 'qib', 'srp' and 'usnic' into for-next

    Roland Dreier
     
  • This patch adds iWARP Port Mapper (IWPM) Version 2 support. The iWARP
    Port Mapper implementation is based on the port mapper specification
    section in the Sockets Direct Protocol paper -
    http://www.rdmaconsortium.org/home/draft-pinkerton-iwarp-sdp-v1.0.pdf

    Existing iWARP RDMA providers use the same IP address as the native
    TCP/IP stack when creating RDMA connections. They need a mechanism to
    claim the TCP ports used for RDMA connections to prevent TCP port
    collisions when other host applications use TCP ports. The iWARP Port
    Mapper provides a standard mechanism to accomplish this. Without this
    service it is possible for RDMA application to bind/listen on the same
    port which is already being used by native TCP host application. If
    that happens the incoming TCP connection data can be passed to the
    RDMA stack with error.

    The iWARP Port Mapper solution doesn't contain any changes to the
    existing network stack in the kernel space. All the changes are
    contained with the infiniband tree and also in user space.

    The iWARP Port Mapper service is implemented as a user space daemon
    process. Source for the IWPM service is located at
    http://git.openfabrics.org/git?p=~tnikolova/libiwpm-1.0.0/.git;a=summary

    The iWARP driver (port mapper client) sends to the IWPM service the
    local IP address and TCP port it has received from the RDMA
    application, when starting a connection. The IWPM service performs a
    socket bind from user space to get an available TCP port, called a
    mapped port, and communicates it back to the client. In that sense,
    the IWPM service is used to map the TCP port, which the RDMA
    application uses to any port available from the host TCP port
    space. The mapped ports are used in iWARP RDMA connections to avoid
    collisions with native TCP stack which is aware that these ports are
    taken. When an RDMA connection using a mapped port is terminated, the
    client notifies the IWPM service, which then releases the TCP port.

    The message exchange between the IWPM service and the iWARP drivers
    (between user space and kernel space) is implemented using netlink
    sockets.

    1) Netlink interface functions are added: ibnl_unicast() and
    ibnl_mulitcast() for sending netlink messages to user space

    2) The signature of the existing ibnl_put_msg() is changed to be more
    generic

    3) Two netlink clients are added: RDMA_NL_NES, RDMA_NL_C4IW
    corresponding to the two iWarp drivers - nes and cxgb4 which use
    the IWPM service

    4) Enums are added to enumerate the attributes in the netlink
    messages, which are exchanged between the user space IWPM service
    and the iWARP drivers

    Signed-off-by: Tatyana Nikolova
    Signed-off-by: Steve Wise
    Reviewed-by: PJ Waskiewicz

    [ Fold in range checking fixes and nlh_next removal as suggested by Dan
    Carpenter and Steve Wise. Fix sparse endianness in hash. - Roland ]

    Signed-off-by: Roland Dreier

    Tatyana Nikolova
     

05 Jun, 2014

1 commit

  • Fix a few functions that are declared with __attribute_const__ in the
    ib_verbs.h header file but defined without it in verbs.c. This gets rid
    of the following sparse warnings:

    drivers/infiniband/core/verbs.c:51:5: error: symbol 'ib_rate_to_mult' redeclared with different type (originally declared at include/rdma/ib_verbs.h:469) - different modifiers
    drivers/infiniband/core/verbs.c:68:14: error: symbol 'mult_to_ib_rate' redeclared with different type (originally declared at include/rdma/ib_verbs.h:607) - different modifiers
    drivers/infiniband/core/verbs.c:85:5: error: symbol 'ib_rate_to_mbps' redeclared with different type (originally declared at include/rdma/ib_verbs.h:476) - different modifiers
    drivers/infiniband/core/verbs.c:111:1: error: symbol 'rdma_node_get_transport' redeclared with different type (originally declared at include/rdma/ib_verbs.h:84) - different modifiers

    Signed-off-by: Roland Dreier

    Roland Dreier
     

03 Jun, 2014

1 commit

  • This addresses a problem where NFS client writes over IPoIB connected
    mode may deadlock on memory allocation/writeback.

    The problem is not directly memory reclamation. There is an indirect
    dependency between network filesystems writing back pages and
    ipoib_cm_tx_init() due to how a kworker is used. Page reclaim cannot
    make forward progress until ipoib_cm_tx_init() succeeds and it is
    stuck in page reclaim itself waiting for network transmission.
    Ordinarily this situation may be avoided by having the caller use
    GFP_NOFS but ipoib_cm_tx_init() does not have that information.

    To address this, take a general approach and add a new QP creation
    flag that tells the low-level hardware driver to use GFP_NOIO for the
    memory allocations related to the new QP.

    Use the new flag in the ipoib connected mode path, and if the driver
    doesn't support it, re-issue the QP creation without the flag.

    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Kosina
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     

03 Apr, 2014

1 commit


02 Apr, 2014

2 commits

  • The code that resolves the passive side source MAC within the rdma_cm
    connection request handler was both redundant and buggy, so remove it.

    It was redundant since later, when an RC QP is modified to RTR state,
    the resolution will take place in the ib_core module. It was buggy
    because this callback also deals with UD SIDR exchange, for which we
    incorrectly looked at the REQ member of the CM event and dereferenced
    a random value.

    Fixes: dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
    Signed-off-by: Moni Shoua
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Moni Shoua
     
  • The code is replaced by driver specific changes and avoids the pointer
    NULL test for drivers that don't overload these operations.

    Suggested-by:
    Reviewed-by: Dennis Dalessandro
    Tested-by: Vinod Kumar
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     

08 Mar, 2014

2 commits

  • Introduce a verbs interface for signature-related operations. A
    signature handover operation configures the layouts of data and
    protection attributes both in memory and wire domains.

    Signature operations are:

    - INSERT:
    Generate and insert protection information when handing over
    data from input space to output space.
    - validate and STRIP:
    Validate protection information and remove it when handing over
    data from input space to output space.
    - validate and PASS:
    Validate protection information and pass it when handing over
    data from input space to output space.

    Once the signature handover opration is done, the HCA will offload
    data integrity generation/validation while performing the actual data
    transfer.

    Additions:

    1. HCA signature capabilities in device attributes
    Verbs provider supporting signature handover operations fills
    relevant fields in device attributes structure returned by
    ib_query_device.

    2. QP creation flag IB_QP_CREATE_SIGNATURE_EN
    Creating a QP that will carry signature handover operations may
    require some special preparations from the verbs provider. So we
    add QP creation flag IB_QP_CREATE_SIGNATURE_EN to declare that the
    created QP may carry out signature handover operations. Expose
    signature support to verbs layer (no support for now).

    3. New send work request IB_WR_REG_SIG_MR
    Signature handover work request. This WR will define the signature
    handover properties of the memory/wire domains as well as the
    domains layout. The purpose of this work request is to bind all
    the needed information for the signature operation:

    - data to be transferred: wr->sg_list (ib_sge).
    * The raw data, pre-registered to a single MR (normally, before
    signature, this MR would have been used directly for the data
    transfer)
    - data protection guards: sig_handover.prot (ib_sge).
    * The data protection buffer, pre-registered to a single MR, which
    contains the data integrity guards of the raw data blocks.
    Note that it may not always exist, only in cases where the user is
    interested in storing protection guards in memory.
    - signature operation attributes: sig_handover.sig_attrs.
    * Tells the HCA how to validate/generate the protection information.

    Once the work request is executed, the memory region that will
    describe the signature transaction will be the sig_mr. The
    application can now go ahead and send the sig_mr.rkey or use the
    sig_mr.lkey for data transfer.

    4. New Verb ib_check_mr_status
    check_mr_status verb checks the status of the memory region post
    transaction. The first check that may be used is
    IB_MR_CHECK_SIG_STATUS, which will indicate if any signature
    errors are pending for a specific signature-enabled ib_mr. This
    verb is a lightwight check and is allowed to be taken from
    interrupt context. An application must call this verb after it is
    known that the actual data transfer has finished.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Roland Dreier

    Sagi Grimberg
     
  • This commit introduces verbs for creating/destoying memory
    regions which will allow new types of memory key operations such
    as protected memory registration.

    Indirect memory registration is registering several (one
    of more) pre-registered memory regions in a specific layout.
    The Indirect region may potentialy describe several regions
    and some repitition format between them.

    Protected Memory registration is registering a memory region
    with various data integrity attributes which will describe protection
    schemes that will be handled by the HCA in an offloaded manner.
    These memory regions will be applicable for a new REG_SIG_MR
    work request introduced later in this patchset.

    In the future these routines may replace or implement current memory
    regions creation routines existing today:
    - ib_reg_user_mr
    - ib_alloc_fast_reg_mr
    - ib_get_dma_mr
    - ib_dereg_mr

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Roland Dreier

    Sagi Grimberg
     

05 Mar, 2014

1 commit

  • This patch refactors the IB core umem code and vendor drivers to use a
    linear (chained) SG table instead of chunk list. With this change the
    relevant code becomes clearer—no need for nested loops to build and
    use umem.

    Signed-off-by: Shachar Raindel
    Signed-off-by: Yishai Hadas
    Signed-off-by: Roland Dreier

    Yishai Hadas
     

14 Feb, 2014

1 commit

  • For userspace RoCE UD QPs we need to know the GID format that the
    kernel uses, e.g when working over older kernels. For that end, add a
    new port capability IB_PORT_IP_BASED_GIDS and report it when query
    port is issued.

    Signed-off-by: Moni Shoua
    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Moni Shoua
     

23 Jan, 2014

2 commits


19 Jan, 2014

2 commits

  • Currently, the IB core and specifically the RDMA-CM assumes that IBoE
    (RoCE) gids encode related Ethernet netdevice interface MAC address
    and possibly VLAN id.

    Change GIDs to be treated as they encode interface IP address.

    Since Ethernet layer 2 address parameters are not longer encoded
    within gids, we have to extend the Infiniband address structures (e.g.
    ib_ah_attr) with layer 2 address parameters, namely mac and vlan.

    Signed-off-by: Moni Shoua
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Moni Shoua
     
  • Add the complementary RDMA_NODE_USNIC_UDP for RDMA_TRANSPORT_USNIC_UDP.

    Signed-off-by: Upinder Malhi
    Signed-off-by: Roland Dreier

    Upinder Malhi
     

15 Jan, 2014

3 commits

  • This patch add the support for Ethernet L2 attributes in the
    verbs/cm/cma structures.

    When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
    in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.

    Thus, those attributes were added to the following structures:

    * ib_ah_attr - added dmac
    * ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
    * ib_wc - added smac, vlan_id
    * ib_sa_path_rec - added smac, dmac, vlan_id
    * cm_av - added smac and vlan_id

    For the path record structure, extra care was taken to avoid the new
    fields when packing it into wire format, so we don't break the IB CM
    and SA wire protocol.

    On the active side, the CM fills. its internal structures from the
    path provided by the ULP. We add there taking the ETH L2 attributes
    and placing them into the CM Address Handle (struct cm_av).

    On the passive side, the CM fills its internal structures from the WC
    associated with the REQ message. We add there taking the ETH L2
    attributes from the WC.

    When the HW driver provides the required ETH L2 attributes in the WC,
    they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
    code checks for the presence of these flags, and in their absence does
    address resolution from the ib_init_ah_from_wc() helper function.

    ib_modify_qp_is_ok is also updated to consider the link layer. Some
    parameters are mandatory for Ethernet link layer, while they are
    irrelevant for IB. Vendor drivers are modified to support the new
    function signature.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Matan Barak
     
  • This patch adds preliminary support for IB L2 device-managed steering,
    currently exposed only in the kernel.

    This flow spec can be used by low-level drivers that need to indicate
    the link layer type when creating device-managed flow rules.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Matan Barak
     
  • When creating an IPoIB UD QP, provide a hint to the low level driver
    that the QP should support flow-steering. This means that privileged
    user space applications can steer TCP/IP IPoIB traffic from the
    network stack, in a similar manner done with Ethernet RAW_PACKET QPs.

    The hint is provided through new QP creation flag called NETIF_QP.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Matan Barak
     

14 Jan, 2014

1 commit


17 Dec, 2013

1 commit

  • Userspace input buffer is not modified by kernel, so it can be 'const'.

    This is also a prerequisite to remove the implicit cast
    from INIT_UDATA().

    Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
    Signed-off-by: Yann Droneaud
    Signed-off-by: Roland Dreier

    Yann Droneaud
     

18 Nov, 2013

2 commits

  • …s', 'ocrdma', 'qib' and 'srp' into for-next

    Roland Dreier
     
  • Commit 400dbc96583f ("IB/core: Infrastructure for extensible uverbs
    commands") added an infrastructure for extensible uverbs commands
    while later commit 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow
    through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
    using this new infrastructure.

    According to the commit 400dbc96583f, the purpose of this
    infrastructure is to support passing around provider (eg. hardware)
    specific buffers when userspace issue commands to the kernel, so that
    it would be possible to extend uverbs (eg. core) buffers independently
    from the provider buffers.

    But the new kernel command function prototypes were not modified to
    take advantage of this extension. This issue was exposed by Roland
    Dreier in a previous review[1].

    So the following patch is an attempt to a revised extensible command
    infrastructure.

    This improved extensible command infrastructure distinguish between
    core (eg. legacy)'s command/response buffers from provider
    (eg. hardware)'s command/response buffers: each extended command
    implementing function is given a struct ib_udata to hold core
    (eg. uverbs) input and output buffers, and another struct ib_udata to
    hold the hw (eg. provider) input and output buffers.

    Having those buffers identified separately make it easier to increase
    one buffer to support extension without having to add some code to
    guess the exact size of each command/response parts: This should make
    the extended functions more reliable.

    Additionally, instead of relying on command identifier being greater
    than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
    unused bits in command field: on the 32 bits provided by command
    field, only 6 bits are really needed to encode the identifier of
    commands currently supported by the kernel. (Even using only 6 bits
    leaves room for about 23 new commands).

    So this patch makes use of some high order bits in command field to
    store flags, leaving enough room for more command identifiers than one
    will ever need (eg. 256).

    The new flags are used to specify if the command should be processed
    as an extended one or a legacy one. While designing the new command
    format, care was taken to make usage of flags itself extensible.

    Using high order bits of the commands field ensure that newer
    libibverbs on older kernel will properly fail when trying to call
    extended commands. On the other hand, older libibverbs on newer kernel
    will never be able to issue calls to extended commands.

    The extended command header includes the optional response pointer so
    that output buffer length and output buffer pointer are located
    together in the command, allowing proper parameters checking. This
    should make implementing functions easier and safer.

    Additionally the extended header ensure 64bits alignment, while making
    all sizes multiple of 8 bytes, extending the maximum buffer size:

    legacy extended

    Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
    Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)

    For the purpose of doing proper buffer size accounting, the headers
    size are no more taken in account in "in_words".

    One of the odds of the current extensible infrastructure, reading
    twice the "legacy" command header, is fixed by removing the "legacy"
    command header from the extended command header: they are processed as
    two different parts of the command: memory is read once and
    information are not duplicated: it's making clear that's an extended
    command scheme and not a different command scheme.

    The proposed scheme will format input (command) and output (response)
    buffers this way:

    - command:

    legacy header +
    extended header +
    command data (core + hw):

    +----------------------------------------+
    | flags | 00 00 | command |
    | in_words | out_words |
    +----------------------------------------+
    | response |
    | response |
    | provider_in_words | provider_out_words |
    | padding |
    +----------------------------------------+
    | |
    . .
    . (in_words * 8) .
    | |
    +----------------------------------------+
    | |
    . .
    . (provider_in_words * 8) .
    | |
    +----------------------------------------+

    - response, if present:

    +----------------------------------------+
    | |
    . .
    . (out_words * 8) .
    | |
    +----------------------------------------+
    | |
    . .
    . (provider_out_words * 8) .
    | |
    +----------------------------------------+

    The overall design is to ensure that the extensible infrastructure is
    itself extensible while begin more reliable with more input and bound
    checking.

    Note:

    The unused field in the extended header would be perfect candidate to
    hold the command "comp_mask" (eg. bit field used to handle
    compatibility). This was suggested by Roland Dreier in a previous
    review[2]. But "comp_mask" field is likely to be present in the uverb
    input and/or provider input, likewise for the response, as noted by
    Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
    header.

    [1]:
    http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com

    [2]:
    http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com

    [3]:
    http://marc.info/?i=525C1149.6000701@mellanox.com

    Signed-off-by: Yann Droneaud
    Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com

    [ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]

    Signed-off-by: Roland Dreier

    Yann Droneaud
     

16 Nov, 2013

1 commit