02 Nov, 2011

1 commit


14 Oct, 2011

6 commits

  • XRC TGT QPs are shared resources among multiple processes. Since the
    creating process may exit, allow other processes which share the same
    XRC domain to open an existing QP. This allows us to transfer
    ownership of an XRC TGT QP to another process.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • Allow user space to create XRC domains. Because XRCDs are expected to
    be shared among multiple processes, we use inodes to identify an XRCD.

    Based on patches by Jack Morgenstein

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • XRC TGT QPs are intended to be shared among multiple users and
    processes. Allow the destruction of an XRC TGT QP to be done explicitly
    through ib_destroy_qp() or when the XRCD is destroyed.

    To support destroying an XRC TGT QP, we need to track TGT QPs with the
    XRCD. When the XRCD is destroyed, all tracked XRC TGT QPs are also
    cleaned up.

    To avoid stale reference issues, if a user is holding a reference on a
    TGT QP, we increment a reference count on the QP. The user releases the
    reference by calling ib_release_qp. This releases any access to the QP
    from a user above verbs, but allows the QP to continue to exist until
    destroyed by the XRCD.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • XRC ("eXtended reliable connected") is an IB transport that provides
    better scalability by allowing senders to specify which shared receive
    queue (SRQ) should be used to receive a message, which essentially
    allows one transport context (QP connection) to serve multiple
    destinations (as long as they share an adapter, of course).

    XRC communication is between an initiator (INI) QP and a target (TGT)
    QP. Target QPs are associated with SRQs through an XRCD. An XRC TGT QP
    behaves like a receive-only RD QP. XRC INI QPs behave similarly to RC
    QPs, except that work requests posted to an XRC INI QP must specify the
    remote SRQ that is the target of the work request.

    We define two new QP types for XRC, to distinguish between INI and TGT
    QPs, and update the core layer to support XRC QPs.

    This patch is derived from work by Jack Morgenstein

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • XRC ("eXtended reliable connected") is an IB transport that provides
    better scalability by allowing senders to specify which shared receive
    queue (SRQ) should be used to receive a message, which essentially
    allows one transport context (QP connection) to serve multiple
    destinations (as long as they share an adapter, of course).

    XRC defines SRQs that are specifically used by XRC connections. Expand
    the SRQ code to support XRC SRQs. An XRC SRQ is currently restricted to
    only XRC use according to the IB XRC Annex.

    Portions of this patch were derived from work by
    Jack Morgenstein .

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • Currently, there is only a single ("basic") type of SRQ, but with XRC
    support we will add a second. Prepare for this by defining an SRQ type
    and setting all current users to IB_SRQT_BASIC.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

13 Oct, 2011

1 commit

  • XRC ("eXtended reliable connected") is an IB transport that provides
    better scalability by allowing senders to specify which shared receive
    queue (SRQ) should be used to receive a message, which essentially
    allows one transport context (QP connection) to serve multiple
    destinations (as long as they share an adapter, of course).

    A few new concepts are introduced to support this. This patch adds:

    - A new device capability flag, IB_DEVICE_XRC, which low-level
    drivers set to indicate that a device supports XRC.
    - A new object type, XRC domains (struct ib_xrcd), and new verbs
    ib_alloc_xrcd()/ib_dealloc_xrcd(). XRCDs are used to limit which
    XRC SRQs an incoming message can target.

    This patch is derived from work by Jack Morgenstein .

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

12 Oct, 2011

1 commit

  • Introduce support for the following extended speeds:

    FDR-10: a Mellanox proprietary link speed which is 10.3125 Gbps with
    64b/66b encoding rather than 8b/10b encoding.
    FDR: IBA extended speed 14.0625 Gbps.
    EDR: IBA extended speed 25.78125 Gbps.

    Signed-off-by: Marcel Apfelbaum
    Reviewed-by: Hal Rosenstock
    Reviewed-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Marcel Apfelbaum
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

19 Jul, 2011

1 commit

  • Add IB GID change event type. This is needed for IBoE when the HW
    driver updates the GID (e.g when new VLANs are added/deleted) table
    and the change should be reflected to the IB core cache.

    Signed-off-by: Eli Cohen
    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     

17 Jan, 2011

1 commit

  • * ib_wq is added, which is used as the common workqueue for infiniband
    instead of the system workqueue. All system workqueue usages
    including flush_scheduled_work() callers are converted to use and
    flush ib_wq.

    * cancel_delayed_work() + flush_scheduled_work() converted to
    cancel_delayed_work_sync().

    * qib_wq is removed and ib_wq is used instead.

    This is to prepare for deprecation of flush_scheduled_work().

    Signed-off-by: Tejun Heo
    Signed-off-by: Roland Dreier

    Tejun Heo
     

28 Sep, 2010

1 commit

  • This patch allows ports to have different link layers:
    IB_LINK_LAYER_INFINIBAND or IB_LINK_LAYER_ETHERNET. This is required
    for adding IBoE (InfiniBand-over-Ethernet, aka RoCE) support. For
    devices that do not provide an implementation for querying the link
    layer property of a port, we return a default value based on the
    transport: RMA_TRANSPORT_IB nodes will return IB_LINK_LAYER_INFINIBAND
    and RDMA_TRANSPORT_IWARP nodes will return IB_LINK_LAYER_ETHERNET.

    Signed-off-by: Eli Cohen
    Signed-off-by: Roland Dreier

    Eli Cohen
     

05 Aug, 2010

1 commit

  • Change abbreviated IB_QPT_RAW_ETY to IB_QPT_RAW_ETHERTYPE to make
    the special QP type easier to understand.

    cf http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04530.html

    Signed-off-by: Aleksey Senin
    Signed-off-by: Roland Dreier

    Aleksey Senin
     

22 May, 2010

1 commit

  • Add a new parameter to ib_register_device() so that low-level device
    drivers can pass in a pointer to a callback function that will be
    called for each port that is registered in sysfs. This allows
    low-level device drivers to create files in

    /sys/class/infiniband//ports//

    without having to poke through the internals of the RDMA sysfs handling.

    There is no need for an unregister function since the kobject
    reference will go to zero when ib_unregister_device() is called.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Roland Dreier

    Ralph Campbell
     

22 Apr, 2010

1 commit

  • - Add new IB_WR_MASKED_ATOMIC_CMP_AND_SWP and IB_WR_MASKED_ATOMIC_FETCH_AND_ADD
    send opcodes that can be used to post "masked atomic compare and
    swap" and "masked atomic fetch and add" work request respectively.
    - Add masked_atomic_cap capability.
    - Add mask fields to atomic struct of ib_send_wr
    - Add new opcodes to ib_wc_opcode

    The new operations are described more precisely below:

    * Masked Compare and Swap (MskCmpSwap)

    The MskCmpSwap atomic operation is an extension to the CmpSwap
    operation defined in the IB spec. MskCmpSwap allows the user to
    select a portion of the 64 bit target data for the “compare” check as
    well as to restrict the swap to a (possibly different) portion. The
    pseudo code below describes the operation:

    | atomic_response = *va
    | if (!((compare_add ^ *va) & compare_add_mask)) then
    | *va = (*va & ~(swap_mask)) | (swap & swap_mask)
    |
    | return atomic_response

    The additional operands are carried in the Extended Transport Header.
    Atomic response generation and packet format for MskCmpSwap is as for
    standard IB Atomic operations.

    * Masked Fetch and Add (MFetchAdd)

    The MFetchAdd Atomic operation extends the functionality of the
    standard IB FetchAdd by allowing the user to split the target into
    multiple fields of selectable length. The atomic add is done
    independently on each one of this fields. A bit set in the
    field_boundary parameter specifies the field boundaries. The pseudo
    code below describes the operation:

    | bit_adder(ci, b1, b2, *co)
    | {
    | value = ci + b1 + b2
    | *co = !!(value & 2)
    |
    | return value & 1
    | }
    |
    | #define MASK_IS_SET(mask, attr) (!!((mask)&(attr)))
    | bit_position = 1
    | carry = 0
    | atomic_response = 0
    |
    | for i = 0 to 63
    | {
    | if ( i != 0 )
    | bit_position = bit_position << 1
    |
    | bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position),
    | MASK_IS_SET(compare_add, bit_position), &new_carry)
    | if (bit_add_res)
    | atomic_response |= bit_position
    |
    | carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position)))
    | }
    |
    | return atomic_response

    Signed-off-by: Vladimir Sokolovsky
    Signed-off-by: Roland Dreier

    Vladimir Sokolovsky
     

25 Feb, 2010

1 commit


10 Dec, 2009

1 commit


15 Feb, 2009

1 commit


27 Jul, 2008

1 commit

  • Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
    architecture does:

    This enables us to cleanly fix the Calgary IOMMU issue that some devices
    are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).

    I think that per-device dma_mapping_ops support would be also helpful for
    KVM people to support PCI passthrough but Andi thinks that this makes it
    difficult to support the PCI passthrough (see the above thread). So I
    CC'ed this to KVM camp. Comments are appreciated.

    A pointer to dma_mapping_ops to struct dev_archdata is added. If the
    pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
    NULL, the system-wide dma_ops pointer is used as before.

    If it's useful for KVM people, I plan to implement a mechanism to register
    a hook called when a new pci (or dma capable) device is created (it works
    with hot plugging). It enables IOMMUs to set up an appropriate
    dma_mapping_ops per device.

    The major obstacle is that dma_mapping_error doesn't take a pointer to the
    device unlike other DMA operations. So x86 can't have dma_mapping_ops per
    device. Note all the POWER IOMMUs use the same dma_mapping_error function
    so this is not a problem for POWER but x86 IOMMUs use different
    dma_mapping_error functions.

    The first patch adds the device argument to dma_mapping_error. The patch
    is trivial but large since it touches lots of drivers and dma-mapping.h in
    all the architecture.

    This patch:

    dma_mapping_error() doesn't take a pointer to the device unlike other DMA
    operations. So we can't have dma_mapping_ops per device.

    Note that POWER already has dma_mapping_ops per device but all the POWER
    IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
    argument.

    [akpm@linux-foundation.org: fix sge]
    [akpm@linux-foundation.org: fix svc_rdma]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: fix bnx2x]
    [akpm@linux-foundation.org: fix s2io]
    [akpm@linux-foundation.org: fix pasemi_mac]
    [akpm@linux-foundation.org: fix sdhci]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: fix sparc]
    [akpm@linux-foundation.org: fix ibmvscsi]
    Signed-off-by: FUJITA Tomonori
    Cc: Muli Ben-Yehuda
    Cc: Andi Kleen
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     

15 Jul, 2008

5 commits

  • - Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
    IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
    STag support and IB HCAs to indicate reserved L_Key support.

    - Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
    this in with the appropriate local DMA L_Key (if they support it).

    - Fix up the drivers using this flag.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • This patch also adds a creation flag for QPs,
    IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
    multicast sends from the QP to a group that the QP is attached to will
    not be looped back to the QP's receive queue. This can be used to
    save receive resources when a consumer does not want a local copy of
    multicast traffic; for example IPoIB must waste CPU time throwing away
    such local copies of multicast traffic.

    This patch also adds a device capability flag that shows whether a
    device supports this feature or not.

    Signed-off-by: Ron Livne
    Signed-off-by: Roland Dreier

    Ron Livne
     
  • This patch adds a sysfs attribute group called "proto_stats" under
    /sys/class/infiniband/$device/ and populates this group with protocol
    statistics if they exist for a given device. Currently, only iWARP
    stats are defined, but the code is designed to allow InfiniBand
    protocol stats if they become available. These stats are per-device
    and more importantly -not- per port.

    Details:

    - Add union rdma_protocol_stats in ib_verbs.h. This union allows
    defining transport-specific stats. Currently only iwarp stats are
    defined.

    - Add struct iw_protocol_stats to define the current set of iwarp
    protocol stats.

    - Add new ib_device method called get_proto_stats() to return protocol
    statistics.

    - Add logic in core/sysfs.c to create iwarp protocol stats attributes
    if the device is an RNIC and has a get_proto_stats() method.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • This patch adds support for the IB "base memory management extension"
    (BMME) and the equivalent iWARP operations (which the iWARP verbs
    mandates all devices must implement). The new operations are:

    - Allocate an ib_mr for use in fast register work requests.

    - Allocate/free a physical buffer lists for use in fast register work
    requests. This allows device drivers to allocate this memory as
    needed for use in posting send requests (eg via dma_alloc_coherent).

    - New send queue work requests:
    * send with remote invalidate
    * fast register memory region
    * local invalidate memory region
    * RDMA read with invalidate local memory region (iWARP only)

    Consumer interface details:

    - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
    to indicate device support for these features.

    - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
    IB_WR_RDMA_READ_WITH_INV are added.

    - A new consumer API function, ib_alloc_mr() is added to allocate
    fast register memory regions.

    - New consumer API functions, ib_alloc_fast_reg_page_list() and
    ib_free_fast_reg_page_list() are added to allocate and free
    device-specific memory for fast registration page lists.

    - A new consumer API function, ib_update_fast_reg_key(), is added to
    allow the key portion of the R_Key and L_Key of a fast registration
    MR to be updated. Consumers call this if desired before posting
    a IB_WR_FAST_REG_MR work request.

    Consumers can use this as follows:

    - MR is allocated with ib_alloc_mr().

    - Page list memory is allocated with ib_alloc_fast_reg_page_list().

    - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().

    - MR made VALID and bound to a specific page list via
    ib_post_send(IB_WR_FAST_REG_MR)

    - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
    ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
    invalidate operation.

    - MR is deallocated with ib_dereg_mr()

    - page lists dealloced via ib_free_fast_reg_page_list().

    Applications can allocate a fast register MR once, and then can
    repeatedly bind the MR to different physical block lists (PBLs) via
    posting work requests to a send queue (SQ). For each outstanding
    MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
    allocated (the fast_reg_page_list is owned by the low-level driver
    from the consumer posting a work request until the request completes).
    Thus pipelining can be achieved while still allowing device-specific
    page_list processing.

    The 32-bit fast register memory key/STag is composed of a 24-bit index
    and an 8-bit key. The application can change the key each time it
    fast registers thus allowing more control over the peer's use of the
    key/STag (ie it can effectively be changed each time the rkey is
    rebound to a page list).

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • Remove subversion $Id lines and improve readability by fixing other
    coding style problems pointed out by checkpatch.pl.

    Signed-off-by: Dotan Barak
    Signed-off-by: Roland Dreier

    Dotan Barak
     

10 Jun, 2008

1 commit

  • In 2.6.26, we added some support for send with invalidate work
    requests, including a device capability flag to indicate whether a
    device supports such requests. However, the support was incomplete:
    the completion structure was not extended with a field for the key
    contained in incoming send with invalidate requests.

    Full support for memory management extensions (send with invalidate,
    local invalidate, fast register through a send queue, etc) is planned
    for 2.6.27. Since send with invalidate is not very useful by itself,
    just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
    release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
    which makes things simpler for applications, since they will not have
    quite as confusing an array of fine-grained bits to check.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

29 Apr, 2008

1 commit

  • Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
    when mapping user-allocated CQs with ib_umem_get().

    Signed-off-by: Arthur Kepner
    Cc: Tony Luck
    Cc: Jesse Barnes
    Cc: Jes Sorensen
    Cc: Randy Dunlap
    Cc: Roland Dreier
    Cc: James Bottomley
    Cc: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Grant Grundler
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arthur Kepner
     

20 Apr, 2008

1 commit


17 Apr, 2008

5 commits

  • Add support for modifying CQ parameters for controlling event
    generation moderation.

    Signed-off-by: Eli Cohen
    Signed-off-by: Roland Dreier

    Eli Cohen
     
  • Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
    "send with invalidate" work request as defined in the iWARP verbs and
    the InfiniBand base memory management extensions. Also put "imm_data"
    and a new "invalidate_rkey" member in a new "ex" union in struct
    ib_send_wr. The invalidate_rkey member can be used to pass in an
    R_Key/STag to be invalidated. Add this new union to struct
    ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
    ib_uverbs_post_send().

    Fix up low-level drivers to deal with the change to struct ib_send_wr,
    and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
    since that code never does any send with immediate operations.

    Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
    the iWARP drivers currently in the tree set the bit. The amso1100
    driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
    if passed in as part of userspace send requests (since it does not
    implement kernel bypass work request queueing). Remove the flag from
    all existing drivers that set it until we know which ones are OK.

    The values chosen for the new flag is not consecutive to avoid clashing
    with flags defined in the XRC patches, which are not merged yet but
    which are already in use and are likely to be merged soon.

    This resurrects a patch sent long ago by Mikkel Hagen .

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • LSO (large send offload) allows the networking stack to pass SKBs with
    data size larger than the MTU to the IPoIB driver and have the HCA HW
    fragment the data to multiple MSS-sized packets. Add a device
    capability flag IB_DEVICE_UD_TSO for devices that can perform TCP
    segmentation offload, a new send work request opcode IB_WR_LSO,
    header, hlen and mss fields for the work request structure, and a new
    IB_WC_LSO completion type.

    Signed-off-by: Eli Cohen
    Signed-off-by: Roland Dreier

    Eli Cohen
     
  • Add a create_flags member to struct ib_qp_init_attr that will allow a
    kernel verbs consumer to create a pass special flags when creating a QP.
    Add a flag value for telling low-level drivers that a QP will be used
    for IPoIB UD LSO. The create_flags member will also be useful for XRC
    and ehca low-latency QP support.

    Since no create_flags handling is implemented yet, add code to all
    low-level drivers to return -EINVAL if create_flags is non-zero.

    Signed-off-by: Eli Cohen
    Signed-off-by: Roland Dreier

    Eli Cohen
     
  • IDR IDs are signed, so struct ib_uobject.id should be signed. This
    avoids some sparse pointer signedness warnings.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

09 Feb, 2008

2 commits


25 Jan, 2008

1 commit


02 Nov, 2007

1 commit


04 Aug, 2007

2 commits


19 May, 2007

1 commit