24 Feb, 2020

1 commit

  • [ Upstream commit 6b57cea9221b0247ad5111b348522625e489a8e4 ]

    Currently when the low level driver notifies Pkey, GID, and port change
    events they are notified to the registered handlers in the order they are
    registered.

    IB core and other ULPs such as IPoIB are interested in GID, LID, Pkey
    change events.

    Since all GID queries done by ULPs are serviced by IB core, and the IB
    core deferes cache updates to a work queue, it is possible for other
    clients to see stale cache data when they handle their own events.

    For example, the below call tree shows how ipoib will call
    rdma_query_gid() concurrently with the update to the cache sitting in the
    WQ.

    mlx5_ib_handle_event()
    ib_dispatch_event()
    ib_cache_event()
    queue_work() -> slow cache update

    [..]
    ipoib_event()
    queue_work()
    [..]
    work handler
    ipoib_ib_dev_flush_light()
    __ipoib_ib_dev_flush()
    ipoib_dev_addr_changed_valid()
    rdma_query_gid()
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Parav Pandit
     

15 Feb, 2020

1 commit

  • commit ca95c1411198c2d87217c19d44571052cdc94725 upstream.

    Verify that MR access flags that are passed from user are all supported
    ones, otherwise an error is returned.

    Fixes: 4fca03778351 ("IB/uverbs: Move ib_access_flags and ib_read_counters_flags to uapi")
    Link: https://lore.kernel.org/r/1578506740-22188-6-git-send-email-yishaih@mellanox.com
    Signed-off-by: Michael Guralnik
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael Guralnik
     

18 Dec, 2019

1 commit

  • commit ecdfdfdbe4d4c74029f2b416b7ee6d0aeb56364a upstream.

    If dev->dma_device->params == NULL then the maximum DMA segment size is 64
    KB. See also the dma_get_max_seg_size() implementation. This patch fixes
    the following kernel warning:

    DMA-API: infiniband rxe0: mapping sg segment longer than device claims to support [len=126976] [max=65536]
    WARNING: CPU: 4 PID: 4848 at kernel/dma/debug.c:1220 debug_dma_map_sg+0x3d9/0x450
    RIP: 0010:debug_dma_map_sg+0x3d9/0x450
    Call Trace:
    srp_queuecommand+0x626/0x18d0 [ib_srp]
    scsi_queue_rq+0xd02/0x13e0 [scsi_mod]
    __blk_mq_try_issue_directly+0x2b3/0x3f0
    blk_mq_request_issue_directly+0xac/0xf0
    blk_insert_cloned_request+0xdf/0x170
    dm_mq_queue_rq+0x43d/0x830 [dm_mod]
    __blk_mq_try_issue_directly+0x2b3/0x3f0
    blk_mq_request_issue_directly+0xac/0xf0
    blk_mq_try_issue_list_directly+0xb8/0x170
    blk_mq_sched_insert_requests+0x23c/0x3b0
    blk_mq_flush_plug_list+0x529/0x730
    blk_flush_plug_list+0x21f/0x260
    blk_mq_make_request+0x56b/0xf20
    generic_make_request+0x196/0x660
    submit_bio+0xae/0x290
    blkdev_direct_IO+0x822/0x900
    generic_file_direct_write+0x110/0x200
    __generic_file_write_iter+0x124/0x2a0
    blkdev_write_iter+0x168/0x270
    aio_write+0x1c4/0x310
    io_submit_one+0x971/0x1390
    __x64_sys_io_submit+0x12a/0x390
    do_syscall_64+0x6f/0x2e0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Link: https://lore.kernel.org/r/20191025225830.257535-2-bvanassche@acm.org
    Cc:
    Fixes: 0b5cb3300ae5 ("RDMA/srp: Increase max_segment_size")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

23 Oct, 2019

1 commit

  • The issue is in drivers/infiniband/core/uverbs_std_types_cq.c in the
    UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE) function. We check that:

    if (attr.comp_vector >= attrs->ufile->device->num_comp_vectors) {

    But we don't check if "attr.comp_vector" is negative. It could
    potentially lead to an array underflow. My concern would be where
    cq->vector is used in the create_cq() function from the cxgb4 driver.

    And really "attr.comp_vector" is appears as a u32 to user space so that's
    the right type to use.

    Fixes: 9ee79fce3642 ("IB/core: Add completion queue (cq) object actions")
    Link: https://lore.kernel.org/r/20191011133419.GA22905@mwanda
    Signed-off-by: Dan Carpenter
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Dan Carpenter
     

22 Sep, 2019

2 commits

  • Pull RDMA subsystem updates from Jason Gunthorpe:
    "This cycle mainly saw lots of bug fixes and clean up code across the
    core code and several drivers, few new functional changes were made.

    - Many cleanup and bug fixes for hns

    - Various small bug fixes and cleanups in hfi1, mlx5, usnic, qed,
    bnxt_re, efa

    - Share the query_port code between all the iWarp drivers

    - General rework and cleanup of the ODP MR umem code to fit better
    with the mmu notifier get/put scheme

    - Support rdma netlink in non init_net name spaces

    - mlx5 support for XRC devx and DC ODP"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
    RDMA: Fix double-free in srq creation error flow
    RDMA/efa: Fix incorrect error print
    IB/mlx5: Free mpi in mp_slave mode
    IB/mlx5: Use the original address for the page during free_pages
    RDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"
    RDMA/hns: Package operations of rq inline buffer into separate functions
    RDMA/hns: Optimize cmd init and mode selection for hip08
    IB/hfi1: Define variables as unsigned long to fix KASAN warning
    IB/{rdmavt, hfi1, qib}: Add a counter for credit waits
    IB/hfi1: Add traces for TID RDMA READ
    RDMA/siw: Relax from kmap_atomic() use in TX path
    IB/iser: Support up to 16MB data transfer in a single command
    RDMA/siw: Fix page address mapping in TX path
    RDMA: Fix goto target to release the allocated memory
    RDMA/usnic: Avoid overly large buffers on stack
    RDMA/odp: Add missing cast for 32 bit
    RDMA/hns: Use devm_platform_ioremap_resource() to simplify code
    Documentation/infiniband: update name of some functions
    RDMA/cma: Fix false error message
    RDMA/hns: Fix wrong assignment of qp_access_flags
    ...

    Linus Torvalds
     
  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

14 Sep, 2019

2 commits


22 Aug, 2019

9 commits

  • At this point the ucontext is only being stored to access the ib_device,
    so just store the ib_device directly instead. This is more natural and
    logical as the umem has nothing to do with the ucontext.

    Link: https://lore.kernel.org/r/20190806231548.25242-8-jgg@ziepe.ca
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a significant simplification, no extra list is kept per FD, and
    the interval tree is now shared between all the ucontexts, reducing
    overhead if there are multiple ucontexts active.

    Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.ca
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Jason Gunthorpe says:

    ====================
    This is a collection of general cleanups for ODP to clarify some of the
    flows around umem creation and use of the interval tree.
    ====================

    The branch is based on v5.3-rc5 due to dependencies

    * odp_fixes:
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    RDMA/core: Make invalidate_range a device operation
    RDMA/odp: Use kvcalloc for the dma_list and page_list
    RDMA/odp: Check for overflow when computing the umem_odp end
    RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
    RDMA/odp: Split creating a umem_odp from ib_umem_get
    RDMA/odp: Make the three ways to create a umem_odp clear
    RMDA/odp: Consolidate umem_odp initialization
    RDMA/odp: Make it clearer when a umem is an implicit ODP umem
    RDMA/odp: Iterate over the whole rbtree directly
    RDMA/odp: Use the common interval tree library instead of generic
    RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • The callback function 'invalidate_range' is implemented in a driver so the
    place for it is in the ib_device_ops structure and not in ib_ucontext.

    Signed-off-by: Moni Shoua
    Reviewed-by: Guy Levi
    Reviewed-by: Jason Gunthorpe
    Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Moni Shoua
     
  • Since the page size can be extended in the ODP case by IB_ACCESS_HUGETLB
    the existing overflow checks done by ib_umem_get() are not
    sufficient. Check for overflow again.

    Further, remove the unchecked math from the inlines and just use the
    precomputed value stored in the interval_tree_node.

    Link: https://lore.kernel.org/r/20190819111710.18440-9-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is the last creation API that is overloaded for both, there is very
    little code sharing and a driver has to be specifically ready for a
    umem_odp to be created to use the odp version.

    Link: https://lore.kernel.org/r/20190819111710.18440-7-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • The three paths to build the umem_odps are kind of muddled, they are:
    - As a normal ib_mr umem
    - As a child in an implicit ODP umem tree
    - As the root of an implicit ODP umem tree

    Only the first two are actually umem's, the last is an abuse.

    The implicit case can only be triggered by explicit driver request, it
    should never be co-mingled with the normal case. While we are here, make
    sensible function names and add some comments to make this clearer.

    Link: https://lore.kernel.org/r/20190819111710.18440-6-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Implicit ODP umems are special, they don't have any page lists, they don't
    exist in the interval tree and they are never DMA mapped.

    Instead of trying to guess this based on a zero length use an explicit
    flag.

    Further, do not allow non-implicit umems to be 0 size.

    Link: https://lore.kernel.org/r/20190819111710.18440-4-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • ODP is working with userspace VA's in the interval tree which always fit
    into an unsigned long, so we can use the common code.

    This comes at a cost of a 16 byte increase in ib_umem_odp struct size due
    to storing the interval tree start/last in addition to the umem
    addr/length. However these values were computed and are performance
    critical for the interval lookup, so this seems like a worthwhile trade
    off.

    Removes 2k of .text from the kernel.

    Link: https://lore.kernel.org/r/20190819111710.18440-2-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

21 Aug, 2019

2 commits

  • task_active_pid_ns() is wrong API to check PID namespace because it
    posses some restrictions and return PID namespace where the process
    was allocated. It created mismatches with current namespace, which
    can be different.

    Rewrite whole rdma_is_visible_in_pid_ns() logic to provide reliable
    results without any relation to allocated PID namespace.

    Fixes: 8be565e65fa9 ("RDMA/nldev: Factor out the PID namespace check")
    Fixes: 6a6c306a09b5 ("RDMA/restrack: Make is_visible_in_pid_ns() as an API")
    Reviewed-by: Mark Zhang
    Signed-off-by: Leon Romanovsky
    Link: https://lore.kernel.org/r/20190815083834.9245-4-leon@kernel.org
    Signed-off-by: Doug Ledford

    Leon Romanovsky
     
  • There is no need to keep DEBUG defines for out-of-the tree testing.

    Signed-off-by: Leon Romanovsky
    Link: https://lore.kernel.org/r/20190819114547.20704-1-leon@kernel.org
    Signed-off-by: Doug Ledford

    Leon Romanovsky
     

12 Aug, 2019

1 commit

  • In order to improve readability, add ib_port_phys_state enum to replace
    the use of magic numbers.

    Signed-off-by: Kamal Heib
    Reviewed-by: Andrew Boyer
    Acked-by: Michal Kalderon
    Acked-by: Bernard Metzler
    Link: https://lore.kernel.org/r/20190807103138.17219-2-kamalheib1@gmail.com
    Signed-off-by: Doug Ledford

    Kamal Heib
     

06 Aug, 2019

1 commit


05 Aug, 2019

1 commit

  • Send and Receive completion is handled on a single CPU selected at
    the time each Completion Queue is allocated. Typically this is when
    an initiator instantiates an RDMA transport, or when a target
    accepts an RDMA connection.

    Some ULPs cannot open a connection per CPU to spread completion
    workload across available CPUs and MSI vectors. For such ULPs,
    provide an API that allows the RDMA core to select a completion
    vector based on the device's complement of available comp_vecs.

    ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted
    to use the new API so that their completion workloads interfere less
    with each other.

    Suggested-by: Håkon Bugge
    Signed-off-by: Chuck Lever
    Reviewed-by: Leon Romanovsky
    Cc:
    Cc:
    Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.net
    Signed-off-by: Doug Ledford

    Chuck Lever
     

01 Aug, 2019

2 commits

  • Due to the complexity of client->remove() callbacks it is desirable to not
    hold any locks while calling them. Remove the last one by tracking only
    the highest client ID and running backwards from there over the xarray.

    Since the only purpose of that lock was to protect the linked list, we can
    drop the lock.

    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Leon Romanovsky
    Link: https://lore.kernel.org/r/20190731081841.32345-3-leon@kernel.org
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • lockdep reports:

    WARNING: possible circular locking dependency detected

    modprobe/302 is trying to acquire lock:
    0000000007c8919c ((wq_completion)ib_cm){+.+.}, at: flush_workqueue+0xdf/0x990

    but task is already holding lock:
    000000002d3d2ca9 (&device->client_data_rwsem){++++}, at: remove_client_context+0x79/0xd0 [ib_core]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (&device->client_data_rwsem){++++}:
    down_read+0x3f/0x160
    ib_get_net_dev_by_params+0xd5/0x200 [ib_core]
    cma_ib_req_handler+0x5f6/0x2090 [rdma_cm]
    cm_process_work+0x29/0x110 [ib_cm]
    cm_req_handler+0x10f5/0x1c00 [ib_cm]
    cm_work_handler+0x54c/0x311d [ib_cm]
    process_one_work+0x4aa/0xa30
    worker_thread+0x62/0x5b0
    kthread+0x1ca/0x1f0
    ret_from_fork+0x24/0x30

    -> #1 ((work_completion)(&(&work->work)->work)){+.+.}:
    process_one_work+0x45f/0xa30
    worker_thread+0x62/0x5b0
    kthread+0x1ca/0x1f0
    ret_from_fork+0x24/0x30

    -> #0 ((wq_completion)ib_cm){+.+.}:
    lock_acquire+0xc8/0x1d0
    flush_workqueue+0x102/0x990
    cm_remove_one+0x30e/0x3c0 [ib_cm]
    remove_client_context+0x94/0xd0 [ib_core]
    disable_device+0x10a/0x1f0 [ib_core]
    __ib_unregister_device+0x5a/0xe0 [ib_core]
    ib_unregister_device+0x21/0x30 [ib_core]
    mlx5_ib_stage_ib_reg_cleanup+0x9/0x10 [mlx5_ib]
    __mlx5_ib_remove+0x3d/0x70 [mlx5_ib]
    mlx5_ib_remove+0x12e/0x140 [mlx5_ib]
    mlx5_remove_device+0x144/0x150 [mlx5_core]
    mlx5_unregister_interface+0x3f/0xf0 [mlx5_core]
    mlx5_ib_cleanup+0x10/0x3a [mlx5_ib]
    __x64_sys_delete_module+0x227/0x350
    do_syscall_64+0xc3/0x6a4
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Which is due to the read side of the client_data_rwsem being obtained
    recursively through a work queue flush during cm client removal.

    The lock is being held across the remove in remove_client_context() so
    that the function is a fence, once it returns the client is removed. This
    is required so that the two callers do not proceed with destruction until
    the client completes removal.

    Instead of using client_data_rwsem use the existing device unregistration
    refcount and add a similar client unregistration (client->uses) refcount.

    This will fence the two unregistration paths without holding any locks.

    Cc:
    Fixes: 921eab1143aa ("RDMA/devices: Re-organize device.c locking")
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Leon Romanovsky
    Link: https://lore.kernel.org/r/20190731081841.32345-2-leon@kernel.org
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     

30 Jul, 2019

1 commit


26 Jul, 2019

2 commits

  • Now that IB core supports RDMA device binding with specific net namespace,
    enable IB core to accept netlink commands in non init_net namespaces.

    This is done by having per net namespace netlink socket.

    At present only netlink device handling client RDMA_NL_NLDEV supports
    device handling in multiple net namespaces. Hence do not accept netlink
    messages for other clients in non init_net net namespaces.

    Link: https://lore.kernel.org/r/20190723070205.6247-1-leon@kernel.org
    Signed-off-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Parav Pandit
     
  • So that rdma can work with CONFIG_KERNEL_HEADER_TEST and
    CONFIG_HEADERS_CHECK.

    Link: https://lore.kernel.org/r/20190722170126.GA16453@ziepe.ca
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

23 Jul, 2019

1 commit

  • When an OPFN request is flushed, the request is completed without
    unreserving itself from the send queue. Subsequently, when a new
    request is post sent, the following warning will be triggered:

    WARNING: CPU: 4 PID: 8130 at rdmavt/qp.c:1761 rvt_post_send+0x72a/0x880 [rdmavt]
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] __warn+0xd8/0x100
    [] warn_slowpath_null+0x1d/0x20
    [] rvt_post_send+0x72a/0x880 [rdmavt]
    [] ? account_entity_dequeue+0xae/0xd0
    [] ? __kmalloc+0x55/0x230
    [] ib_uverbs_post_send+0x37c/0x5d0 [ib_uverbs]
    [] ? rdma_lookup_put_uobject+0x26/0x60 [ib_uverbs]
    [] ib_uverbs_write+0x286/0x460 [ib_uverbs]
    [] ? security_file_permission+0x27/0xa0
    [] vfs_write+0xc0/0x1f0
    [] SyS_write+0x7f/0xf0
    [] system_call_fastpath+0x22/0x27

    This patch fixes the problem by moving rvt_qp_wqe_unreserve() into
    rvt_qp_complete_swqe() to simplify the code and make it less
    error-prone.

    Fixes: ca95f802ef51 ("IB/hfi1: Unreserve a reserved request when it is completed")
    Link: https://lore.kernel.org/r/20190715164528.74174.31364.stgit@awfm-01.aw.intel.com
    Cc:
    Reviewed-by: Mike Marciniszyn
    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Kaike Wan
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Jason Gunthorpe

    Kaike Wan
     

09 Jul, 2019

3 commits

  • 5.4-rc1 will have new compile time debugging to test that headers can be
    compiled stand alone. Many rdma headers are already broken and excluded
    from the mechanism, however to avoid compile failures during the merge
    window fix enough so that the newly added header compiles clean.

    Fixes: 413d3347503b ("RDMA/counter: Add set/clear per-port auto mode support")
    Reported-by: Stephen Rothwell
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Mark Zhang

    Jason Gunthorpe
     
  • Added the interface in the infiniband driver that applies the rdma_dim
    adaptive moderation. There is now a special function for allocating an
    ib_cq that uses rdma_dim.

    Performance improvement (ConnectX-5 100GbE, x86) running FIO benchmark over
    NVMf between two equal end-hosts with 56 cores across a Mellanox switch
    using null_blk device:

    READS without DIM:
    blk size | BW | IOPS | 99th percentile latency | 99.99th latency
    512B | 3.8GiB/s | 7.7M | 1401 usec | 2442 usec
    4k | 7.0GiB/s | 1.8M | 4817 usec | 6587 usec
    64k | 10.7GiB/s| 175k | 9896 usec | 10028 usec

    IO WRITES without DIM:
    blk size | BW | IOPS | 99th percentile latency | 99.99th latency
    512B | 3.6GiB/s | 7.5M | 1434 usec | 2474 usec
    4k | 6.3GiB/s | 1.6M | 938 usec | 1221 usec
    64k | 10.7GiB/s| 175k | 8979 usec | 12780 usec

    IO READS with DIM:
    blk size | BW | IOPS | 99th percentile latency | 99.99th latency
    512B | 4GiB/s | 8.2M | 816 usec | 889 usec
    4k | 10.1GiB/s| 2.65M| 3359 usec | 5080 usec
    64k | 10.7GiB/s| 175k | 9896 usec | 10028 usec

    IO WRITES with DIM:
    blk size | BW | IOPS | 99th percentile latency | 99.99th latency
    512B | 3.9GiB/s | 8.1M | 799 usec | 922 usec
    4k | 9.6GiB/s | 2.5M | 717 usec | 1004 usec
    64k | 10.7GiB/s| 176k | 8586 usec | 12256 usec

    The rdma_dim algorithm was designed to measure the effectiveness of
    moderation on the flow in a general way and thus should be appropriate
    for all RDMA storage protocols.

    rdma_dim is configured to be the default option based on performance
    improvement seen after extensive tests.

    Signed-off-by: Yamin Friedman
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Yamin Friedman
     
  • Userspace expects the IB_TM_CAP_RC bit to indicate that the device
    supports RC transport tag matching with rendezvous offload. However the
    firmware splits this into two capabilities for eager and rendezvous tag
    matching.

    Only if the FW supports both modes should userspace be told the tag
    matching capability is available.

    Cc: # 4.13
    Fixes: eb761894351d ("IB/mlx5: Fill XRQ capabilities")
    Signed-off-by: Danit Goldberg
    Reviewed-by: Yishai Hadas
    Reviewed-by: Artemy Kovalyov
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Danit Goldberg
     

05 Jul, 2019

8 commits


29 Jun, 2019

1 commit