31 Jan, 2020

1 commit

  • Compilation of mlx5 driver without CONFIG_INFINIBAND_USER_ACCESS generates
    the following error.

    on x86_64:

    ld: drivers/infiniband/hw/mlx5/main.o: in function `mlx5_ib_handler_MLX5_IB_METHOD_VAR_OBJ_ALLOC':
    main.c:(.text+0x186d): undefined reference to `ib_uverbs_get_ucontext_file'
    ld: drivers/infiniband/hw/mlx5/main.o:(.rodata+0x2480): undefined reference to `uverbs_idr_class'
    ld: drivers/infiniband/hw/mlx5/main.o:(.rodata+0x24d8): undefined reference to `uverbs_destroy_def_handler'

    This is happening because some parts of the UAPI description are not
    static. This is a hold over from earlier code that relied on struct
    pointers to refer to object types, now object types are referenced by
    number. Remove the unused globals and add statics to the remaining UAPI
    description elements.

    Remove the redundent #ifdefs around mlx5_ib_*defs and obsolete
    mlx5_ib_get_devx_tree().

    The compiler now trims alot more unused code, including the above
    problematic definitions when !CONFIG_INFINIBAND_USER_ACCESS.

    Fixes: 7be76bef320b ("IB/mlx5: Introduce VAR object and its alloc/destroy methods")
    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

26 Jan, 2020

2 commits

  • All accesses now use the new IBA acessor scheme, so delete the structs
    entirely and generate the structures from the schema file.

    Link: https://lore.kernel.org/r/20200116170037.30109-8-jgg@ziepe.ca
    Tested-by: Leon Romanovsky
    Reviewed-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • There is no separation between RDMA-CM wire format as it is declared in
    IBTA and kernel logic which implements needed support. Such situation
    causes to many mistakes in conversion between big-endian (wire format)
    and CPU format used by kernel. It also mixes RDMA core code with
    combination of uXX and beXX variables.

    The idea that all accesses to IBA definitions will go through special
    GET/SET macros to ensure that no conversion mistakes are made. The
    shifting and masking required to read the value is automatically deduced
    using the field offset description from the tables in the IBA
    specification.

    This starts with the CM MADs described in IBTA release 1.3 volume 1.

    To confirm that the new macros behave the same as the old accessors a
    self-test is included in this patch.

    Each macro replacing a straightforward struct field compile-time tests
    that the new field has the same offsetof() and width as the old field.

    For the fields with accessor functions a runtime test, the 'all ones'
    value is placed in a dummy message and read back in several ways to
    confirm that both approaches give identical results.

    Later patches in this series delete the self test.

    This creates a tested table of new field name, old field name(s) and some
    meta information like BE coding for the functions which will be used in
    the next patches.

    Link: https://lore.kernel.org/r/20200116170037.30109-3-jgg@ziepe.ca
    Link: https://lore.kernel.org/r/20191212093830.316934-5-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Tested-by: Leon Romanovsky
    Reviewed-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Leon Romanovsky
     

21 Jan, 2020

1 commit

  • From https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma

    Leon Romanovsky says:

    ====================
    Use ODP MRs for kernel ULPs

    The following series extends MR creation routines to allow creation of
    user MRs through kernel ULPs as a proxy. The immediate use case is to
    allow RDS to work over FS-DAX, which requires ODP (on-demand-paging)
    MRs to be created and such MRs were not possible to create prior this
    series.

    The first part of this patchset extends RDMA to have special verb
    ib_reg_user_mr(). The common use case that uses this function is a
    userspace application that allocates memory for HCA access but the
    responsibility to register the memory at the HCA is on an kernel ULP.
    This ULP acts as an agent for the userspace application.

    The second part provides advise MR functionality for ULPs. This is
    integral part of ODP flows and used to trigger pagefaults in advance
    to prepare memory before running working set.

    The third part is actual user of those in-kernel APIs.
    ====================

    * tag 'rds-odp-for-5.5':
    net/rds: Use prefetch for On-Demand-Paging MR
    net/rds: Handle ODP mr registration/unregistration
    net/rds: Detect need of On-Demand-Paging memory registration
    RDMA/mlx5: Fix handling of IOVA != user_va in ODP paths
    IB/mlx5: Mask out unsupported ODP capabilities for kernel QPs
    RDMA/mlx5: Don't fake udata for kernel path
    IB/mlx5: Add ODP WQE handlers for kernel QPs
    IB/core: Add interface to advise_mr for kernel users
    IB/core: Introduce ib_reg_user_mr
    IB: Allow calls to ib_umem_get from kernel ULPs

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

17 Jan, 2020

3 commits

  • Add a new relaxed ordering access flag for memory regions. Using memory
    regions with relaxed ordeing set can enhance performance.

    This access flag is handled in a best-effort manner, drivers should ignore
    if they don't support setting relaxed ordering.

    Link: https://lore.kernel.org/r/1578506740-22188-9-git-send-email-yishaih@mellanox.com
    Signed-off-by: Michael Guralnik
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Michael Guralnik
     
  • Define a range of access flags that are defined to be optional, both
    uverbs and drivers should enable getting them and use if they are
    applicable

    This will be used, for example, for the relaxed ordering access flag which
    unsupporting drivers can ignore.

    Link: https://lore.kernel.org/r/1578506740-22188-7-git-send-email-yishaih@mellanox.com
    Signed-off-by: Michael Guralnik
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Michael Guralnik
     
  • Verify that MR access flags that are passed from user are all supported
    ones, otherwise an error is returned.

    Fixes: 4fca03778351 ("IB/uverbs: Move ib_access_flags and ib_read_counters_flags to uapi")
    Link: https://lore.kernel.org/r/1578506740-22188-6-git-send-email-yishaih@mellanox.com
    Signed-off-by: Michael Guralnik
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Michael Guralnik
     

16 Jan, 2020

3 commits

  • Allow ULPs to call advise_mr, so they can control ODP regions
    in the same way as user space applications.

    Signed-off-by: Moni Shoua
    Signed-off-by: Leon Romanovsky

    Moni Shoua
     
  • Add ib_reg_user_mr() for kernel ULPs to register user MRs.

    The common use case that uses this function is a userspace application
    that allocates memory for HCA access but the responsibility to register
    the memory at the HCA is on an kernel ULP. This ULP that acts as an agent
    for the userspace application.

    This function is intended to be used without a user context so vendor
    drivers need to be aware of calling reg_user_mr() device operation with
    udata equal to NULL.

    Among all drivers, i40iw is the only driver which relies on presence
    of udata, so check udata existence for that driver.

    Signed-off-by: Moni Shoua
    Reviewed-by: Guy Levi
    Signed-off-by: Leon Romanovsky

    Moni Shoua
     
  • So far the assumption was that ib_umem_get() and ib_umem_odp_get()
    are called from flows that start in UVERBS and therefore has a user
    context. This assumption restricts flows that are initiated by ULPs
    and need the service that ib_umem_get() provides.

    This patch changes ib_umem_get() and ib_umem_odp_get() to get IB device
    directly by relying on the fact that both UVERBS and ULPs sets that
    field correctly.

    Reviewed-by: Guy Levi
    Signed-off-by: Moni Shoua
    Signed-off-by: Leon Romanovsky

    Moni Shoua
     

14 Jan, 2020

9 commits

  • Now that all callers provide a non-NULL attrs the ufile is redundant.
    Adjust things so that the context handling is done inside alloc_uobj,
    and the ib_uverbs_get_ucontext_file() is avoided if we already have the
    context.

    Link: https://lore.kernel.org/r/1578504126-9400-13-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a struct ib_uwq_object pointer, instead of using container_of()
    all over the place just store it with its actual type.

    Link: https://lore.kernel.org/r/1578504126-9400-10-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a struct ib_usrq_object pointer, instead of using container_of()
    all over the place just store it with its actual type.

    Link: https://lore.kernel.org/r/1578504126-9400-9-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a struct ib_uqp_object pointer, instead of using container_of()
    all over the place just store it with its actual type.

    Link: https://lore.kernel.org/r/1578504126-9400-8-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a struct ib_ucq_object pointer, instead of using container_of()
    all over the place just store it with its actual type.

    Link: https://lore.kernel.org/r/1578504126-9400-7-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This is a left over from an earlier version that creates a lot of
    complexity for error unwind, particularly for FD uobjects.

    The only reason this was done is so that anon_inode_get_file() could be
    called with the final fops and a fully setup uobject. Both need to be
    setup since unwinding anon_inode_get_file() via fput will call the
    driver's release().

    Now that the driver does not provide release, we no longer need to worry
    about this complicated sequence, simply create the struct file at the
    start and allow the core code's release function to deal with the abort
    case.

    This allows all the confusing error paths around commit to be removed.

    Link: https://lore.kernel.org/r/1578504126-9400-5-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • FD uobjects have a weird split between the struct file and uobject
    world. Simplify this to make them pure uobjects and use a generic release
    method for all struct file operations.

    This fixes the control flow so that mlx5_cmd_cleanup_async_ctx() is always
    called before erasing the linked list contents to make the concurrancy
    simpler to understand.

    For this to work the uobject destruction must fence anything that it is
    cleaning up - the design must not rely on struct file lifetime.

    Only deliver_event() relies on the struct file to when adding new events
    to the queue, add a is_destroyed check under lock to block it.

    Link: https://lore.kernel.org/r/1578504126-9400-3-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • dispatch_event_fd() runs from a notifier with minimal locking, and relies
    on RCU and a file refcount to keep the uobject and eventfd alive.

    As the next patch wants to remove the file_operations release function
    from the drivers, re-organize things so that the devx_event_notifier()
    path uses the existing RCU to manage the lifetime of the uobject and
    eventfd.

    Move the refcount puts to a call_rcu so that the objects are guaranteed to
    exist and remove the indirect file refcount.

    Link: https://lore.kernel.org/r/1578504126-9400-2-git-send-email-yishaih@mellanox.com
    Signed-off-by: Yishai Hadas
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • After device disassociation the uapi_objects are destroyed and freed,
    however it is still possible that core code can be holding a kref on the
    uobject. When it finally goes to uverbs_uobject_free() via the kref_put()
    it can trigger a use-after-free on the uapi_object.

    Since needs_kfree_rcu is a micro optimization that only benefits file
    uobjects, just get rid of it. There is no harm in using kfree_rcu even if
    it isn't required, and the number of involved objects is small.

    Link: https://lore.kernel.org/r/20200113143306.GA28717@ziepe.ca
    Signed-off-by: Michael Guralnik
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

08 Jan, 2020

4 commits

  • This lock is used to protect the qp->open_list linked list. As a side
    effect it seems to also globally serialize the qp event_handler, but it
    isn't clear if that is a deliberate design.

    Link: https://lore.kernel.org/r/20191212113024.336702-5-leon@kernel.org
    Signed-off-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Parav Pandit
     
  • Given that ib_cache structure has only single member now, merge the cache
    lock directly in the ib_device.

    Link: https://lore.kernel.org/r/20191212113024.336702-4-leon@kernel.org
    Signed-off-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Parav Pandit
     
  • Currently when the low level driver notifies Pkey, GID, and port change
    events they are notified to the registered handlers in the order they are
    registered.

    IB core and other ULPs such as IPoIB are interested in GID, LID, Pkey
    change events.

    Since all GID queries done by ULPs are serviced by IB core, and the IB
    core deferes cache updates to a work queue, it is possible for other
    clients to see stale cache data when they handle their own events.

    For example, the below call tree shows how ipoib will call
    rdma_query_gid() concurrently with the update to the cache sitting in the
    WQ.

    mlx5_ib_handle_event()
    ib_dispatch_event()
    ib_cache_event()
    queue_work() -> slow cache update

    [..]
    ipoib_event()
    queue_work()
    [..]
    work handler
    ipoib_ib_dev_flush_light()
    __ipoib_ib_dev_flush()
    ipoib_dev_addr_changed_valid()
    rdma_query_gid()
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Parav Pandit
     
  • Sample trace events:

    kworker/u29:0-300 [007] 120.042217: cq_alloc: cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
    -0 [002] 120.056292: cq_schedule: cq.id=4
    kworker/2:1H-482 [002] 120.056402: cq_process: cq.id=4 wake-up took 109 [us] from interrupt
    kworker/2:1H-482 [002] 120.056407: cq_poll: cq.id=4 requested 16, returned 1
    -0 [002] 120.067503: cq_schedule: cq.id=4
    kworker/2:1H-482 [002] 120.067537: cq_process: cq.id=4 wake-up took 34 [us] from interrupt
    kworker/2:1H-482 [002] 120.067541: cq_poll: cq.id=4 requested 16, returned 1
    -0 [002] 120.067657: cq_schedule: cq.id=4
    kworker/2:1H-482 [002] 120.067672: cq_process: cq.id=4 wake-up took 15 [us] from interrupt
    kworker/2:1H-482 [002] 120.067674: cq_poll: cq.id=4 requested 16, returned 1

    ...

    systemd-1 [002] 122.392653: cq_schedule: cq.id=4
    kworker/2:1H-482 [002] 122.392688: cq_process: cq.id=4 wake-up took 35 [us] from interrupt
    kworker/2:1H-482 [002] 122.392693: cq_poll: cq.id=4 requested 16, returned 16
    kworker/2:1H-482 [002] 122.392836: cq_poll: cq.id=4 requested 16, returned 16
    kworker/2:1H-482 [002] 122.392970: cq_poll: cq.id=4 requested 16, returned 16
    kworker/2:1H-482 [002] 122.393083: cq_poll: cq.id=4 requested 16, returned 16
    kworker/2:1H-482 [002] 122.393195: cq_poll: cq.id=4 requested 16, returned 3

    Several features to note in this output:
    - The WCE count and context type are reported at allocation time
    - The CPU and kworker for each CQ is evident
    - The CQ's restracker ID is tagged on each trace event
    - CQ poll scheduling latency is measured
    - Details about how often single completions occur versus multiple
    completions are evident
    - The cost of the ULP's completion handler is recorded

    Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.net
    Signed-off-by: Chuck Lever
    Reviewed-by: Parav Pandit
    Signed-off-by: Jason Gunthorpe

    Chuck Lever
     

04 Jan, 2020

2 commits

  • Clean the code by deleting ARP functions, which are not called anyway.

    Link: https://lore.kernel.org/r/20191212093830.316934-46-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Leon Romanovsky
     
  • Comments need to be with the definition of rvt_restart_sge().

    Other comments were duplicated in sw/rdmavt/rc.c and were removed.

    Fixes: 385156c5f2a6 ("IB/hfi: Move RC functions into a header file")
    Link: https://lore.kernel.org/r/20191219211934.58387.88014.stgit@awfm-01.aw.intel.com
    Reviewed-by: Kaike Wan
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe

    Mike Marciniszyn
     

13 Dec, 2019

1 commit


01 Dec, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is another round of bug fixing and cleanup. This time the focus
    is on the driver pattern to use mmu notifiers to monitor a VA range.
    This code is lifted out of many drivers and hmm_mirror directly into
    the mmu_notifier core and written using the best ideas from all the
    driver implementations.

    This removes many bugs from the drivers and has a very pleasing
    diffstat. More drivers can still be converted, but that is for another
    cycle.

    - A shared branch with RDMA reworking the RDMA ODP implementation

    - New mmu_interval_notifier API. This is focused on the use case of
    monitoring a VA and simplifies the process for drivers

    - A common seq-count locking scheme built into the
    mmu_interval_notifier API usable by drivers that call
    get_user_pages() or hmm_range_fault() with the VA range

    - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
    GntDev drivers to the new API. This deletes a lot of wonky driver
    code.

    - Two improvements for hmm_range_fault(), from testing done by Ralph"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
    mm/hmm: make full use of walk_page_range()
    xen/gntdev: use mmu_interval_notifier_insert
    mm/hmm: remove hmm_mirror and related
    drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
    drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
    drm/amdgpu: Call find_vma under mmap_sem
    nouveau: use mmu_interval_notifier instead of hmm_mirror
    nouveau: use mmu_notifier directly for invalidate_range_start
    drm/radeon: use mmu_interval_notifier_insert
    RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
    RDMA/odp: Use mmu_interval_notifier_insert()
    mm/hmm: define the pre-processor related parts of hmm.h even if disabled
    mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
    mm/mmu_notifier: add an interval tree notifier
    mm/mmu_notifier: define the header pre-processor parts even if disabled
    mm/hmm: allow snapshot of the special zero page

    Linus Torvalds
     

28 Nov, 2019

1 commit

  • Pull rdma updates from Jason Gunthorpe:
    "Again another fairly quiet cycle with few notable core code changes
    and the usual variety of driver bug fixes and small improvements.

    - Various driver updates and bug fixes for siw, bnxt_re, hns, qedr,
    iw_cxgb4, vmw_pvrdma, mlx5

    - Improvements in SRPT from working with iWarp

    - SRIOV VF support for bnxt_re

    - Skeleton kernel-doc files for drivers/infiniband

    - User visible counters for events related to ODP

    - Common code for tracking of mmap lifetimes so that drivers can link
    HW object liftime to a VMA

    - ODP bug fixes and rework

    - RDMA READ support for efa

    - Removal of the very old cxgb3 driver"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (168 commits)
    RDMA/hns: Delete unnecessary callback functions for cq
    RDMA/hns: Rename the functions used inside creating cq
    RDMA/hns: Redefine the member of hns_roce_cq struct
    RDMA/hns: Redefine interfaces used in creating cq
    RDMA/efa: Expose RDMA read related attributes
    RDMA/efa: Support remote read access in MR registration
    RDMA/efa: Store network attributes in device attributes
    IB/hfi1: remove redundant assignment to variable ret
    RDMA/bnxt_re: Fix missing le16_to_cpu
    RDMA/bnxt_re: Fix stat push into dma buffer on gen p5 devices
    RDMA/bnxt_re: Fix chip number validation Broadcom's Gen P5 series
    RDMA/bnxt_re: Fix Kconfig indentation
    IB/mlx5: Implement callbacks for getting VFs GUID attributes
    IB/ipoib: Add ndo operation for getting VFs GUID attributes
    IB/core: Add interfaces to get VF node and port GUIDs
    net/core: Add support for getting VF GUIDs
    RDMA/qedr: Fix null-pointer dereference when calling rdma_user_mmap_get_offset
    RDMA/cm: Use refcount_t type for refcount variable
    IB/mlx5: Support extended number of strides for Striding RQ
    IB/mlx4: Update HW GID table while adding vlan GID
    ...

    Linus Torvalds
     

25 Nov, 2019

1 commit

  • Danit Goldberg says:

    ====================
    This series extends RTNETLINK to provide IB port and node GUIDs, which
    were configured for Infiniband VFs.

    The functionality to set VF GUIDs already existed for a long time, and
    here we are adding the missing "get" so that netlink will be symmetric and
    various cloud orchestration tools will be able to manage such VFs more
    naturally.

    The iproute2 was extended too to present those GUIDs.

    - ip link show

    For example:
    - ip link set ib4 vf 0 node_guid 22:44:33:00:33:11:00:33
    - ip link set ib4 vf 0 port_guid 10:21:33:12:00:11:22:10
    - ip link show ib4
    ib4: mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
    link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    vf 0 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff,
    spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off
    ====================

    Based on the mlx5-next branch from
    git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
    dependencies

    * branch 'ib-guids': (35 commits)
    IB/mlx5: Implement callbacks for getting VFs GUID attributes
    IB/ipoib: Add ndo operation for getting VFs GUID attributes
    IB/core: Add interfaces to get VF node and port GUIDs
    net/core: Add support for getting VF GUIDs

    net/mlx5: Add new chain for netfilter flow table offload
    net/mlx5: Refactor creating fast path prio chains
    net/mlx5: Accumulate levels for chains prio namespaces
    net/mlx5: Define fdb tc levels per prio
    net/mlx5: Rename FDB_* tc related defines to FDB_TC_* defines
    net/mlx5: Simplify fdb chain and prio eswitch defines
    IB/mlx5: Load profile according to RoCE enablement state
    IB/mlx5: Rename profile and init methods
    net/mlx5: Handle "enable_roce" devlink param
    net/mlx5: Document flow_steering_mode devlink param
    devlink: Add new "enable_roce" generic device param
    net/mlx5: fix spelling mistake "metdata" -> "metadata"
    net/mlx5: fix kvfree of uninitialized pointer spec
    IB/mlx5: Introduce and use mlx5_core_is_vf()
    net/mlx5: E-switch, Enable metadata on own vport
    net/mlx5: Refactor ingress acl configuration
    ...

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

24 Nov, 2019

1 commit

  • Replace the internal interval tree based mmu notifier with the new common
    mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
    deadlock that can be triggered in ODP:

    zap_page_range()
    mmu_notifier_invalidate_range_start()
    [..]
    ib_umem_notifier_invalidate_range_start()
    down_read(&per_mm->umem_rwsem)
    unmap_single_vma()
    [..]
    __split_huge_page_pmd()
    mmu_notifier_invalidate_range_start()
    [..]
    ib_umem_notifier_invalidate_range_start()
    down_read(&per_mm->umem_rwsem) // DEADLOCK

    mmu_notifier_invalidate_range_end()
    up_read(&per_mm->umem_rwsem)
    mmu_notifier_invalidate_range_end()
    up_read(&per_mm->umem_rwsem)

    The umem_rwsem is held across the range_start/end as the ODP algorithm for
    invalidate_range_end cannot tolerate changes to the interval
    tree. However, due to the nested invalidation regions the second
    down_read() can deadlock if there are competing writers. The new core code
    provides an alternative scheme to solve this problem.

    Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
    Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.ca
    Tested-by: Artemy Kovalyov
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

23 Nov, 2019

1 commit


17 Nov, 2019

1 commit


13 Nov, 2019

1 commit


07 Nov, 2019

3 commits

  • Delete never implemented and used MAD functions.

    Link: https://lore.kernel.org/r/20191029062745.7932-2-leon@kernel.org
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe

    Leon Romanovsky
     
  • The rdma_user_mmap_io interface created a common interface for drivers to
    correctly map hw resources and zap them once the ucontext is destroyed
    enabling the drivers to safely free the hw resources.

    However, this meant the drivers need to delay freeing the resource to the
    ucontext destroy phase to ensure they were no longer mapped. The new
    mechanism for a common way of handling user/driver address mapping enabled
    notifying the driver if all umap_priv mappings were removed, and enabled
    freeing the hw resources when they are done with and not delay it until
    ucontext destroy.

    Since not all drivers use the mechanism, NULL can be sent to the
    rdma_user_mmap_io interface to continue working as before. Drivers that
    use the mmap_xa interface can pass the entry being mapped to the
    rdma_user_mmap_io function to be linked together.

    Link: https://lore.kernel.org/r/20191030094417.16866-4-michal.kalderon@marvell.com
    Signed-off-by: Ariel Elior
    Signed-off-by: Michal Kalderon
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Michal Kalderon
     
  • Create some common API's for adding entries to a xa_mmap. Searching for
    an entry and freeing one.

    The general approach is copied from the EFA driver and improved to be more
    general and do more to help the drivers. Integration with the core allows
    a reference counted scheme with a free function so that the driver can
    know when its mmaps are all gone.

    This significant new functionality will be helpful for drivers to have the
    correct lifetime model for mmap objects.

    Link: https://lore.kernel.org/r/20191030094417.16866-3-michal.kalderon@marvell.com
    Signed-off-by: Ariel Elior
    Signed-off-by: Michal Kalderon
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Michal Kalderon
     

29 Oct, 2019

4 commits

  • Jason Gunthorpe says:

    ====================
    In order to hoist the interval tree code out of the drivers and into the
    mmu_notifiers it is necessary for the drivers to not use the interval tree
    for other things.

    This series replaces the interval tree with an xarray and along the way
    re-aligns all the locking to use a sensible SRCU model where the 'update'
    step is done by modifying an xarray.

    The result is overall much simpler and with less locking in the critical
    path. Many functions were reworked for clarity and small details like
    using 'imr' to refer to the implicit MR make the entire code flow here
    more readable.

    This also squashes at least two race bugs on its own, and quite possibily
    more that haven't been identified.
    ====================

    Merge conflicts with the odp statistics patch resolved.

    * branch 'odp_rework':
    RDMA/odp: Remove broken debugging call to invalidate_range
    RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroy
    RDMA/mlx5: Do not store implicit children in the odp_mkeys xarray
    RDMA/mlx5: Rework implicit ODP destroy
    RDMA/mlx5: Avoid double lookups on the pagefault path
    RDMA/mlx5: Reduce locking in implicit_mr_get_data()
    RDMA/mlx5: Use an xarray for the children of an implicit ODP
    RDMA/mlx5: Split implicit handling from pagefault_mr
    RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the tree
    RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call it
    RDMA/mlx5: Rework implicit_mr_get_data
    RDMA/mlx5: Delete struct mlx5_priv->mkey_table
    RDMA/mlx5: Use a dedicated mkey xarray for ODP
    RDMA/mlx5: Split sig_err MR data into its own xarray
    RDMA/mlx5: Use SRCU properly in ODP prefetch

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Use SRCU in a sensible way by removing all MRs in the implicit tree from
    the two xarrays (the update operation), then a synchronize, followed by a
    normal single threaded teardown.

    This is only a little unusual from the normal pattern as there can still
    be some work pending in the unbound wq that may also require a workqueue
    flush. This is tracked with a single atomic, consolidating the redundant
    existing atomics and wait queue.

    For understand-ability the entire ODP implicit create/destroy flow now
    largely exists in a single pair of functions within odp.c, with a few
    support functions for tearing down an unused child.

    Link: https://lore.kernel.org/r/20191009160934.3143-13-jgg@ziepe.ca
    Reviewed-by: Artemy Kovalyov
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Currently the child leaves are stored in the shared interval tree and
    every lookup for a child must be done under the interval tree rwsem.

    This is further complicated by dropping the rwsem during iteration (ie the
    odp_lookup(), odp_next() pattern), which requires a very tricky an
    difficult to understand locking scheme with SRCU.

    Instead reserve the interval tree for the exclusive use of the mmu
    notifier related code in umem_odp.c and give each implicit MR a xarray
    containing all the child MRs.

    Since the size of each child is 1GB of VA, a 1 level xarray will index 64G
    of VA, and a 2 level will index 2TB, making xarray a much better
    data structure choice than an interval tree.

    The locking properties of xarray will be used in the next patches to
    rework the implicit ODP locking scheme into something simpler.

    At this point, the xarray is locked by the implicit MR's umem_mutex, and
    read can also be locked by the odp_srcu.

    Link: https://lore.kernel.org/r/20191009160934.3143-10-jgg@ziepe.ca
    Reviewed-by: Artemy Kovalyov
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • If dev->dma_device->params == NULL then the maximum DMA segment size is 64
    KB. See also the dma_get_max_seg_size() implementation. This patch fixes
    the following kernel warning:

    DMA-API: infiniband rxe0: mapping sg segment longer than device claims to support [len=126976] [max=65536]
    WARNING: CPU: 4 PID: 4848 at kernel/dma/debug.c:1220 debug_dma_map_sg+0x3d9/0x450
    RIP: 0010:debug_dma_map_sg+0x3d9/0x450
    Call Trace:
    srp_queuecommand+0x626/0x18d0 [ib_srp]
    scsi_queue_rq+0xd02/0x13e0 [scsi_mod]
    __blk_mq_try_issue_directly+0x2b3/0x3f0
    blk_mq_request_issue_directly+0xac/0xf0
    blk_insert_cloned_request+0xdf/0x170
    dm_mq_queue_rq+0x43d/0x830 [dm_mod]
    __blk_mq_try_issue_directly+0x2b3/0x3f0
    blk_mq_request_issue_directly+0xac/0xf0
    blk_mq_try_issue_list_directly+0xb8/0x170
    blk_mq_sched_insert_requests+0x23c/0x3b0
    blk_mq_flush_plug_list+0x529/0x730
    blk_flush_plug_list+0x21f/0x260
    blk_mq_make_request+0x56b/0xf20
    generic_make_request+0x196/0x660
    submit_bio+0xae/0x290
    blkdev_direct_IO+0x822/0x900
    generic_file_direct_write+0x110/0x200
    __generic_file_write_iter+0x124/0x2a0
    blkdev_write_iter+0x168/0x270
    aio_write+0x1c4/0x310
    io_submit_one+0x971/0x1390
    __x64_sys_io_submit+0x12a/0x390
    do_syscall_64+0x6f/0x2e0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Link: https://lore.kernel.org/r/20191025225830.257535-2-bvanassche@acm.org
    Cc:
    Fixes: 0b5cb3300ae5 ("RDMA/srp: Increase max_segment_size")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Bart Van Assche