20 Sep, 2015

1 commit

  • Pull rdma fixes from Doug Ledford:
    "The new hfi1 driver in staging/rdma has had a number of fixup patches
    since being added to the tree. This is the first batch of those fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
    IB/hfi: Properly set permissions for user device files
    IB/hfi1: mask vs shift confusion
    IB/hfi1: clean up some defines
    IB/hfi1: info leak in get_ctxt_info()
    IB/hfi1: fix a locking bug
    IB/hfi1: checking for NULL instead of IS_ERR
    IB/hfi1: fix sdma_descq_cnt parameter parsing
    IB/hfi1: fix copy_to/from_user() error handling
    IB/hfi1: fix pstateinfo from returning improperly byteswapped value

    Linus Torvalds
     

18 Sep, 2015

1 commit


09 Sep, 2015

1 commit

  • Pull inifiniband/rdma updates from Doug Ledford:
    "This is a fairly sizeable set of changes. I've put them through a
    decent amount of testing prior to sending the pull request due to
    that.

    There are still a few fixups that I know are coming, but I wanted to
    go ahead and get the big, sizable chunk into your hands sooner rather
    than waiting for those last few fixups.

    Of note is the fact that this creates what is intended to be a
    temporary area in the drivers/staging tree specifically for some
    cleanups and additions that are coming for the RDMA stack. We
    deprecated two drivers (ipath and amso1100) and are waiting to hear
    back if we can deprecate another one (ehca). We also put Intel's new
    hfi1 driver into this area because it needs to be refactored and a
    transfer library created out of the factored out code, and then it and
    the qib driver and the soft-roce driver should all be modified to use
    that library.

    I expect drivers/staging/rdma to be around for three or four kernel
    releases and then to go away as all of the work is completed and final
    deletions of deprecated drivers are done.

    Summary of changes for 4.3:

    - Create drivers/staging/rdma
    - Move amso1100 driver to staging/rdma and schedule for deletion
    - Move ipath driver to staging/rdma and schedule for deletion
    - Add hfi1 driver to staging/rdma and set TODO for move to regular
    tree
    - Initial support for namespaces to be used on RDMA devices
    - Add RoCE GID table handling to the RDMA core caching code
    - Infrastructure to support handling of devices with differing read
    and write scatter gather capabilities
    - Various iSER updates
    - Kill off unsafe usage of global mr registrations
    - Update SRP driver
    - Misc mlx4 driver updates
    - Support for the mr_alloc verb
    - Support for a netlink interface between kernel and user space cache
    daemon to speed path record queries and route resolution
    - Ininitial support for safe hot removal of verbs devices"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
    IB/ipoib: Suppress warning for send only join failures
    IB/ipoib: Clean up send-only multicast joins
    IB/srp: Fix possible protection fault
    IB/core: Move SM class defines from ib_mad.h to ib_smi.h
    IB/core: Remove unnecessary defines from ib_mad.h
    IB/hfi1: Add PSM2 user space header to header_install
    IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
    mlx5: Fix incorrect wc pkey_index assignment for GSI messages
    IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
    IB/uverbs: reject invalid or unknown opcodes
    IB/cxgb4: Fix if statement in pick_local_ip6adddrs
    IB/sa: Fix rdma netlink message flags
    IB/ucma: HW Device hot-removal support
    IB/mlx4_ib: Disassociate support
    IB/uverbs: Enable device removal when there are active user space applications
    IB/uverbs: Explicitly pass ib_dev to uverbs commands
    IB/uverbs: Fix race between ib_uverbs_open and remove_one
    IB/uverbs: Fix reference counting usage of event files
    IB/core: Make ib_dealloc_pd return void
    IB/srp: Create an insecure all physical rkey only if needed
    ...

    Linus Torvalds
     

04 Sep, 2015

2 commits

  • When the hfi1 driver was added these definitions were moved from the qib driver
    to ib_mad.h to be used by both qib and hfi1. They should have been moved to
    ib_smi.h instead.

    Fixes: d4ab347005fb ("IB/core: Add core header changes needed for OPA")
    Reviewed-by: Hal Rosenstock
    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • Remove the unused IB_NOTICE_REPRESS_* defines.

    When the hfi1 driver was added these definitions were moved from the qib driver
    to ib_mad.h. They should have been removed instead.

    Fixes: d4ab347005fb ("IB/core: Add core header changes needed for OPA")
    Signed-off-by: Ira Weiny
    Reviewed-by: Hal Rosenstock
    Signed-off-by: Doug Ledford

    Ira Weiny
     

31 Aug, 2015

16 commits

  • Enables the uverbs_remove_one to succeed despite the fact that there are
    running IB applications working with the given ib device. This
    functionality enables a HW device to be unbind/reset despite the fact that
    there are running user space applications using it.

    It exposes a new IB kernel API named 'disassociate_ucontext' which lets
    a driver detaching its HW resources from a given user context without
    crashing/terminating the application. In case a driver implemented the
    above API and registered with ib_uverb there will be no dependency between its
    device to its uverbs_device. Upon calling remove_one of ib_uverbs the call
    should return after disassociating the open HW resources without waiting to
    clients disconnecting. In case driver didn't implement this API there will be no
    change to current behaviour and uverbs_remove_one will return only when last
    client has disconnected and reference count on uverbs device became 0.

    In case the lower driver device was removed any application will
    continue working over some zombie HCA, further calls will ended with an
    immediate error.

    Signed-off-by: Yishai Hadas
    Signed-off-by: Shachar Raindel
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Doug Ledford

    Yishai Hadas
     
  • The majority of callers never check the return value, and even if they
    did, they can't do anything about a failure.

    All possible failure cases represent a bug in the caller, so just
    WARN_ON inside the function instead.

    This fixes a few random errors:
    net/rd/iw.c infinite loops while it fails. (racing with EBUSY?)

    This also lays the ground work to get rid of error return from the
    drivers. Most drivers do not error, the few that do are broken since
    it cannot be handled.

    Since uverbs can legitimately make use of EBUSY, open code the
    check.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Chuck Lever
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • The pd now has a local_dma_lkey member which completely replaces
    ib_get_dma_mr, use it instead.

    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • Every single ULP requires a local_dma_lkey to do anything with
    a QP, so let us ensure one exists for every PD created.

    If the driver can supply a global local_dma_lkey then use that, otherwise
    ask the driver to create a local use all physical memory MR associated
    with the new PD.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Sagi Grimberg
    Acked-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Reviewed-by: Ira Weiny
    Tested-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • This patch adds a function to check if listeners for a netlink multicast
    group are present. It also adds a function to receive netlink response
    messages.

    Signed-off-by: Kaike Wan
    Signed-off-by: John Fleck
    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Kaike Wan
     
  • get_netdev: get the net_device on the physical port of the IB transport port. In
    port aggregation mode it is required to return the netdev of the active port.

    modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
    the harsware GID table. It is possible that indexes in cahce and hardware tables
    won't match so a translation is required when modifying a QP or creating an
    address handle.

    Signed-off-by: Moni Shoua
    Signed-off-by: Doug Ledford

    Moni Shoua
     
  • RoCE GIDs are based on IP addresses configured on Ethernet net-devices
    which relate to the RDMA (RoCE) device port.

    Currently, each of the low-level drivers that support RoCE (ocrdma,
    mlx4) manages its own RoCE port GID table. As there's nothing which is
    essentially vendor specific, we generalize that, and enhance the RDMA
    core GID cache to do this job.

    In order to populate the GID table, we listen for events:

    (a) netdev up/down/change_addr events - if a netdev is built onto
    our RoCE device, we need to add/delete its IPs. This involves
    adding all GIDs related to this ndev, add default GIDs, etc.

    (b) inet events - add new GIDs (according to the IP addresses)
    to the table.

    For programming the port RoCE GID table, providers must implement
    the add_gid and del_gid callbacks.

    RoCE GID management requires us to state the associated net_device
    alongside the GID. This information is necessary in order to manage
    the GID table. For example, when a net_device is removed, its
    associated GIDs need to be removed as well.

    RoCE mandates generating a default GID for each port, based on the
    related net-device's IPv6 link local. In contrast to the GID based on
    the regular IPv6 link-local (as we generate GID per IP address),
    the default GID is also available when the net device is down (in
    order to support loopback).

    Locking is done as follows:
    The patch modify the GID table code both for new RoCE drivers
    implementing the add_gid/del_gid callbacks and for current RoCE and
    IB drivers that do not. The flows for updating the table are
    different, so the locking requirements are too.

    While updating RoCE GID table, protection against multiple writers is
    achieved via mutex_lock(&table->lock). Since writing to a table
    requires us to find an entry (possible a free entry) in the table and
    then modify it, this mutex protects both the find_gid and write_gid
    ensuring the atomicity of the action.
    Each entry in the GID cache is protected by rwlock. In RoCE, writing
    (usually results from netdev notifier) involves invoking the vendor's
    add_gid and del_gid callbacks, which could sleep.
    Therefore, an invalid flag is added for each entry. Updates for RoCE are
    done via a workqueue, thus sleeping is permitted.

    In IB, updates are done in write_lock_irq(&device->cache.lock), thus
    write_gid isn't allowed to sleep and add_gid/del_gid are not called.

    When passing net-device into/out-of the GID cache, the device
    is always passed held (dev_hold).

    The code uses a single work item for updating all RDMA devices,
    following a netdev or inet notifier.

    The patch moves the cache from being a client (which was incorrect,
    as the cache is part of the IB infrastructure) to being explicitly
    initialized/freed when a device is registered/removed.

    Signed-off-by: Matan Barak
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Fully replaced by a more generic and suitable
    ib_alloc_mr.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     
  • Use ib_alloc_mr with specific parameters.
    Change the existing callers.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     
  • This was added in a thought of uniting all mr allocation
    and deallocation routines but the fact is we have a single
    deallocation routine already, ib_dereg_mr.

    And, move mlx5_ib_destroy_mr specific logic into mlx5_ib_dereg_mr
    (includes only signature stuff for now).

    And, fixup the only callers (iser/isert) accordingly.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     
  • Now that there are no ib_cm clients using the compare_data feature for
    matching IB CM requests' private data, remove the compare_data parameter of
    ib_cm_listen and remove the code implementing the feature.

    Signed-off-by: Haggai Eran
    Signed-off-by: Doug Ledford

    Haggai Eran
     
  • The rdma_cm module will later use the P_Key from the BTH to de-mux
    requests.

    See discussion at:
    http://www.spinics.net/lists/netdev/msg336067.html

    Cc: Jason Gunthorpe
    Cc: Liran Liss
    Signed-off-by: Haggai Eran
    Signed-off-by: Doug Ledford

    Haggai Eran
     
  • Enabling network namespaces for RDMA CM will allow processes on different
    namespaces to listen on the same port. In order to leave namespace support
    out of the CM layer, this requires that multiple RDMA CM IDs will be able
    to share a single CM ID.

    This patch adds infrastructure to retrieve an existing listening ib_cm_id,
    based on its device and service ID, or create a new one if one does not
    already exist. It also adds a reference count for such instances
    (cm_id_private.listen_sharecount), and prevents cm_destroy_id from
    destroying a CM if it is still shared. See the relevant discussion [1].

    [1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
    http://www.spinics.net/lists/netdev/msg328860.html

    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Haggai Eran
    Signed-off-by: Doug Ledford

    Haggai Eran
     
  • Expose the service ID on an incoming CM or SIDR request to the event
    handler. This will allow the RDMA CM module to de-multiplex connection
    requests based on the information encoded in the service ID.

    Acked-by: Sean Hefty
    Signed-off-by: Haggai Eran
    Signed-off-by: Doug Ledford

    Haggai Eran
     
  • In the case of IPoIB, and maybe in other cases, the network device is
    managed by an upper-layer protocol (ULP). In order to expose this
    network device to other users of the IB device, let ULPs implement
    a callback that returns network device according to connection parameters.

    The IB device and port, together with the P_Key and the GID should
    be enough to uniquely identify the ULP net device. However, in current
    kernels there can be multiple IPoIB interfaces created with the same GID.
    Furthermore, such configuration may be desireable to support ipvlan-like
    configurations for RDMA CM with IPoIB. To resolve the device in these
    cases the code will also take the IP address as an additional input.

    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Haggai Eran
    Signed-off-by: Yotam Kenneth
    Signed-off-by: Shachar Raindel
    Signed-off-by: Guy Shapiro
    Signed-off-by: Doug Ledford

    Yotam Kenneth
     
  • An ib_client callback that is called with the lists_rwsem locked only for
    read is protected from changes to the IB client lists, but not from
    ib_unregister_device() freeing its client data. This is because
    ib_unregister_device() will remove the device from the device list with
    lists_rwsem locked for write, but perform the rest of the cleanup,
    including the call to remove() without that lock.

    Mark client data that is undergoing de-registration with a new going_down
    flag in the client data context. Lock the client data list with lists_rwsem
    for write in addition to using the spinlock, so that functions calling the
    callback would be able to lock only lists_rwsem for read and let callbacks
    sleep.

    Since ib_unregister_client() now marks the client data context, no need for
    remove() to search the context again, so pass the client data directly to
    remove() callbacks.

    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Haggai Eran
    Signed-off-by: Doug Ledford

    Haggai Eran
     

29 Aug, 2015

2 commits

  • This functionality already exists via the max_sge_rd
    device capability.

    Signed-off-by: Steve Wise
    Signed-off-by: Doug Ledford

    Steve Wise
     
  • This patch adds the value of the CNP opcode to the existing list of enumerated
    opcodes in ib_pack.h

    Add common OPA header definitions for driver
    build:
    - opa_port_info.h
    - opa_smi.h
    - hfi1_user.h

    Additionally, ib_mad.h, has additional definitions
    that are common to ib_drivers including:
    - trap support
    - cca support

    The qib driver has the duplication removed in favor
    those in ib_mad.h

    Reviewed-by: Mike Marciniszyn
    Reviewed-by: John, Jubin
    Signed-off-by: Ira Weiny
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford

    Dennis Dalessandro
     

06 Aug, 2015

1 commit

  • The verbs are obsolete. The ib_rereg_phys_mr() verb is not used by
    kernel ULPs, and the last ib_reg_phys_mr() call site in the kernel
    tree has now been removed.

    Two staging tree call sites remain in the Lustre client. The Lustre
    team has been notified of the deprecation of reg_phys_mr.

    Signed-off-by: Chuck Lever
    Acked-by: Doug Ledford
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

15 Jul, 2015

1 commit

  • Persuant to Liran's comments on node_type on linux-rdma
    mailing list:

    In an effort to reform the RDMA core and ULPs to minimize use of
    node_type in struct ib_device, an additional bit is added to
    struct ib_device for is_switch (IB switch). This is needed
    to be initialized by any IB switch device driver. This is a
    NEW requirement on such device drivers which are all
    "out of tree".

    In addition, an ib_switch helper was added to ib_verbs.h
    based on the is_switch device bit rather than node_type
    (although those should be consistent).

    The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
    as well as (IPoIB and SRP) ULPs are updated where
    appropriate to use this new helper. In some cases,
    the helper is now used under the covers of using
    rdma_[start end]_port rather than the open coding
    previously used.

    Reviewed-by: Sean Hefty
    Reviewed-By: Jason Gunthorpe
    Reviewed-by: Ira Weiny
    Tested-by: Ira Weiny
    Signed-off-by: Hal Rosenstock
    Signed-off-by: Doug Ledford

    Hal Rosenstock
     

13 Jun, 2015

12 commits

  • For devices which support OPA MADs

    1) Use previously defined SMP support functions.

    2) Pass correct base version to ib_create_send_mad when processing OPA MADs.

    3) Process out_mad_key_index returned by agents for a response. This is
    necessary because OPA SMP packets must carry a valid pkey.

    4) Carry the correct segment size (OPA vs IBTA) of RMPP messages within
    ib_mad_recv_wc.

    5) Handle variable length OPA MADs by:

    * Adjusting the 'fake' WC for locally routed SMP's to represent the
    proper incoming byte_len
    * out_mad_size is used from the local HCA agents
    1) when sending agent responses on the wire
    2) when passing responses through the local_completions
    function

    NOTE: wc.byte_len includes the GRH length and therefore is different
    from the in_mad_size specified to the local HCA agents.
    out_mad_size should _not_ include the GRH length as it is added

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • Add OPA SMP processing functionality.

    Define the new OPA SMP format, create support functions for this format using
    the previously defined helper functions as appropriate.

    These functions are defined in this patch and used in the final OPA MAD support
    patch.

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • This patch is the first of 3 which adds processing of OPA MADs

    1) Add Intel Omni-Path Architecture defines
    2) Increase max management version to accommodate OPA
    3) update ib_create_send_mad
    If the device supports OPA MADs and the MAD being sent is the OPA base
    version alter the MAD size and sg lengths as appropriate

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • Add OPA MAD support flags to the core capability immutable flags. In addition
    add the rdma_cap_opa_mad helper function for core functions to use to detect
    OPA MAD support.

    OPA MADs share a common header with IBTA MADs but with some differences for
    increased performance.

    Sharing a common header with IBTA MADs allows us to share most of the MAD
    processing code when dealing with OPA MADs in addition to supporting some IBTA
    MADs on OPA devices.

    OPA MADs differ in the following ways:

    1) MADs are variable size up to 2K
    IBTA defined MADs remain fixed at 256 bytes
    2) OPA SMPs must carry valid PKeys
    3) OPA SMP packets are a different format

    The MAD stack will use this new functionality to determine if OPA MAD
    processing should occur on individual device ports.

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • In order to support alternate sized MADs (and variable sized MADs on OPA
    devices) add in/out MAD size parameters to the process_mad core call.

    In addition, add an out_mad_pkey_index to communicate the pkey index the driver
    wishes the MAD stack to use when sending OPA MAD responses.

    The out MAD size and the out MAD PKey index are required by the MAD
    stack to generate responses on OPA devices.

    Furthermore, the in and out MAD parameters are made generic by specifying them
    as ib_mad_hdr rather than ib_mad.

    Drivers are modified as needed and are protected by BUG_ON flags if the MAD
    sizes passed to them is incorrect.

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • Add max MAD size to the device immutable data set and have all drivers that
    support MADs report the current IB MAD size (IB_MGMT_MAD_SIZE) to the core.

    Verify MAD size data in both the MAD core and when reading the immutable data.

    OPA drivers will report alternate MAD sizes in subsequent patches.

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • In preparation to support the new OPA MAD Base version, add a base version
    parameter to ib_create_send_mad and set it to IB_MGMT_BASE_VERSION for current
    users.

    Definition of the new base version and it's processing will occur in later
    patches.

    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • Vendors should be able to pass vendor specific data to/from
    user-space via query_device uverb. In order to do this,
    we need to pass the vendors' specific udata.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • In order to expose timestamp we need to expose two new attributes in
    query_device to be used for CQ completion time-stamping:

    timestamp_mask - how many bits are valid in the timestamp, where timestamp
    values could be 64bits the most.

    hca_core_clock - timestamp is given in HW cycles, the frequency in KHZ units
    of the HCA, necessary in order to convert cycles to seconds.

    This is added both to ib_query_device and its respective uverbs counterpart.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Add CQ creation flag which dictates that the created CQ will report
    completion time-stamp value in the WC.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Currently, ib_create_cq uses cqe and comp_vecotr instead
    of the extendible ib_cq_init_attr struct.

    Earlier patches already changed the vendors to work with
    ib_cq_init_attr. This patch changes the consumers too.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Signed-off-by: Doug Ledford

    Matan Barak
     
  • Add a new ib_cq_init_attr structure which contains the
    previous cqe (minimum number of CQ entries) and comp_vector
    (completion vector) in addition to a new flags field.
    All vendors' create_cq callbacks are changed in order
    to work with the new API.

    This commit does not change any functionality.

    Signed-off-by: Matan Barak
    Signed-off-by: Or Gerlitz
    Reviewed-By: Devesh Sharma to patch #2
    Signed-off-by: Doug Ledford

    Matan Barak
     

11 Jun, 2015

1 commit


02 Jun, 2015

2 commits