01 Feb, 2017

3 commits

  • commit 0a475ef4226e305bdcffe12b401ca1eab06c4913 upstream.

    After setting indirect_sg_entries module_param to huge value (e.g 500,000),
    srp_alloc_req_data() fails to allocate indirect descriptors for the request
    ring (kmalloc fails). This commit enforces the maximum value of
    indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param
    description.

    Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS)
    Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't fit in SRP_CMD)
    Signed-off-by: Israel Rukshin
    Signed-off-by: Max Gurtovoy
    Reviewed-by: Laurence Oberman
    Reviewed-by: Bart Van Assche --
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Israel Rukshin
     
  • commit ad8e66b4a80182174f73487ed25fd2140cf43361 upstream.

    If the device support arbitrary sg list mapping (device cap
    IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with
    IB_MR_TYPE_SG_GAPS.

    Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
    Signed-off-by: Israel Rukshin
    Signed-off-by: Max Gurtovoy
    Reviewed-by: Leon Romanovsky
    Reviewed-by: Mark Bloch
    Reviewed-by: Yuval Shaia
    Reviewed-by: Bart Van Assche
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Israel Rukshin
     
  • commit 1e5db6c31ade4150c2e2b1a21e39f776c38fea39 upstream.

    For devices that can register page list that is bigger than
    USHRT_MAX, we actually take the wrong value for sg_tablesize.
    E.g: for CX4 max_fast_reg_page_list_len is 65536 (bigger than USHRT_MAX)
    so we set sg_tablesize to 0 by mistake. Therefore, each IO that is
    bigger than 4k splitted to "< 4k" chunks that cause performance degredation.
    Remove wrong sg_tablesize assignment, and use the value that was set during
    address resolution handler with the needed casting.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Max Gurtovoy
     

26 Jan, 2017

1 commit

  • commit 0b59970e7d96edcb3c7f651d9d48e1a59af3c3b0 upstream.

    Remove the warning print of "can't use of GFP_NOIO" to avoid prints in
    each QP creation when devices aren't supporting IB_QP_CREATE_USE_GFP_NOIO.

    This print become more annoying when the IPoIB interface is configured
    to work in connected mode.

    Fixes: 09b93088d750 ('IB: Add a QP creation flag to use GFP_NOIO allocations')
    Signed-off-by: Kamal Heib
    Signed-off-by: Leon Romanovsky
    Reviewed-by: Yuval Shaia
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Kamal Heib
     

09 Jan, 2017

1 commit

  • commit 11b642b84e8c43e8597de031678d15c08dd057bc upstream.

    This patch avoids that Coverity reports the following:

    Using uninitialized value port_attr.state when calling printk

    Fixes: commit 94232d9ce817 ("IPoIB: Start multicast join process only on active ports")
    Signed-off-by: Bart Van Assche
    Cc: Erez Shitrit
    Reviewed-by: Leon Romanovsky
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

14 Oct, 2016

1 commit

  • After the commit 9207f9d45b0a ("net: preserve IP control block
    during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
    That destroy the IPoIB address information cached there,
    causing a severe performance regression, as better described here:

    http://marc.info/?l=linux-kernel&m=146787279825501&w=2

    This change moves the data cached by the IPoIB driver from the
    skb control lock into the IPoIB hard header, as done before
    the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
    and use skb->cb to stash LL addresses").
    In order to avoid GRO issue, on packet reception, the IPoIB driver
    stash into the skb a dummy pseudo header, so that the received
    packets have actually a hard header matching the declared length.
    To avoid changing the connected mode maximum mtu, the allocated
    head buffer size is increased by the pseudo header length.

    After this commit, IPoIB performances are back to pre-regression
    value.

    v2 -> v3: rebased
    v1 -> v2: avoid changing the max mtu, increasing the head buf size

    Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

10 Oct, 2016

1 commit

  • Pull main rdma updates from Doug Ledford:
    "This is the main pull request for the rdma stack this release. The
    code has been through 0day and I had it tagged for linux-next testing
    for a couple days.

    Summary:

    - updates to mlx5

    - updates to mlx4 (two conflicts, both minor and easily resolved)

    - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
    proper resolution is to keep the code in cxgb4_main.c as it is in
    Linus' tree as attach_uld was refactored and moved into
    cxgb4_uld.c)

    - improvements to uAPI (moved vendor specific API elements to uAPI
    area)

    - add hns-roce driver and hns and hns-roce ACPI reset support

    - conversion of all rdma code away from deprecated
    create_singlethread_workqueue

    - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
    staging)"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
    staging/lustre: Disable InfiniBand support
    iw_cxgb4: add fast-path for small REG_MR operations
    cxgb4: advertise support for FR_NSMR_TPTE_WR
    IB/core: correctly handle rdma_rw_init_mrs() failure
    IB/srp: Fix infinite loop when FMR sg[0].offset != 0
    IB/srp: Remove an unused argument
    IB/core: Improve ib_map_mr_sg() documentation
    IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
    IB/mthca: Move user vendor structures
    IB/nes: Move user vendor structures
    IB/ocrdma: Move user vendor structures
    IB/mlx4: Move user vendor structures
    IB/cxgb4: Move user vendor structures
    IB/cxgb3: Move user vendor structures
    IB/mlx5: Move and decouple user vendor structures
    IB/{core,hw}: Add constant for node_desc
    ipoib: Make ipoib_warn ratelimited
    IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
    IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
    IB/ipoib: Remove deprecated create_singlethread_workqueue
    ...

    Linus Torvalds
     

08 Oct, 2016

5 commits

  • Avoid that mapping an sg-list in which the first element has a
    non-zero offset triggers an infinite loop when using FMR. This
    patch makes the FMR mapping code similar to that of ib_sg_to_pages().

    Note: older Mellanox HCAs do not support non-zero offsets for FMR.
    See also commit 8c4037b501ac ("IB/srp: always avoid non-zero offsets
    into an FMR").

    Reported-by: Alex Estrin
    Signed-off-by: Bart Van Assche
    Cc:
    Signed-off-by: Doug Ledford

    Bart Van Assche
     
  • Signed-off-by: Bart Van Assche
    Signed-off-by: Doug Ledford

    Bart Van Assche
     
  • In certain cases it's possible to be flooded by warning messages. To
    cope with such situations make the ipoib_warn macro be ratelimited.
    To prevent accidental limiting of legitimate, bursty messages make
    the limit fairly liberal by allowing up to 100 messages in 10 seconds.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Doug Ledford

    kernel@kyup.com
     
  • alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
    deprecated create_singlethread_workqueue(). This is the identity
    conversion.

    The workqueue "wq" queues mulitple work items viz &priv->restart_task,
    &priv->cm.rx_reap_task, &priv->cm.skb_task, &priv->neigh_reap_task,
    &priv->ah_reap_task, &priv->mcast_task and &priv->carrier_on_task.
    The work items require strict execution ordering.
    Hence, an ordered dedicated workqueue has been used.

    WQ_MEM_RECLAIM has been set to ensure forward progress under
    memory pressure.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: Doug Ledford

    Bhaktipriya Shridhar
     
  • alloc_ordered_workqueue() replaces deprecated
    create_singlethread_workqueue().

    The workqueue "ipoib_workqueue" that is used for all flush operations
    for the device.

    WQ_MEM_RECLAIM has been set since the flush operations may need to
    complete in order for other network functions to continue, and
    the memory reclaim operation might need the network functioning in
    order to make progress.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: Doug Ledford

    Bhaktipriya Shridhar
     

24 Sep, 2016

3 commits


17 Sep, 2016

1 commit

  • This fix solves a race between light flush and on the fly joins.
    Light flush doesn't set the device to down and unset IPOIB_OPER_UP
    flag, this means that if while flushing we have a MC join in progress
    and the QP was attached to BC MGID we can have a mismatches when
    re-attaching a QP to the BC MGID.

    The light flush would set the broadcast group to NULL causing an on
    the fly join to rejoin and reattach to the BC MCG as well as adding
    the BC MGID to the multicast list. The flush process would later on
    remove the BC MGID and detach it from the QP. On the next flush
    the BC MGID is present in the multicast list but not found when trying
    to detach it because of the previous double attach and single detach.

    [18332.714265] ------------[ cut here ]------------
    [18332.717775] WARNING: CPU: 6 PID: 3767 at drivers/infiniband/core/verbs.c:280 ib_dealloc_pd+0xff/0x120 [ib_core]
    ...
    [18332.775198] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
    [18332.779411] 0000000000000000 ffff8800b50dfbb0 ffffffff813fed47 0000000000000000
    [18332.784960] 0000000000000000 ffff8800b50dfbf0 ffffffff8109add1 0000011832f58300
    [18332.790547] ffff880226a596c0 ffff880032482000 ffff880032482830 ffff880226a59280
    [18332.796199] Call Trace:
    [18332.798015] [] dump_stack+0x63/0x8c
    [18332.801831] [] __warn+0xd1/0xf0
    [18332.805403] [] warn_slowpath_null+0x1d/0x20
    [18332.809706] [] ib_dealloc_pd+0xff/0x120 [ib_core]
    [18332.814384] [] ipoib_transport_dev_cleanup+0xfc/0x1d0 [ib_ipoib]
    [18332.820031] [] ipoib_ib_dev_cleanup+0x98/0x110 [ib_ipoib]
    [18332.825220] [] ipoib_dev_cleanup+0x2d8/0x550 [ib_ipoib]
    [18332.830290] [] ipoib_uninit+0x2f/0x40 [ib_ipoib]
    [18332.834911] [] rollback_registered_many+0x1aa/0x2c0
    [18332.839741] [] rollback_registered+0x31/0x40
    [18332.844091] [] unregister_netdevice_queue+0x48/0x80
    [18332.848880] [] ipoib_vlan_delete+0x1fb/0x290 [ib_ipoib]
    [18332.853848] [] delete_child+0x7d/0xf0 [ib_ipoib]
    [18332.858474] [] dev_attr_store+0x18/0x30
    [18332.862510] [] sysfs_kf_write+0x3a/0x50
    [18332.866349] [] kernfs_fop_write+0x120/0x170
    [18332.870471] [] __vfs_write+0x28/0xe0
    [18332.874152] [] ? percpu_down_read+0x1f/0x50
    [18332.878274] [] vfs_write+0xa2/0x1a0
    [18332.881896] [] SyS_write+0x46/0xa0
    [18332.885632] [] do_syscall_64+0x57/0xb0
    [18332.889709] [] entry_SYSCALL64_slow_path+0x25/0x25
    [18332.894727] ---[ end trace 09ebbe31f831ef17 ]---

    Fixes: ee1e2c82c245 ("IPoIB: Refresh paths instead of flushing them on SM change events")
    Signed-off-by: Alex Vesker
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Alex Vesker
     

03 Sep, 2016

2 commits

  • When a new CM connection is being requested, ipoib driver copies data
    from the path pointer in the CM/tx object, the path object might be
    invalid at the point and memory corruption will happened later when now
    the CM driver will try using that data.

    The next scenario demonstrates it:
    neigh_add_path --> ipoib_cm_create_tx -->
    queue_work (pointer to path is in the cm/tx struct)
    #while the work is still in the queue,
    #the port goes down and causes the ipoib_flush_paths:
    ipoib_flush_paths --> path_free --> kfree(path)
    #at this point the work scheduled starts.
    ipoib_cm_tx_start --> copy from the (invalid)path pointer:
    (memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
    -> memory corruption.

    To fix that the driver now starts the CM/tx connection only if that
    specific path exists in the general paths database.
    This check is protected with the relevant locks, and uses the gid from
    the neigh member in the CM/tx object which is valid according to the ref
    count that was taken by the CM/tx.

    Fixes: 839fcaba35 ('IPoIB: Connected mode experimental support')
    Signed-off-by: Erez Shitrit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Erez Shitrit
     
  • When the low level driver exercises the hot unplug they would call
    rdma_cm cma_remove_one which would fire DEVICE_REMOVAL event to all cma
    consumers. Now, if consumer doesn't make sure they destroy all IB
    objects created on that IB device instance prior to finalizing all
    processing of DEVICE_REMOVAL callback, rdma_cm will let the lld to
    de-register with IB core and destroy the IB device instance. And if the
    consumer calls (say) ib_dereg_mr(), it will crash since that dev object
    is NULL.

    In the current implementation, iser-target just initiates the cleanup
    and returns from DEVICE_REMOVAL callback. This deferred work creates a
    race between iser-target cleaning IB objects(say MR) and lld destroying
    IB device instance.

    This patch includes the following fixes
    -> make sure that consumer frees all IB objects associated with device
    instance
    -> return non-zero from the callback to destroy the rdma_cm id

    Signed-off-by: Raju Rangoju
    Acked-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Raju Rangoju
     

25 Aug, 2016

1 commit

  • If port_guid is set with the default subnet_prefix, then we get a change
    event and run a port refresh, we don't update the port_guid. As a
    result, attempts to create a target device that uses the new
    subnet_prefix in the wwn will fail to find a match and be rejected by
    the ib_srpt driver. This makes it impossible to configure a port if it
    was initialized with a default subnet_prefix and later changed to any
    non-default subnet-prefix. Updating the port refresh task to always
    update the wwn based upon the current subnext_prefix solves this
    problem.

    Cc: Bart Van Assche
    Cc: nab@linux-iscsi.org
    Signed-off-by: Doug Ledford

    Doug Ledford
     

23 Aug, 2016

1 commit


05 Aug, 2016

2 commits

  • Pull second round of rdma updates from Doug Ledford:
    "This can be split out into just two categories:

    - fixes to the RDMA R/W API in regards to SG list length limits
    (about 5 patches)

    - fixes/features for the Intel hfi1 driver (everything else)

    The hfi1 driver is still being brought to full feature support by
    Intel, and they have a lot of people working on it, so that amounts to
    almost the entirety of this pull request"

    * tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (84 commits)
    IB/hfi1: Add cache evict LRU list
    IB/hfi1: Fix memory leak during unexpected shutdown
    IB/hfi1: Remove unneeded mm argument in remove function
    IB/hfi1: Consistently call ops->remove outside spinlock
    IB/hfi1: Use evict mmu rb operation
    IB/hfi1: Add evict operation to the mmu rb handler
    IB/hfi1: Fix TID caching actions
    IB/hfi1: Make the cache handler own its rb tree root
    IB/hfi1: Make use of mm consistent
    IB/hfi1: Fix user SDMA racy user request claim
    IB/hfi1: Fix error condition that needs to clean up
    IB/hfi1: Release node on insert failure
    IB/hfi1: Validate SDMA user iovector count
    IB/hfi1: Validate SDMA user request index
    IB/hfi1: Use the same capability state for all shared contexts
    IB/hfi1: Prevent null pointer dereference
    IB/hfi1: Rename TID mmu_rb_* functions
    IB/hfi1: Remove unneeded empty check in hfi1_mmu_rb_unregister()
    IB/hfi1: Restructure hfi1_file_open
    IB/hfi1: Make iovec loop index easy to understand
    ...

    Linus Torvalds
     
  • Pull base rdma updates from Doug Ledford:
    "Round one of 4.8 code: while this is mostly normal, there is a new
    driver in here (the driver was hosted outside the kernel for several
    years and is actually a fairly mature and well coded driver). It
    amounts to 13,000 of the 16,000 lines of added code in here.

    Summary:

    - Updates/fixes for iw_cxgb4 driver
    - Updates/fixes for mlx5 driver
    - Add flow steering and RSS API
    - Add hardware stats to mlx4 and mlx5 drivers
    - Add firmware version API for RDMA driver use
    - Add the rxe driver (this is a software RoCE driver that makes any
    Ethernet device a RoCE device)
    - Fixes for i40iw driver
    - Support for send only multicast joins in the cma layer
    - Other minor fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (72 commits)
    Soft RoCE driver
    IB/core: Support for CMA multicast join flags
    IB/sa: Add cached attribute containing SM information to SA port
    IB/uverbs: Fix race between uverbs_close and remove_one
    IB/mthca: Clean up error unwind flow in mthca_reset()
    IB/mthca: NULL arg to pci_dev_put is OK
    IB/hfi1: NULL arg to sc_return_credits is OK
    IB/mlx4: Add diagnostic hardware counters
    net/mlx4: Query performance and diagnostics counters
    net/mlx4: Add diagnostic counters capability bit
    Use smaller 512 byte messages for portmapper messages
    IB/ipoib: Report SG feature regardless of HW UD CSUM capability
    IB/mlx4: Don't use GFP_ATOMIC for CQ resize struct
    IB/hfi1: Disable by default
    IB/rdmavt: Disable by default
    IB/mlx5: Fix port counter ID association to QP offset
    IB/mlx5: Fix iteration overrun in GSI qps
    i40iw: Add NULL check for puda buffer
    i40iw: Change dup_ack_thresh to u8
    i40iw: Remove unnecessary check for moving CQ head
    ...

    Linus Torvalds
     

04 Aug, 2016

2 commits


03 Aug, 2016

3 commits

  • Signed-off-by: Bart Van Assche
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Tested-by: Steve Wise
    Tested-by: Laurence Oberman
    Cc: Parav Pandit
    Cc: Nicholas Bellinger
    Signed-off-by: Doug Ledford

    Bart Van Assche
     
  • Initialize first_wr to &send_wr. This allows to remove a ternary
    operator and an else branch. This patch does not change the behavior
    of srpt_queue_response().

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Tested-by: Steve Wise
    Tested-by: Laurence Oberman
    Cc: Parav Pandit
    Cc: Nicholas Bellinger
    Signed-off-by: Doug Ledford

    Bart Van Assche
     
  • Limit the number of SG elements per work request to what the HCA
    and the queue pair support.

    Fixes: 34693573fde0 ("IB/srpt: Reduce QP buffer size")
    Reported-by: Parav Pandit
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Steve Wise
    Cc: Parav Pandit
    Cc: Nicholas Bellinger
    Cc: Laurence Oberman
    Cc: #v4.7+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Doug Ledford

    Bart Van Assche
     

24 Jun, 2016

2 commits

  • Using this allows for devices to specify the format of their
    firmware version rather than forcing a format.

    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Ira Weiny
    Signed-off-by: Doug Ledford

    Ira Weiny
     
  • The memory needed for the send and receive queues associated with
    a QP is proportional to the max_sge parameter. The current value
    of that parameter is such that with an mlx4 HCA the QP buffer size
    is 8 MB. Since DMA is used for communication between HCA and CPU
    that buffer either has to be allocated coherently or map_single()
    must succeed for that buffer. Since large contiguous allocations
    are fragile and since the maximum segment size for e.g. swiotlb
    is 256 KB, reduce the max_sge parameter. This patch avoids that
    the following text appears on the console after SRP logout and
    relogin on a system equipped with multiple IB HCAs:

    mlx4_core 0000:05:00.0: swiotlb buffer is full (sz: 8388608 bytes)
    swiotlb: coherent allocation failed for device 0000:05:00.0 size=8388608
    CPU: 11 PID: 148 Comm: kworker/11:1 Not tainted 4.7.0-rc4-dbg+ #1
    Call Trace:
    [] dump_stack+0x67/0x92
    [] swiotlb_alloc_coherent+0x141/0x150
    [] x86_swiotlb_alloc_coherent+0x3e/0x50
    [] mlx4_buf_direct_alloc.isra.5+0x9a/0x120 [mlx4_core]
    [] mlx4_buf_alloc+0x165/0x1a0 [mlx4_core]
    [] create_qp_common.isra.29+0x57d/0xff0 [mlx4_ib]
    [] mlx4_ib_create_qp+0x12a/0x3f0 [mlx4_ib]
    [] ib_create_qp+0x3a/0x250 [ib_core]
    [] srpt_cm_handler+0x4bb/0xcad [ib_srpt]
    [] cm_process_work+0x20/0xf0 [ib_cm]
    [] cm_work_handler+0x1ac0/0x2059 [ib_cm]
    [] process_one_work+0x19d/0x490
    [] worker_thread+0x49/0x490
    [] kthread+0xea/0x100
    [] ret_from_fork+0x1f/0x40

    Fixes: b99f8e4d7bcd ("IB/srpt: convert to the generic RDMA READ/WRITE API")
    Signed-off-by: Bart Van Assche
    Cc: Laurence Oberman
    Cc: Christoph Hellwig
    Cc: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Bart Van Assche
     

07 Jun, 2016

5 commits

  • ipoib_neigh_get unconditionally updates the "alive" variable member on
    any packet send. This prevents the neighbor garbage collection from
    cleaning out a dead neighbor entry if we are still queueing packets
    for it. If the queue for this neighbor is full, then don't update the
    alive timestamp. That way the neighbor can time out even if packets
    are still being queued as long as none of them are being sent.

    Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path")
    Signed-off-by: Erez Shitrit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Erez Shitrit
     
  • Align locking usage when touching device address with rest
    of the kernel. Lock the bottom half when doing so using
    netif_addr_lock_bh.

    This also solves the following case as reported by lockdep:
    CPU0 CPU1
    ---- ----
    lock(_xmit_INFINIBAND);
    local_irq_disable();
    lock(&(&mc->mca_lock)->rlock);
    lock(_xmit_INFINIBAND);

    lock(&(&mc->mca_lock)->rlock);

    *** DEADLOCK ***

    Fixes: 492a7e67ff83 ("IB/IPoIB: Allow setting the device address")
    Signed-off-by: Mark Bloch
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Mark Bloch
     
  • In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
    operation like dev_change_flags or unregister_netdev, while sysfs
    callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
    rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
    free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
    free its sysfs directory, and we will get a->b, b->a deadlock.

    Trace like the following:

    schedule+0x37/0x80
    schedule_preempt_disabled+0xe/0x10
    __mutex_lock_slowpath+0xb5/0x120
    mutex_lock+0x23/0x40
    rtnl_lock+0x15/0x20
    netdev_run_todo+0x17c/0x320
    rtnl_unlock+0xe/0x10
    ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib]
    delete_child+0x54/0x80 [ib_ipoib]
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x37/0x40
    mutex_lock+0x16/0x40
    SyS_write+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x16/0x75
    And
    schedule+0x37/0x80
    __kernfs_remove+0x1a8/0x260
    ? wake_atomic_t_function+0x60/0x60
    kernfs_remove+0x25/0x40
    sysfs_remove_dir+0x50/0x80
    kobject_del+0x18/0x50
    device_del+0x19f/0x260
    netdev_unregister_kobject+0x6a/0x80
    rollback_registered_many+0x1fd/0x340
    rollback_registered+0x3c/0x70
    unregister_netdevice_queue+0x55/0xc0
    unregister_netdev+0x20/0x30
    ipoib_remove_one+0x114/0x1b0 [ib_ipoib]
    ib_unregister_client+0x4a/0x170 [ib_core]
    ? find_module_all+0x71/0xa0
    ipoib_cleanup_module+0x10/0x94 [ib_ipoib]
    SyS_delete_module+0x1b5/0x210
    entry_SYSCALL_64_fastpath+0x16/0x75

    The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to
    get out from the sysfs function.

    Fixes: 862096a8bbf8 ("IB/ipoib: Add more rtnl_link_ops callbacks")
    Fixes: 9baa0b036410 ("IB/ipoib: Add rtnl_link_ops support")
    Signed-off-by: Erez Shitrit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Erez Shitrit
     
  • Because patch "IB/srp: Move common code into the caller" was applied
    partially srp_map_sg_dma() doesn't work properly. Fix this by
    applying the remainder of that patch. See also
    http://thread.gmane.org/gmane.linux.drivers.rdma/35803/focus=35811.

    Fixes: 3849e44d1c4b ("IB/srp: Move common code into the caller")
    Signed-off-by: Bart Van Assche
    Cc: Mike Marciniszyn
    Cc: Sagi Grimberg
    Cc: Christoph Hellwig
    Cc: Laurence Oberman
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Doug Ledford

    Bart Van Assche
     
  • Avoid that mapping fails due to use_fast_reg != 0 or use_fmr != 0
    if both member variables should be zero (if never_register == 1 or
    if neither FMR nor FR is supported). Remove an initialization that
    became superfluous due to changing a kmalloc() into a kzalloc()
    call.

    Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures")
    Cc: Sagi Grimberg
    Cc: Christoph Hellwig
    Cc: Laurence Oberman
    Signed-off-by: Bart Van Assche
    Reviewed-by: Leon Romanovsky
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Doug Ledford

    Bart Van Assche
     

29 May, 2016

2 commits

  • Pull SCSI target updates from Nicholas Bellinger:
    "Here are the outstanding target pending updates for v4.7-rc1.

    The highlights this round include:

    - Allow external PR/ALUA metadata path be defined at runtime via top
    level configfs attribute (Lee)
    - Fix target session shutdown bug for ib_srpt multi-channel (hch)
    - Make TFO close_session() and shutdown_session() optional (hch)
    - Drop se_sess->sess_kref + convert tcm_qla2xxx to internal kref
    (hch)
    - Add tcm_qla2xxx endpoint attribute for basic FC jammer (Laurence)
    - Refactor iscsi-target RX/TX PDU encode/decode into common code
    (Varun)
    - Extend iscsit_transport with xmit_pdu, release_cmd, get_rx_pdu,
    validate_parameters, and get_r2t_ttt for generic ISO offload
    (Varun)
    - Initial merge of cxgb iscsi-segment offload target driver (Varun)

    The bulk of the changes are Chelsio's new driver, along with a number
    of iscsi-target common code improvements made by Varun + Co along the
    way"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (29 commits)
    iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race
    cxgbit: Use type ISCSI_CXGBIT + cxgbit tpg_np attribute
    iscsi-target: Convert transport drivers to signal rdma_shutdown
    iscsi-target: Make iscsi_tpg_np driver show/store use generic code
    tcm_qla2xxx Add SCSI command jammer/discard capability
    iscsi-target: graceful disconnect on invalid mapping to iovec
    target: need_to_release is always false, remove redundant check and kfree
    target: remove sess_kref and ->shutdown_session
    iscsi-target: remove usage of ->shutdown_session
    tcm_qla2xxx: introduce a private sess_kref
    target: make close_session optional
    target: make ->shutdown_session optional
    target: remove acl_stop
    target: consolidate and fix session shutdown
    cxgbit: add files for cxgbit.ko
    iscsi-target: export symbols
    iscsi-target: call complete on conn_logout_comp
    iscsi-target: clear tx_thread_active
    iscsi-target: add new offload transport type
    iscsi-target: use conn_transport->transport_type in text rsp
    ...

    Linus Torvalds
     
  • Pull more rdma updates from Doug Ledford:
    "This is the second group of code for the 4.7 merge window. It looks
    large, but only in one sense. I'll get to that in a minute. The list
    of changes here breaks down as follows:

    - Dynamic counter infrastructure in the IB drivers

    This is a sysfs based code to allow free form access to the
    hardware counters RDMA devices might support so drivers don't need
    to code this up repeatedly themselves

    - SendOnlyFullMember multicast support

    - IB router support

    - A couple misc fixes

    - The big item on the list: hfi1 driver updates, plus moving the hfi1
    driver out of staging

    There was a group of 15 patches in the hfi1 list that I thought I had
    in the first pull request but they weren't. So that added to the
    length of the hfi1 section here.

    As far as these go, everything but the hfi1 is pretty straight
    forward.

    The hfi1 is, if you recall, the driver that Al had complaints about
    how it used the write/writev interfaces in an overloaded fashion. The
    write portion of their interface behaved like the write handler in the
    IB stack proper and did bi-directional communications. The writev
    interface, on the other hand, only accepts SDMA request structures.
    The completions for those structures are sent back via an entirely
    different event mechanism.

    With the security patch, we put security checks on the write
    interface, however, we also knew they would be going away soon. Now,
    we've converted the write handler in the hfi1 driver to use ioctls
    from the IB reserved magic area for its bidirectional communications.
    With that change, Intel has addressed all of the items originally on
    their TODO when they went into staging (as well as many items added to
    the list later).

    As such, I moved them out, and since they were the last item in the
    staging/rdma directory, and I don't have immediate plans to use the
    staging area again, I removed the staging/rdma area.

    Because of the move out of staging, as well as a series of 5 patches
    in the hfi1 driver that removed code people thought should be done in
    a different way and was optional to begin with (a snoop debug
    interface, an eeprom driver for an eeprom connected directory to their
    hfi1 chip and not via an i2c bus, and a few other things like that),
    the line count, especially the removal count, is high"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (56 commits)
    staging/rdma: Remove the entire rdma subdirectory of staging
    IB/core: Make device counter infrastructure dynamic
    IB/hfi1: Fix pio map initialization
    IB/hfi1: Correct 8051 link parameter settings
    IB/hfi1: Update pkey table properly after link down or FM start
    IB/rdamvt: Fix rdmavt s_ack_queue sizing
    IB/rdmavt: Max atomic value should be a u8
    IB/hfi1: Fix hard lockup due to not using save/restore spin lock
    IB/hfi1: Add tracing support for send with invalidate opcode
    IB/hfi1, qib: Add ieth to the packet header definitions
    IB/hfi1: Move driver out of staging
    IB/hfi1: Do not free hfi1 cdev parent structure early
    IB/hfi1: Add trace message in user IOCTL handling
    IB/hfi1: Remove write(), use ioctl() for user cmds
    IB/hfi1: Add ioctl() interface for user commands
    IB/hfi1: Remove unused user command
    IB/hfi1: Remove snoop/diag interface
    IB/hfi1: Remove EPROM functionality from data device
    IB/hfi1: Remove UI char device
    IB/hfi1: Remove multiple device cdev
    ...

    Linus Torvalds
     

26 May, 2016

3 commits

  • In IB networks, and specifically in IPoIB/rdmacm traffic, the device
    address of an IPoIB interface is used as a means to exchange information
    between nodes needed for communication.

    Currently an IPoIB interface will always be created with a device
    address based on its node GUID without a way to change that.

    This change adds the ability to set the device address of an IPoIB
    interface by value. We use the set mac address ndo to do that.

    The flow should be broken down to two:
    1) The GID value is already in the GID table,
    in this case the interface will be able to set carrier up.

    2) The GID value is not yet in the GID table,
    in this case the interface won't try to join the multicast group
    and will wait (listen on GID_CHANGE event) until the GID is inserted.

    In order to track those changes, we add a new flag:
    * IPOIB_FLAG_DEV_ADDR_SET.

    When set, it means the dev_addr is a based on a value in the gid
    table. this bit will be cleared upon a dev_addr change triggered
    by the user and set after validation.

    Per IB spec the port GUID can't change if the module is loaded.
    port GUID is the basis for GID at index 0 which is the basis for
    the default device address of a ipoib interface.

    The issue is that there are devices that don't follow the spec,
    they change the port GUID while HCA is powered on, so in order
    not to break userspace applications. We need to check if the
    user wanted to control the device address and we assume that
    if he sets the device address back to be based on GID index 0,
    he no longer wishs to control it.

    In order to track this, we add an additional flag:
    * IPOIB_FLAG_DEV_ADDR_CTRL

    When setting the device address, there is no validation of the upper
    twelve bytes of the device address (flags, qpn, subnet prefix) as those
    bytes are not under the control of the user.

    Signed-off-by: Mark Bloch
    Reviewed-by: Leon Romanovsky
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Mark Bloch
     
  • Check (via an SA query) if the SM supports the new option for SendOnly
    multicast joins.
    If the SM supports that option it will use the new join state to create
    such multicast group.
    If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
    join for SendOnly MCG, use the correct state if supported.

    This check is performed at every invocation of mcast_restart task, to be
    sure that the driver stays in sync with the current state of the SM.

    Signed-off-by: Erez Shitrit
    Reviewed-by: Leon Romanovsky
    Signed-off-by: Doug Ledford

    Erez Shitrit
     
  • Change struct ib_class_port_info to conform to IB Spec 1.3
    That in order to get specific capability mask from ClassPortInfo mad.

    >From the IB Spec, ClassPortInfo section:
    "CapabilityMask2 Bits 0-26: Additional class-specific capabilities...
    RespTimeValue the rest 5 bits"

    The new struct now has one field for capabilitymask2 (previously was the
    reserved field) and the resp_time field.

    And it fixes up qib and srpt, use of the field repurposed to be used as
    capabilitymask2:
    IB/qib: Change pma_get_classportinfo
    IB/srpt: Adjust the use of ib_class_port_info

    Signed-off-by: Erez Shitrit
    Reviewed-by: Leon Romanovsky
    Reviewed-by: Hal Rosenstock
    Signed-off-by: Doug Ledford

    Erez Shitrit
     

21 May, 2016

1 commit

  • Pull rdma updates from Doug Ledford:
    "Primary 4.7 merge window changes

    - Updates to the new Intel X722 iWARP driver
    - Updates to the hfi1 driver
    - Fixes for the iw_cxgb4 driver
    - Misc core fixes
    - Generic RDMA READ/WRITE API addition
    - SRP updates
    - Misc ipoib updates
    - Minor mlx5 updates"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits)
    IB/mlx5: Fire the CQ completion handler from tasklet
    net/mlx5_core: Use tasklet for user-space CQ completion events
    IB/core: Do not require CAP_NET_ADMIN for packet sniffing
    IB/mlx4: Fix unaligned access in send_reply_to_slave
    IB/mlx5: Report Scatter FCS device capability when supported
    IB/mlx5: Add Scatter FCS support for Raw Packet QP
    IB/core: Add Scatter FCS create flag
    IB/core: Add Raw Scatter FCS device capability
    IB/core: Add extended device capability flags
    i40iw: pass hw_stats by reference rather than by value
    i40iw: Remove unnecessary synchronize_irq() before free_irq()
    i40iw: constify i40iw_vf_cqp_ops structure
    IB/mlx5: Add UARs write-combining and non-cached mapping
    IB/mlx5: Allow mapping the free running counter on PROT_EXEC
    IB/mlx4: Use list_for_each_entry_safe
    IB/SA: Use correct free function
    IB/core: Fix a potential array overrun in CMA and SA agent
    IB/core: Remove unnecessary check in ibnl_rcv_msg
    IB/IWPM: Fix a potential skb leak
    RDMA/nes: replace custom print_hex_dump()
    ...

    Linus Torvalds