06 Apr, 2019

2 commits

  • [ Upstream commit f368ff188ae4b3ef6f740a15999ea0373261b619 ]

    When an application aborts the connection by moving QP from RTS to ERROR,
    then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
    *srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
    connection by calling c4iw_ep_disconnect().

    c4iw_ep_disconnect() does the following:
    1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
    2. sends abort request CPL to hw.

    But, since the close_complete_upcall() is sent before sending the
    ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
    connection holds one. Because, the srqidx is passed up to libcxgb4 only
    after corresponding ABORT_RPL is processed by kernel in abort_rpl().

    This patch handle the corner-case by moving the call to
    close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl(). So that
    libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and
    libcxgb4 can relinquish the srqidx properly.

    Signed-off-by: Raju Rangoju
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Raju Rangoju
     
  • [ Upstream commit 2612d723aadcf8281f9bf8305657129bd9f3cd57 ]

    Using CX-3 virtual functions, either from a bare-metal machine or
    pass-through from a VM, MAD packets are proxied through the PF driver.

    Since the VF drivers have separate name spaces for MAD Transaction Ids
    (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
    in a cache.

    Following the RDMA Connection Manager (CM) protocol, it is clear when
    an entry has to evicted form the cache. But life is not perfect,
    remote peers may die or be rebooted. Hence, it's a timeout to wipe out
    a cache entry, when the PF driver assumes the remote peer has gone.

    During workloads where a high number of QPs are destroyed concurrently,
    excessive amount of CM DREQ retries has been observed

    The problem can be demonstrated in a bare-metal environment, where two
    nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
    have 16 vPorts per physical server.

    64 processes are associated with each vPort and creates and destroys
    one QP for each of the remote 64 processes. That is, 1024 QPs per
    vPort, all in all 16K QPs. The QPs are created/destroyed using the
    CM.

    When tearing down these 16K QPs, excessive CM DREQ retries (and
    duplicates) are observed. With some cat/paste/awk wizardry on the
    infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
    nodes:

    cm_rx_duplicates:
    dreq 2102
    cm_rx_msgs:
    drep 1989
    dreq 6195
    rep 3968
    req 4224
    rtu 4224
    cm_tx_msgs:
    drep 4093
    dreq 27568
    rep 4224
    req 3968
    rtu 3968
    cm_tx_retries:
    dreq 23469

    Note that the active/passive side is equally distributed between the
    two nodes.

    Enabling pr_debug in cm.c gives tons of:

    [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave:
    1,sl_cm_id: 0xd393089f} is NULL!

    By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
    tear-down phase of the application is reduced from approximately 90 to
    50 seconds. Retries/duplicates are also significantly reduced:

    cm_rx_duplicates:
    dreq 2460
    []
    cm_tx_retries:
    dreq 3010
    req 47

    Increasing the timeout further didn't help, as these duplicates and
    retries stems from a too short CMA timeout, which was 20 (~4 seconds)
    on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
    numbers fell down to about 10 for both of them.

    Adjustment of the CMA timeout is not part of this commit.

    Signed-off-by: Håkon Bugge
    Acked-by: Jack Morgenstein
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Håkon Bugge
     

27 Mar, 2019

1 commit

  • commit 5fc01fb846bce8fa6d5f95e2625b8ce0f8e86810 upstream.

    If cma_acquire_dev_by_src_ip() returns error in addr_handler(), the
    device state changes back to RDMA_CM_ADDR_BOUND but the resolved source
    IP address is still left. After that, if rdma_destroy_id() is called
    after rdma_listen(), the device is freed without removed from
    listen_any_list in cma_cancel_operation(). Revert to the previous IP
    address if acquiring device fails.

    Reported-by: syzbot+f3ce716af730c8f96637@syzkaller.appspotmail.com
    Signed-off-by: Myungho Jung
    Reviewed-by: Parav Pandit
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Myungho Jung
     

24 Mar, 2019

1 commit

  • commit bc5add09764c123f58942a37c8335247e683d234 upstream.

    When disabling and removing a receive context, it is possible for an
    asynchronous event (i.e IRQ) to occur. Because of this, there is a race
    between cleaning up the context, and the context being used by the
    asynchronous event.

    cpu 0 (context cleanup)
    rc->ref_count-- (ref_count == 0)
    hfi1_rcd_free()
    cpu 1 (IRQ (with rcd index))
    rcd_get_by_index()
    lock
    ref_count+++
    Signed-off-by: Michael J. Ruhl
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael J. Ruhl
     

14 Mar, 2019

2 commits

  • [ Upstream commit 6ab4aba00f811a5265acc4d3eb1863bb3ca60562 ]

    The following BUG was reported by kasan:

    BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
    Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020

    Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
    Call Trace:
    dump_stack+0x9a/0xeb
    print_address_description+0xe3/0x2e0
    kasan_report+0x18a/0x2e0
    ? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
    memcpy+0x1f/0x50
    ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
    ? kvm_clock_read+0x1f/0x30
    ? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib]
    ? __lock_is_held+0xc2/0x170
    ? process_one_work+0x880/0x1960
    ? process_one_work+0x912/0x1960
    process_one_work+0x912/0x1960
    ? wq_pool_ids_show+0x310/0x310
    ? lock_acquire+0x145/0x440
    worker_thread+0x87/0xbb0
    ? process_one_work+0x1960/0x1960
    kthread+0x314/0x3d0
    ? kthread_create_worker_on_cpu+0xc0/0xc0
    ret_from_fork+0x3a/0x50

    Allocated by task 0:
    kasan_kmalloc+0xa0/0xd0
    kmem_cache_alloc_trace+0x168/0x3e0
    path_rec_create+0xa2/0x1f0 [ib_ipoib]
    ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib]
    dev_hard_start_xmit+0x159/0x8d0
    sch_direct_xmit+0x226/0xb40
    __dev_queue_xmit+0x1d63/0x2950
    neigh_update+0x889/0x1770
    arp_process+0xc47/0x21f0
    arp_rcv+0x462/0x760
    __netif_receive_skb_core+0x1546/0x2da0
    netif_receive_skb_internal+0xf2/0x590
    napi_gro_receive+0x28e/0x390
    ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib]
    ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib]
    net_rx_action+0x427/0xe30
    __do_softirq+0x28e/0xc42

    Freed by task 26680:
    __kasan_slab_free+0x11d/0x160
    kfree+0xf5/0x360
    ipoib_flush_paths+0x532/0x9d0 [ib_ipoib]
    ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib]
    set_mode+0xc8/0x150 [ib_ipoib]
    kernfs_fop_write+0x279/0x440
    __vfs_write+0xd8/0x5c0
    vfs_write+0x15e/0x470
    ksys_write+0xb8/0x180
    do_syscall_64+0x9b/0x420
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff88034c30bcc8
    which belongs to the cache kmalloc-512 of size 512
    The buggy address is located 8 bytes inside of
    512-byte region [ffff88034c30bcc8, ffff88034c30bec8)
    The buggy address belongs to the page:

    The following race between change mode and xmit flow is the reason for
    this use-after-free:

    Change mode Send packet 1 to GID XX Send packet 2 to GID XX
    | | |
    start | |
    | | |
    | | |
    | Create new path for GID XX |
    | and update neigh path |
    | | |
    | | |
    | | |
    flush_paths | |
    | |
    queue_work(cm.start_task) |
    | Path for GID XX not found
    | create new path
    |
    |
    start_task runs with old
    released path

    There is no locking to protect the lifetime of the path through the
    ipoib_cm_tx struct, so delete it entirely and always use the newly looked
    up path under the priv->lock.

    Fixes: 546481c2816e ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow")
    Signed-off-by: Feras Daoud
    Reviewed-by: Erez Shitrit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Feras Daoud
     
  • [ Upstream commit 904bba211acc2112fdf866e5a2bc6cd9ecd0de1b ]

    The work completion length for a receiving a UD send with immediate is
    short by 4 bytes causing application using this opcode to fail.

    The UD receive logic incorrectly subtracts 4 bytes for immediate
    value. These bytes are already included in header length and are used to
    calculate header/payload split, so the result is these 4 bytes are
    subtracted twice, once when the header length subtracted from the overall
    length and once again in the UD opcode specific path.

    Remove the extra subtraction when handling the opcode.

    Fixes: 7724105686e7 ("IB/hfi1: add driver files")
    Reviewed-by: Michael J. Ruhl
    Signed-off-by: Brian Welty
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Brian Welty
     

27 Feb, 2019

2 commits

  • commit 48396e80fb6526ea5ed267bd84f028bae56d2f9e upstream.

    Since .scsi_done() must only be called after scsi_queue_rq() has
    finished, make sure that the SRP initiator driver does not call
    .scsi_done() while scsi_queue_rq() is in progress. Although
    invoking sg_reset -d while I/O is in progress works fine with kernel
    v4.20 and before, that is not the case with kernel v5.0-rc1. This
    patch avoids that the following crash is triggered with kernel
    v5.0-rc1:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000138
    CPU: 0 PID: 360 Comm: kworker/0:1H Tainted: G B 5.0.0-rc1-dbg+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    Workqueue: kblockd blk_mq_run_work_fn
    RIP: 0010:blk_mq_dispatch_rq_list+0x116/0xb10
    Call Trace:
    blk_mq_sched_dispatch_requests+0x2f7/0x300
    __blk_mq_run_hw_queue+0xd6/0x180
    blk_mq_run_work_fn+0x27/0x30
    process_one_work+0x4f1/0xa20
    worker_thread+0x67/0x5b0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    Cc:
    Fixes: 94a9174c630c ("IB/srp: reduce lock coverage of command completion")
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • [ Upstream commit 9d9f59b4204bc41896c866b3e5856e5b416aa199 ]

    As part of audit process to update drivers to use rdma_restrack_add()
    ensure that QP objects is cleared before access. Such change fixes the
    crash observed with uninitialized non zero sgid attr accessed by
    ib_destroy_qp().

    CPU: 3 PID: 74 Comm: kworker/u16:1 Not tainted 4.19.10-300.fc29.x86_64
    Workqueue: ipoib_wq ipoib_cm_tx_reap [ib_ipoib]
    RIP: 0010:rdma_put_gid_attr+0x9/0x30 [ib_core]
    RSP: 0018:ffffb7ad819dbde8 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff8d1bdf5a2e00 RCX: 0000000000002699
    RDX: 206c656e72656af8 RSI: ffff8d1bf7ae6160 RDI: 206c656e72656b20
    RBP: 0000000000000000 R08: 0000000000026160 R09: ffffffffc06b45bf
    R10: ffffe849887da000 R11: 0000000000000002 R12: ffff8d1be30cb400
    R13: ffff8d1bdf681800 R14: ffff8d1be2272400 R15: ffff8d1be30ca000
    FS: 0000000000000000(0000) GS:ffff8d1bf7ac0000(0000)
    knlGS:0000000000000000
    Trace:
    ib_destroy_qp+0xc9/0x240 [ib_core]
    ipoib_cm_tx_reap+0x1f9/0x4e0 [ib_ipoib]
    process_one_work+0x1a1/0x3a0
    worker_thread+0x30/0x380
    ? pwq_unbound_release_workfn+0xd0/0xd0
    kthread+0x112/0x130
    ? kthread_create_worker_on_cpu+0x70/0x70
    ret_from_fork+0x22/0x40

    Reported-by: Alexander Murashkin
    Tested-by: Alexander Murashkin
    Fixes: 1a1f460ff151 ("RDMA: Hold the sgid_attr inside the struct ib_ah/qp")
    Signed-off-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Leon Romanovsky
     

13 Feb, 2019

2 commits

  • commit 09ce351dff8e7636af0beb72cd4a86c3904a0500 upstream.

    Fix potential memory corruption and panic in loopback for IB_WR_SEND
    variants.

    The code blindly assumes the posted length will fit in the fetched rwqe,
    which is not a valid assumption.

    Fix by adding a limit test, and triggering the appropriate send completion
    and putting the QP in an error state. This mimics the handling for
    non-loopback QPs.

    Fixes: 15703461533a ("IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt")
    Cc: #v4.20+
    Reviewed-by: Michael J. Ruhl
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Mike Marciniszyn

    Mike Marciniszyn
     
  • [ Upstream commit ca95f802ef5139722acc8d30aeaab6fe5bbe939e ]

    Currently, When a reserved operation is completed, its entry in the send
    queue will not be unreserved, which leads to the miscalculation of
    qp->s_avail and thus the triggering of a WARN_ON call trace. This patch
    fixes the problem by unreserving the reserved operation when it is
    completed.

    Fixes: 856cc4c237ad ("IB/hfi1: Add the capability for reserved operations")
    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Kaike Wan
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Kaike Wan
     

07 Feb, 2019

1 commit

  • commit 7709b0dc265f28695487712c45f02bbd1f98415d upstream.

    Applications that use the stack for execution purposes cause userspace PSM
    jobs to fail during mmap().

    Both Fortran (non-standard format parsing) and C (callback functions
    located in the stack) applications can be written such that stack
    execution is required. The linker notes this via the gnu_stack ELF flag.

    This causes READ_IMPLIES_EXEC to be set which forces all PROT_READ mmaps
    to have PROT_EXEC for the process.

    Checking for VM_EXEC bit and failing the request with EPERM is overly
    conservative and will break any PSM application using executable stacks.

    Cc: #v4.14+
    Fixes: 12220267645c ("IB/hfi: Protect against writable mmap")
    Reviewed-by: Mike Marciniszyn
    Reviewed-by: Dennis Dalessandro
    Reviewed-by: Ira Weiny
    Signed-off-by: Michael J. Ruhl
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael J. Ruhl
     

26 Jan, 2019

2 commits

  • [ Upstream commit 8036e90f92aae2784b855a0007ae2d8154d28b3c ]

    Acquiring the rtnl lock while holding usdev_lock could result in a
    deadlock.

    For example:

    usnic_ib_query_port()
    | mutex_lock(&us_ibdev->usdev_lock)
    | ib_get_eth_speed()
    | rtnl_lock()

    rtnl_lock()
    | usnic_ib_netdevice_event()
    | mutex_lock(&us_ibdev->usdev_lock)

    This commit moves the usdev_lock acquisition after the rtnl lock has been
    released.

    This is safe to do because usdev_lock is not protecting anything being
    accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
    (rtnl -> usdev_lock) is not violated.

    Signed-off-by: Parvi Kaustubhi
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Parvi Kaustubhi
     
  • [ Upstream commit b024dd0eba6e6d568f69d63c5e3153aba94c23e3 ]

    FRWR memory registration is done with a series of calls and WRs.
    1. ULP invokes ib_dma_map_sg()
    2. ULP invokes ib_map_mr_sg()
    3. ULP posts an IB_WR_REG_MR on the Send queue

    Step 2 generates an iova. It is permissible for ULPs to change this
    iova (with certain restrictions) between steps 2 and 3.

    rxe_map_mr_sg captures the MR's iova but later when rxe processes the
    REG_MR WR, it ignores the MR's iova field. If a ULP alters the MR's iova
    after step 2 but before step 3, rxe never captures that change.

    When the remote sends an RDMA Read targeting that MR, rxe looks up the
    R_key, but the altered iova does not match the iova stored in the MR,
    causing the RDMA Read request to fail.

    Reported-by: Anna Schumaker
    Signed-off-by: Chuck Lever
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Chuck Lever
     

23 Jan, 2019

2 commits

  • commit 6325e01b6cdf4636b721cf7259c1616e3cf28ce2 upstream.

    Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
    opcodes explicitly.

    Reported-by: Ruishuang Wang
    Fixes: 9a59739bd01f ("IB/rxe: Revise the ib_wr_opcode enum")
    Cc: stable@vger.kernel.org
    Reviewed-by: Bryan Tan
    Reviewed-by: Ruishuang Wang
    Reviewed-by: Vishnu Dasa
    Signed-off-by: Adit Ranadive
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Adit Ranadive
     
  • commit a9666c1cae8dbcd1a9aacd08a778bf2a28eea300 upstream.

    Unsafe global rkey is considered dangerous because it exposes memory
    registered for all memory in the system. Only users with a QP on the same
    PD can use the rkey, and generally those QPs will already know the
    value. However, out of caution, do not expose the value to unprivleged
    users on the local system. Require CAP_NET_ADMIN instead.

    Cc: # 4.16
    Fixes: 29cf1351d450 ("RDMA/nldev: provide detailed PD information")
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Leon Romanovsky
     

13 Jan, 2019

4 commits

  • commit ed041919f0d23c109d52cde8da6ddc211c52d67e upstream.

    This patch avoids that KASAN sporadically reports the following:

    BUG: KASAN: use-after-free in rxe_run_task+0x1e/0x60 [rdma_rxe]
    Read of size 1 at addr ffff88801c50d8f4 by task check/24830

    CPU: 4 PID: 24830 Comm: check Not tainted 4.20.0-rc6-dbg+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    Call Trace:
    dump_stack+0x86/0xca
    print_address_description+0x71/0x239
    kasan_report.cold.5+0x242/0x301
    __asan_load1+0x47/0x50
    rxe_run_task+0x1e/0x60 [rdma_rxe]
    rxe_post_send+0x4bd/0x8d0 [rdma_rxe]
    srpt_zerolength_write+0xe1/0x160 [ib_srpt]
    srpt_close_ch+0x8b/0xe0 [ib_srpt]
    srpt_set_enabled+0xe7/0x150 [ib_srpt]
    srpt_tpg_enable_store+0xc0/0x100 [ib_srpt]
    configfs_write_file+0x157/0x1d0
    __vfs_write+0xd7/0x3d0
    vfs_write+0x102/0x290
    ksys_write+0xab/0x130
    __x64_sys_write+0x43/0x50
    do_syscall_64+0x71/0x210
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Allocated by task 13856:
    save_stack+0x43/0xd0
    kasan_kmalloc+0xc7/0xe0
    kasan_slab_alloc+0x11/0x20
    kmem_cache_alloc+0x105/0x320
    rxe_alloc+0xff/0x1f0 [rdma_rxe]
    rxe_create_qp+0x9f/0x160 [rdma_rxe]
    ib_create_qp+0xf5/0x690 [ib_core]
    rdma_create_qp+0x6a/0x140 [rdma_cm]
    srpt_cm_req_recv.cold.59+0x1588/0x237b [ib_srpt]
    srpt_rdma_cm_req_recv.isra.35+0x1d5/0x220 [ib_srpt]
    srpt_rdma_cm_handler+0x6f/0x100 [ib_srpt]
    cma_listen_handler+0x59/0x60 [rdma_cm]
    cma_ib_req_handler+0xd5b/0x2570 [rdma_cm]
    cm_process_work+0x2e/0x110 [ib_cm]
    cm_work_handler+0x2aae/0x502b [ib_cm]
    process_one_work+0x481/0x9e0
    worker_thread+0x67/0x5b0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    Freed by task 3440:
    save_stack+0x43/0xd0
    __kasan_slab_free+0x139/0x190
    kasan_slab_free+0xe/0x10
    kmem_cache_free+0xbc/0x330
    rxe_elem_release+0x66/0xe0 [rdma_rxe]
    rxe_destroy_qp+0x3f/0x50 [rdma_rxe]
    ib_destroy_qp+0x140/0x360 [ib_core]
    srpt_release_channel_work+0xdc/0x310 [ib_srpt]
    process_one_work+0x481/0x9e0
    worker_thread+0x67/0x5b0
    kthread+0x1cf/0x1f0
    ret_from_fork+0x24/0x30

    Cc: Sergey Gorenko
    Cc: Max Gurtovoy
    Cc: Laurence Oberman
    Cc:
    Signed-off-by: Bart Van Assche
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • commit e48d8ed9c6193502d849b35767fd18e20bbd7ba2 upstream.

    Error completions must still contain a valid wr_id and
    qp_num such that the consumer can rely on. Correctly
    fill these fields in receive error completions.

    Reported-by: Walker Benjamin
    Cc: stable@vger.kernel.org
    Signed-off-by: Sagi Grimberg
    Reviewed-by: Zhu Yanjun
    Tested-by: Zhu Yanjun
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • [ Upstream commit 37fbd834b4e492dc41743830cbe435f35120abd8 ]

    When support for bonding of RoCE devices was added, there was
    necessarily a link between the RoCE device and the paired netdevice that
    was part of the bond. If you remove the mlx4_en module, that paired
    association is broken (the RoCE device is still present but the paired
    netdevice has been released). We need to account for this in
    is_upper_ndev_bond_master_filter() and filter out those links with a
    broken pairing or else we later oops in netdev_next_upper_dev_rcu().

    Fixes: 408f1242d940 ("IB/core: Delete lower netdevice default GID entries in bonding scenario")
    Signed-off-by: Mark Zhang
    Reviewed-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin

    Mark Zhang
     
  • [ Upstream commit 47f07f03b5ee436fe074c4fb1fb28d013c36a0d8 ]

    Blocks creating a DEVX UMEM with the non applicable access flags
    as of ODP, MW_BIND, etc.

    Specifically when an ODP flag is used below WARN call trace is issued.

    [ 2510.404131] RIP: 0010:__mlx5_ib_populate_pas+0x207/0x220 [mlx5_ib]
    ...
    [ 2510.404143] Call Trace:
    [ 2510.404150] ? __kmalloc_node+0x1b3/0x280
    [ 2510.404156] ? _uverbs_alloc+0x63/0x90 [ib_uverbs]
    [ 2510.404158] ? _uverbs_alloc+0x63/0x90 [ib_uverbs]
    [ 2510.404162] mlx5_ib_populate_pas+0x53/0x60 [mlx5_ib]
    [ 2510.404167] mlx5_ib_handler_MLX5_IB_METHOD_DEVX_UMEM_REG+0x273/0x3f0 [mlx5_ib]

    Fixes: aeae94579caf ("IB/mlx5: Add DEVX support for memory registration")
    Signed-off-by: Yishai Hadas
    Reviewed-by: Artemy Kovalyov
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin

    Yishai Hadas
     

10 Jan, 2019

1 commit

  • commit dbc2970caef74e8ff41923d302aa6fb5a4812d0e upstream.

    An incorrect sge sizing in the HFI PIO path will cause an OOPs similar to
    this:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] hfi1_verbs_send_pio+0x3d8/0x530 [hfi1]
    PGD 0
    Oops: 0000 1 SMP
    Call Trace:
    ? hfi1_verbs_send_dma+0xad0/0xad0 [hfi1]
    hfi1_verbs_send+0xdf/0x250 [hfi1]
    ? make_rc_ack+0xa80/0xa80 [hfi1]
    hfi1_do_send+0x192/0x430 [hfi1]
    hfi1_do_send_from_rvt+0x10/0x20 [hfi1]
    rvt_post_send+0x369/0x820 [rdmavt]
    ib_uverbs_post_send+0x317/0x570 [ib_uverbs]
    ib_uverbs_write+0x26f/0x420 [ib_uverbs]
    ? security_file_permission+0x21/0xa0
    vfs_write+0xbd/0x1e0
    ? mntput+0x24/0x40
    SyS_write+0x7f/0xe0
    system_call_fastpath+0x16/0x1b

    Fix by adding the missing sizing check to correctly determine the sge
    length.

    Fixes: 7724105686e7 ("IB/hfi1: add driver files")
    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Michael J. Ruhl
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael J. Ruhl
     

21 Dec, 2018

1 commit

  • commit 28a9a9e83ceae2cee25b9af9ad20d53aaa9ab951 upstream

    Packet queue state is over used to determine SDMA descriptor
    availablitity and packet queue request state.

    cpu 0 ret = user_sdma_send_pkts(req, pcount);
    cpu 0 if (atomic_read(&pq->n_reqs))
    cpu 1 IRQ user_sdma_txreq_cb calls pq_update() (state to _INACTIVE)
    cpu 0 xchg(&pq->state, SDMA_PKT_Q_ACTIVE);

    At this point pq->n_reqs == 0 and pq->state is incorrectly
    SDMA_PKT_Q_ACTIVE. The close path will hang waiting for the state
    to return to _INACTIVE.

    This can also change the state from _DEFERRED to _ACTIVE. However,
    this is a mostly benign race.

    Remove the racy code path.

    Use n_reqs to determine if a packet queue is active or not.

    Cc: # 4.19.x
    Reviewed-by: Mitko Haralanov
    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Michael J. Ruhl
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Michael J. Ruhl
     

17 Dec, 2018

8 commits

  • commit 36d842194a57f1b21fbc6a6875f2fa2f9a7f8679 upstream.

    When running with KASAN, the following trace is produced:

    [ 62.535888]

    ==================================================================
    [ 62.544930] BUG: KASAN: slab-out-of-bounds in
    gut_hw_stats+0x122/0x230 [hfi1]
    [ 62.553856] Write of size 8 at addr ffff88080e8d6330 by task
    kworker/0:1/14

    [ 62.565333] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted
    4.19.0-test-build-kasan+ #8
    [ 62.575087] Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS
    SE5C610.86B.01.01.0019.101220160604 10/12/2016
    [ 62.587951] Workqueue: events work_for_cpu_fn
    [ 62.594050] Call Trace:
    [ 62.598023] dump_stack+0xc6/0x14c
    [ 62.603089] ? dump_stack_print_info.cold.1+0x2f/0x2f
    [ 62.610041] ? kmsg_dump_rewind_nolock+0x59/0x59
    [ 62.616615] ? get_hw_stats+0x122/0x230 [hfi1]
    [ 62.622985] print_address_description+0x6c/0x23c
    [ 62.629744] ? get_hw_stats+0x122/0x230 [hfi1]
    [ 62.636108] kasan_report.cold.6+0x241/0x308
    [ 62.642365] get_hw_stats+0x122/0x230 [hfi1]
    [ 62.648703] ? hfi1_alloc_rn+0x40/0x40 [hfi1]
    [ 62.655088] ? __kmalloc+0x110/0x240
    [ 62.660695] ? hfi1_alloc_rn+0x40/0x40 [hfi1]
    [ 62.667142] setup_hw_stats+0xd8/0x430 [ib_core]
    [ 62.673972] ? show_hfi+0x50/0x50 [hfi1]
    [ 62.680026] ib_device_register_sysfs+0x165/0x180 [ib_core]
    [ 62.687995] ib_register_device+0x5a2/0xa10 [ib_core]
    [ 62.695340] ? show_hfi+0x50/0x50 [hfi1]
    [ 62.701421] ? ib_unregister_device+0x2e0/0x2e0 [ib_core]
    [ 62.709222] ? __vmalloc_node_range+0x2d0/0x380
    [ 62.716131] ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
    [ 62.723735] ? vmalloc_node+0x5c/0x70
    [ 62.729697] ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
    [ 62.737347] ? rvt_driver_mr_init+0x1f5/0x2d0 [rdmavt]
    [ 62.744998] ? __rvt_alloc_mr+0x110/0x110 [rdmavt]
    [ 62.752315] ? rvt_rc_error+0x140/0x140 [rdmavt]
    [ 62.759434] ? rvt_vma_open+0x30/0x30 [rdmavt]
    [ 62.766364] ? mutex_unlock+0x1d/0x40
    [ 62.772445] ? kmem_cache_create_usercopy+0x15d/0x230
    [ 62.780115] rvt_register_device+0x1f6/0x360 [rdmavt]
    [ 62.787823] ? rvt_get_port_immutable+0x180/0x180 [rdmavt]
    [ 62.796058] ? __get_txreq+0x400/0x400 [hfi1]
    [ 62.802969] ? memcpy+0x34/0x50
    [ 62.808611] hfi1_register_ib_device+0xde6/0xeb0 [hfi1]
    [ 62.816601] ? hfi1_get_npkeys+0x10/0x10 [hfi1]
    [ 62.823760] ? hfi1_init+0x89f/0x9a0 [hfi1]
    [ 62.830469] ? hfi1_setup_eagerbufs+0xad0/0xad0 [hfi1]
    [ 62.838204] ? pcie_capability_clear_and_set_word+0xcd/0xe0
    [ 62.846429] ? pcie_capability_read_word+0xd0/0xd0
    [ 62.853791] ? hfi1_pcie_init+0x187/0x4b0 [hfi1]
    [ 62.860958] init_one+0x67f/0xae0 [hfi1]
    [ 62.867301] ? hfi1_init+0x9a0/0x9a0 [hfi1]
    [ 62.873876] ? wait_woken+0x130/0x130
    [ 62.879860] ? read_word_at_a_time+0xe/0x20
    [ 62.886329] ? strscpy+0x14b/0x280
    [ 62.891998] ? hfi1_init+0x9a0/0x9a0 [hfi1]
    [ 62.898405] local_pci_probe+0x70/0xd0
    [ 62.904295] ? pci_device_shutdown+0x90/0x90
    [ 62.910833] work_for_cpu_fn+0x29/0x40
    [ 62.916750] process_one_work+0x584/0x960
    [ 62.922974] ? rcu_work_rcufn+0x40/0x40
    [ 62.928991] ? __schedule+0x396/0xdc0
    [ 62.934806] ? __sched_text_start+0x8/0x8
    [ 62.941020] ? pick_next_task_fair+0x68b/0xc60
    [ 62.947674] ? run_rebalance_domains+0x260/0x260
    [ 62.954471] ? __list_add_valid+0x29/0xa0
    [ 62.960607] ? move_linked_works+0x1c7/0x230
    [ 62.967077] ?
    trace_event_raw_event_workqueue_execute_start+0x140/0x140
    [ 62.976248] ? mutex_lock+0xa6/0x100
    [ 62.982029] ? __mutex_lock_slowpath+0x10/0x10
    [ 62.988795] ? __switch_to+0x37a/0x710
    [ 62.994731] worker_thread+0x62e/0x9d0
    [ 63.000602] ? max_active_store+0xf0/0xf0
    [ 63.006828] ? __switch_to_asm+0x40/0x70
    [ 63.012932] ? __switch_to_asm+0x34/0x70
    [ 63.019013] ? __switch_to_asm+0x40/0x70
    [ 63.025042] ? __switch_to_asm+0x34/0x70
    [ 63.031030] ? __switch_to_asm+0x40/0x70
    [ 63.037006] ? __schedule+0x396/0xdc0
    [ 63.042660] ? kmem_cache_alloc_trace+0xf3/0x1f0
    [ 63.049323] ? kthread+0x59/0x1d0
    [ 63.054594] ? ret_from_fork+0x35/0x40
    [ 63.060257] ? __sched_text_start+0x8/0x8
    [ 63.066212] ? schedule+0xcf/0x250
    [ 63.071529] ? __wake_up_common+0x110/0x350
    [ 63.077794] ? __schedule+0xdc0/0xdc0
    [ 63.083348] ? wait_woken+0x130/0x130
    [ 63.088963] ? finish_task_switch+0x1f1/0x520
    [ 63.095258] ? kasan_unpoison_shadow+0x30/0x40
    [ 63.101792] ? __init_waitqueue_head+0xa0/0xd0
    [ 63.108183] ? replenish_dl_entity.cold.60+0x18/0x18
    [ 63.115151] ? _raw_spin_lock_irqsave+0x25/0x50
    [ 63.121754] ? max_active_store+0xf0/0xf0
    [ 63.127753] kthread+0x1ae/0x1d0
    [ 63.132894] ? kthread_bind+0x30/0x30
    [ 63.138422] ret_from_fork+0x35/0x40

    [ 63.146973] Allocated by task 14:
    [ 63.152077] kasan_kmalloc+0xbf/0xe0
    [ 63.157471] __kmalloc+0x110/0x240
    [ 63.162804] init_cntrs+0x34d/0xdf0 [hfi1]
    [ 63.168883] hfi1_init_dd+0x29a3/0x2f90 [hfi1]
    [ 63.175244] init_one+0x551/0xae0 [hfi1]
    [ 63.181065] local_pci_probe+0x70/0xd0
    [ 63.186759] work_for_cpu_fn+0x29/0x40
    [ 63.192310] process_one_work+0x584/0x960
    [ 63.198163] worker_thread+0x62e/0x9d0
    [ 63.203843] kthread+0x1ae/0x1d0
    [ 63.208874] ret_from_fork+0x35/0x40

    [ 63.217203] Freed by task 1:
    [ 63.221844] __kasan_slab_free+0x12e/0x180
    [ 63.227844] kfree+0x92/0x1a0
    [ 63.232570] single_release+0x3a/0x60
    [ 63.238024] __fput+0x1d9/0x480
    [ 63.242911] task_work_run+0x139/0x190
    [ 63.248440] exit_to_usermode_loop+0x191/0x1a0
    [ 63.254814] do_syscall_64+0x301/0x330
    [ 63.260283] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [ 63.270199] The buggy address belongs to the object at
    ffff88080e8d5500
    which belongs to the cache kmalloc-4096 of size 4096
    [ 63.287247] The buggy address is located 3632 bytes inside of
    4096-byte region [ffff88080e8d5500, ffff88080e8d6500)
    [ 63.303564] The buggy address belongs to the page:
    [ 63.310447] page:ffffea00203a3400 count:1 mapcount:0
    mapping:ffff88081380e840 index:0x0 compound_mapcount: 0
    [ 63.323102] flags: 0x2fffff80008100(slab|head)
    [ 63.329775] raw: 002fffff80008100 0000000000000000 0000000100000001
    ffff88081380e840
    [ 63.340175] raw: 0000000000000000 0000000000070007 00000001ffffffff
    0000000000000000
    [ 63.350564] page dumped because: kasan: bad access detected

    [ 63.361974] Memory state around the buggy address:
    [ 63.369137] ffff88080e8d6200: 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00
    [ 63.379082] ffff88080e8d6280: 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00
    [ 63.389032] >ffff88080e8d6300: 00 00 00 00 00 00 fc fc fc fc fc fc fc
    fc fc fc
    [ 63.398944] ^
    [ 63.406141] ffff88080e8d6380: fc fc fc fc fc fc fc fc fc fc fc fc fc
    fc fc fc
    [ 63.416109] ffff88080e8d6400: fc fc fc fc fc fc fc fc fc fc fc fc fc
    fc fc fc
    [ 63.426099]
    ==================================================================

    The trace happens because get_hw_stats() assumes there is room in the
    memory allocated in init_cntrs() to accommodate the driver counters.
    Unfortunately, that routine only allocated space for the device
    counters.

    Fix by insuring the allocation has room for the additional driver
    counters.

    Cc: # v4.14+
    Fixes: b7481944b06e9 ("IB/hfi1: Show statistics counters under IB stats interface")
    Reviewed-by: Mike Marciniczyn
    Reviewed-by: Mike Ruhl
    Signed-off-by: Piotr Stankiewicz
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Doug Ledford
    Signed-off-by: Greg Kroah-Hartman

    Piotr Stankiewicz
     
  • [ Upstream commit 75b7b86bdb0df37e08e44b6c1f99010967f81944 ]

    Memory windows are implemented with an indirect MKey, when a page fault
    event comes for a MW Mkey we need to find the MR at the end of the list of
    the indirect MKeys by iterating on all items from the first to the last.

    The offset calculated during this process has to be zeroed after the first
    iteration or the next iteration will start from a wrong address, resulting
    incorrect ODP faulting behavior.

    Fixes: db570d7deafb ("IB/mlx5: Add ODP support to MW")
    Signed-off-by: Artemy Kovalyov
    Signed-off-by: Moni Shoua
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Artemy Kovalyov
     
  • [ Upstream commit ca088320a02537f36c243ac21794525d8eabb3bd ]

    Current hns driver assigned the first two PBL page addresses from previous
    registered MR to the hardware when reregister MR changing the memory
    locations occurred. This will lead to PBL addressing error as the PBL has
    already been released. This patch fixes this wrong assignment by using the
    page address from new allocated PBL.

    Fixes: a2c80b7b4119 ("RDMA/hns: Add rereg mr support for hip08")
    Signed-off-by: Yixian Liu
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Yixian Liu
     
  • [ Upstream commit 4f32fb921b153ae9ea280e02a3e91509fffc03d3 ]

    rdmavt uses a crazy system that looses the type checking when assinging
    functions to struct ib_device function pointers. Because of this the
    signature to this function was not changed when the below commit revised
    things.

    Fix the signature so we are not calling a function pointer with a
    mismatched signature.

    Fixes: 477864c8fcd9 ("IB/core: Let create_ah return extended response to user")
    Signed-off-by: Kamal Heib
    Reviewed-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Kamal Heib
     
  • [ Upstream commit a6c66d6a08b88cc10aca9d3f65cfae31e7652a99 ]

    When bnxt_re_ib_reg returns failure, the device structure gets
    freed. Driver tries to access the device pointer
    after it is freed.

    [ 4871.034744] Failed to register with netedev: 0xffffffa1
    [ 4871.034765] infiniband (null): Failed to register with IB: 0xffffffea
    [ 4871.046430] ==================================================================
    [ 4871.046437] BUG: KASAN: use-after-free in bnxt_re_task+0x63/0x180 [bnxt_re]
    [ 4871.046439] Write of size 4 at addr ffff880fa8406f48 by task kworker/u48:2/17813

    [ 4871.046443] CPU: 20 PID: 17813 Comm: kworker/u48:2 Kdump: loaded Tainted: G B OE 4.20.0-rc1+ #42
    [ 4871.046444] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
    [ 4871.046447] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
    [ 4871.046449] Call Trace:
    [ 4871.046454] dump_stack+0x91/0xeb
    [ 4871.046458] print_address_description+0x6a/0x2a0
    [ 4871.046461] kasan_report+0x176/0x2d0
    [ 4871.046463] ? bnxt_re_task+0x63/0x180 [bnxt_re]
    [ 4871.046466] bnxt_re_task+0x63/0x180 [bnxt_re]
    [ 4871.046470] process_one_work+0x216/0x5b0
    [ 4871.046471] ? process_one_work+0x189/0x5b0
    [ 4871.046475] worker_thread+0x4e/0x3d0
    [ 4871.046479] kthread+0x10e/0x140
    [ 4871.046480] ? process_one_work+0x5b0/0x5b0
    [ 4871.046482] ? kthread_stop+0x220/0x220
    [ 4871.046486] ret_from_fork+0x3a/0x50

    [ 4871.046492] The buggy address belongs to the page:
    [ 4871.046494] page:ffffea003ea10180 count:0 mapcount:0 mapping:0000000000000000 index:0x0
    [ 4871.046495] flags: 0x57ffffc0000000()
    [ 4871.046498] raw: 0057ffffc0000000 0000000000000000 ffffea003ea10188 0000000000000000
    [ 4871.046500] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    [ 4871.046501] page dumped because: kasan: bad access detected

    Avoid accessing the device structure once it is freed.

    Fixes: 497158aa5f52 ("RDMA/bnxt_re: Fix the ib_reg failure cleanup")
    Signed-off-by: Selvin Xavier
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Selvin Xavier
     
  • [ Upstream commit 3c4b1419c33c2417836a63f8126834ee36968321 ]

    Driver doesn't release rtnl lock if registration with
    L2 driver (bnxt_re_register_netdev) fais and this causes
    hang while requesting for the next lock.

    [ 371.635416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 371.635417] kworker/u48:1 D 0 634 2 0x80000000
    [ 371.635423] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
    [ 371.635424] Call Trace:
    [ 371.635426] ? __schedule+0x36b/0xbd0
    [ 371.635429] schedule+0x39/0x90
    [ 371.635430] schedule_preempt_disabled+0x11/0x20
    [ 371.635431] __mutex_lock+0x45b/0x9c0
    [ 371.635433] ? __mutex_lock+0x16d/0x9c0
    [ 371.635435] ? bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
    [ 371.635438] ? wake_up_klogd+0x37/0x40
    [ 371.635442] bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
    [ 371.635447] bnxt_re_task+0xfd/0x180 [bnxt_re]
    [ 371.635449] process_one_work+0x216/0x5b0
    [ 371.635450] ? process_one_work+0x189/0x5b0
    [ 371.635453] worker_thread+0x4e/0x3d0
    [ 371.635455] kthread+0x10e/0x140
    [ 371.635456] ? process_one_work+0x5b0/0x5b0
    [ 371.635458] ? kthread_stop+0x220/0x220
    [ 371.635460] ret_from_fork+0x3a/0x50
    [ 371.635477] INFO: task NetworkManager:1228 blocked for more than 120 seconds.
    [ 371.635478] Tainted: G B OE 4.20.0-rc1+ #42
    [ 371.635479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

    Release the rtnl_lock correctly in the failure path.

    Fixes: de5c95d0f518 ("RDMA/bnxt_re: Fix system crash during RDMA resource initialization")
    Signed-off-by: Selvin Xavier
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Selvin Xavier
     
  • [ Upstream commit d52ef88a9f4be523425730da3239cf87bee936da ]

    Currently when MAC address is changed, regardless of the netdev reg_state,
    GID entries are removed and added to reflect the new MAC address and new
    default GID entries.

    When a bonding device is used and the underlying PCI device is removed
    several netdevice events are generated. Two events of the interest are
    CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
    netdevice.

    Sometimes CHANGEADDR event is generated when netdev state is
    UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
    entries for default GIDs are added and never deleted because GID entries
    are deleted only when netdev state is < UNREGISTERED.

    This leads to non zero reference count on the netdevice. Due to this, PCI
    device unbind operation is getting stuck.

    To avoid it, when changing mac address, add GID entries only if netdev is
    in REGISTERED state.

    Fixes: 03db3a2d81e6 ("IB/core: Add RoCE GID table management")
    Signed-off-by: Parav Pandit
    Reviewed-by: Mark Bloch
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Parav Pandit
     
  • [ Upstream commit 074fca3a18e7e1e0d4d7dcc9d7badc43b90232f4 ]

    Currently, for IB_WR_LOCAL_INV WR, when the next fence is None, the
    current fence will be SMALL instead of Normal Fence.

    Without this patch krping doesn't work on CX-5 devices and throws
    following error:

    The error messages are from CX5 driver are: (from server side)
    [ 710.434014] mlx5_0:dump_cqe:278:(pid 2712): dump error cqe
    [ 710.434016] 00000000 00000000 00000000 00000000
    [ 710.434016] 00000000 00000000 00000000 00000000
    [ 710.434017] 00000000 00000000 00000000 00000000
    [ 710.434018] 00000000 93003204 100000b8 000524d2
    [ 710.434019] krping: cq completion failed with wr_id 0 status 4 opcode 128 vender_err 32

    Fixed the logic to set the correct fence type.

    Fixes: 6e8484c5cf07 ("RDMA/mlx5: set UMR wqe fence according to HCA cap")
    Signed-off-by: Majd Dibbiny
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Majd Dibbiny
     

08 Dec, 2018

2 commits

  • commit db7a691a1551a748cb92d9c89c6b190ea87e28d5 upstream.

    If the firmware reports a connection width that is not 1x, 4x, 8x or 12x
    it causes the driver to fail during initialization.

    To prevent this failure every time a new width is introduced to the RDMA
    stack, we will set a default 4x width for these widths which ar unknown to
    the driver.

    This is needed to allow to run old kernels with new firmware.

    Cc: # 4.1
    Fixes: 1b5daf11b015 ("IB/mlx5: Avoid using the MAD_IFC command under ISSI > 0 mode")
    Signed-off-by: Michael Guralnik
    Reviewed-by: Majd Dibbiny
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael Guralnik
     
  • commit 24c3456c8d5ee6fc1933ca40f7b4406130682668 upstream.

    If for some reason we failed to query the mr status, we need to make sure
    to provide sufficient information for an ambiguous error (guard error on
    sector 0).

    Fixes: 0a7a08ad6f5f ("IB/iser: Implement check_protection")
    Cc:
    Reported-by: Dan Carpenter
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

01 Dec, 2018

1 commit

  • commit a0e0cb82804a6a21d9067022c2dfdf80d11da429 upstream.

    pq_update() can only be called in two places: from the completion
    function when the complete (npkts) sequence of packets has been
    submitted and processed, or from setup function if a subset of the
    packets were submitted (i.e. the error path).

    Currently both paths can call pq_update() if an error occurrs. This
    race will cause the n_req value to go negative, hanging file_close(),
    or cause a crash by freeing the txlist more than once.

    Several variables are used to determine SDMA send state. Most of
    these are unnecessary, and have code inspectible races between the
    setup function and the completion function, in both the send path and
    the error path.

    The request 'status' value can be set by the setup or by the
    completion function. This is code inspectibly racy. Since the status
    is not needed in the completion code or by the caller it has been
    removed.

    The request 'done' value races between usage by the setup and the
    completion function. The completion function does not need this.
    When the number of processed packets matches npkts, it is done.

    The 'has_error' value races between usage of the setup and the
    completion function. This can cause incorrect error handling and leave
    the n_req in an incorrect value (i.e. negative).

    Simplify the code by removing all of the unneeded state checks and
    variables.

    Clean up iovs node when it is freed.

    Eliminate race conditions in the error path:

    If all packets are submitted, the completion handler will set the
    completion status correctly (ok or aborted).

    If all packets are not submitted, the caller must wait until the
    submitted packets have completed, and then set the completion status.

    These two change eliminate the race condition in the error path.

    Reviewed-by: Mitko Haralanov
    Reviewed-by: Mike Marciniszyn
    Signed-off-by: Michael J. Ruhl
    Signed-off-by: Dennis Dalessandro
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Michael J. Ruhl
     

14 Nov, 2018

8 commits

  • commit 013c2403bf32e48119aeb13126929f81352cc7ac upstream.

    Schedule MR cache work only after bucket was initialized.

    Cc: # 4.10
    Fixes: 49780d42dfc9 ("IB/mlx5: Expose MR cache for mlx5_ib")
    Signed-off-by: Artemy Kovalyov
    Reviewed-by: Majd Dibbiny
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Artemy Kovalyov
     
  • [ Upstream commit b97db58557f4aa6d9903f8e1deea6b3d1ed0ba43 ]

    Don't reset the resp opcode for a replayed read response.
    The resp opcode could be in the middle of a write or send
    sequence, when the duplicate read request was received.
    An example sequence is as follows:
    - Receive read request for 12KB PSN 20. Transmit read response
    first, middle and last with PSNs 20,21,22.
    - Receive write first PSN 23.
    At this point the resp psn is 24 and resp opcode is write first.
    - The sender notices that PSN 20 is dropped and retransmits.
    Receive read request for 12KB PSN 20. Transmit read response
    first, middle and last with PSNs 20,21,22. The resp opcode is
    set to -1, the resp psn remains 24.
    - Receive write first PSN 23. This is processed by duplicate_request().
    The resp opcode remains -1 and resp psn remains 24.
    - Receive write middle PSN 24. check_op_seq() reports a missing
    first error since the resp opcode is -1.

    When sending an ack for a duplicate send or write request,
    use the psn of the previous ack sent. Do not use the psn
    of a read response for the ack.
    An example sequence is as follows:
    - Receive write PSN 30. Transmit ACK for PSN 30.
    - Receive read request 4KB PSN 31. Transmit read response with
    PSN 31. The resp psn is now 32.
    - The sender notices that PSN 30 is dropped and retransmits.
    Receive write PSN 30. duplicate_request() sends an ACK with
    PSN 31. That is incorrect since PSN 31 was a read request.

    Signed-off-by: Vijay Immanuel
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vijay Immanuel
     
  • [ Upstream commit 99ed748e878a99c6c7b87bbec063eefd9e47cb42 ]

    The transition is allowed from any state and the atrribute mask must be
    IB_QP_STATE.

    Fixes: c32a4f296e1d ("IB/mlx5: Add support for DC Initiator QP")
    Signed-off-by: Moni Shoua
    Reviewed-by: Artemy Kovalyov
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Moni Shoua
     
  • [ Upstream commit 9b8b2a323008aedd39a8debb861b825707f01420 ]

    Some InfiniBand network devices have multiple ports on the same PCI
    function. This initializes the `dev_port' sysfs field of those
    network interfaces with their port number.

    Prior to this the kernel erroneously used the `dev_id' sysfs
    field of those network interfaces to convey the port number to userspace.

    The use of `dev_id' was considered correct until Linux 3.15,
    when another field, `dev_port', was defined for this particular
    purpose and `dev_id' was reserved for distinguishing stacked ifaces
    (e.g: VLANs) with the same hardware address as their parent device.

    Similar fixes to net/mlx4_en and many other drivers, which started
    exporting this information through `dev_id' before 3.15, were accepted
    into the kernel 4 years ago.
    See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').

    Signed-off-by: Arseny Maslennikov
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Arseny Maslennikov
     
  • [ Upstream commit d455f29f6d76a5f94881ca1289aaa1e90617ff5d ]

    Fix possible recursive lock warning. Its a false warning as the locks are
    part of two differnt HW Queue data structure - cmdq and creq. Debug kernel
    is throwing the following warning and stack trace.

    [ 783.914967] ============================================
    [ 783.914970] WARNING: possible recursive locking detected
    [ 783.914973] 4.19.0-rc2+ #33 Not tainted
    [ 783.914976] --------------------------------------------
    [ 783.914979] swapper/2/0 is trying to acquire lock:
    [ 783.914982] 000000002aa3949d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
    [ 783.914999]
    but task is already holding lock:
    [ 783.915002] 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
    [ 783.915013]
    other info that might help us debug this:
    [ 783.915016] Possible unsafe locking scenario:

    [ 783.915019] CPU0
    [ 783.915021] ----
    [ 783.915034] lock(&(&hwq->lock)->rlock);
    [ 783.915035] lock(&(&hwq->lock)->rlock);
    [ 783.915037]
    *** DEADLOCK ***

    [ 783.915038] May be due to missing lock nesting notation

    [ 783.915039] 1 lock held by swapper/2/0:
    [ 783.915040] #0: 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
    [ 783.915044]
    stack backtrace:
    [ 783.915046] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0-rc2+ #33
    [ 783.915047] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
    [ 783.915048] Call Trace:
    [ 783.915049]
    [ 783.915054] dump_stack+0x90/0xe3
    [ 783.915058] __lock_acquire+0x106c/0x1080
    [ 783.915061] ? sched_clock+0x5/0x10
    [ 783.915063] lock_acquire+0xbd/0x1a0
    [ 783.915065] ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
    [ 783.915069] _raw_spin_lock_irqsave+0x4a/0x90
    [ 783.915071] ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
    [ 783.915073] bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
    [ 783.915078] tasklet_action_common.isra.17+0x197/0x1b0
    [ 783.915081] __do_softirq+0xcb/0x3a6
    [ 783.915084] irq_exit+0xe9/0x100
    [ 783.915085] do_IRQ+0x6a/0x120
    [ 783.915087] common_interrupt+0xf/0xf
    [ 783.915088]

    Use nested notation for the spin_lock to avoid this warning.

    Signed-off-by: Selvin Xavier
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Selvin Xavier
     
  • [ Upstream commit ed51efd2ce44091a858ad829f666727e7c95695e ]

    In the failure path, nq->bar_reg_iomem gets accessed without
    initializing. Avoid this by calling the bnxt_qplib_nq_stop_irq only if the
    initialization is complete.

    Reported-by: Dan Carpenter
    Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
    Fixes: 6e04b1035689 ("RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes")
    Signed-off-by: Selvin Xavier
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Selvin Xavier
     
  • [ Upstream commit 4d6e4d12da2c308f8f976d3955c45ee62539ac98 ]

    IPCB should be cleared before icmp_send, since it may contain data from
    previous layers and the data could be misinterpreted as ip header options,
    which later caused the ihl to be set to an invalid value and resulted in
    the following stack corruption:

    [ 1083.031512] ib0: packet len 57824 (> 2048) too long to send, dropping
    [ 1083.031843] ib0: packet len 37904 (> 2048) too long to send, dropping
    [ 1083.032004] ib0: packet len 4040 (> 2048) too long to send, dropping
    [ 1083.032253] ib0: packet len 63800 (> 2048) too long to send, dropping
    [ 1083.032481] ib0: packet len 23960 (> 2048) too long to send, dropping
    [ 1083.033149] ib0: packet len 63800 (> 2048) too long to send, dropping
    [ 1083.033439] ib0: packet len 63800 (> 2048) too long to send, dropping
    [ 1083.033700] ib0: packet len 63800 (> 2048) too long to send, dropping
    [ 1083.034124] ib0: packet len 63800 (> 2048) too long to send, dropping
    [ 1083.034387] ==================================================================
    [ 1083.034602] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xf08/0x1310
    [ 1083.034798] Write of size 4 at addr ffff880353457c5f by task kworker/u16:0/7
    [ 1083.034990]
    [ 1083.035104] CPU: 7 PID: 7 Comm: kworker/u16:0 Tainted: G O 4.19.0-rc5+ #1
    [ 1083.035316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
    [ 1083.035573] Workqueue: ipoib_wq ipoib_cm_skb_reap [ib_ipoib]
    [ 1083.035750] Call Trace:
    [ 1083.035888] dump_stack+0x9a/0xeb
    [ 1083.036031] print_address_description+0xe3/0x2e0
    [ 1083.036213] kasan_report+0x18a/0x2e0
    [ 1083.036356] ? __ip_options_echo+0xf08/0x1310
    [ 1083.036522] __ip_options_echo+0xf08/0x1310
    [ 1083.036688] icmp_send+0x7b9/0x1cd0
    [ 1083.036843] ? icmp_route_lookup.constprop.9+0x1070/0x1070
    [ 1083.037018] ? netif_schedule_queue+0x5/0x200
    [ 1083.037180] ? debug_show_all_locks+0x310/0x310
    [ 1083.037341] ? rcu_dynticks_curr_cpu_in_eqs+0x85/0x120
    [ 1083.037519] ? debug_locks_off+0x11/0x80
    [ 1083.037673] ? debug_check_no_obj_freed+0x207/0x4c6
    [ 1083.037841] ? check_flags.part.27+0x450/0x450
    [ 1083.037995] ? debug_check_no_obj_freed+0xc3/0x4c6
    [ 1083.038169] ? debug_locks_off+0x11/0x80
    [ 1083.038318] ? skb_dequeue+0x10e/0x1a0
    [ 1083.038476] ? ipoib_cm_skb_reap+0x2b5/0x650 [ib_ipoib]
    [ 1083.038642] ? netif_schedule_queue+0xa8/0x200
    [ 1083.038820] ? ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
    [ 1083.038996] ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
    [ 1083.039174] process_one_work+0x912/0x1830
    [ 1083.039336] ? wq_pool_ids_show+0x310/0x310
    [ 1083.039491] ? lock_acquire+0x145/0x3a0
    [ 1083.042312] worker_thread+0x87/0xbb0
    [ 1083.045099] ? process_one_work+0x1830/0x1830
    [ 1083.047865] kthread+0x322/0x3e0
    [ 1083.050624] ? kthread_create_worker_on_cpu+0xc0/0xc0
    [ 1083.053354] ret_from_fork+0x3a/0x50

    For instance __ip_options_echo is failing to proceed with invalid srr and
    optlen passed from another layer via IPCB

    [ 762.139568] IPv4: __ip_options_echo rr=0 ts=0 srr=43 cipso=0
    [ 762.139720] IPv4: ip_options_build: IPCB 00000000f3cd969e opt 000000002ccb3533
    [ 762.139838] IPv4: __ip_options_echo in srr: optlen 197 soffset 84
    [ 762.139852] IPv4: ip_options_build srr=0 is_frag=0 rr_needaddr=0 ts_needaddr=0 ts_needtime=0 rr=0 ts=0
    [ 762.140269] ==================================================================
    [ 762.140713] IPv4: __ip_options_echo rr=0 ts=0 srr=0 cipso=0
    [ 762.141078] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x12ec/0x1680
    [ 762.141087] Write of size 4 at addr ffff880353457c7f by task kworker/u16:0/7

    Signed-off-by: Denis Drozdov
    Reviewed-by: Erez Shitrit
    Reviewed-by: Feras Daoud
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Denis Drozdov
     
  • [ Upstream commit e54b6a3bcd1ec972b25a164bdf495d9e7120b107 ]

    Add missing check for failure of cm_init_av_by_path

    Fixes: e1444b5a163e ("IB/cm: Fix automatic path migration support")
    Reported-by: Slava Shwartsman
    Reviewed-by: Parav Pandit
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Leon Romanovsky