Eric Lee / smarc-fsl-linux-kernel

06 Apr, 2019

2 commits

5203cf8e2 iw_cxgb4: fix srqidx leak during connection abort ... Browse Code »

[ Upstream commit f368ff188ae4b3ef6f740a15999ea0373261b619 ]

When an application aborts the connection by moving QP from RTS to ERROR,
then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
*srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
connection by calling c4iw_ep_disconnect().

c4iw_ep_disconnect() does the following:
1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
2. sends abort request CPL to hw.

But, since the close_complete_upcall() is sent before sending the
ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
connection holds one. Because, the srqidx is passed up to libcxgb4 only
after corresponding ABORT_RPL is processed by kernel in abort_rpl().

This patch handle the corner-case by moving the call to
close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl(). So that
libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and
libcxgb4 can relinquish the srqidx properly.

Signed-off-by: Raju Rangoju
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Raju Rangoju
2019-04-06 04:33:09 +0800
d3ec442d6 IB/mlx4: Increase the timeout for CM cache ... Browse Code »

[ Upstream commit 2612d723aadcf8281f9bf8305657129bd9f3cd57 ]

Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.

During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed

The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.

64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.

When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:

cm_rx_duplicates:
dreq 2102
cm_rx_msgs:
drep 1989
dreq 6195
rep 3968
req 4224
rtu 4224
cm_tx_msgs:
drep 4093
dreq 27568
rep 4224
req 3968
rtu 3968
cm_tx_retries:
dreq 23469

Note that the active/passive side is equally distributed between the
two nodes.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
dreq 2460
[]
cm_tx_retries:
dreq 3010
req 47

Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.

Adjustment of the CMA timeout is not part of this commit.

Signed-off-by: Håkon Bugge
Acked-by: Jack Morgenstein
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Håkon Bugge
2019-04-06 04:33:03 +0800

27 Mar, 2019

1 commit

9dd5053c8 RDMA/cma: Rollback source IP address if failing to acquire device ... Browse Code »

commit 5fc01fb846bce8fa6d5f95e2625b8ce0f8e86810 upstream.

If cma_acquire_dev_by_src_ip() returns error in addr_handler(), the
device state changes back to RDMA_CM_ADDR_BOUND but the resolved source
IP address is still left. After that, if rdma_destroy_id() is called
after rdma_listen(), the device is freed without removed from
listen_any_list in cma_cancel_operation(). Revert to the previous IP
address if acquiring device fails.

Reported-by: syzbot+f3ce716af730c8f96637@syzkaller.appspotmail.com
Signed-off-by: Myungho Jung
Reviewed-by: Parav Pandit
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Myungho Jung
2019-03-27 13:14:42 +0800

24 Mar, 2019

1 commit

54674984d IB/hfi1: Close race condition on user context disable and close ... Browse Code »

commit bc5add09764c123f58942a37c8335247e683d234 upstream.

When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur. Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.

cpu 0 (context cleanup)
rc->ref_count-- (ref_count == 0)
hfi1_rcd_free()
cpu 1 (IRQ (with rcd index))
rcd_get_by_index()
lock
ref_count+++
Signed-off-by: Michael J. Ruhl
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Michael J. Ruhl
2019-03-24 03:10:02 +0800

14 Mar, 2019

2 commits

1ee82160e IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start ... Browse Code »

[ Upstream commit 6ab4aba00f811a5265acc4d3eb1863bb3ca60562 ]

The following BUG was reported by kasan:

BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020

Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
kasan_report+0x18a/0x2e0
? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
memcpy+0x1f/0x50
ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
? kvm_clock_read+0x1f/0x30
? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib]
? __lock_is_held+0xc2/0x170
? process_one_work+0x880/0x1960
? process_one_work+0x912/0x1960
process_one_work+0x912/0x1960
? wq_pool_ids_show+0x310/0x310
? lock_acquire+0x145/0x440
worker_thread+0x87/0xbb0
? process_one_work+0x1960/0x1960
kthread+0x314/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50

Allocated by task 0:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_trace+0x168/0x3e0
path_rec_create+0xa2/0x1f0 [ib_ipoib]
ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib]
dev_hard_start_xmit+0x159/0x8d0
sch_direct_xmit+0x226/0xb40
__dev_queue_xmit+0x1d63/0x2950
neigh_update+0x889/0x1770
arp_process+0xc47/0x21f0
arp_rcv+0x462/0x760
__netif_receive_skb_core+0x1546/0x2da0
netif_receive_skb_internal+0xf2/0x590
napi_gro_receive+0x28e/0x390
ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib]
ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib]
net_rx_action+0x427/0xe30
__do_softirq+0x28e/0xc42

Freed by task 26680:
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x360
ipoib_flush_paths+0x532/0x9d0 [ib_ipoib]
ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib]
set_mode+0xc8/0x150 [ib_ipoib]
kernfs_fop_write+0x279/0x440
__vfs_write+0xd8/0x5c0
vfs_write+0x15e/0x470
ksys_write+0xb8/0x180
do_syscall_64+0x9b/0x420
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff88034c30bcc8
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 8 bytes inside of
512-byte region [ffff88034c30bcc8, ffff88034c30bec8)
The buggy address belongs to the page:

The following race between change mode and xmit flow is the reason for
this use-after-free:

Change mode Send packet 1 to GID XX Send packet 2 to GID XX
| | |
start | |
| | |
| | |
| Create new path for GID XX |
| and update neigh path |
| | |
| | |
| | |
flush_paths | |
| |
queue_work(cm.start_task) |
| Path for GID XX not found
| create new path
|
|
start_task runs with old
released path

There is no locking to protect the lifetime of the path through the
ipoib_cm_tx struct, so delete it entirely and always use the newly looked
up path under the priv->lock.

Fixes: 546481c2816e ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow")
Signed-off-by: Feras Daoud
Reviewed-by: Erez Shitrit
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Feras Daoud
2019-03-14 05:02:28 +0800
43b0c9391 IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM ... Browse Code »

[ Upstream commit 904bba211acc2112fdf866e5a2bc6cd9ecd0de1b ]

The work completion length for a receiving a UD send with immediate is
short by 4 bytes causing application using this opcode to fail.

The UD receive logic incorrectly subtracts 4 bytes for immediate
value. These bytes are already included in header length and are used to
calculate header/payload split, so the result is these 4 bytes are
subtracted twice, once when the header length subtracted from the overall
length and once again in the UD opcode specific path.

Remove the extra subtraction when handling the opcode.

Fixes: 7724105686e7 ("IB/hfi1: add driver files")
Reviewed-by: Michael J. Ruhl
Signed-off-by: Brian Welty
Signed-off-by: Mike Marciniszyn
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Brian Welty
2019-03-14 05:02:27 +0800

27 Feb, 2019

2 commits

293f2dcd0 RDMA/srp: Rework SCSI device reset handling ... Browse Code »

commit 48396e80fb6526ea5ed267bd84f028bae56d2f9e upstream.

Since .scsi_done() must only be called after scsi_queue_rq() has
finished, make sure that the SRP initiator driver does not call
.scsi_done() while scsi_queue_rq() is in progress. Although
invoking sg_reset -d while I/O is in progress works fine with kernel
v4.20 and before, that is not the case with kernel v5.0-rc1. This
patch avoids that the following crash is triggered with kernel
v5.0-rc1:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000138
CPU: 0 PID: 360 Comm: kworker/0:1H Tainted: G B 5.0.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:blk_mq_dispatch_rq_list+0x116/0xb10
Call Trace:
blk_mq_sched_dispatch_requests+0x2f7/0x300
__blk_mq_run_hw_queue+0xd6/0x180
blk_mq_run_work_fn+0x27/0x30
process_one_work+0x4f1/0xa20
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30

Cc:
Fixes: 94a9174c630c ("IB/srp: reduce lock coverage of command completion")
Signed-off-by: Bart Van Assche
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2019-02-27 17:09:00 +0800
eeb370eae RDMA/mthca: Clear QP objects during their allocation ... Browse Code »

[ Upstream commit 9d9f59b4204bc41896c866b3e5856e5b416aa199 ]

As part of audit process to update drivers to use rdma_restrack_add()
ensure that QP objects is cleared before access. Such change fixes the
crash observed with uninitialized non zero sgid attr accessed by
ib_destroy_qp().

CPU: 3 PID: 74 Comm: kworker/u16:1 Not tainted 4.19.10-300.fc29.x86_64
Workqueue: ipoib_wq ipoib_cm_tx_reap [ib_ipoib]
RIP: 0010:rdma_put_gid_attr+0x9/0x30 [ib_core]
RSP: 0018:ffffb7ad819dbde8 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8d1bdf5a2e00 RCX: 0000000000002699
RDX: 206c656e72656af8 RSI: ffff8d1bf7ae6160 RDI: 206c656e72656b20
RBP: 0000000000000000 R08: 0000000000026160 R09: ffffffffc06b45bf
R10: ffffe849887da000 R11: 0000000000000002 R12: ffff8d1be30cb400
R13: ffff8d1bdf681800 R14: ffff8d1be2272400 R15: ffff8d1be30ca000
FS: 0000000000000000(0000) GS:ffff8d1bf7ac0000(0000)
knlGS:0000000000000000
Trace:
ib_destroy_qp+0xc9/0x240 [ib_core]
ipoib_cm_tx_reap+0x1f9/0x4e0 [ib_ipoib]
process_one_work+0x1a1/0x3a0
worker_thread+0x30/0x380
? pwq_unbound_release_workfn+0xd0/0xd0
kthread+0x112/0x130
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x22/0x40

Reported-by: Alexander Murashkin
Tested-by: Alexander Murashkin
Fixes: 1a1f460ff151 ("RDMA: Hold the sgid_attr inside the struct ib_ah/qp")
Signed-off-by: Parav Pandit
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Leon Romanovsky
2019-02-27 17:08:54 +0800

13 Feb, 2019

2 commits

fc3f15c67 IB/hfi1: Add limit test for RC/UC send via loopback ... Browse Code »

commit 09ce351dff8e7636af0beb72cd4a86c3904a0500 upstream.

Fix potential memory corruption and panic in loopback for IB_WR_SEND
variants.

The code blindly assumes the posted length will fit in the fetched rwqe,
which is not a valid assumption.

Fix by adding a limit test, and triggering the appropriate send completion
and putting the QP in an error state. This mimics the handling for
non-loopback QPs.

Fixes: 15703461533a ("IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt")
Cc: #v4.20+
Reviewed-by: Michael J. Ruhl
Signed-off-by: Mike Marciniszyn
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Mike Marciniszyn

Mike Marciniszyn
2019-02-13 02:47:26 +0800
91e46947d IB/hfi1: Unreserve a reserved request when it is completed ... Browse Code »

[ Upstream commit ca95f802ef5139722acc8d30aeaab6fe5bbe939e ]

Currently, When a reserved operation is completed, its entry in the send
queue will not be unreserved, which leads to the miscalculation of
qp->s_avail and thus the triggering of a WARN_ON call trace. This patch
fixes the problem by unreserving the reserved operation when it is
completed.

Fixes: 856cc4c237ad ("IB/hfi1: Add the capability for reserved operations")
Reviewed-by: Mike Marciniszyn
Signed-off-by: Kaike Wan
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Kaike Wan
2019-02-13 02:47:06 +0800

07 Feb, 2019

1 commit

71ff3384d IB/hfi1: Remove overly conservative VM_EXEC flag check ... Browse Code »

commit 7709b0dc265f28695487712c45f02bbd1f98415d upstream.

Applications that use the stack for execution purposes cause userspace PSM
jobs to fail during mmap().

Both Fortran (non-standard format parsing) and C (callback functions
located in the stack) applications can be written such that stack
execution is required. The linker notes this via the gnu_stack ELF flag.

This causes READ_IMPLIES_EXEC to be set which forces all PROT_READ mmaps
to have PROT_EXEC for the process.

Checking for VM_EXEC bit and failing the request with EPERM is overly
conservative and will break any PSM application using executable stacks.

Cc: #v4.14+
Fixes: 12220267645c ("IB/hfi: Protect against writable mmap")
Reviewed-by: Mike Marciniszyn
Reviewed-by: Dennis Dalessandro
Reviewed-by: Ira Weiny
Signed-off-by: Michael J. Ruhl
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Michael J. Ruhl
2019-02-07 00:30:13 +0800

26 Jan, 2019

2 commits

6fa75685a IB/usnic: Fix potential deadlock ... Browse Code »

[ Upstream commit 8036e90f92aae2784b855a0007ae2d8154d28b3c ]

Acquiring the rtnl lock while holding usdev_lock could result in a
deadlock.

For example:

usnic_ib_query_port()
| mutex_lock(&us_ibdev->usdev_lock)
| ib_get_eth_speed()
| rtnl_lock()

rtnl_lock()
| usnic_ib_netdevice_event()
| mutex_lock(&us_ibdev->usdev_lock)

This commit moves the usdev_lock acquisition after the rtnl lock has been
released.

This is safe to do because usdev_lock is not protecting anything being
accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
(rtnl -> usdev_lock) is not violated.

Signed-off-by: Parvi Kaustubhi
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Parvi Kaustubhi
2019-01-26 16:32:42 +0800
fded1b0e0 rxe: IB_WR_REG_MR does not capture MR's iova field ... Browse Code »

[ Upstream commit b024dd0eba6e6d568f69d63c5e3153aba94c23e3 ]

FRWR memory registration is done with a series of calls and WRs.
1. ULP invokes ib_dma_map_sg()
2. ULP invokes ib_map_mr_sg()
3. ULP posts an IB_WR_REG_MR on the Send queue

Step 2 generates an iova. It is permissible for ULPs to change this
iova (with certain restrictions) between steps 2 and 3.

rxe_map_mr_sg captures the MR's iova but later when rxe processes the
REG_MR WR, it ignores the MR's iova field. If a ULP alters the MR's iova
after step 2 but before step 3, rxe never captures that change.

When the remote sends an RDMA Read targeting that MR, rxe looks up the
R_key, but the altered iova does not match the iova stored in the MR,
causing the RDMA Read request to fail.

Reported-by: Anna Schumaker
Signed-off-by: Chuck Lever
Reviewed-by: Sagi Grimberg
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Chuck Lever
2019-01-26 16:32:36 +0800

23 Jan, 2019

2 commits

ec485378a RDMA/vmw_pvrdma: Return the correct opcode when creating WR ... Browse Code »

commit 6325e01b6cdf4636b721cf7259c1616e3cf28ce2 upstream.

Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
opcodes explicitly.

Reported-by: Ruishuang Wang
Fixes: 9a59739bd01f ("IB/rxe: Revise the ib_wr_opcode enum")
Cc: stable@vger.kernel.org
Reviewed-by: Bryan Tan
Reviewed-by: Ruishuang Wang
Reviewed-by: Vishnu Dasa
Signed-off-by: Adit Ranadive
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Adit Ranadive
2019-01-23 04:40:34 +0800
836edf22f RDMA/nldev: Don't expose unsafe global rkey to regular user ... Browse Code »

commit a9666c1cae8dbcd1a9aacd08a778bf2a28eea300 upstream.

Unsafe global rkey is considered dangerous because it exposes memory
registered for all memory in the system. Only users with a QP on the same
PD can use the rkey, and generally those QPs will already know the
value. However, out of caution, do not expose the value to unprivleged
users on the local system. Require CAP_NET_ADMIN instead.

Cc: # 4.16
Fixes: 29cf1351d450 ("RDMA/nldev: provide detailed PD information")
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Leon Romanovsky
2019-01-23 04:40:34 +0800

13 Jan, 2019

4 commits

f408aac31 RDMA/srpt: Fix a use-after-free in the channel release code ... Browse Code »

commit ed041919f0d23c109d52cde8da6ddc211c52d67e upstream.

This patch avoids that KASAN sporadically reports the following:

BUG: KASAN: use-after-free in rxe_run_task+0x1e/0x60 [rdma_rxe]
Read of size 1 at addr ffff88801c50d8f4 by task check/24830

CPU: 4 PID: 24830 Comm: check Not tainted 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_address_description+0x71/0x239
kasan_report.cold.5+0x242/0x301
__asan_load1+0x47/0x50
rxe_run_task+0x1e/0x60 [rdma_rxe]
rxe_post_send+0x4bd/0x8d0 [rdma_rxe]
srpt_zerolength_write+0xe1/0x160 [ib_srpt]
srpt_close_ch+0x8b/0xe0 [ib_srpt]
srpt_set_enabled+0xe7/0x150 [ib_srpt]
srpt_tpg_enable_store+0xc0/0x100 [ib_srpt]
configfs_write_file+0x157/0x1d0
__vfs_write+0xd7/0x3d0
vfs_write+0x102/0x290
ksys_write+0xab/0x130
__x64_sys_write+0x43/0x50
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Allocated by task 13856:
save_stack+0x43/0xd0
kasan_kmalloc+0xc7/0xe0
kasan_slab_alloc+0x11/0x20
kmem_cache_alloc+0x105/0x320
rxe_alloc+0xff/0x1f0 [rdma_rxe]
rxe_create_qp+0x9f/0x160 [rdma_rxe]
ib_create_qp+0xf5/0x690 [ib_core]
rdma_create_qp+0x6a/0x140 [rdma_cm]
srpt_cm_req_recv.cold.59+0x1588/0x237b [ib_srpt]
srpt_rdma_cm_req_recv.isra.35+0x1d5/0x220 [ib_srpt]
srpt_rdma_cm_handler+0x6f/0x100 [ib_srpt]
cma_listen_handler+0x59/0x60 [rdma_cm]
cma_ib_req_handler+0xd5b/0x2570 [rdma_cm]
cm_process_work+0x2e/0x110 [ib_cm]
cm_work_handler+0x2aae/0x502b [ib_cm]
process_one_work+0x481/0x9e0
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30

Freed by task 3440:
save_stack+0x43/0xd0
__kasan_slab_free+0x139/0x190
kasan_slab_free+0xe/0x10
kmem_cache_free+0xbc/0x330
rxe_elem_release+0x66/0xe0 [rdma_rxe]
rxe_destroy_qp+0x3f/0x50 [rdma_rxe]
ib_destroy_qp+0x140/0x360 [ib_core]
srpt_release_channel_work+0xdc/0x310 [ib_srpt]
process_one_work+0x481/0x9e0
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30

Cc: Sergey Gorenko
Cc: Max Gurtovoy
Cc: Laurence Oberman
Cc:
Signed-off-by: Bart Van Assche
Signed-off-by: Doug Ledford
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2019-01-13 16:51:09 +0800
c03d6b0f9 rxe: fix error completion wr_id and qp_num ... Browse Code »

commit e48d8ed9c6193502d849b35767fd18e20bbd7ba2 upstream.

Error completions must still contain a valid wr_id and
qp_num such that the consumer can rely on. Correctly
fill these fields in receive error completions.

Reported-by: Walker Benjamin
Cc: stable@vger.kernel.org
Signed-off-by: Sagi Grimberg
Reviewed-by: Zhu Yanjun
Tested-by: Zhu Yanjun
Signed-off-by: Doug Ledford
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2019-01-13 16:51:09 +0800
8c9c37477 IB/core: Fix oops in netdev_next_upper_dev_rcu() ... Browse Code »

[ Upstream commit 37fbd834b4e492dc41743830cbe435f35120abd8 ]

When support for bonding of RoCE devices was added, there was
necessarily a link between the RoCE device and the paired netdevice that
was part of the bond. If you remove the mlx4_en module, that paired
association is broken (the RoCE device is still present but the paired
netdevice has been released). We need to account for this in
is_upper_ndev_bond_master_filter() and filter out those links with a
broken pairing or else we later oops in netdev_next_upper_dev_rcu().

Fixes: 408f1242d940 ("IB/core: Delete lower netdevice default GID entries in bonding scenario")
Signed-off-by: Mark Zhang
Reviewed-by: Parav Pandit
Signed-off-by: Leon Romanovsky
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin

Mark Zhang
2019-01-13 16:50:57 +0800
b4592b838 IB/mlx5: Block DEVX umem from the non applicable cases ... Browse Code »

[ Upstream commit 47f07f03b5ee436fe074c4fb1fb28d013c36a0d8 ]

Blocks creating a DEVX UMEM with the non applicable access flags
as of ODP, MW_BIND, etc.

Specifically when an ODP flag is used below WARN call trace is issued.

[ 2510.404131] RIP: 0010:__mlx5_ib_populate_pas+0x207/0x220 [mlx5_ib]
...
[ 2510.404143] Call Trace:
[ 2510.404150] ? __kmalloc_node+0x1b3/0x280
[ 2510.404156] ? _uverbs_alloc+0x63/0x90 [ib_uverbs]
[ 2510.404158] ? _uverbs_alloc+0x63/0x90 [ib_uverbs]
[ 2510.404162] mlx5_ib_populate_pas+0x53/0x60 [mlx5_ib]
[ 2510.404167] mlx5_ib_handler_MLX5_IB_METHOD_DEVX_UMEM_REG+0x273/0x3f0 [mlx5_ib]

Fixes: aeae94579caf ("IB/mlx5: Add DEVX support for memory registration")
Signed-off-by: Yishai Hadas
Reviewed-by: Artemy Kovalyov
Signed-off-by: Leon Romanovsky
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin

Yishai Hadas
2019-01-13 16:50:56 +0800

10 Jan, 2019

1 commit

91dea490a IB/hfi1: Incorrect sizing of sge for PIO will OOPs ... Browse Code »

commit dbc2970caef74e8ff41923d302aa6fb5a4812d0e upstream.

An incorrect sge sizing in the HFI PIO path will cause an OOPs similar to
this:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] hfi1_verbs_send_pio+0x3d8/0x530 [hfi1]
PGD 0
Oops: 0000 1 SMP
Call Trace:
? hfi1_verbs_send_dma+0xad0/0xad0 [hfi1]
hfi1_verbs_send+0xdf/0x250 [hfi1]
? make_rc_ack+0xa80/0xa80 [hfi1]
hfi1_do_send+0x192/0x430 [hfi1]
hfi1_do_send_from_rvt+0x10/0x20 [hfi1]
rvt_post_send+0x369/0x820 [rdmavt]
ib_uverbs_post_send+0x317/0x570 [ib_uverbs]
ib_uverbs_write+0x26f/0x420 [ib_uverbs]
? security_file_permission+0x21/0xa0
vfs_write+0xbd/0x1e0
? mntput+0x24/0x40
SyS_write+0x7f/0xe0
system_call_fastpath+0x16/0x1b

Fix by adding the missing sizing check to correctly determine the sge
length.

Fixes: 7724105686e7 ("IB/hfi1: add driver files")
Reviewed-by: Mike Marciniszyn
Signed-off-by: Michael J. Ruhl
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Michael J. Ruhl
2019-01-10 00:38:36 +0800

21 Dec, 2018

1 commit

70b0baddd IB/hfi1: Remove race conditions in user_sdma send path ... Browse Code »

commit 28a9a9e83ceae2cee25b9af9ad20d53aaa9ab951 upstream

Packet queue state is over used to determine SDMA descriptor
availablitity and packet queue request state.

cpu 0 ret = user_sdma_send_pkts(req, pcount);
cpu 0 if (atomic_read(&pq->n_reqs))
cpu 1 IRQ user_sdma_txreq_cb calls pq_update() (state to _INACTIVE)
cpu 0 xchg(&pq->state, SDMA_PKT_Q_ACTIVE);

At this point pq->n_reqs == 0 and pq->state is incorrectly
SDMA_PKT_Q_ACTIVE. The close path will hang waiting for the state
to return to _INACTIVE.

This can also change the state from _DEFERRED to _ACTIVE. However,
this is a mostly benign race.

Remove the racy code path.

Use n_reqs to determine if a packet queue is active or not.

Cc: # 4.19.x
Reviewed-by: Mitko Haralanov
Reviewed-by: Mike Marciniszyn
Signed-off-by: Michael J. Ruhl
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Michael J. Ruhl
2018-12-21 21:15:12 +0800

17 Dec, 2018

8 commits

1fd99ac17 IB/hfi1: Fix an out-of-bounds access in get_hw_stats ... Browse Code »

commit 36d842194a57f1b21fbc6a6875f2fa2f9a7f8679 upstream.

When running with KASAN, the following trace is produced:

[ 62.535888]

==================================================================
[ 62.544930] BUG: KASAN: slab-out-of-bounds in
gut_hw_stats+0x122/0x230 [hfi1]
[ 62.553856] Write of size 8 at addr ffff88080e8d6330 by task
kworker/0:1/14

[ 62.565333] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted
4.19.0-test-build-kasan+ #8
[ 62.575087] Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS
SE5C610.86B.01.01.0019.101220160604 10/12/2016
[ 62.587951] Workqueue: events work_for_cpu_fn
[ 62.594050] Call Trace:
[ 62.598023] dump_stack+0xc6/0x14c
[ 62.603089] ? dump_stack_print_info.cold.1+0x2f/0x2f
[ 62.610041] ? kmsg_dump_rewind_nolock+0x59/0x59
[ 62.616615] ? get_hw_stats+0x122/0x230 [hfi1]
[ 62.622985] print_address_description+0x6c/0x23c
[ 62.629744] ? get_hw_stats+0x122/0x230 [hfi1]
[ 62.636108] kasan_report.cold.6+0x241/0x308
[ 62.642365] get_hw_stats+0x122/0x230 [hfi1]
[ 62.648703] ? hfi1_alloc_rn+0x40/0x40 [hfi1]
[ 62.655088] ? __kmalloc+0x110/0x240
[ 62.660695] ? hfi1_alloc_rn+0x40/0x40 [hfi1]
[ 62.667142] setup_hw_stats+0xd8/0x430 [ib_core]
[ 62.673972] ? show_hfi+0x50/0x50 [hfi1]
[ 62.680026] ib_device_register_sysfs+0x165/0x180 [ib_core]
[ 62.687995] ib_register_device+0x5a2/0xa10 [ib_core]
[ 62.695340] ? show_hfi+0x50/0x50 [hfi1]
[ 62.701421] ? ib_unregister_device+0x2e0/0x2e0 [ib_core]
[ 62.709222] ? __vmalloc_node_range+0x2d0/0x380
[ 62.716131] ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
[ 62.723735] ? vmalloc_node+0x5c/0x70
[ 62.729697] ? rvt_driver_mr_init+0x11f/0x2d0 [rdmavt]
[ 62.737347] ? rvt_driver_mr_init+0x1f5/0x2d0 [rdmavt]
[ 62.744998] ? __rvt_alloc_mr+0x110/0x110 [rdmavt]
[ 62.752315] ? rvt_rc_error+0x140/0x140 [rdmavt]
[ 62.759434] ? rvt_vma_open+0x30/0x30 [rdmavt]
[ 62.766364] ? mutex_unlock+0x1d/0x40
[ 62.772445] ? kmem_cache_create_usercopy+0x15d/0x230
[ 62.780115] rvt_register_device+0x1f6/0x360 [rdmavt]
[ 62.787823] ? rvt_get_port_immutable+0x180/0x180 [rdmavt]
[ 62.796058] ? __get_txreq+0x400/0x400 [hfi1]
[ 62.802969] ? memcpy+0x34/0x50
[ 62.808611] hfi1_register_ib_device+0xde6/0xeb0 [hfi1]
[ 62.816601] ? hfi1_get_npkeys+0x10/0x10 [hfi1]
[ 62.823760] ? hfi1_init+0x89f/0x9a0 [hfi1]
[ 62.830469] ? hfi1_setup_eagerbufs+0xad0/0xad0 [hfi1]
[ 62.838204] ? pcie_capability_clear_and_set_word+0xcd/0xe0
[ 62.846429] ? pcie_capability_read_word+0xd0/0xd0
[ 62.853791] ? hfi1_pcie_init+0x187/0x4b0 [hfi1]
[ 62.860958] init_one+0x67f/0xae0 [hfi1]
[ 62.867301] ? hfi1_init+0x9a0/0x9a0 [hfi1]
[ 62.873876] ? wait_woken+0x130/0x130
[ 62.879860] ? read_word_at_a_time+0xe/0x20
[ 62.886329] ? strscpy+0x14b/0x280
[ 62.891998] ? hfi1_init+0x9a0/0x9a0 [hfi1]
[ 62.898405] local_pci_probe+0x70/0xd0
[ 62.904295] ? pci_device_shutdown+0x90/0x90
[ 62.910833] work_for_cpu_fn+0x29/0x40
[ 62.916750] process_one_work+0x584/0x960
[ 62.922974] ? rcu_work_rcufn+0x40/0x40
[ 62.928991] ? __schedule+0x396/0xdc0
[ 62.934806] ? __sched_text_start+0x8/0x8
[ 62.941020] ? pick_next_task_fair+0x68b/0xc60
[ 62.947674] ? run_rebalance_domains+0x260/0x260
[ 62.954471] ? __list_add_valid+0x29/0xa0
[ 62.960607] ? move_linked_works+0x1c7/0x230
[ 62.967077] ?
trace_event_raw_event_workqueue_execute_start+0x140/0x140
[ 62.976248] ? mutex_lock+0xa6/0x100
[ 62.982029] ? __mutex_lock_slowpath+0x10/0x10
[ 62.988795] ? __switch_to+0x37a/0x710
[ 62.994731] worker_thread+0x62e/0x9d0
[ 63.000602] ? max_active_store+0xf0/0xf0
[ 63.006828] ? __switch_to_asm+0x40/0x70
[ 63.012932] ? __switch_to_asm+0x34/0x70
[ 63.019013] ? __switch_to_asm+0x40/0x70
[ 63.025042] ? __switch_to_asm+0x34/0x70
[ 63.031030] ? __switch_to_asm+0x40/0x70
[ 63.037006] ? __schedule+0x396/0xdc0
[ 63.042660] ? kmem_cache_alloc_trace+0xf3/0x1f0
[ 63.049323] ? kthread+0x59/0x1d0
[ 63.054594] ? ret_from_fork+0x35/0x40
[ 63.060257] ? __sched_text_start+0x8/0x8
[ 63.066212] ? schedule+0xcf/0x250
[ 63.071529] ? __wake_up_common+0x110/0x350
[ 63.077794] ? __schedule+0xdc0/0xdc0
[ 63.083348] ? wait_woken+0x130/0x130
[ 63.088963] ? finish_task_switch+0x1f1/0x520
[ 63.095258] ? kasan_unpoison_shadow+0x30/0x40
[ 63.101792] ? __init_waitqueue_head+0xa0/0xd0
[ 63.108183] ? replenish_dl_entity.cold.60+0x18/0x18
[ 63.115151] ? _raw_spin_lock_irqsave+0x25/0x50
[ 63.121754] ? max_active_store+0xf0/0xf0
[ 63.127753] kthread+0x1ae/0x1d0
[ 63.132894] ? kthread_bind+0x30/0x30
[ 63.138422] ret_from_fork+0x35/0x40

[ 63.146973] Allocated by task 14:
[ 63.152077] kasan_kmalloc+0xbf/0xe0
[ 63.157471] __kmalloc+0x110/0x240
[ 63.162804] init_cntrs+0x34d/0xdf0 [hfi1]
[ 63.168883] hfi1_init_dd+0x29a3/0x2f90 [hfi1]
[ 63.175244] init_one+0x551/0xae0 [hfi1]
[ 63.181065] local_pci_probe+0x70/0xd0
[ 63.186759] work_for_cpu_fn+0x29/0x40
[ 63.192310] process_one_work+0x584/0x960
[ 63.198163] worker_thread+0x62e/0x9d0
[ 63.203843] kthread+0x1ae/0x1d0
[ 63.208874] ret_from_fork+0x35/0x40

[ 63.217203] Freed by task 1:
[ 63.221844] __kasan_slab_free+0x12e/0x180
[ 63.227844] kfree+0x92/0x1a0
[ 63.232570] single_release+0x3a/0x60
[ 63.238024] __fput+0x1d9/0x480
[ 63.242911] task_work_run+0x139/0x190
[ 63.248440] exit_to_usermode_loop+0x191/0x1a0
[ 63.254814] do_syscall_64+0x301/0x330
[ 63.260283] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 63.270199] The buggy address belongs to the object at
ffff88080e8d5500
which belongs to the cache kmalloc-4096 of size 4096
[ 63.287247] The buggy address is located 3632 bytes inside of
4096-byte region [ffff88080e8d5500, ffff88080e8d6500)
[ 63.303564] The buggy address belongs to the page:
[ 63.310447] page:ffffea00203a3400 count:1 mapcount:0
mapping:ffff88081380e840 index:0x0 compound_mapcount: 0
[ 63.323102] flags: 0x2fffff80008100(slab|head)
[ 63.329775] raw: 002fffff80008100 0000000000000000 0000000100000001
ffff88081380e840
[ 63.340175] raw: 0000000000000000 0000000000070007 00000001ffffffff
0000000000000000
[ 63.350564] page dumped because: kasan: bad access detected

[ 63.361974] Memory state around the buggy address:
[ 63.369137] ffff88080e8d6200: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[ 63.379082] ffff88080e8d6280: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[ 63.389032] >ffff88080e8d6300: 00 00 00 00 00 00 fc fc fc fc fc fc fc
fc fc fc
[ 63.398944] ^
[ 63.406141] ffff88080e8d6380: fc fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[ 63.416109] ffff88080e8d6400: fc fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[ 63.426099]
==================================================================

The trace happens because get_hw_stats() assumes there is room in the
memory allocated in init_cntrs() to accommodate the driver counters.
Unfortunately, that routine only allocated space for the device
counters.

Fix by insuring the allocation has room for the additional driver
counters.

Cc: # v4.14+
Fixes: b7481944b06e9 ("IB/hfi1: Show statistics counters under IB stats interface")
Reviewed-by: Mike Marciniczyn
Reviewed-by: Mike Ruhl
Signed-off-by: Piotr Stankiewicz
Signed-off-by: Dennis Dalessandro
Signed-off-by: Doug Ledford
Signed-off-by: Greg Kroah-Hartman

Piotr Stankiewicz
2018-12-17 16:24:42 +0800
4f03e063a IB/mlx5: Fix page fault handling for MW ... Browse Code »

[ Upstream commit 75b7b86bdb0df37e08e44b6c1f99010967f81944 ]

Memory windows are implemented with an indirect MKey, when a page fault
event comes for a MW Mkey we need to find the MR at the end of the list of
the indirect MKeys by iterating on all items from the first to the last.

The offset calculated during this process has to be zeroed after the first
iteration or the next iteration will start from a wrong address, resulting
incorrect ODP faulting behavior.

Fixes: db570d7deafb ("IB/mlx5: Add ODP support to MW")
Signed-off-by: Artemy Kovalyov
Signed-off-by: Moni Shoua
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Artemy Kovalyov
2018-12-17 16:24:36 +0800
4c7d50c23 RDMA/hns: Bugfix pbl configuration for rereg mr ... Browse Code »

[ Upstream commit ca088320a02537f36c243ac21794525d8eabb3bd ]

Current hns driver assigned the first two PBL page addresses from previous
registered MR to the hardware when reregister MR changing the memory
locations occurred. This will lead to PBL addressing error as the PBL has
already been released. This patch fixes this wrong assignment by using the
page address from new allocated PBL.

Fixes: a2c80b7b4119 ("RDMA/hns: Add rereg mr support for hip08")
Signed-off-by: Yixian Liu
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Yixian Liu
2018-12-17 16:24:35 +0800
8653ffc34 RDMA/rdmavt: Fix rvt_create_ah function signature ... Browse Code »

[ Upstream commit 4f32fb921b153ae9ea280e02a3e91509fffc03d3 ]

rdmavt uses a crazy system that looses the type checking when assinging
functions to struct ib_device function pointers. Because of this the
signature to this function was not changed when the below commit revised
things.

Fix the signature so we are not calling a function pointer with a
mismatched signature.

Fixes: 477864c8fcd9 ("IB/core: Let create_ah return extended response to user")
Signed-off-by: Kamal Heib
Reviewed-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Kamal Heib
2018-12-17 16:24:34 +0800
59315d0ca RDMA/bnxt_re: Avoid accessing the device structure after it is freed ... Browse Code »

[ Upstream commit a6c66d6a08b88cc10aca9d3f65cfae31e7652a99 ]

When bnxt_re_ib_reg returns failure, the device structure gets
freed. Driver tries to access the device pointer
after it is freed.

[ 4871.034744] Failed to register with netedev: 0xffffffa1
[ 4871.034765] infiniband (null): Failed to register with IB: 0xffffffea
[ 4871.046430] ==================================================================
[ 4871.046437] BUG: KASAN: use-after-free in bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046439] Write of size 4 at addr ffff880fa8406f48 by task kworker/u48:2/17813

[ 4871.046443] CPU: 20 PID: 17813 Comm: kworker/u48:2 Kdump: loaded Tainted: G B OE 4.20.0-rc1+ #42
[ 4871.046444] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[ 4871.046447] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
[ 4871.046449] Call Trace:
[ 4871.046454] dump_stack+0x91/0xeb
[ 4871.046458] print_address_description+0x6a/0x2a0
[ 4871.046461] kasan_report+0x176/0x2d0
[ 4871.046463] ? bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046466] bnxt_re_task+0x63/0x180 [bnxt_re]
[ 4871.046470] process_one_work+0x216/0x5b0
[ 4871.046471] ? process_one_work+0x189/0x5b0
[ 4871.046475] worker_thread+0x4e/0x3d0
[ 4871.046479] kthread+0x10e/0x140
[ 4871.046480] ? process_one_work+0x5b0/0x5b0
[ 4871.046482] ? kthread_stop+0x220/0x220
[ 4871.046486] ret_from_fork+0x3a/0x50

[ 4871.046492] The buggy address belongs to the page:
[ 4871.046494] page:ffffea003ea10180 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[ 4871.046495] flags: 0x57ffffc0000000()
[ 4871.046498] raw: 0057ffffc0000000 0000000000000000 ffffea003ea10188 0000000000000000
[ 4871.046500] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 4871.046501] page dumped because: kasan: bad access detected

Avoid accessing the device structure once it is freed.

Fixes: 497158aa5f52 ("RDMA/bnxt_re: Fix the ib_reg failure cleanup")
Signed-off-by: Selvin Xavier
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Selvin Xavier
2018-12-17 16:24:34 +0800
f4515855b RDMA/bnxt_re: Fix system hang when registration with L2 driver fails ... Browse Code »

[ Upstream commit 3c4b1419c33c2417836a63f8126834ee36968321 ]

Driver doesn't release rtnl lock if registration with
L2 driver (bnxt_re_register_netdev) fais and this causes
hang while requesting for the next lock.

[ 371.635416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 371.635417] kworker/u48:1 D 0 634 2 0x80000000
[ 371.635423] Workqueue: bnxt_re bnxt_re_task [bnxt_re]
[ 371.635424] Call Trace:
[ 371.635426] ? __schedule+0x36b/0xbd0
[ 371.635429] schedule+0x39/0x90
[ 371.635430] schedule_preempt_disabled+0x11/0x20
[ 371.635431] __mutex_lock+0x45b/0x9c0
[ 371.635433] ? __mutex_lock+0x16d/0x9c0
[ 371.635435] ? bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
[ 371.635438] ? wake_up_klogd+0x37/0x40
[ 371.635442] bnxt_re_ib_reg+0x2b/0xb30 [bnxt_re]
[ 371.635447] bnxt_re_task+0xfd/0x180 [bnxt_re]
[ 371.635449] process_one_work+0x216/0x5b0
[ 371.635450] ? process_one_work+0x189/0x5b0
[ 371.635453] worker_thread+0x4e/0x3d0
[ 371.635455] kthread+0x10e/0x140
[ 371.635456] ? process_one_work+0x5b0/0x5b0
[ 371.635458] ? kthread_stop+0x220/0x220
[ 371.635460] ret_from_fork+0x3a/0x50
[ 371.635477] INFO: task NetworkManager:1228 blocked for more than 120 seconds.
[ 371.635478] Tainted: G B OE 4.20.0-rc1+ #42
[ 371.635479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Release the rtnl_lock correctly in the failure path.

Fixes: de5c95d0f518 ("RDMA/bnxt_re: Fix system crash during RDMA resource initialization")
Signed-off-by: Selvin Xavier
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Selvin Xavier
2018-12-17 16:24:34 +0800
5a49ef983 RDMA/core: Add GIDs while changing MAC addr only for registered ndev ... Browse Code »

[ Upstream commit d52ef88a9f4be523425730da3239cf87bee936da ]

Currently when MAC address is changed, regardless of the netdev reg_state,
GID entries are removed and added to reflect the new MAC address and new
default GID entries.

When a bonding device is used and the underlying PCI device is removed
several netdevice events are generated. Two events of the interest are
CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
netdevice.

Sometimes CHANGEADDR event is generated when netdev state is
UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
entries for default GIDs are added and never deleted because GID entries
are deleted only when netdev state is < UNREGISTERED.

This leads to non zero reference count on the netdevice. Due to this, PCI
device unbind operation is getting stuck.

To avoid it, when changing mac address, add GID entries only if netdev is
in REGISTERED state.

Fixes: 03db3a2d81e6 ("IB/core: Add RoCE GID table management")
Signed-off-by: Parav Pandit
Reviewed-by: Mark Bloch
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Parav Pandit
2018-12-17 16:24:34 +0800
7c736fee5 RDMA/mlx5: Fix fence type for IB_WR_LOCAL_INV WR ... Browse Code »

[ Upstream commit 074fca3a18e7e1e0d4d7dcc9d7badc43b90232f4 ]

Currently, for IB_WR_LOCAL_INV WR, when the next fence is None, the
current fence will be SMALL instead of Normal Fence.

Without this patch krping doesn't work on CX-5 devices and throws
following error:

The error messages are from CX5 driver are: (from server side)
[ 710.434014] mlx5_0:dump_cqe:278:(pid 2712): dump error cqe
[ 710.434016] 00000000 00000000 00000000 00000000
[ 710.434016] 00000000 00000000 00000000 00000000
[ 710.434017] 00000000 00000000 00000000 00000000
[ 710.434018] 00000000 93003204 100000b8 000524d2
[ 710.434019] krping: cq completion failed with wr_id 0 status 4 opcode 128 vender_err 32

Fixed the logic to set the correct fence type.

Fixes: 6e8484c5cf07 ("RDMA/mlx5: set UMR wqe fence according to HCA cap")
Signed-off-by: Majd Dibbiny
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin

Majd Dibbiny
2018-12-17 16:24:34 +0800

08 Dec, 2018

2 commits

a99075642 IB/mlx5: Avoid load failure due to unknown link width ... Browse Code »

commit db7a691a1551a748cb92d9c89c6b190ea87e28d5 upstream.

If the firmware reports a connection width that is not 1x, 4x, 8x or 12x
it causes the driver to fail during initialization.

To prevent this failure every time a new width is introduced to the RDMA
stack, we will set a default 4x width for these widths which ar unknown to
the driver.

This is needed to allow to run old kernels with new firmware.

Cc: # 4.1
Fixes: 1b5daf11b015 ("IB/mlx5: Avoid using the MAD_IFC command under ISSI > 0 mode")
Signed-off-by: Michael Guralnik
Reviewed-by: Majd Dibbiny
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Michael Guralnik
2018-12-08 19:59:07 +0800
61c963ab5 iser: set sector for ambiguous mr status errors ... Browse Code »

commit 24c3456c8d5ee6fc1933ca40f7b4406130682668 upstream.

If for some reason we failed to query the mr status, we need to make sure
to provide sufficient information for an ambiguous error (guard error on
sector 0).

Fixes: 0a7a08ad6f5f ("IB/iser: Implement check_protection")
Cc:
Reported-by: Dan Carpenter
Signed-off-by: Sagi Grimberg
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-12-08 19:59:06 +0800

01 Dec, 2018

1 commit

6763372b8 IB/hfi1: Eliminate races in the SDMA send error path ... Browse Code »

commit a0e0cb82804a6a21d9067022c2dfdf80d11da429 upstream.

pq_update() can only be called in two places: from the completion
function when the complete (npkts) sequence of packets has been
submitted and processed, or from setup function if a subset of the
packets were submitted (i.e. the error path).

Currently both paths can call pq_update() if an error occurrs. This
race will cause the n_req value to go negative, hanging file_close(),
or cause a crash by freeing the txlist more than once.

Several variables are used to determine SDMA send state. Most of
these are unnecessary, and have code inspectible races between the
setup function and the completion function, in both the send path and
the error path.

The request 'status' value can be set by the setup or by the
completion function. This is code inspectibly racy. Since the status
is not needed in the completion code or by the caller it has been
removed.

The request 'done' value races between usage by the setup and the
completion function. The completion function does not need this.
When the number of processed packets matches npkts, it is done.

The 'has_error' value races between usage of the setup and the
completion function. This can cause incorrect error handling and leave
the n_req in an incorrect value (i.e. negative).

Simplify the code by removing all of the unneeded state checks and
variables.

Clean up iovs node when it is freed.

Eliminate race conditions in the error path:

If all packets are submitted, the completion handler will set the
completion status correctly (ok or aborted).

If all packets are not submitted, the caller must wait until the
submitted packets have completed, and then set the completion status.

These two change eliminate the race condition in the error path.

Reviewed-by: Mitko Haralanov
Reviewed-by: Mike Marciniszyn
Signed-off-by: Michael J. Ruhl
Signed-off-by: Dennis Dalessandro
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Michael J. Ruhl
2018-12-01 16:37:30 +0800

14 Nov, 2018

8 commits

b8a7fdb12 IB/mlx5: Fix MR cache initialization ... Browse Code »

commit 013c2403bf32e48119aeb13126929f81352cc7ac upstream.

Schedule MR cache work only after bucket was initialized.

Cc: # 4.10
Fixes: 49780d42dfc9 ("IB/mlx5: Expose MR cache for mlx5_ib")
Signed-off-by: Artemy Kovalyov
Reviewed-by: Majd Dibbiny
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Artemy Kovalyov
2018-11-14 03:08:43 +0800
9dd0fafb5 IB/rxe: fix for duplicate request processing and ack psns ... Browse Code »

[ Upstream commit b97db58557f4aa6d9903f8e1deea6b3d1ed0ba43 ]

Don't reset the resp opcode for a replayed read response.
The resp opcode could be in the middle of a write or send
sequence, when the duplicate read request was received.
An example sequence is as follows:
- Receive read request for 12KB PSN 20. Transmit read response
first, middle and last with PSNs 20,21,22.
- Receive write first PSN 23.
At this point the resp psn is 24 and resp opcode is write first.
- The sender notices that PSN 20 is dropped and retransmits.
Receive read request for 12KB PSN 20. Transmit read response
first, middle and last with PSNs 20,21,22. The resp opcode is
set to -1, the resp psn remains 24.
- Receive write first PSN 23. This is processed by duplicate_request().
The resp opcode remains -1 and resp psn remains 24.
- Receive write middle PSN 24. check_op_seq() reports a missing
first error since the resp opcode is -1.

When sending an ack for a duplicate send or write request,
use the psn of the previous ack sent. Do not use the psn
of a read response for the ack.
An example sequence is as follows:
- Receive write PSN 30. Transmit ACK for PSN 30.
- Receive read request 4KB PSN 31. Transmit read response with
PSN 31. The resp psn is now 32.
- The sender notices that PSN 30 is dropped and retransmits.
Receive write PSN 30. duplicate_request() sends an ACK with
PSN 31. That is incorrect since PSN 31 was a read request.

Signed-off-by: Vijay Immanuel
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Vijay Immanuel
2018-11-14 03:08:38 +0800
3c1deb69a IB/mlx5: Allow transition of DCI QP to reset ... Browse Code »

[ Upstream commit 99ed748e878a99c6c7b87bbec063eefd9e47cb42 ]

The transition is allowed from any state and the atrribute mask must be
IB_QP_STATE.

Fixes: c32a4f296e1d ("IB/mlx5: Add support for DC Initiator QP")
Signed-off-by: Moni Shoua
Reviewed-by: Artemy Kovalyov
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Moni Shoua
2018-11-14 03:08:37 +0800
24d4d9a43 IB/ipoib: Use dev_port to expose network interface port numbers ... Browse Code »

[ Upstream commit 9b8b2a323008aedd39a8debb861b825707f01420 ]

Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.

Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.

The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.

Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').

Signed-off-by: Arseny Maslennikov
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Arseny Maslennikov
2018-11-14 03:08:37 +0800
6a89b3a67 RDMA/bnxt_re: Fix recursive lock warning in debug kernel ... Browse Code »

[ Upstream commit d455f29f6d76a5f94881ca1289aaa1e90617ff5d ]

Fix possible recursive lock warning. Its a false warning as the locks are
part of two differnt HW Queue data structure - cmdq and creq. Debug kernel
is throwing the following warning and stack trace.

[ 783.914967] ============================================
[ 783.914970] WARNING: possible recursive locking detected
[ 783.914973] 4.19.0-rc2+ #33 Not tainted
[ 783.914976] --------------------------------------------
[ 783.914979] swapper/2/0 is trying to acquire lock:
[ 783.914982] 000000002aa3949d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[ 783.914999]
but task is already holding lock:
[ 783.915002] 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
[ 783.915013]
other info that might help us debug this:
[ 783.915016] Possible unsafe locking scenario:

[ 783.915019] CPU0
[ 783.915021] ----
[ 783.915034] lock(&(&hwq->lock)->rlock);
[ 783.915035] lock(&(&hwq->lock)->rlock);
[ 783.915037]
*** DEADLOCK ***

[ 783.915038] May be due to missing lock nesting notation

[ 783.915039] 1 lock held by swapper/2/0:
[ 783.915040] #0: 00000000be73920d (&(&hwq->lock)->rlock){..-.}, at: bnxt_qplib_service_creq+0x2a/0x350 [bnxt_re]
[ 783.915044]
stack backtrace:
[ 783.915046] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0-rc2+ #33
[ 783.915047] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[ 783.915048] Call Trace:
[ 783.915049]
[ 783.915054] dump_stack+0x90/0xe3
[ 783.915058] __lock_acquire+0x106c/0x1080
[ 783.915061] ? sched_clock+0x5/0x10
[ 783.915063] lock_acquire+0xbd/0x1a0
[ 783.915065] ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[ 783.915069] _raw_spin_lock_irqsave+0x4a/0x90
[ 783.915071] ? bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[ 783.915073] bnxt_qplib_service_creq+0x232/0x350 [bnxt_re]
[ 783.915078] tasklet_action_common.isra.17+0x197/0x1b0
[ 783.915081] __do_softirq+0xcb/0x3a6
[ 783.915084] irq_exit+0xe9/0x100
[ 783.915085] do_IRQ+0x6a/0x120
[ 783.915087] common_interrupt+0xf/0xf
[ 783.915088]

Use nested notation for the spin_lock to avoid this warning.

Signed-off-by: Selvin Xavier
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Selvin Xavier
2018-11-14 03:08:33 +0800
0b27b2b27 RDMA/bnxt_re: Avoid accessing nq->bar_reg_iomem in failure case ... Browse Code »

[ Upstream commit ed51efd2ce44091a858ad829f666727e7c95695e ]

In the failure path, nq->bar_reg_iomem gets accessed without
initializing. Avoid this by calling the bnxt_qplib_nq_stop_irq only if the
initialization is complete.

Reported-by: Dan Carpenter
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Fixes: 6e04b1035689 ("RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes")
Signed-off-by: Selvin Xavier
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Selvin Xavier
2018-11-14 03:08:33 +0800
ebd7595dc IB/ipoib: Clear IPCB before icmp_send ... Browse Code »

[ Upstream commit 4d6e4d12da2c308f8f976d3955c45ee62539ac98 ]

IPCB should be cleared before icmp_send, since it may contain data from
previous layers and the data could be misinterpreted as ip header options,
which later caused the ihl to be set to an invalid value and resulted in
the following stack corruption:

[ 1083.031512] ib0: packet len 57824 (> 2048) too long to send, dropping
[ 1083.031843] ib0: packet len 37904 (> 2048) too long to send, dropping
[ 1083.032004] ib0: packet len 4040 (> 2048) too long to send, dropping
[ 1083.032253] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.032481] ib0: packet len 23960 (> 2048) too long to send, dropping
[ 1083.033149] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033439] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.033700] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034124] ib0: packet len 63800 (> 2048) too long to send, dropping
[ 1083.034387] ==================================================================
[ 1083.034602] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xf08/0x1310
[ 1083.034798] Write of size 4 at addr ffff880353457c5f by task kworker/u16:0/7
[ 1083.034990]
[ 1083.035104] CPU: 7 PID: 7 Comm: kworker/u16:0 Tainted: G O 4.19.0-rc5+ #1
[ 1083.035316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 1083.035573] Workqueue: ipoib_wq ipoib_cm_skb_reap [ib_ipoib]
[ 1083.035750] Call Trace:
[ 1083.035888] dump_stack+0x9a/0xeb
[ 1083.036031] print_address_description+0xe3/0x2e0
[ 1083.036213] kasan_report+0x18a/0x2e0
[ 1083.036356] ? __ip_options_echo+0xf08/0x1310
[ 1083.036522] __ip_options_echo+0xf08/0x1310
[ 1083.036688] icmp_send+0x7b9/0x1cd0
[ 1083.036843] ? icmp_route_lookup.constprop.9+0x1070/0x1070
[ 1083.037018] ? netif_schedule_queue+0x5/0x200
[ 1083.037180] ? debug_show_all_locks+0x310/0x310
[ 1083.037341] ? rcu_dynticks_curr_cpu_in_eqs+0x85/0x120
[ 1083.037519] ? debug_locks_off+0x11/0x80
[ 1083.037673] ? debug_check_no_obj_freed+0x207/0x4c6
[ 1083.037841] ? check_flags.part.27+0x450/0x450
[ 1083.037995] ? debug_check_no_obj_freed+0xc3/0x4c6
[ 1083.038169] ? debug_locks_off+0x11/0x80
[ 1083.038318] ? skb_dequeue+0x10e/0x1a0
[ 1083.038476] ? ipoib_cm_skb_reap+0x2b5/0x650 [ib_ipoib]
[ 1083.038642] ? netif_schedule_queue+0xa8/0x200
[ 1083.038820] ? ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.038996] ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib]
[ 1083.039174] process_one_work+0x912/0x1830
[ 1083.039336] ? wq_pool_ids_show+0x310/0x310
[ 1083.039491] ? lock_acquire+0x145/0x3a0
[ 1083.042312] worker_thread+0x87/0xbb0
[ 1083.045099] ? process_one_work+0x1830/0x1830
[ 1083.047865] kthread+0x322/0x3e0
[ 1083.050624] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 1083.053354] ret_from_fork+0x3a/0x50

For instance __ip_options_echo is failing to proceed with invalid srr and
optlen passed from another layer via IPCB

[ 762.139568] IPv4: __ip_options_echo rr=0 ts=0 srr=43 cipso=0
[ 762.139720] IPv4: ip_options_build: IPCB 00000000f3cd969e opt 000000002ccb3533
[ 762.139838] IPv4: __ip_options_echo in srr: optlen 197 soffset 84
[ 762.139852] IPv4: ip_options_build srr=0 is_frag=0 rr_needaddr=0 ts_needaddr=0 ts_needtime=0 rr=0 ts=0
[ 762.140269] ==================================================================
[ 762.140713] IPv4: __ip_options_echo rr=0 ts=0 srr=0 cipso=0
[ 762.141078] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x12ec/0x1680
[ 762.141087] Write of size 4 at addr ffff880353457c7f by task kworker/u16:0/7

Signed-off-by: Denis Drozdov
Reviewed-by: Erez Shitrit
Reviewed-by: Feras Daoud
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Denis Drozdov
2018-11-14 03:08:33 +0800
9f7a37efd RDMA/cm: Respect returned status of cm_init_av_by_path ... Browse Code »

[ Upstream commit e54b6a3bcd1ec972b25a164bdf495d9e7120b107 ]

Add missing check for failure of cm_init_av_by_path

Fixes: e1444b5a163e ("IB/cm: Fix automatic path migration support")
Reported-by: Slava Shwartsman
Reviewed-by: Parav Pandit
Signed-off-by: Leon Romanovsky
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Leon Romanovsky
2018-11-14 03:08:32 +0800