Eric Lee / smarc-fsl-linux-kernel

07 Oct, 2020

6 commits

8db44b30d nvme: consolidate chunk_sectors settings ... Browse Code »

commit 38adf94e166e3cb4eb89683458ca578051e8218d upstream.

Move the quirked chunk_sectors setting to the same location as noiob so
one place registers this setting. And since the noiob value is only used
locally, remove the member from struct nvme_ns.

Signed-off-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Signed-off-by: Revanth Rajashekar
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2020-10-07 14:01:30 +0800
03f4f85bb nvme: Introduce nvme_lba_to_sect() ... Browse Code »

commit e08f2ae850929d40e66268ee47e443e7ea56eeb7 upstream.

Introduce the new helper function nvme_lba_to_sect() to convert a device
logical block number to a 512B sector number. Use this new helper in
obvious places, cleaning up the code.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Revanth Rajashekar
Signed-off-by: Greg Kroah-Hartman

Damien Le Moal
2020-10-07 14:01:30 +0800
34b939695 nvme: Cleanup and rename nvme_block_nr() ... Browse Code »

commit 314d48dd224897e35ddcaf5a1d7d133b5adddeb7 upstream.

Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT
instead of its hard coded value 9. Also add a comment to decribe this
helper.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Revanth Rajashekar 1
Signed-off-by: Greg Kroah-Hartman

Damien Le Moal
2020-10-07 14:01:30 +0800
8c03d0ef6 nvme-fc: fail new connections to a deleted host or remote port ... Browse Code »

[ Upstream commit 9e0e8dac985d4bd07d9e62922b9d189d3ca2fccf ]

The lldd may have made calls to delete a remote port or local port and
the delete is in progress when the cli then attempts to create a new
controller. Currently, this proceeds without error although it can't be
very successful.

Fix this by validating that both the host port and remote port are
present when a new controller is to be created.

Signed-off-by: James Smart
Reviewed-by: Himanshu Madhani
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

James Smart
2020-10-07 14:01:27 +0800
2b217eafc nvme-pci: fix NULL req in completion handler ... Browse Code »

[ Upstream commit 50b7c24390a53c78de546215282fb52980f1d7b7 ]

Currently, we use nvmeq->q_depth as the upper limit for a valid tag in
nvme_handle_cqe(), it is not correct. Because the available tag number
is recorded in tagset, which is not equal to nvmeq->q_depth.

The nvme driver registers interrupts for queues before initializing the
tagset, because it uses the number of successful request_irq() calls to
configure the tagset parameters. This allows a race condition with the
current tag validity check if the controller happens to produce an
interrupt with a corrupted CQE before the tagset is initialized.

Replace the driver's indirect tag check with the one already provided by
the block layer.

Signed-off-by: Xianting Tian
Reviewed-by: Keith Busch
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Xianting Tian
2020-10-07 14:01:27 +0800
44b4baf85 nvme-core: get/put ctrl and transport module in nvme_dev_open/release() ... Browse Code »

[ Upstream commit 52a3974feb1a3eec25d8836d37a508b67b0a9cd0 ]

Get and put the reference to the ctrl in the nvme_dev_open() and
nvme_dev_release() before and after module get/put for ctrl in char
device file operations.

Introduce char_dev relase function, get/put the controller and module
which allows us to fix the potential Oops which can be easily reproduced
with a passthru ctrl (although the problem also exists with pure user
access):

Entering kdb (current=0xffff8887f8290000, pid 3128) on processor 30 Oops: (null)
due to oops @ 0xffffffffa01019ad
CPU: 30 PID: 3128 Comm: bash Tainted: G W OE 5.8.0-rc4nvme-5.9+ #35
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.4
RIP: 0010:nvme_free_ctrl+0x234/0x285 [nvme_core]
Code: 57 10 a0 e8 73 bf 02 e1 ba 3d 11 00 00 48 c7 c6 98 33 10 a0 48 c7 c7 1d 57 10 a0 e8 5b bf 02 e1 8
RSP: 0018:ffffc90001d63de0 EFLAGS: 00010246
RAX: ffffffffa05c0440 RBX: ffff8888119e45a0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8888177e9550 RDI: ffff8888119e43b0
RBP: ffff8887d4768000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffc90001d63c90 R12: ffff8888119e43b0
R13: ffff8888119e5108 R14: dead000000000100 R15: ffff8888119e5108
FS: 00007f1ef27b0740(0000) GS:ffff888817600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa05c0470 CR3: 00000007f6bee000 CR4: 00000000003406e0
Call Trace:
device_release+0x27/0x80
kobject_put+0x98/0x170
nvmet_passthru_ctrl_disable+0x4a/0x70 [nvmet]
nvmet_passthru_enable_store+0x4c/0x90 [nvmet]
configfs_write_file+0xe6/0x150
vfs_write+0xba/0x1e0
ksys_write+0x5f/0xe0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f1ef1eb2840
Code: Bad RIP value.
RSP: 002b:00007fffdbff0eb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1ef1eb2840
RDX: 0000000000000002 RSI: 00007f1ef27d2000 RDI: 0000000000000001
RBP: 00007f1ef27d2000 R08: 000000000000000a R09: 00007f1ef27b0740
R10: 0000000000000001 R11: 0000000000000246 R12: 00007f1ef2186400
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

With this patch fix we take the module ref count in nvme_dev_open() and
release that ref count in newly introduced nvme_dev_release().

Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Chaitanya Kulkarni
2020-10-07 14:01:26 +0800

01 Oct, 2020

7 commits

b0632590c nvme-tcp: fix kconfig dependency warning when !CRYPTO ... Browse Code »

[ Upstream commit af5ad17854f96a6d3c9775e776bd01ab262672a1 ]

When NVME_TCP is enabled and CRYPTO is disabled, it results in the
following Kbuild warning:

WARNING: unmet direct dependencies detected for CRYPTO_CRC32C
Depends on [n]: CRYPTO [=n]
Selected by [y]:
- NVME_TCP [=y] && INET [=y] && BLK_DEV_NVME [=y]

The reason is that NVME_TCP selects CRYPTO_CRC32C without depending on or
selecting CRYPTO while CRYPTO_CRC32C is subordinate to CRYPTO.

Honor the kconfig menu hierarchy to remove kconfig dependency warnings.

Fixes: 79fd751d61aa ("nvme: tcp: selects CRYPTO_CRC32C for nvme-tcp")
Signed-off-by: Necip Fazil Yildiran
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Necip Fazil Yildiran
2020-10-01 19:18:19 +0800
5b84cd71c nvme: explicitly update mpath disk capacity on revalidation ... Browse Code »

[ Upstream commit 05b29021fba5e725dd385151ef00b6340229b500 ]

Commit 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
blocked") reverted multipath head disk revalidation due to deadlocks
caused by holding the bd_mutex during revalidate.

Updating the multipath disk blockdev size is still required though for
userspace to be able to observe any resizing while the device is
mounted. Directly update the bdev inode size to avoid unnecessarily
holding the bdev->bd_mutex.

Fixes: 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
blocked")

Signed-off-by: Anthony Iliopoulos
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Anthony Iliopoulos
2020-10-01 19:18:12 +0800
8bfd43a51 nvme: fix possible deadlock when I/O is blocked ... Browse Code »

[ Upstream commit 3b4b19721ec652ad2c4fe51dfbe5124212b5f581 ]

Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")

When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].

The mpath disk revalidation was originally added to detect online disk
size change, but this is no longer needed since commit cb224c3af4df
("nvme: Convert to use set_capacity_revalidate_and_notify") which already
updates resize info without unnecessarily revalidating the disk (the
mpath disk doesn't even implement .revalidate_disk fop).

[1]:
--
kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:9 D 0 494 2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: schedule_preempt_disabled+0xe/0x10
kernel: __mutex_lock.isra.0+0x182/0x4f0
kernel: __mutex_lock_slowpath+0x13/0x20
kernel: mutex_lock+0x2e/0x40
kernel: revalidate_disk+0x63/0xa0
kernel: __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
kernel: nvme_revalidate_disk+0xa4/0x160 [nvme_core]
kernel: ? evict+0x14c/0x1b0
kernel: revalidate_disk+0x2b/0xa0
kernel: nvme_validate_ns+0x49/0x940 [nvme_core]
kernel: ? blk_mq_free_request+0xd2/0x100
kernel: ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
kernel: ? process_one_work+0x380/0x380
kernel: ? kthread_park+0x80/0x80
kernel: ret_from_fork+0x1f/0x40
...
kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:1 D 0 2630 2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: ? __switch_to_asm+0x34/0x70
kernel: ? file_fdatawait_range+0x30/0x30
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: ? kmem_cache_alloc_trace+0x19c/0x230
kernel: efi_partition+0x1e6/0x708
kernel: ? vsnprintf+0x39e/0x4e0
kernel: ? snprintf+0x49/0x60
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: ? blk_mq_free_request+0xd2/0x100
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
kernel: ? process_one_work+0x380/0x380
kernel: ? kthread_park+0x80/0x80
kernel: ret_from_fork+0x1f/0x40
--

Fixes: fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
Signed-off-by: Anton Eidelman
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-10-01 19:18:09 +0800
74c28abcb nvmet-rdma: fix double free of rdma queue ... Browse Code »

[ Upstream commit 21f9024355e58772ec5d7fc3534aa5e29d72a8b6 ]

In case rdma accept fails at nvmet_rdma_queue_connect(), release work is
scheduled. Later on, a new RDMA CM event may arrive since we didn't
destroy the cm-id and call nvmet_rdma_queue_connect_fail(), which
schedule another release work. This will cause calling
nvmet_rdma_free_queue twice. To fix this we implicitly destroy the cm_id
with non-zero ret code, which guarantees that new rdma_cm events will
not arrive afterwards. Also add a qp pointer to nvmet_rdma_queue
structure, so we can use it when the cm_id pointer is NULL or was
destroyed.

Signed-off-by: Israel Rukshin
Suggested-by: Sagi Grimberg
Reviewed-by: Max Gurtovoy
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Israel Rukshin
2020-10-01 19:17:54 +0800
9aca5e655 nvme: Fix controller creation races with teardown flow ... Browse Code »

[ Upstream commit ce1518139e6976cf19c133b555083354fdb629b8 ]

Calling nvme_sysfs_delete() when the controller is in the middle of
creation may cause several bugs. If the controller is in NEW state we
remove delete_controller file and don't delete the controller. The user
will not be able to use nvme disconnect command on that controller again,
although the controller may be active. Other bugs may happen if the
controller is in the middle of create_ctrl callback and
nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at
nvme_do_delete_ctrl() before it was allocated at create_ctrl callback.

To fix all those races don't allow the user to delete the controller
before it was fully created.

Signed-off-by: Israel Rukshin
Reviewed-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin

Israel Rukshin
2020-10-01 19:17:49 +0800
509730002 nvme: Fix ctrl use-after-free during sysfs deletion ... Browse Code »

[ Upstream commit b780d7415aacec855e2f2370cbf98f918b224903 ]

In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.

Signed-off-by: Israel Rukshin
Reviewed-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin

Israel Rukshin
2020-10-01 19:17:49 +0800
eef1e9e87 nvme-multipath: do not reset on unknown status ... Browse Code »

[ Upstream commit 764e9332098c0e60251386a507fe46ac91276120 ]

The nvme multipath error handling defaults to controller reset if the
error is unknown. There are, however, no existing nvme status codes that
indicate a reset should be used, and resetting causes unnecessary
disruption to the rest of IO.

Change nvme's error handling to first check if failover should happen.
If not, let the normal error handling take over rather than reset the
controller.

Based-on-a-patch-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: John Meneghini
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin

John Meneghini
2020-10-01 19:17:49 +0800

23 Sep, 2020

3 commits

373312e85 nvme-tcp: cancel async events before freeing event struct ... Browse Code »

[ Upstream commit ceb1e0874dba5cbfc4e0b4145796a4bfb3716e6a ]

Cancel async event work in case async event has been queued up, and
nvme_tcp_submit_async_event() runs after event has been freed.

Signed-off-by: David Milburn
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

David Milburn
2020-09-23 18:40:37 +0800
89669cae6 nvme-rdma: cancel async events before freeing event struct ... Browse Code »

[ Upstream commit 925dd04c1f9825194b9e444c12478084813b2b5d ]

Cancel async event work in case async event has been queued up, and
nvme_rdma_submit_async_event() runs after event has been freed.

Signed-off-by: David Milburn
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

David Milburn
2020-09-23 18:40:37 +0800
103e82d5e nvme-fc: cancel async events before freeing event struct ... Browse Code »

[ Upstream commit e126e8210e950bb83414c4f57b3120ddb8450742 ]

Cancel async event work in case async event has been queued up, and
nvme_fc_submit_async_event() runs after event has been freed.

Signed-off-by: David Milburn
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

David Milburn
2020-09-23 18:40:37 +0800

17 Sep, 2020

11 commits

9eef311eb nvme-pci: cancel nvme device request before disabling ... Browse Code »

[ Upstream commit 7ad92f656bddff4cf8f641e0e3b1acd4eb9644cb ]

This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.

[ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[ 62.488796] nvme nvme0: could not set timestamp (881)
[ 62.494888] ------------[ cut here ]------------
[ 62.495142] Trying to free already-free IRQ 11
[ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[ 62.495742] Modules linked in:
[ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
[ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[ 62.497019] RIP: 0010:free_irq+0x1f7/0x370
[ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 62.501864] Call Trace:
[ 62.501993] pci_free_irq+0x13/0x20
[ 62.502167] nvme_reset_work+0x5d0/0x12a0
[ 62.502369] ? update_load_avg+0x59/0x580
[ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0
[ 62.502780] ? try_to_wake_up+0x1a2/0x450
[ 62.502979] process_one_work+0x1d2/0x390
[ 62.503179] worker_thread+0x45/0x3b0
[ 62.503361] ? process_one_work+0x390/0x390
[ 62.503568] kthread+0xf9/0x130
[ 62.503726] ? kthread_park+0x80/0x80
[ 62.503911] ret_from_fork+0x22/0x30
[ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 123.917469] #PF: supervisor write access in kernel mode
[ 123.917725] #PF: error_code(0x0002) - not-present page
[ 123.917976] PGD 0 P4D 0
[ 123.918109] Oops: 0002 [#1] SMP PTI
[ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8
[ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 123.924353] Call Trace:
[ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0
[ 123.924694] nvme_reset_work+0xed6/0x12a0
[ 123.924898] process_one_work+0x1d2/0x390
[ 123.925099] worker_thread+0x45/0x3b0
[ 123.925280] ? process_one_work+0x390/0x390
[ 123.925486] kthread+0xf9/0x130
[ 123.925642] ? kthread_park+0x80/0x80
[ 123.925825] ret_from_fork+0x22/0x30
[ 123.926004] Modules linked in:
[ 123.926158] CR2: 0000000000000000
[ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Co-developed-by: Keith Busch
Signed-off-by: Tong Zhang
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Tong Zhang
2020-09-17 19:47:48 +0800
068f73f76 nvme-rdma: fix reset hang if controller died in the middle of a reset ... Browse Code »

[ Upstream commit 2362acb6785611eda795bfc12e1ea6b202ecf62c ]

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:48 +0800
d409ed019 nvme-rdma: fix timeout handler ... Browse Code »

[ Upstream commit 0475a8dcbcee92a5d22e40c9c6353829fc6294b8 ]

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Reviewed-by: Christoph Hellwig
Reviewed-by: James Smart
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:48 +0800
9cf1ee492 nvme-rdma: serialize controller teardown sequences ... Browse Code »

[ Upstream commit 5110f40241d08334375eb0495f174b1d2c07657e ]

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Reviewed-by: Christoph Hellwig
Reviewed-by: James Smart
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
bb018c735 nvme-tcp: fix reset hang if controller died in the middle of a reset ... Browse Code »

[ Upstream commit e5c01f4f7f623e768e868bcc08d8e7ceb03b75d0 ]

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
34b1b26b2 nvme-tcp: fix timeout handler ... Browse Code »

[ Upstream commit 236187c4ed195161dfa4237c7beffbba0c5ae45b ]

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
7abff4475 nvme-tcp: serialize controller teardown sequences ... Browse Code »

[ Upstream commit d4d61470ae48838f49e668503e840e1520b97162 ]

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
7598a9d0a nvme: have nvme_wait_freeze_timeout return if it timed out ... Browse Code »

[ Upstream commit 7cf0d7c0f3c3b0203aaf81c1bc884924d8fdb9bd ]

Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
35d841908 nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance ... Browse Code »

[ Upstream commit d7144f5c4cf4de95fdc3422943cf51c06aeaf7a7 ]

NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:47 +0800
f9af5758b nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu ... Browse Code »

[ Upstream commit a6ce7d7b4adaebc27ee7e78e5ecc378a1cfc221d ]

When handling commands without in-capsule data, we assign the ttag
assuming we already have the queue commands array allocated (based
on the queue size information in the connect data payload). However
if the connect itself did not send the connect data in-capsule we
have yet to allocate the queue commands,and we will assign a bogus
ttag and suffer a NULL dereference when we receive the corresponding
h2cdata pdu.

Fix this by checking if we already allocated commands before
dereferencing it when handling h2cdata, if we didn't, its for sure a
connect and we should use the preallocated connect command.

Signed-off-by: Ziye Yang
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Ziye Yang
2020-09-17 19:47:47 +0800
9ed3ab0c6 nvme-fabrics: allow to queue requests for live queues ... Browse Code »

[ Upstream commit 73a5379937ec89b91e907bb315e2434ee9696a2c ]

Right now we are failing requests based on the controller state (which
is checked inline in nvmf_check_ready) however we should definitely
accept requests if the queue is live.

When entering controller reset, we transition the controller into
NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for the wrong reason. There
shouldn't be any reason for us to reject this I/O in a controller reset.
We do want to prevent passthru commands on the admin queue because we
need the controller to fully initialize first before we let user passthru
admin commands to be issued.

In a non-mpath setup, this means that the requests will simply be
requeued over and over forever not allowing the q_usage_counter to drop
its final reference, causing controller reset to hang if running
concurrently with heavy I/O.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Reviewed-by: James Smart
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-09-17 19:47:44 +0800

10 Sep, 2020

3 commits

50d284778 nvme: fix controller instance leak ... Browse Code »

[ Upstream commit 192f6c29bb28bfd0a17e6ad331d09f1ec84143d0 ]

If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.

Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Keith Busch
2020-09-10 01:12:27 +0800
f4b830c61 nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' ... Browse Code »

[ Upstream commit 70e37988db94aba607d5491a94f80ba08e399b6b ]

The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.

Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.

Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Sasha Levin

Christophe JAILLET
2020-09-10 01:12:27 +0800
2c9bc7d96 nvmet: Disable keep-alive timer when kato is cleared to 0h ... Browse Code »

[ Upstream commit 0d3b6a8d213a30387b5104b2fb25376d18636f23 ]

Based on nvme spec, when keep alive timeout is set to zero
the keep-alive timer should be disabled.

Signed-off-by: Amit Engel
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Amit Engel
2020-09-10 01:12:21 +0800

03 Sep, 2020

2 commits

88994acaf nvme: multipath: round-robin: fix single non-optimized path case ... Browse Code »

[ Upstream commit 93eb0381e13d249a18ed4aae203291ff977e7ffb ]

If there's only one usable, non-optimized path, nvme_round_robin_path()
returns NULL, which is wrong. Fix it by falling back to "old", like in
the single optimized path case. Also, if the active path isn't changed,
there's no need to re-assign the pointer.

Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
Signed-off-by: Martin Wilck
Signed-off-by: Martin George
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Martin Wilck
2020-09-03 17:26:55 +0800
97f30414a nvme-fc: Fix wrong return value in __nvme_fc_init_request() ... Browse Code »

[ Upstream commit f34448cd0dc697723fb5f4118f8431d9233b370d ]

On an error exit path, a negative error code should be returned
instead of a positive return value.

Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart
Signed-off-by: Tianjia Zhang
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Tianjia Zhang
2020-09-03 17:26:55 +0800

19 Aug, 2020

5 commits

86f305a9a nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths ... Browse Code »

[ Upstream commit fbd6a42d8932e172921c7de10468a2e12c34846b ]

When nvme_round_robin_path() finds a valid namespace we should be using it;
falling back to __nvme_find_path() for non-optimized paths will cause the
result from nvme_round_robin_path() to be ignored for non-optimized paths.

Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
Signed-off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Hannes Reinecke
2020-08-19 14:15:59 +0800
f0a8c0254 nvme-multipath: fix logic for non-optimized paths ... Browse Code »

[ Upstream commit 3f6e3246db0e6f92e784965d9d0edb8abe6c6b74 ]

Handle the special case where we have exactly one optimized path,
which we should keep using in this case.

Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
Signed off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Martin Wilck
2020-08-19 14:15:59 +0800
4e8691ba0 nvme-rdma: fix controller reset hang during traffic ... Browse Code »

[ Upstream commit 9f98772ba307dd89a3d17dc2589f213d3972fc64 ]

commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.

So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).

This follows to how the pci driver handles resets.

Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-08-19 14:15:59 +0800
b98a96662 nvme-tcp: fix controller reset hang during traffic ... Browse Code »

[ Upstream commit 2875b0aecabe2f081a8432e2bc85b85df0529490 ]

commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.

So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).

This follows to how the pci driver handles resets.

Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-08-19 14:15:58 +0800
bd1584865 nvme: add a Identify Namespace Identification Descriptor list quirk ... Browse Code »

commit 5bedd3afee8eb01ccd256f0cd2cc0fa6f841417a upstream.

Add a quirk for a device that does not support the Identify Namespace
Identification Descriptor list despite claiming 1.3 compliance.

Fixes: ea43d9709f72 ("nvme: fix identify error status silent ignore")
Reported-by: Ingo Brunberg
Signed-off-by: Christoph Hellwig
Tested-by: Ingo Brunberg
Reviewed-by: Sagi Grimberg
Cc: Plamen Lyutov
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2020-08-19 14:15:53 +0800

11 Aug, 2020

1 commit

8e6af828a nvme-pci: prevent SK hynix PC400 from using Write Zeroes command ... Browse Code »

[ Upstream commit 5611ec2b9814bc91f7b0a8d804c1fc152e2025d9 ]

After commit 6e02318eaea5 ("nvme: add support for the Write Zeroes
command"), SK hynix PC400 becomes very slow with the following error
message:

[ 224.567695] blk_update_request: operation not supported error, dev nvme1n1, sector 499384320 op 0x9:(WRITE_ZEROES) flags 0x1000000 phys_seg 0 prio class 0]

SK Hynix PC400 has a buggy firmware that treats NLB as max value instead
of a range, so the NLB passed isn't a valid value to the firmware.

According to SK hynix there are three commands are affected:
- Write Zeroes
- Compare
- Write Uncorrectable

Right now only Write Zeroes is implemented, so disable it completely on
SK hynix PC400.

BugLink: https://bugs.launchpad.net/bugs/1872383
Cc: kyounghwan sohn
Signed-off-by: Kai-Heng Feng
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Kai-Heng Feng
2020-08-11 21:33:37 +0800

05 Aug, 2020

1 commit

831c904a0 nvme-tcp: fix possible hang waiting for icresp response ... Browse Code »

[ Upstream commit adc99fd378398f4c58798a1c57889872967d56a6 ]

If the controller died exactly when we are receiving icresp
we hang because icresp may never return. Make sure to set a
high finite limit.

Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Sagi Grimberg
2020-08-05 15:59:45 +0800

16 Jul, 2020

1 commit

010f93079 nvme-rdma: assign completion vector correctly ... Browse Code »

[ Upstream commit 032a9966a22a3596addf81dacf0c1736dfedc32a ]

The completion vector index that is given during CQ creation can't
exceed the number of support vectors by the underlying RDMA device. This
violation currently can accure, for example, in case one will try to
connect with N regular read/write queues and M poll queues and the sum
of N + M > num_supported_vectors. This will lead to failure in establish
a connection to remote target. Instead, in that case, share a completion
vector between queues.

Signed-off-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Max Gurtovoy
2020-07-16 14:16:36 +0800