07 Oct, 2020

6 commits

  • commit 38adf94e166e3cb4eb89683458ca578051e8218d upstream.

    Move the quirked chunk_sectors setting to the same location as noiob so
    one place registers this setting. And since the noiob value is only used
    locally, remove the member from struct nvme_ns.

    Signed-off-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Revanth Rajashekar
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • commit e08f2ae850929d40e66268ee47e443e7ea56eeb7 upstream.

    Introduce the new helper function nvme_lba_to_sect() to convert a device
    logical block number to a 512B sector number. Use this new helper in
    obvious places, cleaning up the code.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Revanth Rajashekar
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 314d48dd224897e35ddcaf5a1d7d133b5adddeb7 upstream.

    Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT
    instead of its hard coded value 9. Also add a comment to decribe this
    helper.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Revanth Rajashekar 1
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • [ Upstream commit 9e0e8dac985d4bd07d9e62922b9d189d3ca2fccf ]

    The lldd may have made calls to delete a remote port or local port and
    the delete is in progress when the cli then attempts to create a new
    controller. Currently, this proceeds without error although it can't be
    very successful.

    Fix this by validating that both the host port and remote port are
    present when a new controller is to be created.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    James Smart
     
  • [ Upstream commit 50b7c24390a53c78de546215282fb52980f1d7b7 ]

    Currently, we use nvmeq->q_depth as the upper limit for a valid tag in
    nvme_handle_cqe(), it is not correct. Because the available tag number
    is recorded in tagset, which is not equal to nvmeq->q_depth.

    The nvme driver registers interrupts for queues before initializing the
    tagset, because it uses the number of successful request_irq() calls to
    configure the tagset parameters. This allows a race condition with the
    current tag validity check if the controller happens to produce an
    interrupt with a corrupted CQE before the tagset is initialized.

    Replace the driver's indirect tag check with the one already provided by
    the block layer.

    Signed-off-by: Xianting Tian
    Reviewed-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Xianting Tian
     
  • [ Upstream commit 52a3974feb1a3eec25d8836d37a508b67b0a9cd0 ]

    Get and put the reference to the ctrl in the nvme_dev_open() and
    nvme_dev_release() before and after module get/put for ctrl in char
    device file operations.

    Introduce char_dev relase function, get/put the controller and module
    which allows us to fix the potential Oops which can be easily reproduced
    with a passthru ctrl (although the problem also exists with pure user
    access):

    Entering kdb (current=0xffff8887f8290000, pid 3128) on processor 30 Oops: (null)
    due to oops @ 0xffffffffa01019ad
    CPU: 30 PID: 3128 Comm: bash Tainted: G W OE 5.8.0-rc4nvme-5.9+ #35
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.4
    RIP: 0010:nvme_free_ctrl+0x234/0x285 [nvme_core]
    Code: 57 10 a0 e8 73 bf 02 e1 ba 3d 11 00 00 48 c7 c6 98 33 10 a0 48 c7 c7 1d 57 10 a0 e8 5b bf 02 e1 8
    RSP: 0018:ffffc90001d63de0 EFLAGS: 00010246
    RAX: ffffffffa05c0440 RBX: ffff8888119e45a0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff8888177e9550 RDI: ffff8888119e43b0
    RBP: ffff8887d4768000 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: ffffc90001d63c90 R12: ffff8888119e43b0
    R13: ffff8888119e5108 R14: dead000000000100 R15: ffff8888119e5108
    FS: 00007f1ef27b0740(0000) GS:ffff888817600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffa05c0470 CR3: 00000007f6bee000 CR4: 00000000003406e0
    Call Trace:
    device_release+0x27/0x80
    kobject_put+0x98/0x170
    nvmet_passthru_ctrl_disable+0x4a/0x70 [nvmet]
    nvmet_passthru_enable_store+0x4c/0x90 [nvmet]
    configfs_write_file+0xe6/0x150
    vfs_write+0xba/0x1e0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f1ef1eb2840
    Code: Bad RIP value.
    RSP: 002b:00007fffdbff0eb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1ef1eb2840
    RDX: 0000000000000002 RSI: 00007f1ef27d2000 RDI: 0000000000000001
    RBP: 00007f1ef27d2000 R08: 000000000000000a R09: 00007f1ef27b0740
    R10: 0000000000000001 R11: 0000000000000246 R12: 00007f1ef2186400
    R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

    With this patch fix we take the module ref count in nvme_dev_open() and
    release that ref count in newly introduced nvme_dev_release().

    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Chaitanya Kulkarni
     

01 Oct, 2020

7 commits

  • [ Upstream commit af5ad17854f96a6d3c9775e776bd01ab262672a1 ]

    When NVME_TCP is enabled and CRYPTO is disabled, it results in the
    following Kbuild warning:

    WARNING: unmet direct dependencies detected for CRYPTO_CRC32C
    Depends on [n]: CRYPTO [=n]
    Selected by [y]:
    - NVME_TCP [=y] && INET [=y] && BLK_DEV_NVME [=y]

    The reason is that NVME_TCP selects CRYPTO_CRC32C without depending on or
    selecting CRYPTO while CRYPTO_CRC32C is subordinate to CRYPTO.

    Honor the kconfig menu hierarchy to remove kconfig dependency warnings.

    Fixes: 79fd751d61aa ("nvme: tcp: selects CRYPTO_CRC32C for nvme-tcp")
    Signed-off-by: Necip Fazil Yildiran
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Necip Fazil Yildiran
     
  • [ Upstream commit 05b29021fba5e725dd385151ef00b6340229b500 ]

    Commit 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
    blocked") reverted multipath head disk revalidation due to deadlocks
    caused by holding the bd_mutex during revalidate.

    Updating the multipath disk blockdev size is still required though for
    userspace to be able to observe any resizing while the device is
    mounted. Directly update the bdev inode size to avoid unnecessarily
    holding the bdev->bd_mutex.

    Fixes: 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
    blocked")

    Signed-off-by: Anthony Iliopoulos
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Anthony Iliopoulos
     
  • [ Upstream commit 3b4b19721ec652ad2c4fe51dfbe5124212b5f581 ]

    Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
    in nvme_validate_ns")

    When adding a new namespace to the head disk (via nvme_mpath_set_live)
    we will see partition scan which triggers I/O on the mpath device node.
    This process will usually be triggered from the scan_work which holds
    the scan_lock. If I/O blocks (if we got ana change currently have only
    available paths but none are accessible) this can deadlock on the head
    disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
    takes it to check for resize (also triggered from scan_work on a different
    path). See trace [1].

    The mpath disk revalidation was originally added to detect online disk
    size change, but this is no longer needed since commit cb224c3af4df
    ("nvme: Convert to use set_capacity_revalidate_and_notify") which already
    updates resize info without unnecessarily revalidating the disk (the
    mpath disk doesn't even implement .revalidate_disk fop).

    [1]:
    --
    kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
    kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
    kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kernel: kworker/u65:9 D 0 494 2 0x80004000
    kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
    kernel: Call Trace:
    kernel: __schedule+0x2b9/0x6c0
    kernel: schedule+0x42/0xb0
    kernel: schedule_preempt_disabled+0xe/0x10
    kernel: __mutex_lock.isra.0+0x182/0x4f0
    kernel: __mutex_lock_slowpath+0x13/0x20
    kernel: mutex_lock+0x2e/0x40
    kernel: revalidate_disk+0x63/0xa0
    kernel: __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
    kernel: nvme_revalidate_disk+0xa4/0x160 [nvme_core]
    kernel: ? evict+0x14c/0x1b0
    kernel: revalidate_disk+0x2b/0xa0
    kernel: nvme_validate_ns+0x49/0x940 [nvme_core]
    kernel: ? blk_mq_free_request+0xd2/0x100
    kernel: ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
    kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
    kernel: process_one_work+0x1db/0x380
    kernel: worker_thread+0x249/0x400
    kernel: kthread+0x104/0x140
    kernel: ? process_one_work+0x380/0x380
    kernel: ? kthread_park+0x80/0x80
    kernel: ret_from_fork+0x1f/0x40
    ...
    kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
    kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
    kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kernel: kworker/u65:1 D 0 2630 2 0x80004000
    kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
    kernel: Call Trace:
    kernel: __schedule+0x2b9/0x6c0
    kernel: schedule+0x42/0xb0
    kernel: io_schedule+0x16/0x40
    kernel: do_read_cache_page+0x438/0x830
    kernel: ? __switch_to_asm+0x34/0x70
    kernel: ? file_fdatawait_range+0x30/0x30
    kernel: read_cache_page+0x12/0x20
    kernel: read_dev_sector+0x27/0xc0
    kernel: read_lba+0xc1/0x220
    kernel: ? kmem_cache_alloc_trace+0x19c/0x230
    kernel: efi_partition+0x1e6/0x708
    kernel: ? vsnprintf+0x39e/0x4e0
    kernel: ? snprintf+0x49/0x60
    kernel: check_partition+0x154/0x244
    kernel: rescan_partitions+0xae/0x280
    kernel: __blkdev_get+0x40f/0x560
    kernel: blkdev_get+0x3d/0x140
    kernel: __device_add_disk+0x388/0x480
    kernel: device_add_disk+0x13/0x20
    kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
    kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
    kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
    kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
    kernel: ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
    kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
    kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
    kernel: ? blk_mq_free_request+0xd2/0x100
    kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
    kernel: process_one_work+0x1db/0x380
    kernel: worker_thread+0x249/0x400
    kernel: kthread+0x104/0x140
    kernel: ? process_one_work+0x380/0x380
    kernel: ? kthread_park+0x80/0x80
    kernel: ret_from_fork+0x1f/0x40
    --

    Fixes: fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
    in nvme_validate_ns")
    Signed-off-by: Anton Eidelman
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 21f9024355e58772ec5d7fc3534aa5e29d72a8b6 ]

    In case rdma accept fails at nvmet_rdma_queue_connect(), release work is
    scheduled. Later on, a new RDMA CM event may arrive since we didn't
    destroy the cm-id and call nvmet_rdma_queue_connect_fail(), which
    schedule another release work. This will cause calling
    nvmet_rdma_free_queue twice. To fix this we implicitly destroy the cm_id
    with non-zero ret code, which guarantees that new rdma_cm events will
    not arrive afterwards. Also add a qp pointer to nvmet_rdma_queue
    structure, so we can use it when the cm_id pointer is NULL or was
    destroyed.

    Signed-off-by: Israel Rukshin
    Suggested-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Israel Rukshin
     
  • [ Upstream commit ce1518139e6976cf19c133b555083354fdb629b8 ]

    Calling nvme_sysfs_delete() when the controller is in the middle of
    creation may cause several bugs. If the controller is in NEW state we
    remove delete_controller file and don't delete the controller. The user
    will not be able to use nvme disconnect command on that controller again,
    although the controller may be active. Other bugs may happen if the
    controller is in the middle of create_ctrl callback and
    nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at
    nvme_do_delete_ctrl() before it was allocated at create_ctrl callback.

    To fix all those races don't allow the user to delete the controller
    before it was fully created.

    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Israel Rukshin
     
  • [ Upstream commit b780d7415aacec855e2f2370cbf98f918b224903 ]

    In case nvme_sysfs_delete() is called by the user before taking the ctrl
    reference count, the ctrl may be freed during the creation and cause the
    bug. Take the reference as soon as the controller is externally visible,
    which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
    reference count at the core layer instead of taking it on each transport
    separately.

    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Israel Rukshin
     
  • [ Upstream commit 764e9332098c0e60251386a507fe46ac91276120 ]

    The nvme multipath error handling defaults to controller reset if the
    error is unknown. There are, however, no existing nvme status codes that
    indicate a reset should be used, and resetting causes unnecessary
    disruption to the rest of IO.

    Change nvme's error handling to first check if failover should happen.
    If not, let the normal error handling take over rather than reset the
    controller.

    Based-on-a-patch-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: John Meneghini
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    John Meneghini
     

23 Sep, 2020

3 commits

  • [ Upstream commit ceb1e0874dba5cbfc4e0b4145796a4bfb3716e6a ]

    Cancel async event work in case async event has been queued up, and
    nvme_tcp_submit_async_event() runs after event has been freed.

    Signed-off-by: David Milburn
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    David Milburn
     
  • [ Upstream commit 925dd04c1f9825194b9e444c12478084813b2b5d ]

    Cancel async event work in case async event has been queued up, and
    nvme_rdma_submit_async_event() runs after event has been freed.

    Signed-off-by: David Milburn
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    David Milburn
     
  • [ Upstream commit e126e8210e950bb83414c4f57b3120ddb8450742 ]

    Cancel async event work in case async event has been queued up, and
    nvme_fc_submit_async_event() runs after event has been freed.

    Signed-off-by: David Milburn
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    David Milburn
     

17 Sep, 2020

11 commits

  • [ Upstream commit 7ad92f656bddff4cf8f641e0e3b1acd4eb9644cb ]

    This patch addresses an irq free warning and null pointer dereference
    error problem when nvme devices got timeout error during initialization.
    This problem happens when nvme_timeout() function is called while
    nvme_reset_work() is still in execution. This patch fixed the problem by
    setting flag of the problematic request to NVME_REQ_CANCELLED before
    calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
    an error code and let nvme_submit_sync_cmd() fail gracefully.
    The following is console output.

    [ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
    [ 62.488796] nvme nvme0: could not set timestamp (881)
    [ 62.494888] ------------[ cut here ]------------
    [ 62.495142] Trying to free already-free IRQ 11
    [ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
    [ 62.495742] Modules linked in:
    [ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
    [ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
    [ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work
    [ 62.497019] RIP: 0010:free_irq+0x1f7/0x370
    [ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
    [ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
    [ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
    [ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
    [ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
    [ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
    [ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
    [ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 62.501864] Call Trace:
    [ 62.501993] pci_free_irq+0x13/0x20
    [ 62.502167] nvme_reset_work+0x5d0/0x12a0
    [ 62.502369] ? update_load_avg+0x59/0x580
    [ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0
    [ 62.502780] ? try_to_wake_up+0x1a2/0x450
    [ 62.502979] process_one_work+0x1d2/0x390
    [ 62.503179] worker_thread+0x45/0x3b0
    [ 62.503361] ? process_one_work+0x390/0x390
    [ 62.503568] kthread+0xf9/0x130
    [ 62.503726] ? kthread_park+0x80/0x80
    [ 62.503911] ret_from_fork+0x22/0x30
    [ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
    [ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
    [ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues
    [ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [ 123.917469] #PF: supervisor write access in kernel mode
    [ 123.917725] #PF: error_code(0x0002) - not-present page
    [ 123.917976] PGD 0 P4D 0
    [ 123.918109] Oops: 0002 [#1] SMP PTI
    [ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8
    [ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
    [ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work
    [ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
    [ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
    [ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
    [ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
    [ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
    [ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
    [ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
    [ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
    [ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 123.924353] Call Trace:
    [ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0
    [ 123.924694] nvme_reset_work+0xed6/0x12a0
    [ 123.924898] process_one_work+0x1d2/0x390
    [ 123.925099] worker_thread+0x45/0x3b0
    [ 123.925280] ? process_one_work+0x390/0x390
    [ 123.925486] kthread+0xf9/0x130
    [ 123.925642] ? kthread_park+0x80/0x80
    [ 123.925825] ret_from_fork+0x22/0x30
    [ 123.926004] Modules linked in:
    [ 123.926158] CR2: 0000000000000000
    [ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
    [ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
    [ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
    [ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
    [ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
    [ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
    [ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
    [ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
    [ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
    [ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Co-developed-by: Keith Busch
    Signed-off-by: Tong Zhang
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Tong Zhang
     
  • [ Upstream commit 2362acb6785611eda795bfc12e1ea6b202ecf62c ]

    If the controller becomes unresponsive in the middle of a reset, we
    will hang because we are waiting for the freeze to complete, but that
    cannot happen since we have commands that are inflight holding the
    q_usage_counter, and we can't blindly fail requests that times out.

    So give a timeout and if we cannot wait for queue freeze before
    unfreezing, fail and have the error handling take care how to
    proceed (either schedule a reconnect of remove the controller).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 0475a8dcbcee92a5d22e40c9c6353829fc6294b8 ]

    When a request times out in a LIVE state, we simply trigger error
    recovery and let the error recovery handle the request cancellation,
    however when a request times out in a non LIVE state, we make sure to
    complete it immediately as it might block controller setup or teardown
    and prevent forward progress.

    However tearing down the entire set of I/O and admin queues causes
    freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
    an overkill to what we actually need, which is to just fence controller
    teardown that may be running, stop the queue, and cancel the request if
    it is not already completed.

    Now that we have the controller teardown_lock, we can safely serialize
    request cancellation. This addresses a hang caused by calling extra
    queue freeze on controller namespaces, causing unfreeze to not complete
    correctly.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Smart
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 5110f40241d08334375eb0495f174b1d2c07657e ]

    In the timeout handler we may need to complete a request because the
    request that timed out may be an I/O that is a part of a serial sequence
    of controller teardown or initialization. In order to complete the
    request, we need to fence any other context that may compete with us
    and complete the request that is timing out.

    In this case, we could have a potential double completion in case
    a hard-irq or a different competing context triggered error recovery
    and is running inflight request cancellation concurrently with the
    timeout handler.

    Protect using a ctrl teardown_lock to serialize contexts that may
    complete a cancelled request due to error recovery or a reset.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Smart
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit e5c01f4f7f623e768e868bcc08d8e7ceb03b75d0 ]

    If the controller becomes unresponsive in the middle of a reset, we will
    hang because we are waiting for the freeze to complete, but that cannot
    happen since we have commands that are inflight holding the
    q_usage_counter, and we can't blindly fail requests that times out.

    So give a timeout and if we cannot wait for queue freeze before
    unfreezing, fail and have the error handling take care how to proceed
    (either schedule a reconnect of remove the controller).

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 236187c4ed195161dfa4237c7beffbba0c5ae45b ]

    When a request times out in a LIVE state, we simply trigger error
    recovery and let the error recovery handle the request cancellation,
    however when a request times out in a non LIVE state, we make sure to
    complete it immediately as it might block controller setup or teardown
    and prevent forward progress.

    However tearing down the entire set of I/O and admin queues causes
    freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
    an overkill to what we actually need, which is to just fence controller
    teardown that may be running, stop the queue, and cancel the request if
    it is not already completed.

    Now that we have the controller teardown_lock, we can safely serialize
    request cancellation. This addresses a hang caused by calling extra
    queue freeze on controller namespaces, causing unfreeze to not complete
    correctly.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit d4d61470ae48838f49e668503e840e1520b97162 ]

    In the timeout handler we may need to complete a request because the
    request that timed out may be an I/O that is a part of a serial sequence
    of controller teardown or initialization. In order to complete the
    request, we need to fence any other context that may compete with us
    and complete the request that is timing out.

    In this case, we could have a potential double completion in case
    a hard-irq or a different competing context triggered error recovery
    and is running inflight request cancellation concurrently with the
    timeout handler.

    Protect using a ctrl teardown_lock to serialize contexts that may
    complete a cancelled request due to error recovery or a reset.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 7cf0d7c0f3c3b0203aaf81c1bc884924d8fdb9bd ]

    Users can detect if the wait has completed or not and take appropriate
    actions based on this information (e.g. weather to continue
    initialization or rather fail and schedule another initialization
    attempt).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit d7144f5c4cf4de95fdc3422943cf51c06aeaf7a7 ]

    NVME_CTRL_NEW should never see any I/O, because in order to start
    initialization it has to transition to NVME_CTRL_CONNECTING and from
    there it will never return to this state.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit a6ce7d7b4adaebc27ee7e78e5ecc378a1cfc221d ]

    When handling commands without in-capsule data, we assign the ttag
    assuming we already have the queue commands array allocated (based
    on the queue size information in the connect data payload). However
    if the connect itself did not send the connect data in-capsule we
    have yet to allocate the queue commands,and we will assign a bogus
    ttag and suffer a NULL dereference when we receive the corresponding
    h2cdata pdu.

    Fix this by checking if we already allocated commands before
    dereferencing it when handling h2cdata, if we didn't, its for sure a
    connect and we should use the preallocated connect command.

    Signed-off-by: Ziye Yang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Ziye Yang
     
  • [ Upstream commit 73a5379937ec89b91e907bb315e2434ee9696a2c ]

    Right now we are failing requests based on the controller state (which
    is checked inline in nvmf_check_ready) however we should definitely
    accept requests if the queue is live.

    When entering controller reset, we transition the controller into
    NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
    requests (have blk_noretry_request set).

    This is also the case for NVME_REQ_USER for the wrong reason. There
    shouldn't be any reason for us to reject this I/O in a controller reset.
    We do want to prevent passthru commands on the admin queue because we
    need the controller to fully initialize first before we let user passthru
    admin commands to be issued.

    In a non-mpath setup, this means that the requests will simply be
    requeued over and over forever not allowing the q_usage_counter to drop
    its final reference, causing controller reset to hang if running
    concurrently with heavy I/O.

    Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
    Reviewed-by: James Smart
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     

10 Sep, 2020

3 commits

  • [ Upstream commit 192f6c29bb28bfd0a17e6ad331d09f1ec84143d0 ]

    If the driver has to unbind from the controller for an early failure
    before the subsystem has been set up, there won't be a subsystem holding
    the controller's instance, so the controller needs to free its own
    instance in this case.

    Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
    Signed-off-by: Keith Busch
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit 70e37988db94aba607d5491a94f80ba08e399b6b ]

    The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
    in this function.

    Use 'spin_lock_irqsave()' also here, as there is no guarantee that
    interruptions are disabled at that point, according to surrounding code.

    Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
    Signed-off-by: Christophe JAILLET
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Sasha Levin

    Christophe JAILLET
     
  • [ Upstream commit 0d3b6a8d213a30387b5104b2fb25376d18636f23 ]

    Based on nvme spec, when keep alive timeout is set to zero
    the keep-alive timer should be disabled.

    Signed-off-by: Amit Engel
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Amit Engel
     

03 Sep, 2020

2 commits

  • [ Upstream commit 93eb0381e13d249a18ed4aae203291ff977e7ffb ]

    If there's only one usable, non-optimized path, nvme_round_robin_path()
    returns NULL, which is wrong. Fix it by falling back to "old", like in
    the single optimized path case. Also, if the active path isn't changed,
    there's no need to re-assign the pointer.

    Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
    Signed-off-by: Martin Wilck
    Signed-off-by: Martin George
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Martin Wilck
     
  • [ Upstream commit f34448cd0dc697723fb5f4118f8431d9233b370d ]

    On an error exit path, a negative error code should be returned
    instead of a positive return value.

    Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
    Cc: James Smart
    Signed-off-by: Tianjia Zhang
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tianjia Zhang
     

19 Aug, 2020

5 commits

  • [ Upstream commit fbd6a42d8932e172921c7de10468a2e12c34846b ]

    When nvme_round_robin_path() finds a valid namespace we should be using it;
    falling back to __nvme_find_path() for non-optimized paths will cause the
    result from nvme_round_robin_path() to be ignored for non-optimized paths.

    Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
    Signed-off-by: Martin Wilck
    Signed-off-by: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Hannes Reinecke
     
  • [ Upstream commit 3f6e3246db0e6f92e784965d9d0edb8abe6c6b74 ]

    Handle the special case where we have exactly one optimized path,
    which we should keep using in this case.

    Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
    Signed off-by: Martin Wilck
    Signed-off-by: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Martin Wilck
     
  • [ Upstream commit 9f98772ba307dd89a3d17dc2589f213d3972fc64 ]

    commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
    exposed an issue where we may hang trying to wait for queue freeze
    during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
    queue maps (which we have now for default/read/poll) is attempting to
    freeze the queue. However we never started queue freeze when starting the
    reset, which means that we have inflight pending requests that entered the
    queue that we will not complete once the queue is quiesced.

    So start a freeze before we quiesce the queue, and unfreeze the queue
    after we successfully connected the I/O queues (and make sure to call
    blk_mq_update_nr_hw_queues only after we are sure that the queue was
    already frozen).

    This follows to how the pci driver handles resets.

    Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • [ Upstream commit 2875b0aecabe2f081a8432e2bc85b85df0529490 ]

    commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
    exposed an issue where we may hang trying to wait for queue freeze
    during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
    queue maps (which we have now for default/read/poll) is attempting to
    freeze the queue. However we never started queue freeze when starting the
    reset, which means that we have inflight pending requests that entered the
    queue that we will not complete once the queue is quiesced.

    So start a freeze before we quiesce the queue, and unfreeze the queue
    after we successfully connected the I/O queues (and make sure to call
    blk_mq_update_nr_hw_queues only after we are sure that the queue was
    already frozen).

    This follows to how the pci driver handles resets.

    Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     
  • commit 5bedd3afee8eb01ccd256f0cd2cc0fa6f841417a upstream.

    Add a quirk for a device that does not support the Identify Namespace
    Identification Descriptor list despite claiming 1.3 compliance.

    Fixes: ea43d9709f72 ("nvme: fix identify error status silent ignore")
    Reported-by: Ingo Brunberg
    Signed-off-by: Christoph Hellwig
    Tested-by: Ingo Brunberg
    Reviewed-by: Sagi Grimberg
    Cc: Plamen Lyutov
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

11 Aug, 2020

1 commit

  • [ Upstream commit 5611ec2b9814bc91f7b0a8d804c1fc152e2025d9 ]

    After commit 6e02318eaea5 ("nvme: add support for the Write Zeroes
    command"), SK hynix PC400 becomes very slow with the following error
    message:

    [ 224.567695] blk_update_request: operation not supported error, dev nvme1n1, sector 499384320 op 0x9:(WRITE_ZEROES) flags 0x1000000 phys_seg 0 prio class 0]

    SK Hynix PC400 has a buggy firmware that treats NLB as max value instead
    of a range, so the NLB passed isn't a valid value to the firmware.

    According to SK hynix there are three commands are affected:
    - Write Zeroes
    - Compare
    - Write Uncorrectable

    Right now only Write Zeroes is implemented, so disable it completely on
    SK hynix PC400.

    BugLink: https://bugs.launchpad.net/bugs/1872383
    Cc: kyounghwan sohn
    Signed-off-by: Kai-Heng Feng
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Kai-Heng Feng
     

05 Aug, 2020

1 commit

  • [ Upstream commit adc99fd378398f4c58798a1c57889872967d56a6 ]

    If the controller died exactly when we are receiving icresp
    we hang because icresp may never return. Make sure to set a
    high finite limit.

    Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Sagi Grimberg
     

16 Jul, 2020

1 commit

  • [ Upstream commit 032a9966a22a3596addf81dacf0c1736dfedc32a ]

    The completion vector index that is given during CQ creation can't
    exceed the number of support vectors by the underlying RDMA device. This
    violation currently can accure, for example, in case one will try to
    connect with N regular read/write queues and M poll queues and the sum
    of N + M > num_supported_vectors. This will lead to failure in establish
    a connection to remote target. Instead, in that case, share a completion
    vector between queues.

    Signed-off-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Max Gurtovoy