02 Sep, 2020

2 commits

  • blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
    which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
    make it irqsafe.

    Signed-off-by: Tejun Heo
    Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies")
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
    when it can be called with irq disabled or enabled. This has a small chance
    of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
    instead.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Sep, 2020

3 commits

  • We need to hold the whole device bd_mutex to protect against
    other thread concurrently deleting out partition before we get
    to it, and thus causing a use after free.

    Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
    Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
    stops to release request queue from wq context because that commit
    supposed all blk_put_queue() is called in context which is allowed
    to sleep. However, this assumption isn't true because we release disk's
    reference in partition's percpu_ref's ->release() which doesn't allow
    to sleep, because the ->release() is run via call_rcu().

    Fixes this issue by moving put disk reference into hd_struct_free_work()

    Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
    Reported-by: Ilya Dryomov
    Signed-off-by: Ming Lei
    Tested-by: Ilya Dryomov
    Reviewed-by: Christoph Hellwig
    Cc: Luis Chamberlain
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • If a driver leaves the limit settings as the defaults, then we don't
    initialize bdi->io_pages. This means that file systems may need to
    work around bdi->io_pages == 0, which is somewhat messy.

    Initialize the default value just like we do for ->ra_pages.

    Cc: stable@vger.kernel.org
    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reported-by: OGAWA Hirofumi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 Aug, 2020

1 commit

  • Pull NVMe fixes from Sagi:

    "- instance leak and io boundary fixes from Keith
    - fc locking fix from Christophe
    - various tcp/rdma reset during traffic fixes from Me
    - pci use-after-free fix from Tong
    - tcp target null deref fix from Ziye"

    * 'nvme-5.9-rc' of git://git.infradead.org/nvme:
    nvme-pci: cancel nvme device request before disabling
    nvme: only use power of two io boundaries
    nvme: fix controller instance leak
    nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
    nvme: Fix NULL dereference for pci nvme controllers
    nvme-rdma: fix reset hang if controller died in the middle of a reset
    nvme-rdma: fix timeout handler
    nvme-rdma: serialize controller teardown sequences
    nvme-tcp: fix reset hang if controller died in the middle of a reset
    nvme-tcp: fix timeout handler
    nvme-tcp: serialize controller teardown sequences
    nvme: have nvme_wait_freeze_timeout return if it timed out
    nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
    nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu

    Jens Axboe
     

29 Aug, 2020

14 commits

  • This patch addresses an irq free warning and null pointer dereference
    error problem when nvme devices got timeout error during initialization.
    This problem happens when nvme_timeout() function is called while
    nvme_reset_work() is still in execution. This patch fixed the problem by
    setting flag of the problematic request to NVME_REQ_CANCELLED before
    calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
    an error code and let nvme_submit_sync_cmd() fail gracefully.
    The following is console output.

    [ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
    [ 62.488796] nvme nvme0: could not set timestamp (881)
    [ 62.494888] ------------[ cut here ]------------
    [ 62.495142] Trying to free already-free IRQ 11
    [ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
    [ 62.495742] Modules linked in:
    [ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
    [ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
    [ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work
    [ 62.497019] RIP: 0010:free_irq+0x1f7/0x370
    [ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
    [ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
    [ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
    [ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
    [ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
    [ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
    [ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
    [ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 62.501864] Call Trace:
    [ 62.501993] pci_free_irq+0x13/0x20
    [ 62.502167] nvme_reset_work+0x5d0/0x12a0
    [ 62.502369] ? update_load_avg+0x59/0x580
    [ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0
    [ 62.502780] ? try_to_wake_up+0x1a2/0x450
    [ 62.502979] process_one_work+0x1d2/0x390
    [ 62.503179] worker_thread+0x45/0x3b0
    [ 62.503361] ? process_one_work+0x390/0x390
    [ 62.503568] kthread+0xf9/0x130
    [ 62.503726] ? kthread_park+0x80/0x80
    [ 62.503911] ret_from_fork+0x22/0x30
    [ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
    [ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
    [ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues
    [ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [ 123.917469] #PF: supervisor write access in kernel mode
    [ 123.917725] #PF: error_code(0x0002) - not-present page
    [ 123.917976] PGD 0 P4D 0
    [ 123.918109] Oops: 0002 [#1] SMP PTI
    [ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8
    [ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
    [ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work
    [ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
    [ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
    [ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
    [ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
    [ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
    [ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
    [ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
    [ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
    [ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 123.924353] Call Trace:
    [ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0
    [ 123.924694] nvme_reset_work+0xed6/0x12a0
    [ 123.924898] process_one_work+0x1d2/0x390
    [ 123.925099] worker_thread+0x45/0x3b0
    [ 123.925280] ? process_one_work+0x390/0x390
    [ 123.925486] kthread+0xf9/0x130
    [ 123.925642] ? kthread_park+0x80/0x80
    [ 123.925825] ret_from_fork+0x22/0x30
    [ 123.926004] Modules linked in:
    [ 123.926158] CR2: 0000000000000000
    [ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
    [ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
    [ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
    [ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
    [ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
    [ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
    [ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
    [ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
    [ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
    [ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
    [ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
    [ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Co-developed-by: Keith Busch
    Signed-off-by: Tong Zhang
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg

    Tong Zhang
     
  • The kernel requires a power of two for boundaries because that's the
    only way it can efficiently split commands that cross them. A
    controller, however, may report a non-power of two boundary.

    The driver had been rounding the controller's value to one the kernel
    can use, but splitting on the wrong boundary provides no benefit on the
    device side, and incurs additional submission overhead from non-optimal
    splits.

    Don't provide any boundary hint if the controller's value can't be used
    and log a warning when first scanning a disk's unreported IO boundary.
    Since the chunk sector logic has grown, move it to a separate function.

    Cc: Martin K. Petersen
    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Sagi Grimberg

    Keith Busch
     
  • If the driver has to unbind from the controller for an early failure
    before the subsystem has been set up, there won't be a subsystem holding
    the controller's instance, so the controller needs to free its own
    instance in this case.

    Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
    Signed-off-by: Keith Busch
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Keith Busch
     
  • The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
    in this function.

    Use 'spin_lock_irqsave()' also here, as there is no guarantee that
    interruptions are disabled at that point, according to surrounding code.

    Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
    Signed-off-by: Christophe JAILLET
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Christophe JAILLET
     
  • PCIe controllers do not have fabric opts, verify they exist before
    showing ctrl_loss_tmo or reconnect_delay attributes.

    Fixes: 764075fdcb2f ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
    Reported-by: Tobias Markus
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • If the controller becomes unresponsive in the middle of a reset, we
    will hang because we are waiting for the freeze to complete, but that
    cannot happen since we have commands that are inflight holding the
    q_usage_counter, and we can't blindly fail requests that times out.

    So give a timeout and if we cannot wait for queue freeze before
    unfreezing, fail and have the error handling take care how to
    proceed (either schedule a reconnect of remove the controller).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • When a request times out in a LIVE state, we simply trigger error
    recovery and let the error recovery handle the request cancellation,
    however when a request times out in a non LIVE state, we make sure to
    complete it immediately as it might block controller setup or teardown
    and prevent forward progress.

    However tearing down the entire set of I/O and admin queues causes
    freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
    an overkill to what we actually need, which is to just fence controller
    teardown that may be running, stop the queue, and cancel the request if
    it is not already completed.

    Now that we have the controller teardown_lock, we can safely serialize
    request cancellation. This addresses a hang caused by calling extra
    queue freeze on controller namespaces, causing unfreeze to not complete
    correctly.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Smart
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • In the timeout handler we may need to complete a request because the
    request that timed out may be an I/O that is a part of a serial sequence
    of controller teardown or initialization. In order to complete the
    request, we need to fence any other context that may compete with us
    and complete the request that is timing out.

    In this case, we could have a potential double completion in case
    a hard-irq or a different competing context triggered error recovery
    and is running inflight request cancellation concurrently with the
    timeout handler.

    Protect using a ctrl teardown_lock to serialize contexts that may
    complete a cancelled request due to error recovery or a reset.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Smart
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • If the controller becomes unresponsive in the middle of a reset, we will
    hang because we are waiting for the freeze to complete, but that cannot
    happen since we have commands that are inflight holding the
    q_usage_counter, and we can't blindly fail requests that times out.

    So give a timeout and if we cannot wait for queue freeze before
    unfreezing, fail and have the error handling take care how to proceed
    (either schedule a reconnect of remove the controller).

    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • When a request times out in a LIVE state, we simply trigger error
    recovery and let the error recovery handle the request cancellation,
    however when a request times out in a non LIVE state, we make sure to
    complete it immediately as it might block controller setup or teardown
    and prevent forward progress.

    However tearing down the entire set of I/O and admin queues causes
    freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
    an overkill to what we actually need, which is to just fence controller
    teardown that may be running, stop the queue, and cancel the request if
    it is not already completed.

    Now that we have the controller teardown_lock, we can safely serialize
    request cancellation. This addresses a hang caused by calling extra
    queue freeze on controller namespaces, causing unfreeze to not complete
    correctly.

    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • In the timeout handler we may need to complete a request because the
    request that timed out may be an I/O that is a part of a serial sequence
    of controller teardown or initialization. In order to complete the
    request, we need to fence any other context that may compete with us
    and complete the request that is timing out.

    In this case, we could have a potential double completion in case
    a hard-irq or a different competing context triggered error recovery
    and is running inflight request cancellation concurrently with the
    timeout handler.

    Protect using a ctrl teardown_lock to serialize contexts that may
    complete a cancelled request due to error recovery or a reset.

    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • Users can detect if the wait has completed or not and take appropriate
    actions based on this information (e.g. weather to continue
    initialization or rather fail and schedule another initialization
    attempt).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • NVME_CTRL_NEW should never see any I/O, because in order to start
    initialization it has to transition to NVME_CTRL_CONNECTING and from
    there it will never return to this state.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     
  • When handling commands without in-capsule data, we assign the ttag
    assuming we already have the queue commands array allocated (based
    on the queue size information in the connect data payload). However
    if the connect itself did not send the connect data in-capsule we
    have yet to allocate the queue commands,and we will assign a bogus
    ttag and suffer a NULL dereference when we receive the corresponding
    h2cdata pdu.

    Fix this by checking if we already allocated commands before
    dereferencing it when handling h2cdata, if we didn't, its for sure a
    connect and we should use the preallocated connect command.

    Signed-off-by: Ziye Yang
    Signed-off-by: Sagi Grimberg

    Ziye Yang
     

28 Aug, 2020

2 commits


26 Aug, 2020

2 commits

  • The device size calculation was done before processing the loop
    configuration, which meant that the we set the size on the underlying
    block device incorrectly in case lo_offset/lo_sizelimit were set in the
    configuration. Delay computing the size until we've setup the device
    parameters correctly.

    Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")
    Reported-by: Lennart Poettering
    Tested-by: Yang Xu
    Signed-off-by: Martijn Coenen
    Signed-off-by: Jens Axboe

    Martijn Coenen
     
  • If we configured io timeout of nbd0 to 100s. Later after we
    finished using it, we configured nbd0 again and set the io
    timeout to 0. We expect it would timeout after 30 seconds
    and keep retry. But in fact we could not change the timeout
    when we set it to 0. the timeout is still the original 100s.

    So change the timeout to default 30s when we set it to zero.
    It also behaves same as commit 2da22da57348 ("nbd: fix zero
    cmd timeout handling v2").

    It becomes more important if we were reconfigure a nbd device
    and the io timeout it set to zero. Because it could take 30s
    to detect the new socket and thus io could be completed more
    quickly compared to 100s.

    Signed-off-by: Hou Pu
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Hou Pu
     

22 Aug, 2020

16 commits

  • REQ_FUA should be checked using rq->cmd_flags instead of req_op().

    Fixes: deb78b419dfda ("nullb: emulate cache")
    Signed-off-by: Hou Pu
    Signed-off-by: Jens Axboe

    Hou Pu
     
  • Based on nvme spec, when keep alive timeout is set to zero
    the keep-alive timer should be disabled.

    Signed-off-by: Amit Engel
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Amit Engel
     
  • If a command send through nvme-multipath failed on a dying queue, resend it
    on another path.

    Signed-off-by: Chao Leng
    [hch: rebased on top of the completion refactoring]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Mike Snitzer
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Chao Leng
     
  • Check the SCT sub-field for a path related status instead of enumerating
    invididual status code. As of NVMe 1.4 this adds "Internal Path Error"
    and "Controller Pathing Error" to the list, but it also future proofs for
    additional status codes added to the category.

    Suggested-by: Chao Leng
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Snitzer
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Lift all the code to decide the dispostition of a completed command
    from nvme_complete_rq and nvme_failover_req into a new helper, which
    returns an emum of the potential actions. nvme_complete_rq then
    just switches on those and calls the proper helper for the action.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Snitzer
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • nvme_end_request is a bit misnamed, as it wraps around the
    blk_mq_complete_* API. It's semantics also are non-trivial, so give it
    a more descriptive name and add a comment explaining the semantics.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Mike Snitzer
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Zoned block devices reuse the chunk_sectors queue limit to define zone
    boundaries. If a such a device happens to also report an optimal
    boundary, do not use that to define the chunk_sectors as that may
    intermittently interfere with io splitting and zone size queries.

    Signed-off-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • All operations are based on the controller, not the host page size.
    Switch the dma pool to use the controller page size as well to avoid
    massive overallocations on large page size systems.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Recently nvme_dev.q_depth was changed from an int to u16 type.

    This falls over for the queue depth calculation in nvme_pci_enable(),
    where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
    NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
    result:

    root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
    dereference at virtual address 0000000000000010
    Mem abort info:
    ESR = 0x96000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
    [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in: nvme nvme_core
    CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
    5.8.0-next-20200812 #27
    Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
    V1.16.01 03/15/2019
    Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
    pc : __sg_alloc_table_from_pages+0xec/0x238
    lr : __sg_alloc_table_from_pages+0xc8/0x238
    sp : ffff800013ccbad0
    x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
    x27: 0000000000000000 x26: 0000000000002dc2
    x25: 0000000000000dc0 x24: 0000000000000000
    x23: 0000000000000000 x22: ffff800013ccbbe8
    x21: 0000000000000010 x20: 0000000000000000
    x19: 00000000fffff000 x18: ffffffffffffffff
    x17: 00000000000000c0 x16: fffffe289eaf6380
    x15: ffff800011b59948 x14: ffff002bc8fe98f8
    x13: ff00000000000000 x12: ffff8000114ca000
    x11: 0000000000000000 x10: ffffffffffffffff
    x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
    x7 : 0000000000000000 x6 : 0000000000000001
    x5 : ffff0a27b5f9b680 x4 : 0000000000000000
    x3 : ffff0a27b5f9b680 x2 : 0000000000000000
    x1 : 0000000000000001 x0 : 0000000000000000
    Call trace:
    __sg_alloc_table_from_pages+0xec/0x238
    sg_alloc_table_from_pages+0x18/0x28
    iommu_dma_alloc+0x474/0x678
    dma_alloc_attrs+0xd8/0xf0
    nvme_alloc_queue+0x114/0x160 [nvme]
    nvme_reset_work+0xb34/0x14b4 [nvme]
    process_one_work+0x1e8/0x360
    worker_thread+0x44/0x478
    kthread+0x150/0x158
    ret_from_fork+0x10/0x34
    Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
    ---[ end trace 89bb2b72d59bf925 ]---

    Fix by making onto a u32.

    Also use u32 for nvme_dev.q_depth, as we assign this value from
    nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
    avoids the same crash as above.

    Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth")
    Signed-off-by: John Garry
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    John Garry
     
  • When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
    dead lock. The new spin_lock() calls recently added produce the
    following lockdep warning when running the blktest nvme/003:

    ================================
    WARNING: inconsistent lock state
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
    ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
    {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0x164/0x500
    _raw_spin_lock+0x28/0x40
    nvme_get_effects_log+0x37/0x1c0
    nvme_init_identify+0x9e4/0x14f0
    nvme_reset_work+0xadd/0x2360
    process_one_work+0x66b/0xb70
    worker_thread+0x6e/0x6c0
    kthread+0x1e7/0x210
    ret_from_fork+0x22/0x30
    irq event stamp: 1449221
    hardirqs last enabled at (1449220): [] ktime_get+0xf9/0x140
    hardirqs last disabled at (1449221): [] _raw_spin_lock_irqsave+0x25/0x60
    softirqs last enabled at (1449210): [] __do_softirq+0x447/0x595
    softirqs last disabled at (1449215): [] run_ksoftirqd+0x35/0x50

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&ctrl->lock);

    lock(&ctrl->lock);

    *** DEADLOCK ***

    no locks held by ksoftirqd/2/22.

    stack backtrace:
    CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
    Call Trace:
    dump_stack+0xc8/0x11a
    print_usage_bug.cold.63+0x235/0x23e
    mark_lock+0xa9c/0xcf0
    __lock_acquire+0xd9a/0x2b50
    lock_acquire+0x164/0x500
    _raw_spin_lock_irqsave+0x40/0x60
    nvme_keep_alive_end_io+0x50/0xc0
    blk_mq_end_request+0x158/0x210
    nvme_complete_rq+0x146/0x500
    nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
    blk_done_softirq+0x187/0x1e0
    __do_softirq+0x118/0x595
    run_ksoftirqd+0x35/0x50
    smpboot_thread_fn+0x1d3/0x310
    kthread+0x1e7/0x210
    ret_from_fork+0x22/0x30

    Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages")
    Signed-off-by: Logan Gunthorpe
    Reviewed-by: Keith Busch
    Tested-by: Chaitanya Kulkarni
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Logan Gunthorpe
     
  • Instead of calling blk_put_request() which calls blk_mq_free_request(),
    call blk_mq_free_request() directly for NVMeOF passthru. This is to
    mainly avoid an extra function call in the completion path
    nvmet_passthru_req_done().

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Keith Busch
    Reviewed-by: Logan Gunthorpe
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • In the existing NVMeOF Passthru core command handling on failure of
    nvme_alloc_request() it errors out with rq value set to NULL. In the
    error handling path it calls blk_put_request() without checking if
    rq is set to NULL or not which produces following Oops:-

    [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
    [ 1457.347838] #PF: supervisor read access in kernel mode
    [ 1457.348464] #PF: error_code(0x0000) - not-present page
    [ 1457.349085] PGD 0 P4D 0
    [ 1457.349402] Oops: 0000 [#1] SMP NOPTI
    [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35
    [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
    [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
    [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
    [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
    [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
    [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
    [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
    [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
    [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
    [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
    [ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
    [ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
    [ 1457.363973] Call Trace:
    [ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
    [ 1457.364990] process_one_work+0x24e/0x5a0
    [ 1457.365493] ? __schedule+0x353/0x840
    [ 1457.365957] worker_thread+0x3c/0x380
    [ 1457.366426] ? process_one_work+0x5a0/0x5a0
    [ 1457.366948] kthread+0x135/0x150
    [ 1457.367362] ? kthread_create_on_node+0x60/0x60
    [ 1457.367934] ret_from_fork+0x22/0x30
    [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
    [ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
    [ 1457.380849] CR2: 0000000000000000
    [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
    [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
    [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
    [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
    [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
    [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
    [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
    [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
    [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
    [ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
    [ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
    [ 1457.392525] Kernel panic - not syncing: Fatal exception
    [ 1457.394138] Kernel Offset: disabled
    [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---

    We fix this Oops by adding a new goto label out_put_req and reordering
    the blk_put_request call to avoid calling blk_put_request() with rq
    value is set to NULL. Here we also update the rest of the code
    accordingly.

    Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Logan Gunthorpe
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • In the current implementation before submitting the passthru cmd we
    may come across error e.g. getting ns from passthru controller,
    allocating a request from passthru controller, etc. For all the failure
    cases it only uses single goto label fail_out.

    In the target code, we follow the pattern to have a separate label for
    each error out the case when setting up multiple things before the actual
    action.

    This patch follows the same pattern and renames generic fail_out label
    to out_put_ns and updates the error out cases in the
    nvmet_passthru_execute_cmd() where it is needed.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Logan Gunthorpe
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • If we find an optimized path, we quit the loop immediately. Thus we can use
    just one variable for the next path, slighly simplifying the code.

    Signed-off-by: Martin Wilck
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin Wilck
     
  • If there's only one usable, non-optimized path, nvme_round_robin_path()
    returns NULL, which is wrong. Fix it by falling back to "old", like in
    the single optimized path case. Also, if the active path isn't changed,
    there's no need to re-assign the pointer.

    Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
    Signed-off-by: Martin Wilck
    Signed-off-by: Martin George
    Reviewed-by: Keith Busch
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin Wilck
     
  • On an error exit path, a negative error code should be returned
    instead of a positive return value.

    Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
    Cc: James Smart
    Signed-off-by: Tianjia Zhang
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Tianjia Zhang