20 Jan, 2021

8 commits

  • commit ada831772188192243f9ea437c46e37e97a5975d upstream.

    We shouldn't call smp_processor_id() in a preemptible
    context, but this is advisory at best, so instead
    call __smp_processor_id().

    Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context")
    Reported-by: Or Gerlitz
    Reported-by: Yi Zhang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • commit ca1ff67d0fb14f39cf0cc5102b1fbcc3b14f6fb9 upstream.

    When a bio merges, we can get a request that spans multiple
    bios, and the overall request payload size is the sum of
    all bios. When we calculate how much we need to send
    from the existing bio (and bvec), we did not take into
    account the iov_iter byte count cap.

    Since multipage bvecs support, bvecs can split in the middle
    which means that when we account for the last bvec send we
    should also take the iov_iter byte count cap as it might be
    lower than the last bvec size.

    Reported-by: Hao Wang
    Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
    Tested-by: Hao Wang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • commit 5ab25a32cd90ce561ac28b9302766e565d61304c upstream.

    Discovery controllers usually don't support smart log page command.
    So when we connect to the discovery controller we see this warning:
    nvme nvme0: Failed to read smart log (error 24577)
    nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.123.1:8009
    nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

    Introduce a new helper to understand if the controller is a discovery
    controller and use this helper to skip nvme_init_hwmon (also use it in
    other places that we check if the controller is a discovery controller).

    Fixes: 400b6a7b13a3 ("nvme: Add hardware monitoring support")
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • commit 7a84665619bb5da8c8b6517157875a1fd7632014 upstream.

    When setting port traddr to INADDR_ANY, the listening cm_id->device
    is NULL. The associate IB device is known only when a connect request
    event arrives, so checking T10-PI device capability should be done
    at this stage.

    Fixes: b09160c3996c ("nvmet-rdma: add metadata/T10-PI support")
    Signed-off-by: Israel Rukshin
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Israel Rukshin
     
  • [ Upstream commit 19fce0470f05031e6af36e49ce222d0f0050d432 ]

    Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
    used to be called from a timeout or work context. Now it is being called
    in an io completion context, which can be an interrupt handler.
    Unfortunately, the abort outstanding ios routine attempts to stop nvme
    queues and nested routines that may try to sleep, which is in conflict
    with the interrupt handler.

    Correct replacing the direct call with a work element scheduling, and the
    abort outstanding ios routine will be called in the work element.

    Fixes: 95ced8a2c72d ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
    Signed-off-by: James Smart
    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    James Smart
     
  • [ Upstream commit 9ceb7863537748c67fa43ac4f2f565819bbd36e4 ]

    When a queue is in NVMET_RDMA_Q_CONNECTING state, it may has some
    requests at rsp_wait_list. In case a disconnect occurs at this
    state, no one will empty this list and will return the requests to
    free_rsps list. Normally nvmet_rdma_queue_established() free those
    requests after moving the queue to NVMET_RDMA_Q_LIVE state, but in
    this case __nvmet_rdma_queue_disconnect() is called before. The
    crash happens at nvmet_rdma_free_rsps() when calling
    list_del(&rsp->free_list), because the request exists only at
    the wait list. To fix the issue, simply clear rsp_wait_list when
    destroying the queue.

    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Israel Rukshin
     
  • [ Upstream commit 62df80165d7f197c9c0652e7416164f294a96661 ]

    While handling the completion queue, keep a local copy of the command id
    from the DMA-accessible completion entry. This silences a time-of-check
    to time-of-use (TOCTOU) warning from KF/x[1], with respect to a
    Thunderclap[2] vulnerability analysis. The double-read impact appears
    benign.

    There may be a theoretical window for @command_id to be used as an
    adversary-controlled array-index-value for mounting a speculative
    execution attack, but that mitigation is saved for a potential follow-on.
    A man-in-the-middle attack on the data payload is out of scope for this
    analysis and is hopefully mitigated by filesystem integrity mechanisms.

    [1] https://github.com/intel/kernel-fuzzer-for-xen-project
    [2] http://thunderclap.io/thunderclap-paper-ndss2019.pdf
    Signed-off-by: Lalithambika Krishna Kumar
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Lalithambika Krishnakumar
     
  • [ Upstream commit 7ee5c78ca3895d44e918c38332921983ed678be0 ]

    A system with more than one of these SSDs will only have one usable.
    Hence the kernel fails to detect nvme devices due to duplicate cntlids.

    [ 6.274554] nvme nvme1: Duplicate cntlid 33 with nvme0, rejecting
    [ 6.274566] nvme nvme1: Removing after probe failure status: -22

    Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.

    Signed-off-by: Gopal Tiwari
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Gopal Tiwari
     

17 Jan, 2021

1 commit

  • commit 5c11f7d9f843bdd24cd29b95401938bc3f168070 upstream.

    We may send a request (with or without its data) from two paths:

    1. From our I/O context nvme_tcp_io_work which is triggered from:
    - queue_rq
    - r2t reception
    - socket data_ready and write_space callbacks
    2. Directly from queue_rq if the send_list is empty (because we want to
    save the context switch associated with scheduling our io_work).

    However, given that now we have the send_mutex, we may run into a race
    condition where none of these contexts will send the pending payload to
    the controller. Both io_work send path and queue_rq send path
    opportunistically attempt to acquire the send_mutex however queue_rq only
    attempts to send a single request, and if io_work context fails to
    acquire the send_mutex it will complete without rescheduling itself.

    The race can trigger with the following sequence:

    1. queue_rq sends request (no incapsule data) and blocks
    2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
    to the send_list and schedules io_work
    3. io_work triggers and cannot acquire the send_mutex - because of (1),
    ends without self rescheduling
    4. queue_rq completes the send, and completes

    ==> no context will send the h2cdata - timeout.

    Fix this by having queue_rq sending as much as it can from the send_list
    such that if it still has any left, its because the socket buffer is
    full and the socket write_space callback will trigger, thus guaranteeing
    that a context will be scheduled to send the h2cdata PDU.

    Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context")
    Reported-by: Potnuri Bharat Teja
    Reported-by: Samuel Jones
    Signed-off-by: Sagi Grimberg
    Tested-by: Potnuri Bharat Teja
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

09 Jan, 2021

1 commit

  • [ Upstream commit 5a7a9e038b032137ae9c45d5429f18a2ffdf7d42 ]

    Use the ib_dma_* helpers to skip the DMA translation instead. This
    removes the last user if dma_virt_ops and keeps the weird layering
    violation inside the RDMA core instead of burderning the DMA mapping
    subsystems with it. This also means the software RDMA drivers now don't
    have to mess with DMA parameters that are not relevant to them at all, and
    that in the future we can use PCI P2P transfers even for software RDMA, as
    there is no first fake layer of DMA mapping that the P2P DMA support.

    Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Tested-by: Mike Marciniszyn
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     

14 Nov, 2020

3 commits

  • xa_destroy() frees only internal data. The caller is responsible for
    freeing the exteranl objects referenced by an xarray.

    Fixes: 1cf7a12e09aa4 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
    Signed-off-by: Keith Busch
    Signed-off-by: Christoph Hellwig

    Keith Busch
     
  • Remove the struct used for tracking known command effects logs in a
    list. This is now saved in an xarray that doesn't use these elements.
    Instead, store the log directly instead of the wrapper struct.

    Signed-off-by: Keith Busch
    Signed-off-by: Christoph Hellwig

    Keith Busch
     
  • If Doorbell Buffer Config command fails even 'dev->dbbuf_dbs != NULL'
    which means OACS indicates that NVME_CTRL_OACS_DBBUF_SUPP is set,
    nvme_dbbuf_update_and_check_event() will check event even it's not been
    successfully set.

    This patch fixes mismatch among dbbuf for sq/cqs in case that dbbuf
    command fails.

    Signed-off-by: Minwoo Im
    Signed-off-by: Christoph Hellwig

    Minwoo Im
     

10 Nov, 2020

1 commit


05 Nov, 2020

1 commit

  • Pull NVMe fixes from Christoph:

    "nvme fixes for 5.10:

    - revert a nvme_queue size optimization (Keith Bush)
    - fabrics timeout races fixes (Chao Leng and Sagi Grimberg)"

    * tag 'nvme-5.10-2020-11-05' of git://git.infradead.org/nvme:
    nvme-tcp: avoid repeated request completion
    nvme-rdma: avoid repeated request completion
    nvme-tcp: avoid race between time out and tear down
    nvme-rdma: avoid race between time out and tear down
    nvme: introduce nvme_sync_io_queues
    Revert "nvme-pci: remove last_sq_tail"

    Jens Axboe
     

03 Nov, 2020

6 commits

  • The request may be executed asynchronously, and rq->state may be
    changed to IDLE. To avoid repeated request completion, only
    MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out.
    It is not safe, so need adding check IDLE for rq->state.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Chao Leng
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     
  • The request may be executed asynchronously, and rq->state may be
    changed to IDLE. To avoid repeated request completion, only
    MQ_RQ_COMPLETE of rq->state is checked in nvme_rdma_complete_timed_out.
    It is not safe, so need adding check IDLE for rq->state.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Chao Leng
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     
  • Now use teardown_lock to serialize for time out and tear down. This may
    cause abnormal: first cancel all request in tear down, then time out may
    complete the request again, but the request may already be freed or
    restarted.

    To avoid race between time out and tear down, in tear down process,
    first we quiesce the queue, and then delete the timer and cancel
    the time out work for the queue. At the same time we need to delete
    teardown_lock.

    Signed-off-by: Chao Leng
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Chao Leng
     
  • Now use teardown_lock to serialize for time out and tear down. This may
    cause abnormal: first cancel all request in tear down, then time out may
    complete the request again, but the request may already be freed or
    restarted.

    To avoid race between time out and tear down, in tear down process,
    first we quiesce the queue, and then delete the timer and cancel
    the time out work for the queue. At the same time we need to delete
    teardown_lock.

    Signed-off-by: Chao Leng
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Chao Leng
     
  • Introduce sync io queues for some scenarios which just only need sync
    io queues not sync all queues.

    Signed-off-by: Chao Leng
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Chao Leng
     
  • Multiple CPUs may be mapped to the same hctx, allowing mulitple
    submission contexts to attempt commit_rqs(). We need to verify we're
    not writing the same doorbell value multiple times since that's a spec
    violation.

    Revert commit 54b2fcee1db041a83b52b51752dade6090cf952f.

    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1878596
    Reported-by: "B.L. Jones"
    Signed-off-by: Keith Busch

    Keith Busch
     

31 Oct, 2020

1 commit

  • Pull block fixes from Jens Axboe:

    - null_blk zone fixes (Damien, Kanchan)

    - NVMe pull request from Christoph:
    - improve zone revalidation (Keith Busch)
    - gracefully handle zero length messages in nvme-rdma (zhenwei pi)
    - nvme-fc error handling fixes (James Smart)
    - nvmet tracing NULL pointer dereference fix (Chaitanya Kulkarni)"

    - xsysace platform fixes (Andy)

    - scatterlist type cleanup (David)

    - blk-cgroup memory fixes (Gabriel)

    - nbd block size update fix (Ming)

    - Flush completion state fix (Ming)

    - bio_add_hw_page() iteration fix (Naohiro)

    * tag 'block-5.10-2020-10-30' of git://git.kernel.dk/linux-block:
    blk-mq: mark flush request as IDLE in flush_end_io()
    lib/scatterlist: use consistent sg_copy_buffer() return type
    xsysace: use platform_get_resource() and platform_get_irq_optional()
    null_blk: Fix locking in zoned mode
    null_blk: Fix zone reset all tracing
    nbd: don't update block size after device is started
    block: advance iov_iter on bio_add_hw_page failure
    null_blk: synchronization fix for zoned device
    nvmet: fix a NULL pointer dereference when tracing the flush command
    nvme-fc: remove nvme_fc_terminate_io()
    nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery
    nvme-fc: remove err_work work item
    nvme-fc: track error_recovery while connecting
    nvme-rdma: handle unexpected nvme completion data length
    nvme: ignore zone validate errors on subsequent scans
    blk-cgroup: Pre-allocate tree node on blkg_conf_prep
    blk-cgroup: Fix memleak on error path

    Linus Torvalds
     

28 Oct, 2020

1 commit

  • There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
    handler triggers a completion and another thread does rdma_connect() or
    the handler directly calls rdma_connect().

    In all cases rdma_connect() needs to hold the handler_mutex, but when
    handler's are invoked this is already held by the core code. This causes
    ULPs using the 2nd method to deadlock.

    Provide a rdma_connect_locked() and have all ULPs call it from their
    handlers.

    Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
    Reported-and-tested-by: Guoqing Jiang
    Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
    Acked-by: Santosh Shilimkar
    Acked-by: Jack Wang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

27 Oct, 2020

7 commits

  • When target side trace in turned on and flush command is issued from the
    host it results in the following Oops.

    [ 856.789724] BUG: kernel NULL pointer dereference, address: 0000000000000068
    [ 856.790686] #PF: supervisor read access in kernel mode
    [ 856.791262] #PF: error_code(0x0000) - not-present page
    [ 856.791863] PGD 6d7110067 P4D 6d7110067 PUD 66f0ad067 PMD 0
    [ 856.792527] Oops: 0000 [#1] SMP NOPTI
    [ 856.792950] CPU: 15 PID: 7034 Comm: nvme Tainted: G OE 5.9.0nvme-5.9+ #71
    [ 856.793790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
    [ 856.794956] RIP: 0010:trace_event_raw_event_nvmet_req_init+0x13e/0x170 [nvmet]
    [ 856.795734] Code: 41 5c 41 5d c3 31 d2 31 f6 e8 4e 9b b8 e0 e9 0e ff ff ff 49 8b 55 00 48 8b 38 8b 0
    [ 856.797740] RSP: 0018:ffffc90001be3a60 EFLAGS: 00010246
    [ 856.798375] RAX: 0000000000000000 RBX: ffff8887e7d2c01c RCX: 0000000000000000
    [ 856.799234] RDX: 0000000000000020 RSI: 0000000057e70ea2 RDI: ffff8887e7d2c034
    [ 856.800088] RBP: ffff88869f710578 R08: ffff888807500d40 R09: 00000000fffffffe
    [ 856.800951] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff8887e7d2c034
    [ 856.801807] R13: ffff88869f7105c8 R14: 0000000000000040 R15: ffff88869f710440
    [ 856.802667] FS: 00007f6a22bd8780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
    [ 856.803635] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 856.804367] CR2: 0000000000000068 CR3: 00000006d73e0000 CR4: 00000000003506e0
    [ 856.805283] Call Trace:
    [ 856.805613] nvmet_req_init+0x27c/0x480 [nvmet]
    [ 856.806200] nvme_loop_queue_rq+0xcb/0x1d0 [nvme_loop]
    [ 856.806862] blk_mq_dispatch_rq_list+0x123/0x7b0
    [ 856.807459] ? kvm_sched_clock_read+0x14/0x30
    [ 856.808025] __blk_mq_sched_dispatch_requests+0xc7/0x170
    [ 856.808708] blk_mq_sched_dispatch_requests+0x30/0x60
    [ 856.809372] __blk_mq_run_hw_queue+0x70/0x100
    [ 856.809935] __blk_mq_delay_run_hw_queue+0x156/0x170
    [ 856.810574] blk_mq_run_hw_queue+0x86/0xe0
    [ 856.811104] blk_mq_sched_insert_request+0xef/0x160
    [ 856.811733] blk_execute_rq+0x69/0xc0
    [ 856.812212] ? blk_mq_rq_ctx_init+0xd0/0x230
    [ 856.812784] nvme_execute_passthru_rq+0x57/0x130 [nvme_core]
    [ 856.813461] nvme_submit_user_cmd+0xeb/0x300 [nvme_core]
    [ 856.814099] nvme_user_cmd.isra.82+0x11e/0x1a0 [nvme_core]
    [ 856.814752] blkdev_ioctl+0x1dc/0x2c0
    [ 856.815197] block_ioctl+0x3f/0x50
    [ 856.815606] __x64_sys_ioctl+0x84/0xc0
    [ 856.816074] do_syscall_64+0x33/0x40
    [ 856.816533] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 856.817168] RIP: 0033:0x7f6a222ed107
    [ 856.817617] Code: 44 00 00 48 8b 05 81 cd 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 8
    [ 856.819901] RSP: 002b:00007ffca848f058 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [ 856.820846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6a222ed107
    [ 856.821726] RDX: 00007ffca848f060 RSI: 00000000c0484e43 RDI: 0000000000000003
    [ 856.822603] RBP: 0000000000000003 R08: 000000000000003f R09: 0000000000000005
    [ 856.823478] R10: 00007ffca848ece0 R11: 0000000000000202 R12: 00007ffca84912d3
    [ 856.824359] R13: 00007ffca848f4d0 R14: 0000000000000002 R15: 000000000067e900
    [ 856.825236] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corel

    Move the nvmet_req_init() tracepoint after we parse the command in
    nvmet_req_init() so that we can get rid of the duplicate
    nvmet_find_namespace() call.
    Rename __assign_disk_name() -> __assign_req_name(). Now that we call
    tracepoint after parsing the command simplify the newly added
    __assign_req_name() which fixes this bug.

    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     
  • __nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
    Consoldate and move the functionality of terminate_io into reset_work.

    In reset_work, rather than calling the create_association directly,
    schedule the connect work element to do its thing. After scheduling,
    flush the connect work element to continue with semantic of not
    returning until connect has been attempted at least once.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • nvme_fc_error_recovery() special cases handling when in CONNECTING state
    and calls __nvme_fc_terminate_io(). __nvme_fc_terminate_io() itself
    special cases CONNECTING state and calls the routine to abort outstanding
    ios.

    Simplify the sequence by putting the call to abort outstanding I/Os
    directly in nvme_fc_error_recovery.

    Move the location of __nvme_fc_abort_outstanding_ios(), and
    nvme_fc_terminate_exchange() which is called by it, to avoid adding
    function prototypes for nvme_fc_error_recovery().

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • err_work was created to handle errors (mainly I/O timeouts) while in
    CONNECTING state. The flag for err_work_active is also unneeded.

    Remove err_work_active and err_work. The actions to abort I/Os are moved
    inline to nvme_error_recovery().

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • Whenever there are errors during CONNECTING, the driver recovers by
    aborting all outstanding ios and counts on the io completion to fail them
    and thus the connection/association they are on. However, the connection
    failure depends on a failure state from the core routines. Not all
    commands that are issued by the core routine are guaranteed to cause a
    failure of the core routine. They may be treated as a failure status and
    the status is then ignored.

    As such, whenever the transport enters error_recovery while CONNECTING,
    it will set a new flag indicating an association failed. The
    create_association routine which creates and initializes the controller,
    will monitor the state of the flag as well as the core routine error
    status and ensure the association fails if there was an error.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • Receiving a zero length message leads to the following warnings because
    the CQE is processed twice:

    refcount_t: underflow; use-after-free.
    WARNING: CPU: 0 PID: 0 at lib/refcount.c:28

    RIP: 0010:refcount_warn_saturate+0xd9/0xe0
    Call Trace:

    nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma]
    __ib_process_cq+0x76/0x150 [ib_core]
    ...

    Sanity check the received data length, to avoids this.

    Thanks to Chao Leng & Sagi for suggestions.

    Signed-off-by: zhenwei pi
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    zhenwei pi
     
  • Revalidating nvme zoned namespaces requires IO commands, and there are
    controller states that prevent IO. For example, a sanitize in progress
    is required to fail all IO, but we don't want to remove a namespace
    we've previously added just because the controller is in such a state.
    Suppress the error in this case.

    Reported-by: Michael Nguyen
    Signed-off-by: Keith Busch
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Keith Busch
     

23 Oct, 2020

4 commits

  • We've had several complaints about a 10s reconnect delay (the default)
    when there was an error while there is connectivity to a subsystem.
    The max_reconnects and reconnect_delay are set in common code prior to
    calling the transport to create the controller.

    This change checks if the default reconnect delay is being used, and if
    so, it adjusts it to a shorter period (2s) for the nvme-fc transport.
    It does so by calculating the controller loss tmo window, changing the
    value of the reconnect delay, and then recalculating the maximum number
    of reconnect attempts allowed.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • On reconnect, the code currently does not freeze the controller before
    possibly updating the number hw queues for the controller.

    Add the freeze before updating the number of hw queues. Note: the queues
    are already started and remain started through the reconnect.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • The loop that backs out of hw io queue creation continues through index
    0, which corresponds to the admin queue as well.

    Fix the loop so it only proceeds through indexes 1..n which correspond to
    I/O queues.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • Currently, an I/O timeout unconditionally invokes
    nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If
    live, the routine resets the controller which initiates a reconnect -
    which is valid. If CONNECTING, err_work is scheduled. Err_work then
    calls the terminate_io routine, which also checks for CONNECTING and
    noops any further action on outstanding I/O. The result is nothing
    happened to the timed out io. As such, if the command was dropped on
    the wire, it will never timeout / complete, and the connect process
    will hang.

    Change the behavior of the io timeout routine to unconditionally abort
    the I/O. I/O completion handling will note that an io failed due to an
    abort and will terminate the connection / association as needed. If the
    abort was unable to happen, continue with a call to
    nvme_fc_error_recovery(). To ensure something different happens in
    nvme_fc_error_recovery() rework it so at it will abort all I/Os on the
    association to force a failure.

    As I/O aborts now may occur outside of delete_association, counting for
    completion must be wary and only count those aborted during
    delete_association when TERMIO is set on the controller.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     

22 Oct, 2020

6 commits

  • By default, we set the passthru request allocation flag such that it
    returns the error in the following code path and we fail the I/O when
    BLK_MQ_REQ_NOWAIT is used for request allocation :-

    nvme_alloc_request()
     blk_mq_alloc_request()
      blk_mq_queue_enter()
       if (flag & BLK_MQ_REQ_NOWAIT)
            return -EBUSY;
    Reviewed-by: Logan Gunthorpe
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     
  • Clean up some confusing elements of nvmet_passthru_map_sg() by returning
    early if the request is greater than the maximum bio size. This allows
    us to drop the sg_cnt variable.

    This should not result in any functional change but makes the code
    clearer and more understandable. The original code allocated a truncated
    bio then would return EINVAL when bio_add_pc_page() filled that bio. The
    new code just returns EINVAL early if this would happen.

    Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
    Signed-off-by: Logan Gunthorpe
    Suggested-by: Douglas Gilbert
    Reviewed-by: Sagi Grimberg
    Cc: Christoph Hellwig
    Cc: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Logan Gunthorpe
     
  • nvmet_passthru_map_sg() only supports mapping a single BIO, not a chain
    so the effective maximum transfer should also be limitted by
    BIO_MAX_PAGES (presently this works out to 1MB).

    For PCI passthru devices the max_sectors would typically be more
    limitting than BIO_MAX_PAGES, but this may not be true for all passthru
    devices.

    Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
    Suggested-by: Christoph Hellwig
    Signed-off-by: Logan Gunthorpe
    Cc: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Logan Gunthorpe
     
  • When connecting a controller with a zero kato value using the following
    command line

    nvme connect -t tcp -n NQN -a ADDR -s PORT --keep-alive-tmo=0

    the warning below can be reproduced:

    WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627 __queue_delayed_work+0x6d/0x90
    with trace:
    mod_delayed_work_on+0x59/0x90
    nvmet_update_cc+0xee/0x100 [nvmet]
    nvmet_execute_prop_set+0x72/0x80 [nvmet]
    nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp]
    nvmet_tcp_io_work+0x63f/0xb2d [nvmet_tcp]
    ...

    This is caused by queuing up an uninitialized work. Althrough the
    keep-alive timer is disabled during allocating the controller (fixed in
    0d3b6a8d213a), ka_work still has a chance to run (called by
    nvmet_start_ctrl).

    Fixes: 0d3b6a8d213a ("nvmet: Disable keep-alive timer when kato is cleared to 0h")
    Signed-off-by: zhenwei pi
    Signed-off-by: Christoph Hellwig

    zhenwei pi
     
  • Like commit 5611ec2b9814 ("nvme-pci: prevent SK hynix PC400 from using
    Write Zeroes command"), Sandisk Skyhawk has the same issue:
    [ 6305.633887] blk_update_request: operation not supported error, dev nvme0n1, sector 340812032 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0

    So also disable Write Zeroes command on Sandisk Skyhawk.

    BugLink: https://bugs.launchpad.net/bugs/1899503
    Signed-off-by: Kai-Heng Feng
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Kai-Heng Feng
     
  • The request's rq_disk isn't set for passthrough IO commands, so tracing
    uses qid 0 for these which incorrectly decodes as an admin command. Use
    the request_queue's queuedata instead since that value is always set for
    the IO queues, and never set for the admin queue.

    Signed-off-by: Keith Busch
    Signed-off-by: Christoph Hellwig

    Keith Busch