20 Jan, 2021
8 commits
-
commit ada831772188192243f9ea437c46e37e97a5975d upstream.
We shouldn't call smp_processor_id() in a preemptible
context, but this is advisory at best, so instead
call __smp_processor_id().Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context")
Reported-by: Or Gerlitz
Reported-by: Yi Zhang
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman -
commit ca1ff67d0fb14f39cf0cc5102b1fbcc3b14f6fb9 upstream.
When a bio merges, we can get a request that spans multiple
bios, and the overall request payload size is the sum of
all bios. When we calculate how much we need to send
from the existing bio (and bvec), we did not take into
account the iov_iter byte count cap.Since multipage bvecs support, bvecs can split in the middle
which means that when we account for the last bvec send we
should also take the iov_iter byte count cap as it might be
lower than the last bvec size.Reported-by: Hao Wang
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Tested-by: Hao Wang
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman -
commit 5ab25a32cd90ce561ac28b9302766e565d61304c upstream.
Discovery controllers usually don't support smart log page command.
So when we connect to the discovery controller we see this warning:
nvme nvme0: Failed to read smart log (error 24577)
nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.123.1:8009
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"Introduce a new helper to understand if the controller is a discovery
controller and use this helper to skip nvme_init_hwmon (also use it in
other places that we check if the controller is a discovery controller).Fixes: 400b6a7b13a3 ("nvme: Add hardware monitoring support")
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman -
commit 7a84665619bb5da8c8b6517157875a1fd7632014 upstream.
When setting port traddr to INADDR_ANY, the listening cm_id->device
is NULL. The associate IB device is known only when a connect request
event arrives, so checking T10-PI device capability should be done
at this stage.Fixes: b09160c3996c ("nvmet-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin
Reviewed-by: Sagi Grimberg
Reviewed-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 19fce0470f05031e6af36e49ce222d0f0050d432 ]
Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
used to be called from a timeout or work context. Now it is being called
in an io completion context, which can be an interrupt handler.
Unfortunately, the abort outstanding ios routine attempts to stop nvme
queues and nested routines that may try to sleep, which is in conflict
with the interrupt handler.Correct replacing the direct call with a work element scheduling, and the
abort outstanding ios routine will be called in the work element.Fixes: 95ced8a2c72d ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
Signed-off-by: James Smart
Reported-by: Daniel Wagner
Tested-by: Daniel Wagner
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin -
[ Upstream commit 9ceb7863537748c67fa43ac4f2f565819bbd36e4 ]
When a queue is in NVMET_RDMA_Q_CONNECTING state, it may has some
requests at rsp_wait_list. In case a disconnect occurs at this
state, no one will empty this list and will return the requests to
free_rsps list. Normally nvmet_rdma_queue_established() free those
requests after moving the queue to NVMET_RDMA_Q_LIVE state, but in
this case __nvmet_rdma_queue_disconnect() is called before. The
crash happens at nvmet_rdma_free_rsps() when calling
list_del(&rsp->free_list), because the request exists only at
the wait list. To fix the issue, simply clear rsp_wait_list when
destroying the queue.Signed-off-by: Israel Rukshin
Reviewed-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin -
[ Upstream commit 62df80165d7f197c9c0652e7416164f294a96661 ]
While handling the completion queue, keep a local copy of the command id
from the DMA-accessible completion entry. This silences a time-of-check
to time-of-use (TOCTOU) warning from KF/x[1], with respect to a
Thunderclap[2] vulnerability analysis. The double-read impact appears
benign.There may be a theoretical window for @command_id to be used as an
adversary-controlled array-index-value for mounting a speculative
execution attack, but that mitigation is saved for a potential follow-on.
A man-in-the-middle attack on the data payload is out of scope for this
analysis and is hopefully mitigated by filesystem integrity mechanisms.[1] https://github.com/intel/kernel-fuzzer-for-xen-project
[2] http://thunderclap.io/thunderclap-paper-ndss2019.pdf
Signed-off-by: Lalithambika Krishna Kumar
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin -
[ Upstream commit 7ee5c78ca3895d44e918c38332921983ed678be0 ]
A system with more than one of these SSDs will only have one usable.
Hence the kernel fails to detect nvme devices due to duplicate cntlids.[ 6.274554] nvme nvme1: Duplicate cntlid 33 with nvme0, rejecting
[ 6.274566] nvme nvme1: Removing after probe failure status: -22Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
Signed-off-by: Gopal Tiwari
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
17 Jan, 2021
1 commit
-
commit 5c11f7d9f843bdd24cd29b95401938bc3f168070 upstream.
We may send a request (with or without its data) from two paths:
1. From our I/O context nvme_tcp_io_work which is triggered from:
- queue_rq
- r2t reception
- socket data_ready and write_space callbacks
2. Directly from queue_rq if the send_list is empty (because we want to
save the context switch associated with scheduling our io_work).However, given that now we have the send_mutex, we may run into a race
condition where none of these contexts will send the pending payload to
the controller. Both io_work send path and queue_rq send path
opportunistically attempt to acquire the send_mutex however queue_rq only
attempts to send a single request, and if io_work context fails to
acquire the send_mutex it will complete without rescheduling itself.The race can trigger with the following sequence:
1. queue_rq sends request (no incapsule data) and blocks
2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
to the send_list and schedules io_work
3. io_work triggers and cannot acquire the send_mutex - because of (1),
ends without self rescheduling
4. queue_rq completes the send, and completes==> no context will send the h2cdata - timeout.
Fix this by having queue_rq sending as much as it can from the send_list
such that if it still has any left, its because the socket buffer is
full and the socket write_space callback will trigger, thus guaranteeing
that a context will be scheduled to send the h2cdata PDU.Fixes: db5ad6b7f8cd ("nvme-tcp: try to send request in queue_rq context")
Reported-by: Potnuri Bharat Teja
Reported-by: Samuel Jones
Signed-off-by: Sagi Grimberg
Tested-by: Potnuri Bharat Teja
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman
09 Jan, 2021
1 commit
-
[ Upstream commit 5a7a9e038b032137ae9c45d5429f18a2ffdf7d42 ]
Use the ib_dma_* helpers to skip the DMA translation instead. This
removes the last user if dma_virt_ops and keeps the weird layering
violation inside the RDMA core instead of burderning the DMA mapping
subsystems with it. This also means the software RDMA drivers now don't
have to mess with DMA parameters that are not relevant to them at all, and
that in the future we can use PCI P2P transfers even for software RDMA, as
there is no first fake layer of DMA mapping that the P2P DMA support.Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@lst.de
Signed-off-by: Christoph Hellwig
Tested-by: Mike Marciniszyn
Signed-off-by: Jason Gunthorpe
Signed-off-by: Sasha Levin
14 Nov, 2020
3 commits
-
xa_destroy() frees only internal data. The caller is responsible for
freeing the exteranl objects referenced by an xarray.Fixes: 1cf7a12e09aa4 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
Signed-off-by: Keith Busch
Signed-off-by: Christoph Hellwig -
Remove the struct used for tracking known command effects logs in a
list. This is now saved in an xarray that doesn't use these elements.
Instead, store the log directly instead of the wrapper struct.Signed-off-by: Keith Busch
Signed-off-by: Christoph Hellwig -
If Doorbell Buffer Config command fails even 'dev->dbbuf_dbs != NULL'
which means OACS indicates that NVME_CTRL_OACS_DBBUF_SUPP is set,
nvme_dbbuf_update_and_check_event() will check event even it's not been
successfully set.This patch fixes mismatch among dbbuf for sq/cqs in case that dbbuf
command fails.Signed-off-by: Minwoo Im
Signed-off-by: Christoph Hellwig
10 Nov, 2020
1 commit
-
The offending commit breaks BLKROSET ioctl because a device
revalidation will blindly override BLKROSET setting. Hence,
we remove the disk rw setting in case NVME_NS_ATTR_RO is cleared
from by the controller.Fixes: 1293477f4f32 ("nvme: set gendisk read only based on nsattr")
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
05 Nov, 2020
1 commit
-
Pull NVMe fixes from Christoph:
"nvme fixes for 5.10:
- revert a nvme_queue size optimization (Keith Bush)
- fabrics timeout races fixes (Chao Leng and Sagi Grimberg)"* tag 'nvme-5.10-2020-11-05' of git://git.infradead.org/nvme:
nvme-tcp: avoid repeated request completion
nvme-rdma: avoid repeated request completion
nvme-tcp: avoid race between time out and tear down
nvme-rdma: avoid race between time out and tear down
nvme: introduce nvme_sync_io_queues
Revert "nvme-pci: remove last_sq_tail"
03 Nov, 2020
6 commits
-
The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.Signed-off-by: Sagi Grimberg
Signed-off-by: Chao Leng
Signed-off-by: Christoph Hellwig -
The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_rdma_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.Signed-off-by: Sagi Grimberg
Signed-off-by: Chao Leng
Signed-off-by: Christoph Hellwig -
Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.Signed-off-by: Chao Leng
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig -
Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.Signed-off-by: Chao Leng
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig -
Introduce sync io queues for some scenarios which just only need sync
io queues not sync all queues.Signed-off-by: Chao Leng
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig -
Multiple CPUs may be mapped to the same hctx, allowing mulitple
submission contexts to attempt commit_rqs(). We need to verify we're
not writing the same doorbell value multiple times since that's a spec
violation.Revert commit 54b2fcee1db041a83b52b51752dade6090cf952f.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1878596
Reported-by: "B.L. Jones"
Signed-off-by: Keith Busch
31 Oct, 2020
1 commit
-
Pull block fixes from Jens Axboe:
- null_blk zone fixes (Damien, Kanchan)
- NVMe pull request from Christoph:
- improve zone revalidation (Keith Busch)
- gracefully handle zero length messages in nvme-rdma (zhenwei pi)
- nvme-fc error handling fixes (James Smart)
- nvmet tracing NULL pointer dereference fix (Chaitanya Kulkarni)"- xsysace platform fixes (Andy)
- scatterlist type cleanup (David)
- blk-cgroup memory fixes (Gabriel)
- nbd block size update fix (Ming)
- Flush completion state fix (Ming)
- bio_add_hw_page() iteration fix (Naohiro)
* tag 'block-5.10-2020-10-30' of git://git.kernel.dk/linux-block:
blk-mq: mark flush request as IDLE in flush_end_io()
lib/scatterlist: use consistent sg_copy_buffer() return type
xsysace: use platform_get_resource() and platform_get_irq_optional()
null_blk: Fix locking in zoned mode
null_blk: Fix zone reset all tracing
nbd: don't update block size after device is started
block: advance iov_iter on bio_add_hw_page failure
null_blk: synchronization fix for zoned device
nvmet: fix a NULL pointer dereference when tracing the flush command
nvme-fc: remove nvme_fc_terminate_io()
nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery
nvme-fc: remove err_work work item
nvme-fc: track error_recovery while connecting
nvme-rdma: handle unexpected nvme completion data length
nvme: ignore zone validate errors on subsequent scans
blk-cgroup: Pre-allocate tree node on blkg_conf_prep
blk-cgroup: Fix memleak on error path
28 Oct, 2020
1 commit
-
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
Reported-and-tested-by: Guoqing Jiang
Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar
Acked-by: Jack Wang
Reviewed-by: Christoph Hellwig
Reviewed-by: Max Gurtovoy
Reviewed-by: Sagi Grimberg
Signed-off-by: Jason Gunthorpe
27 Oct, 2020
7 commits
-
When target side trace in turned on and flush command is issued from the
host it results in the following Oops.[ 856.789724] BUG: kernel NULL pointer dereference, address: 0000000000000068
[ 856.790686] #PF: supervisor read access in kernel mode
[ 856.791262] #PF: error_code(0x0000) - not-present page
[ 856.791863] PGD 6d7110067 P4D 6d7110067 PUD 66f0ad067 PMD 0
[ 856.792527] Oops: 0000 [#1] SMP NOPTI
[ 856.792950] CPU: 15 PID: 7034 Comm: nvme Tainted: G OE 5.9.0nvme-5.9+ #71
[ 856.793790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[ 856.794956] RIP: 0010:trace_event_raw_event_nvmet_req_init+0x13e/0x170 [nvmet]
[ 856.795734] Code: 41 5c 41 5d c3 31 d2 31 f6 e8 4e 9b b8 e0 e9 0e ff ff ff 49 8b 55 00 48 8b 38 8b 0
[ 856.797740] RSP: 0018:ffffc90001be3a60 EFLAGS: 00010246
[ 856.798375] RAX: 0000000000000000 RBX: ffff8887e7d2c01c RCX: 0000000000000000
[ 856.799234] RDX: 0000000000000020 RSI: 0000000057e70ea2 RDI: ffff8887e7d2c034
[ 856.800088] RBP: ffff88869f710578 R08: ffff888807500d40 R09: 00000000fffffffe
[ 856.800951] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff8887e7d2c034
[ 856.801807] R13: ffff88869f7105c8 R14: 0000000000000040 R15: ffff88869f710440
[ 856.802667] FS: 00007f6a22bd8780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
[ 856.803635] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 856.804367] CR2: 0000000000000068 CR3: 00000006d73e0000 CR4: 00000000003506e0
[ 856.805283] Call Trace:
[ 856.805613] nvmet_req_init+0x27c/0x480 [nvmet]
[ 856.806200] nvme_loop_queue_rq+0xcb/0x1d0 [nvme_loop]
[ 856.806862] blk_mq_dispatch_rq_list+0x123/0x7b0
[ 856.807459] ? kvm_sched_clock_read+0x14/0x30
[ 856.808025] __blk_mq_sched_dispatch_requests+0xc7/0x170
[ 856.808708] blk_mq_sched_dispatch_requests+0x30/0x60
[ 856.809372] __blk_mq_run_hw_queue+0x70/0x100
[ 856.809935] __blk_mq_delay_run_hw_queue+0x156/0x170
[ 856.810574] blk_mq_run_hw_queue+0x86/0xe0
[ 856.811104] blk_mq_sched_insert_request+0xef/0x160
[ 856.811733] blk_execute_rq+0x69/0xc0
[ 856.812212] ? blk_mq_rq_ctx_init+0xd0/0x230
[ 856.812784] nvme_execute_passthru_rq+0x57/0x130 [nvme_core]
[ 856.813461] nvme_submit_user_cmd+0xeb/0x300 [nvme_core]
[ 856.814099] nvme_user_cmd.isra.82+0x11e/0x1a0 [nvme_core]
[ 856.814752] blkdev_ioctl+0x1dc/0x2c0
[ 856.815197] block_ioctl+0x3f/0x50
[ 856.815606] __x64_sys_ioctl+0x84/0xc0
[ 856.816074] do_syscall_64+0x33/0x40
[ 856.816533] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 856.817168] RIP: 0033:0x7f6a222ed107
[ 856.817617] Code: 44 00 00 48 8b 05 81 cd 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 8
[ 856.819901] RSP: 002b:00007ffca848f058 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 856.820846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6a222ed107
[ 856.821726] RDX: 00007ffca848f060 RSI: 00000000c0484e43 RDI: 0000000000000003
[ 856.822603] RBP: 0000000000000003 R08: 000000000000003f R09: 0000000000000005
[ 856.823478] R10: 00007ffca848ece0 R11: 0000000000000202 R12: 00007ffca84912d3
[ 856.824359] R13: 00007ffca848f4d0 R14: 0000000000000002 R15: 000000000067e900
[ 856.825236] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corelMove the nvmet_req_init() tracepoint after we parse the command in
nvmet_req_init() so that we can get rid of the duplicate
nvmet_find_namespace() call.
Rename __assign_disk_name() -> __assign_req_name(). Now that we call
tracepoint after parsing the command simplify the newly added
__assign_req_name() which fixes this bug.Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig -
__nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
Consoldate and move the functionality of terminate_io into reset_work.In reset_work, rather than calling the create_association directly,
schedule the connect work element to do its thing. After scheduling,
flush the connect work element to continue with semantic of not
returning until connect has been attempted at least once.Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig -
nvme_fc_error_recovery() special cases handling when in CONNECTING state
and calls __nvme_fc_terminate_io(). __nvme_fc_terminate_io() itself
special cases CONNECTING state and calls the routine to abort outstanding
ios.Simplify the sequence by putting the call to abort outstanding I/Os
directly in nvme_fc_error_recovery.Move the location of __nvme_fc_abort_outstanding_ios(), and
nvme_fc_terminate_exchange() which is called by it, to avoid adding
function prototypes for nvme_fc_error_recovery().Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig -
err_work was created to handle errors (mainly I/O timeouts) while in
CONNECTING state. The flag for err_work_active is also unneeded.Remove err_work_active and err_work. The actions to abort I/Os are moved
inline to nvme_error_recovery().Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig -
Whenever there are errors during CONNECTING, the driver recovers by
aborting all outstanding ios and counts on the io completion to fail them
and thus the connection/association they are on. However, the connection
failure depends on a failure state from the core routines. Not all
commands that are issued by the core routine are guaranteed to cause a
failure of the core routine. They may be treated as a failure status and
the status is then ignored.As such, whenever the transport enters error_recovery while CONNECTING,
it will set a new flag indicating an association failed. The
create_association routine which creates and initializes the controller,
will monitor the state of the flag as well as the core routine error
status and ensure the association fails if there was an error.Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig -
Receiving a zero length message leads to the following warnings because
the CQE is processed twice:refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 0 at lib/refcount.c:28RIP: 0010:refcount_warn_saturate+0xd9/0xe0
Call Trace:
nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma]
__ib_process_cq+0x76/0x150 [ib_core]
...Sanity check the received data length, to avoids this.
Thanks to Chao Leng & Sagi for suggestions.
Signed-off-by: zhenwei pi
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig -
Revalidating nvme zoned namespaces requires IO commands, and there are
controller states that prevent IO. For example, a sanitize in progress
is required to fail all IO, but we don't want to remove a namespace
we've previously added just because the controller is in such a state.
Suppress the error in this case.Reported-by: Michael Nguyen
Signed-off-by: Keith Busch
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
23 Oct, 2020
4 commits
-
We've had several complaints about a 10s reconnect delay (the default)
when there was an error while there is connectivity to a subsystem.
The max_reconnects and reconnect_delay are set in common code prior to
calling the transport to create the controller.This change checks if the default reconnect delay is being used, and if
so, it adjusts it to a shorter period (2s) for the nvme-fc transport.
It does so by calculating the controller loss tmo window, changing the
value of the reconnect delay, and then recalculating the maximum number
of reconnect attempts allowed.Signed-off-by: James Smart
Reviewed-by: Himanshu Madhani
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig -
On reconnect, the code currently does not freeze the controller before
possibly updating the number hw queues for the controller.Add the freeze before updating the number of hw queues. Note: the queues
are already started and remain started through the reconnect.Signed-off-by: James Smart
Reviewed-by: Himanshu Madhani
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig -
The loop that backs out of hw io queue creation continues through index
0, which corresponds to the admin queue as well.Fix the loop so it only proceeds through indexes 1..n which correspond to
I/O queues.Signed-off-by: James Smart
Reviewed-by: Himanshu Madhani
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig -
Currently, an I/O timeout unconditionally invokes
nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If
live, the routine resets the controller which initiates a reconnect -
which is valid. If CONNECTING, err_work is scheduled. Err_work then
calls the terminate_io routine, which also checks for CONNECTING and
noops any further action on outstanding I/O. The result is nothing
happened to the timed out io. As such, if the command was dropped on
the wire, it will never timeout / complete, and the connect process
will hang.Change the behavior of the io timeout routine to unconditionally abort
the I/O. I/O completion handling will note that an io failed due to an
abort and will terminate the connection / association as needed. If the
abort was unable to happen, continue with a call to
nvme_fc_error_recovery(). To ensure something different happens in
nvme_fc_error_recovery() rework it so at it will abort all I/Os on the
association to force a failure.As I/O aborts now may occur outside of delete_association, counting for
completion must be wary and only count those aborted during
delete_association when TERMIO is set on the controller.Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
22 Oct, 2020
6 commits
-
By default, we set the passthru request allocation flag such that it
returns the error in the following code path and we fail the I/O when
BLK_MQ_REQ_NOWAIT is used for request allocation :-nvme_alloc_request()
blk_mq_alloc_request()
blk_mq_queue_enter()
if (flag & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
Reviewed-by: Logan Gunthorpe
Signed-off-by: Christoph Hellwig -
Clean up some confusing elements of nvmet_passthru_map_sg() by returning
early if the request is greater than the maximum bio size. This allows
us to drop the sg_cnt variable.This should not result in any functional change but makes the code
clearer and more understandable. The original code allocated a truncated
bio then would return EINVAL when bio_add_pc_page() filled that bio. The
new code just returns EINVAL early if this would happen.Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
Signed-off-by: Logan Gunthorpe
Suggested-by: Douglas Gilbert
Reviewed-by: Sagi Grimberg
Cc: Christoph Hellwig
Cc: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig -
nvmet_passthru_map_sg() only supports mapping a single BIO, not a chain
so the effective maximum transfer should also be limitted by
BIO_MAX_PAGES (presently this works out to 1MB).For PCI passthru devices the max_sectors would typically be more
limitting than BIO_MAX_PAGES, but this may not be true for all passthru
devices.Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
Suggested-by: Christoph Hellwig
Signed-off-by: Logan Gunthorpe
Cc: Christoph Hellwig
Cc: Sagi Grimberg
Cc: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig -
When connecting a controller with a zero kato value using the following
command linenvme connect -t tcp -n NQN -a ADDR -s PORT --keep-alive-tmo=0
the warning below can be reproduced:
WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627 __queue_delayed_work+0x6d/0x90
with trace:
mod_delayed_work_on+0x59/0x90
nvmet_update_cc+0xee/0x100 [nvmet]
nvmet_execute_prop_set+0x72/0x80 [nvmet]
nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp]
nvmet_tcp_io_work+0x63f/0xb2d [nvmet_tcp]
...This is caused by queuing up an uninitialized work. Althrough the
keep-alive timer is disabled during allocating the controller (fixed in
0d3b6a8d213a), ka_work still has a chance to run (called by
nvmet_start_ctrl).Fixes: 0d3b6a8d213a ("nvmet: Disable keep-alive timer when kato is cleared to 0h")
Signed-off-by: zhenwei pi
Signed-off-by: Christoph Hellwig -
Like commit 5611ec2b9814 ("nvme-pci: prevent SK hynix PC400 from using
Write Zeroes command"), Sandisk Skyhawk has the same issue:
[ 6305.633887] blk_update_request: operation not supported error, dev nvme0n1, sector 340812032 op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0So also disable Write Zeroes command on Sandisk Skyhawk.
BugLink: https://bugs.launchpad.net/bugs/1899503
Signed-off-by: Kai-Heng Feng
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig -
The request's rq_disk isn't set for passthrough IO commands, so tracing
uses qid 0 for these which incorrectly decodes as an admin command. Use
the request_queue's queuedata instead since that value is always set for
the IO queues, and never set for the admin queue.Signed-off-by: Keith Busch
Signed-off-by: Christoph Hellwig