Eric Lee / smarc-fsl-linux-kernel

02 Sep, 2020

2 commits

e11d80a84 blk-stat: make q->stats->lock irqsafe ... Browse Code »

blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.

Signed-off-by: Tejun Heo
Fixes: cd006509b0a9 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Jens Axboe

Tejun Heo
2020-09-02 06:48:46 +0800
5aeac7c4b blk-iocost: ioc_pd_free() shouldn't assume irq disabled ... Browse Code »

ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.

Signed-off-by: Tejun Heo
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe

Tejun Heo
2020-09-02 06:48:44 +0800

01 Sep, 2020

3 commits

08fc1ab6d block: fix locking in bdev_del_partition ... Browse Code »

We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-01 22:35:35 +0800
cafe01ef8 block: release disk reference in hd_struct_free_work ... Browse Code »

Commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().

Fixes this issue by moving put disk reference into hd_struct_free_work()

Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")
Reported-by: Ilya Dryomov
Signed-off-by: Ming Lei
Tested-by: Ilya Dryomov
Reviewed-by: Christoph Hellwig
Cc: Luis Chamberlain
Cc: Christoph Hellwig
Cc: Bart Van Assche
Signed-off-by: Jens Axboe

Ming Lei
2020-09-01 22:34:08 +0800
de1b0ee49 block: ensure bdi->io_pages is always initialized ... Browse Code »

If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.

Initialize the default value just like we do for ->ra_pages.

Cc: stable@vger.kernel.org
Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2020-09-01 22:00:14 +0800

30 Aug, 2020

1 commit

5d220bcd3 Merge branch 'nvme-5.9-rc' of git://git.infradead.org/nvme into block-5.9 ... Browse Code »

Pull NVMe fixes from Sagi:

"- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Me
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye"

* 'nvme-5.9-rc' of git://git.infradead.org/nvme:
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu

Jens Axboe
2020-08-30 00:54:11 +0800

29 Aug, 2020

14 commits

7ad92f656 nvme-pci: cancel nvme device request before disabling ... Browse Code »

This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.

[ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[ 62.488796] nvme nvme0: could not set timestamp (881)
[ 62.494888] ------------[ cut here ]------------
[ 62.495142] Trying to free already-free IRQ 11
[ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[ 62.495742] Modules linked in:
[ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
[ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[ 62.497019] RIP: 0010:free_irq+0x1f7/0x370
[ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 62.501864] Call Trace:
[ 62.501993] pci_free_irq+0x13/0x20
[ 62.502167] nvme_reset_work+0x5d0/0x12a0
[ 62.502369] ? update_load_avg+0x59/0x580
[ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0
[ 62.502780] ? try_to_wake_up+0x1a2/0x450
[ 62.502979] process_one_work+0x1d2/0x390
[ 62.503179] worker_thread+0x45/0x3b0
[ 62.503361] ? process_one_work+0x390/0x390
[ 62.503568] kthread+0xf9/0x130
[ 62.503726] ? kthread_park+0x80/0x80
[ 62.503911] ret_from_fork+0x22/0x30
[ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 123.917469] #PF: supervisor write access in kernel mode
[ 123.917725] #PF: error_code(0x0002) - not-present page
[ 123.917976] PGD 0 P4D 0
[ 123.918109] Oops: 0002 [#1] SMP PTI
[ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8
[ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 123.924353] Call Trace:
[ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0
[ 123.924694] nvme_reset_work+0xed6/0x12a0
[ 123.924898] process_one_work+0x1d2/0x390
[ 123.925099] worker_thread+0x45/0x3b0
[ 123.925280] ? process_one_work+0x390/0x390
[ 123.925486] kthread+0xf9/0x130
[ 123.925642] ? kthread_park+0x80/0x80
[ 123.925825] ret_from_fork+0x22/0x30
[ 123.926004] Modules linked in:
[ 123.926158] CR2: 0000000000000000
[ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Co-developed-by: Keith Busch
Signed-off-by: Tong Zhang
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg

Tong Zhang
2020-08-29 07:43:57 +0800
e83d776f9 nvme: only use power of two io boundaries ... Browse Code »

The kernel requires a power of two for boundaries because that's the
only way it can efficiently split commands that cross them. A
controller, however, may report a non-power of two boundary.

The driver had been rounding the controller's value to one the kernel
can use, but splitting on the wrong boundary provides no benefit on the
device side, and incurs additional submission overhead from non-optimal
splits.

Don't provide any boundary hint if the controller's value can't be used
and log a warning when first scanning a disk's unreported IO boundary.
Since the chunk sector logic has grown, move it to a separate function.

Cc: Martin K. Petersen
Signed-off-by: Keith Busch
Reviewed-by: Martin K. Petersen
Signed-off-by: Sagi Grimberg

Keith Busch
2020-08-29 07:43:57 +0800
192f6c29b nvme: fix controller instance leak ... Browse Code »

If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.

Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg

Keith Busch
2020-08-29 07:43:57 +0800
70e37988d nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' ... Browse Code »

The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.

Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.

Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg

Christophe JAILLET
2020-08-29 07:43:57 +0800
7cd49f757 nvme: Fix NULL dereference for pci nvme controllers ... Browse Code »

PCIe controllers do not have fabric opts, verify they exist before
showing ctrl_loss_tmo or reconnect_delay attributes.

Fixes: 764075fdcb2f ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
Reported-by: Tobias Markus
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
2362acb67 nvme-rdma: fix reset hang if controller died in the middle of a reset ... Browse Code »

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
0475a8dcb nvme-rdma: fix timeout handler ... Browse Code »

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Reviewed-by: Christoph Hellwig
Reviewed-by: James Smart
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
5110f4024 nvme-rdma: serialize controller teardown sequences ... Browse Code »

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Reviewed-by: Christoph Hellwig
Reviewed-by: James Smart
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
e5c01f4f7 nvme-tcp: fix reset hang if controller died in the middle of a reset ... Browse Code »

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
236187c4e nvme-tcp: fix timeout handler ... Browse Code »

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:57 +0800
d4d61470a nvme-tcp: serialize controller teardown sequences ... Browse Code »

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:56 +0800
7cf0d7c0f nvme: have nvme_wait_freeze_timeout return if it timed out ... Browse Code »

Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:56 +0800
d7144f5c4 nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance ... Browse Code »

NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.

Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg

Sagi Grimberg
2020-08-29 07:43:56 +0800
a6ce7d7b4 nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu ... Browse Code »

When handling commands without in-capsule data, we assign the ttag
assuming we already have the queue commands array allocated (based
on the queue size information in the connect data payload). However
if the connect itself did not send the connect data in-capsule we
have yet to allocate the queue commands,and we will assign a bogus
ttag and suffer a NULL dereference when we receive the corresponding
h2cdata pdu.

Fix this by checking if we already allocated commands before
dereferencing it when handling h2cdata, if we didn't, its for sure a
connect and we should use the preallocated connect command.

Signed-off-by: Ziye Yang
Signed-off-by: Sagi Grimberg

Ziye Yang
2020-08-29 07:43:56 +0800

28 Aug, 2020

2 commits

a433d7217 Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 ... Browse Code »

Pull MD fix from Song.

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5: make sure stripe_size as power of two

Jens Axboe
2020-08-28 21:52:02 +0800
6af10a33c md/raid5: make sure stripe_size as power of two ... Browse Code »

Commit 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs
entry") make stripe_size as a configurable value. It just requires
stripe_size as multiple of 4KB.

In fact, we should make sure stripe_size as power of two. Otherwise,
stripe_shift which is the result of ilog2 can not represent the real
stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
get unexpected value.

Fixes: 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry")
Signed-off-by: Yufen Yu
Signed-off-by: Song Liu

Yufen Yu
2020-08-28 13:41:03 +0800

26 Aug, 2020

2 commits

79e5dc59e loop: Set correct device size when using LOOP_CONFIGURE ... Browse Code »

The device size calculation was done before processing the loop
configuration, which meant that the we set the size on the underlying
block device incorrectly in case lo_offset/lo_sizelimit were set in the
configuration. Delay computing the size until we've setup the device
parameters correctly.

Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")
Reported-by: Lennart Poettering
Tested-by: Yang Xu
Signed-off-by: Martijn Coenen
Signed-off-by: Jens Axboe

Martijn Coenen
2020-08-26 23:30:31 +0800
acb19e17c nbd: restore default timeout when setting it to zero ... Browse Code »

If we configured io timeout of nbd0 to 100s. Later after we
finished using it, we configured nbd0 again and set the io
timeout to 0. We expect it would timeout after 30 seconds
and keep retry. But in fact we could not change the timeout
when we set it to 0. the timeout is still the original 100s.

So change the timeout to default 30s when we set it to zero.
It also behaves same as commit 2da22da57348 ("nbd: fix zero
cmd timeout handling v2").

It becomes more important if we were reconfigure a nbd device
and the io timeout it set to zero. Because it could take 30s
to detect the new socket and thus io could be completed more
quickly compared to 100s.

Signed-off-by: Hou Pu
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Hou Pu
2020-08-26 23:08:49 +0800

22 Aug, 2020

16 commits

2d62e6b03 null_blk: fix passing of REQ_FUA flag in null_handle_rq ... Browse Code »

REQ_FUA should be checked using rq->cmd_flags instead of req_op().

Fixes: deb78b419dfda ("nullb: emulate cache")
Signed-off-by: Hou Pu
Signed-off-by: Jens Axboe

Hou Pu
2020-08-22 07:14:28 +0800
0d3b6a8d2 nvmet: Disable keep-alive timer when kato is cleared to 0h ... Browse Code »

Based on nvme spec, when keep alive timeout is set to zero
the keep-alive timer should be disabled.

Signed-off-by: Amit Engel
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Amit Engel
2020-08-22 07:14:28 +0800
5eac5f334 nvme: redirect commands on dying queue ... Browse Code »

If a command send through nvme-multipath failed on a dying queue, resend it
on another path.

Signed-off-by: Chao Leng
[hch: rebased on top of the completion refactoring]
Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Mike Snitzer
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Chao Leng
2020-08-22 07:14:28 +0800
1e41f3bd2 nvme: just check the status code type in nvme_is_path_error ... Browse Code »

Check the SCT sub-field for a path related status instead of enumerating
invididual status code. As of NVMe 1.4 this adds "Internal Path Error"
and "Controller Pathing Error" to the list, but it also future proofs for
additional status codes added to the category.

Suggested-by: Chao Leng
Signed-off-by: Christoph Hellwig
Reviewed-by: Mike Snitzer
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-08-22 07:14:28 +0800
5ddaabe8e nvme: refactor command completion ... Browse Code »

Lift all the code to decide the dispostition of a completed command
from nvme_complete_rq and nvme_failover_req into a new helper, which
returns an emum of the potential actions. nvme_complete_rq then
just switches on those and calls the proper helper for the action.

Signed-off-by: Christoph Hellwig
Reviewed-by: Mike Snitzer
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-08-22 07:14:28 +0800
2eb81a336 nvme: rename and document nvme_end_request ... Browse Code »

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API. It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Mike Snitzer
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-08-22 07:14:28 +0800
c41ad98be nvme: skip noiob for zoned devices ... Browse Code »

Zoned block devices reuse the chunk_sectors queue limit to define zone
boundaries. If a such a device happens to also report an optimal
boundary, do not use that to define the chunk_sectors as that may
intermittently interfere with io splitting and zone size queries.

Signed-off-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Keith Busch
2020-08-22 07:14:27 +0800
c61b82c7b nvme-pci: fix PRP pool size ... Browse Code »

All operations are based on the controller, not the host page size.
Switch the dma pool to use the controller page size as well to avoid
massive overallocations on large page size systems.

Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-08-22 07:14:27 +0800
7442ddced nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth ... Browse Code »

Recently nvme_dev.q_depth was changed from an int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
x1 : 0000000000000001 x0 : 0000000000000000
Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---

Fix by making onto a u32.

Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.

Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: John Garry
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

John Garry
2020-08-22 07:14:27 +0800
ecbcdf0c8 nvme: Use spin_lock_irq() when taking the ctrl->lock ... Browse Code »

When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
dead lock. The new spin_lock() calls recently added produce the
following lockdep warning when running the blktest nvme/003:

================================
WARNING: inconsistent lock state
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x164/0x500
_raw_spin_lock+0x28/0x40
nvme_get_effects_log+0x37/0x1c0
nvme_init_identify+0x9e4/0x14f0
nvme_reset_work+0xadd/0x2360
process_one_work+0x66b/0xb70
worker_thread+0x6e/0x6c0
kthread+0x1e7/0x210
ret_from_fork+0x22/0x30
irq event stamp: 1449221
hardirqs last enabled at (1449220): [] ktime_get+0xf9/0x140
hardirqs last disabled at (1449221): [] _raw_spin_lock_irqsave+0x25/0x60
softirqs last enabled at (1449210): [] __do_softirq+0x447/0x595
softirqs last disabled at (1449215): [] run_ksoftirqd+0x35/0x50

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&ctrl->lock);

lock(&ctrl->lock);

*** DEADLOCK ***

no locks held by ksoftirqd/2/22.

stack backtrace:
CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xc8/0x11a
print_usage_bug.cold.63+0x235/0x23e
mark_lock+0xa9c/0xcf0
__lock_acquire+0xd9a/0x2b50
lock_acquire+0x164/0x500
_raw_spin_lock_irqsave+0x40/0x60
nvme_keep_alive_end_io+0x50/0xc0
blk_mq_end_request+0x158/0x210
nvme_complete_rq+0x146/0x500
nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
blk_done_softirq+0x187/0x1e0
__do_softirq+0x118/0x595
run_ksoftirqd+0x35/0x50
smpboot_thread_fn+0x1d3/0x310
kthread+0x1e7/0x210
ret_from_fork+0x22/0x30

Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages")
Signed-off-by: Logan Gunthorpe
Reviewed-by: Keith Busch
Tested-by: Chaitanya Kulkarni
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Logan Gunthorpe
2020-08-22 07:14:27 +0800
7ee51cf60 nvmet: call blk_mq_free_request() directly ... Browse Code »

Instead of calling blk_put_request() which calls blk_mq_free_request(),
call blk_mq_free_request() directly for NVMeOF passthru. This is to
mainly avoid an extra function call in the completion path
nvmet_passthru_req_done().

Signed-off-by: Chaitanya Kulkarni
Reviewed-by: Keith Busch
Reviewed-by: Logan Gunthorpe
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2020-08-22 07:14:27 +0800
a2138fd49 nvmet: fix oops in pt cmd execution ... Browse Code »

In the existing NVMeOF Passthru core command handling on failure of
nvme_alloc_request() it errors out with rq value set to NULL. In the
error handling path it calls blk_put_request() without checking if
rq is set to NULL or not which produces following Oops:-

[ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1457.347838] #PF: supervisor read access in kernel mode
[ 1457.348464] #PF: error_code(0x0000) - not-present page
[ 1457.349085] PGD 0 P4D 0
[ 1457.349402] Oops: 0000 [#1] SMP NOPTI
[ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35
[ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
[ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.363973] Call Trace:
[ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
[ 1457.364990] process_one_work+0x24e/0x5a0
[ 1457.365493] ? __schedule+0x353/0x840
[ 1457.365957] worker_thread+0x3c/0x380
[ 1457.366426] ? process_one_work+0x5a0/0x5a0
[ 1457.366948] kthread+0x135/0x150
[ 1457.367362] ? kthread_create_on_node+0x60/0x60
[ 1457.367934] ret_from_fork+0x22/0x30
[ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
[ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
[ 1457.380849] CR2: 0000000000000000
[ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
[ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.392525] Kernel panic - not syncing: Fatal exception
[ 1457.394138] Kernel Offset: disabled
[ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---

We fix this Oops by adding a new goto label out_put_req and reordering
the blk_put_request call to avoid calling blk_put_request() with rq
value is set to NULL. Here we also update the rest of the code
accordingly.

Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
Signed-off-by: Chaitanya Kulkarni
Reviewed-by: Logan Gunthorpe
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2020-08-22 07:14:27 +0800
4db69a3d7 nvmet: add ns tear down label for pt-cmd handling ... Browse Code »

In the current implementation before submitting the passthru cmd we
may come across error e.g. getting ns from passthru controller,
allocating a request from passthru controller, etc. For all the failure
cases it only uses single goto label fail_out.

In the target code, we follow the pattern to have a separate label for
each error out the case when setting up multiple things before the actual
action.

This patch follows the same pattern and renames generic fail_out label
to out_put_ns and updates the error out cases in the
nvmet_passthru_execute_cmd() where it is needed.

Signed-off-by: Chaitanya Kulkarni
Reviewed-by: Logan Gunthorpe
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2020-08-22 07:14:27 +0800
e398863b7 nvme: multipath: round-robin: eliminate "fallback" variable ... Browse Code »

If we find an optimized path, we quit the loop immediately. Thus we can use
just one variable for the next path, slighly simplifying the code.

Signed-off-by: Martin Wilck
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin Wilck
2020-08-22 07:14:27 +0800
93eb0381e nvme: multipath: round-robin: fix single non-optimized path case ... Browse Code »

If there's only one usable, non-optimized path, nvme_round_robin_path()
returns NULL, which is wrong. Fix it by falling back to "old", like in
the single optimized path case. Also, if the active path isn't changed,
there's no need to re-assign the pointer.

Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
Signed-off-by: Martin Wilck
Signed-off-by: Martin George
Reviewed-by: Keith Busch
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin Wilck
2020-08-22 07:14:27 +0800
f34448cd0 nvme-fc: Fix wrong return value in __nvme_fc_init_request() ... Browse Code »

On an error exit path, a negative error code should be returned
instead of a positive return value.

Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart
Signed-off-by: Tianjia Zhang
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Tianjia Zhang
2020-08-22 07:14:27 +0800