08 May, 2019
1 commit
-
Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:- Series of fixes for sed-opal (David, Jonas)
- Fixes and performance tweaks for BFQ (via Paolo)
- Set of fixes for bcache (via Coly)
- Set of fixes for md (via Song)
- Enabling multi-page for passthrough requests (Ming)
- Queue release fix series (Ming)
- Device notification improvements (Martin)
- Propagate underlying device rotational status in loop (Holger)
- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)- Improvement and cleanup of nvme command handling (Christoph)
- Add block SPDX tags (Christoph)
- Cleanup/hardening of bio/bvec iteration (Christoph)
- A few NVMe pull requests (Christoph)
- Removal of CONFIG_LBDAF (Christoph)
- Various little fixes here and there"
* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...
04 May, 2019
2 commits
-
hctx is always released after requeue is freed.
With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")But still can't cover all cases, recently James reports another such
kind of issue:https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang
Cc: James Smart
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Tested-by: James Smart
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
01 May, 2019
1 commit
-
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
26 Apr, 2019
1 commit
-
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace all of the ktype default_attrs fields in
the block subsystem with default_groups and use the ATTRIBUTE_GROUPS
macro to create the default groups.Remove default_ctx_attrs[] because it doesn't contain any attributes.
This patch was tested by verifying that the sysfs files for the
attributes in the default groups were created.Signed-off-by: Kimberly Brown
Reviewed-by: Bart Van Assche
Signed-off-by: Greg Kroah-Hartman
17 Dec, 2018
1 commit
-
Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debugfs.Given not necessary to export hctx->type twice, so remove the export
from sysfs.Cc: Jeff Moyer
Cc: Mike Snitzer
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
05 Dec, 2018
1 commit
-
Having another indirect all in the fast path doesn't really help
in our post-spectre world. Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Nov, 2018
1 commit
-
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.This patch fixes the issue by the following approach:
1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.Reported-by: Guenter Roeck
Cc: "jianchao.wang"
Tested-by: Guenter Roeck
Reviewed-by: Greg Kroah-Hartman
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
16 Nov, 2018
1 commit
-
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
08 Nov, 2018
1 commit
-
It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.Reviewed-by: Hannes Reinecke
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe
25 May, 2018
1 commit
-
Convert the S_ symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.see: https://lkml.org/lkml/2016/8/2/1945
Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplaceMiscellanea:
o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesisSigned-off-by: Joe Perches
Signed-off-by: Jens Axboe
15 Jan, 2018
1 commit
-
The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
04 May, 2017
2 commits
-
Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
Preparation for adding more declarations.
Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
27 Apr, 2017
4 commits
-
Hctx unregistration involves calling kobject_del(). kobject_del()
must not be called if kobject_add() has not been called. Hence in
the error path only unregister hctxs for which registration succeeded.Signed-off-by: Bart Van Assche
Cc: Omar Sandoval
Cc: Hannes Reinecke
Signed-off-by: Jens Axboe -
Since the blk_mq_debugfs_*register_hctxs() functions register and
unregister all attributes under the "mq" directory, rename these
into blk_mq_debugfs_*register_mq().Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
A later patch will move the call of blk_mq_debugfs_register() to
a function to which the queue name is not passed as an argument.
To avoid having to add a 'name' argument to multiple callers, let
blk_mq_debugfs_register() look up the queue name.Signed-off-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
A later patch in this series will modify blk_mq_debugfs_register()
such that it uses q->kobj.parent to determine the name of a
request queue. Hence make sure that that pointer is initialized
before blk_mq_debugfs_register() is called. To avoid lock inversion,
protect sysfs / debugfs registration with the queue sysfs_lock
instead of the global mutex all_q_mutex.Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
09 Mar, 2017
4 commits
-
It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe -
This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe -
Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe -
Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.This patch fixes one issue[1] reported from Omar.
[1] kernel wearning when doing unbind/bind on one scsi-mq device
[ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[ 19.350920] Workqueue: events_unbound async_run_entry_fn
[ 19.350920] Call Trace:
[ 19.350920] dump_stack+0x63/0x83
[ 19.350920] kobject_init+0x77/0x90
[ 19.350920] blk_mq_register_dev+0x40/0x130
[ 19.350920] blk_register_queue+0xb6/0x190
[ 19.350920] device_add_disk+0x1ec/0x4b0
[ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
[ 19.350920] async_run_entry_fn+0x48/0x150
[ 19.350920] process_one_work+0x1d0/0x480
[ 19.350920] worker_thread+0x48/0x4e0
[ 19.350920] kthread+0x101/0x140
[ 19.350920] ? process_one_work+0x480/0x480
[ 19.350920] ? kthread_create_on_node+0x60/0x60
[ 19.350920] ret_from_fork+0x2c/0x40Cc: Omar Sandoval
Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe
03 Feb, 2017
1 commit
-
This needs to happen after we tear down blktrace.
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
27 Jan, 2017
5 commits
-
These counters aren't as out-of-place in sysfs as the other stuff, but
debugfs is a slightly better home for them.Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
These statistics _might_ be useful to userspace, but it's better not to
commit to an ABI for these yet. Also, the dispatched file in sysfs
couldn't be cleared, so make it clearable like the others in debugfs.Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
These are very tied to the blk-mq tag implementation, so exposing them
to sysfs isn't a great idea. Move the debugging information to debugfs
and add basic entries for the number of tags and the number of reserved
tags to sysfs.Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
These lists are only useful for debugging; they definitely don't belong
in sysfs. Putting them in debugfs also removes the limitation of a
single page of output.Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:# tree -d /sys/kernel/debug/block
/sys/kernel/debug/block
|-- nvme0n1
| `-- mq
| |-- 0
| | `-- cpu0
| |-- 1
| | `-- cpu1
| |-- 2
| | `-- cpu2
| `-- 3
| `-- cpu3
`-- vda
`-- mq
`-- 0
|-- cpu0
|-- cpu1
|-- cpu2
`-- cpu3Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.Reviewed-by: Hannes Reinecke
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
18 Jan, 2017
1 commit
-
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.Signed-off-by: Jens Axboe
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
11 Nov, 2016
1 commit
-
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.The stats are tracked in, roughly, 0.1s interval windows.
Add sysfs files to display the stats.
The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.Signed-off-by: Jens Axboe
21 Sep, 2016
1 commit
-
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe
17 Sep, 2016
1 commit
-
We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:$ cat /sys/block/nvme0n1/mq/0/dispatched
0 1006
1 20229
2 1
4 0
8 0
16 0
32+ 0Signed-off-by: Jens Axboe
Reviewed-by: Omar Sandoval
14 Sep, 2016
2 commits
-
Allow the io_poll statistics to be zeroed to make for easier logging
of polling event.Signed-off-by: Stephen Bates
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.Signed-off-by: Stephen Bates
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe
05 Aug, 2016
1 commit
-
If we fail registering any of the hardware queues, we call
into blk_mq_unregister_disk() with the hotplug mutex already
held. Since blk_mq_unregister_disk() attempts to acquire the
same mutex, we end up in a less than happy place.Reported-by: Jinpu Wang
Signed-off-by: Jens Axboe
20 Mar, 2016
1 commit
-
queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().Reported-by: Xiong Zhou
Signed-off-by: Thomas Gleixner
Signed-off-by: Jens Axboe
10 Feb, 2016
1 commit
-
The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Tested-by: Jon Derrick
Signed-off-by: Jens Axboe
08 Nov, 2015
1 commit
-
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch
22 Oct, 2015
1 commit
-
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios. This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request(). The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().This fixes crash signatures like the following:
BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
[] ? copy_user_handle_tail+0x5f/0x70
[] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
[] pmem_make_request+0xd1/0x200 [nd_pmem]
[] ? mempool_alloc+0x72/0x1a0
[] generic_make_request+0xd6/0x110
[] submit_bio+0x76/0x170
[] submit_bh_wbc+0x12f/0x160
[] submit_bh+0x12/0x20
[] jbd2_write_superblock+0x8d/0x170
[] jbd2_mark_journal_empty+0x5d/0x90
[] jbd2_journal_destroy+0x24b/0x270
[] ? put_pwq_unlocked+0x2a/0x30
[] ? destroy_workqueue+0x225/0x250
[] ext4_put_super+0x64/0x360
[] generic_shutdown_super+0x6a/0xf0Cc: Jens Axboe
Cc: Keith Busch
Cc: Ross Zwisler
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe
30 Sep, 2015
1 commit
-
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().If these happen at the same time, a deadlock can happen. Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Ming Lei
Cc: Wanpeng Li
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe