11 May, 2017
1 commit
-
SCSI devices can return short writes on Write Same just like for normal
writes, so we need to handle this case for our special payload requests
as well.Signed-off-by: Christoph Hellwig
Reported-by: Abdul Haleem
Tested-by: Abdul Haleem
Signed-off-by: Jens Axboe
04 May, 2017
2 commits
-
Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
Preparation for adding more declarations.
Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
28 Apr, 2017
3 commits
-
The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.If we combine the handlers, then we need to handle two things:
1) If a delayed/stopped run is scheduled, then we should not run
the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
the queue. Normally a run of a queue with the stopped bit set
would be a no-op.Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe -
This modifies (or adds, if not currently pending) an existing
delayed work item.Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe -
They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
27 Apr, 2017
1 commit
-
We currently call blk_mq_free_queue() from blk_cleanup_queue()
before we unregister the debugfs attributes for that queue in
blk_release_queue(). This leaves a window open during which
accessing most of the mq debugfs attributes would cause a
use-after-free. Additionally, the "state" attribute allows
running the queue, which we should not do after the queue has
entered the "dead" state. Fix both cases by unregistering the
debugfs attributes before freeing queue resources starts.Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
21 Apr, 2017
1 commit
-
Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Acked-by: Roger Pau Monné
Reviewed-by: Konrad Rzeszutek Wilk
Signed-off-by: Jens Axboe
20 Apr, 2017
5 commits
-
Since only a single caller remains, inline blk_rq_set_prio(). Initialize
req->ioprio even if no I/O priority has been set in the bio nor in the
I/O context.Signed-off-by: Bart Van Assche
Reviewed-by: Adam Manzanares
Tested-by: Adam Manzanares
Reviewed-by: Christoph Hellwig
Cc: Matias Bjørling
Signed-off-by: Jens Axboe -
Export this function such that it becomes available to block
drivers.Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Matias Bjørling
Cc: Adam Manzanares
Signed-off-by: Jens Axboe -
blk_insert_flush should be using __blk_end_request to start with.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
This function is not used anywhere in the kernel.
Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Both functions are entirely unused.
Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe
07 Apr, 2017
1 commit
-
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe
05 Apr, 2017
1 commit
-
In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.I have tested this patch with the null block device driver with the following
parameters.null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1
I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.Reviewed-by: Christoph Hellwig
Signed-off-by: Adam Manzanares
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
29 Mar, 2017
3 commits
-
Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
so that we can block new I/O coming once the queue is set as dying.Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undoing the counter.Cc: Tejun Heo
Reviewed-by: Hannes Reinecke
Signed-off-by: Ming Lei
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe -
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().So rename it and we can use this function in both
paths.Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Signed-off-by: Ming Lei
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Without the barrier, reading DEAD flag of .q_usage_counter
and reading .mq_freeze_depth may be reordered, then the
following wait_event_interruptible() may never return.Reviewed-by: Hannes Reinecke
Signed-off-by: Ming Lei
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
28 Mar, 2017
1 commit
-
Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe
22 Mar, 2017
3 commits
-
If a driver allocates a queue for stacked usage, then it does
not currently get stats allocated. This causes the later init
of, eg, writeback throttling to blow up. Move the init to the
queue allocation instead.Additionally, allow a NULL callback unregistration. This avoids
having the caller check for that, fixing another oops on
removal of a block device that doesn't have poll stats allocated.Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Jens Axboe -
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
12 Mar, 2017
1 commit
-
Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.There are two places which walk the list and requeue selected bios,
and others that check if the list is empty. These are no longer
correct.So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.Signed-off-by: NeilBrown
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe
09 Mar, 2017
2 commits
-
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang
Inspired-by: Lars Ellenberg
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe -
This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
leaking of device numbers for SCSI when SCSI registers multiple gendisks
for one request_queue in succession. It can be easily reproduced using
Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
Furthermore the protection provided by this commit is not needed anymore
as the problem it was fixing got also fixed by commit 165a5e22fafb
"block: Move bdi_unregister() to del_gendisk()".[1]: http://marc.info/?l=linux-block&m=148554717109098&w=2
Signed-off-by: Jan Kara
Acked-by: Dan Williams
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe
03 Mar, 2017
1 commit
-
Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
unregistered." moved bdi unregistration (at that time through
bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
it needs to happen before blk_unregister_region() call in del_gendisk()
for MD. SCSI though will free up the device number from sd_remove()
called through a maze of callbacks from device_del() in
__scsi_remove_device() before blk_cleanup_queue() and thus similar races
as described in 6cd18e711dd8 can happen for SCSI as well as reported by
Omar [1].Moving bdi_unregister() to del_gendisk() works for MD and fixes the
problem for SCSI since del_gendisk() gets called from sd_remove() before
freeing the device number.This also makes device_add_disk() (calling bdi_register_owner()) more
symmetric with del_gendisk().[1] http://marc.info/?l=linux-block&m=148554717109098&w=2
Tested-by: Lekshmi Pillai
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe
18 Feb, 2017
1 commit
-
Signed-off-by: Jens Axboe
09 Feb, 2017
2 commits
-
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code. I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
04 Feb, 2017
1 commit
-
If we end up doing a request-to-request merge when we have completed
a bio-to-request merge, we free the request from deep down in that
path. For blk-mq-sched, the merge path has to hold the appropriate
lock, but we don't need it for freeing the request. And in fact
holding the lock is problematic, since we are now calling the
mq sched put_rq_private() hook with the lock held. Other call paths
do not hold this lock.Fix this inconsistency by ensuring that the caller frees a merged
request. Then we can do it outside of the lock, making it both more
efficient and fixing the blk-mq-sched problem of invoking parts of
the scheduler with an unknown lock state.Reported-by: Paolo Valente
Signed-off-by: Jens Axboe
Reviewed-by: Omar Sandoval
03 Feb, 2017
1 commit
-
When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
02 Feb, 2017
6 commits
-
Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
? kernfs_path_from_node+0x4f/0x60
sysfs_warn_dup+0x62/0x80
sysfs_create_dir_ns+0x77/0x90
kobject_add_internal+0xb2/0x350
kobject_add+0x75/0xd0
device_add+0x15a/0x650
device_create_groups_vargs+0xe0/0xf0
device_create_vargs+0x1c/0x20
bdi_register+0x90/0x240
? lockdep_init_map+0x57/0x200
bdi_register_owner+0x36/0x60
device_add_disk+0x1bb/0x4e0
? __pm_runtime_use_autosuspend+0x5c/0x70
sd_probe_async+0x10d/0x1c0
async_run_entry_fn+0x39/0x170This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.[1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
[2]: http://marc.info/?l=linux-block&m=148554717109098&w=2Cc: James Bottomley
Cc: Bart Van Assche
Cc: "Martin K. Petersen"
Cc: Jan Kara
Reported-by: Omar Sandoval
Tested-by: Omar Sandoval
Signed-off-by: Dan Williams
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe -
blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6cSigned-off-by: Jens Axboe
-
Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
blk_set_queue_dying() does not acquire queue lock before it calls
blk_queue_for_each_rl(). This allows a racing blkg_destroy() to
remove blkg->q_node from the linked list and have
blk_queue_for_each_rl() loop infitely over the removed blkg->q_node
list node.Signed-off-by: Tahsin Erdogan
Signed-off-by: Jens Axboe
01 Feb, 2017
2 commits
-
Instead of keeping two levels of indirection for requests types, fold it
all into the operations. The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
28 Jan, 2017
1 commit
-
These days we have the proper flags set since request allocation time.
Signed-off-by: Christoph Hellwig
Acked-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe