14 Nov, 2020
1 commit
-
For avoiding use-after-free on flush request, we call its .end_io() from
both timeout code path and __blk_mq_end_request().When flush request's ref doesn't drop to zero, it is still used, we
can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
to zero really.Fixes: 65ff5cd04551 ("blk-mq: mark flush request as IDLE in flush_end_io()")
Signed-off-by: Ming Lei
Cc: Yi Zhang
Signed-off-by: Jens Axboe
30 Oct, 2020
1 commit
-
Mark flush request as IDLE in its .end_io(), aligning it with how normal
requests behave. The flush request stays in in-flight tags if we're not
using an IO scheduler, so we need to change its state into IDLE.
Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
error recovery because flush the request state is kept as COMPLETED.Reported-by: Yi Zhang
Signed-off-by: Ming Lei
Tested-by: Yi Zhang
Cc: Chao Leng
Cc: Sagi Grimberg
Signed-off-by: Jens Axboe
12 Aug, 2020
1 commit
-
In case of none scheduler, we share data request's driver tag for
flush request, so have to mark the flush request as INFLIGHT for
avoiding double account of this driver tag.Fixes: 568f27006577 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
Reported-by: Matthew Wilcox
Signed-off-by: Ming Lei
Tested-by: Matthew Wilcox
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe
17 Jul, 2020
1 commit
-
Commit 7520872c0cf4 ("block: don't defer flushes on blk-mq + scheduling")
tried to fix deadlock for cycled wait between flush requests and data
request into flush_data_in_flight. The former holded all driver tags
and wait for data request completion, but the latter can not complete
for waiting free driver tags.After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
for flush rq"), flush requests will not get driver tag before queuing
into flush queue.* With elevator, flush request just get sched_tags before inserting
flush queue. It will not get driver tag until issue them to driver.
data request on list fq->flush_data_in_flight will complete in
the end.* Without elevator, each flush request will get a driver tag when
allocate request. Then data request on fq->flush_data_in_flight
don't worry about lacking driver tag.In both of these cases, cycled wait cannot be true. So we may allow
to defer flush request.Signed-off-by: Yufen Yu
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
09 Jul, 2020
1 commit
-
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.Signed-off-by: Ming Lei
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe
02 Jul, 2020
1 commit
-
This reverts commits the following commits:
37f4a24c2469a10a4c16c641671bd766e276cf9f
723bf178f158abd1ce6069cb049581b3cb003aab
36a3df5a4574d5ddf59804fcd0c4e9654c514d9aThe last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/Reported-by: Marek Szyprowski
Reported-by: Qian Cai
Signed-off-by: Jens Axboe
01 Jul, 2020
1 commit
-
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.Signed-off-by: Ming Lei
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe
29 Jun, 2020
1 commit
-
It is natural to release driver tag when this request is completed by
LLD or device since its purpose is for LLD use.One big benefit is that the released tag can be re-used quicker since
bio_endio() may take too long.Meantime we don't need to release driver tag for flush request.
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
22 May, 2020
2 commits
-
Both of these never can be NULL for a live block device.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
19 May, 2020
1 commit
-
The flush_queue_delayed was introdued to hold queue if flush is
running for non-queueable flush drive by commit 3ac0cc450870
("hold queue if flush is running for non-queueable flush drive"),
but the non mq parts of the flush code had been removed by
commit 7e992f847a08 ("block: remove non mq parts from the flush code"),
as well as removing the usage of the flush_queue_delayed flag.
Thus remove the unused flush_queue_delayed flag.Signed-off-by: Baolin Wang
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
28 Mar, 2020
1 commit
-
This reverts commit f10d9f617a65905c556c3b37c9b9646ae7d04ed7.
We can't have queues without a make_request_fn any more (and the
loop device uses blk-mq these days anyway..).Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
12 Mar, 2020
2 commits
-
Remove the comment about return value, since it is not valid after
commit 404b8f5a03d84 ("block: cleanup kick/queued handling").Signed-off-by: Guoqing Jiang
Reviewed-by: Nikolay Borisov
Reviewed-by: Bart Van Assche
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe -
Remove 'q' from arguments since it is not used anymore after
commit 7e992f847a08e ("block: remove non mq parts from the
flush code").Signed-off-by: Guoqing Jiang
Reviewed-by: Nikolay Borisov
Reviewed-by: Bart Van Assche
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe
25 Feb, 2020
1 commit
-
For some reason, device may be in one situation which can't handle
FS request, so STS_RESOURCE is always returned and the FS request
will be added to hctx->dispatch. However passthrough request may
be required at that time for fixing the problem. If passthrough
request is added to scheduler queue, there isn't any chance for
blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
Then the FS IO request may never be completed, and IO hang is caused.So passthrough request has to be added to hctx->dispatch directly
for fixing the IO hang.Fix this issue by inserting passthrough request into hctx->dispatch
directly together withing adding FS request to the tail of
hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().Then it becomes consistent with original legacy IO request
path, in which passthrough request is always added to q->queue_head.Cc: Dongli Zhang
Cc: Christoph Hellwig
Cc: Ewan D. Milne
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
21 Dec, 2019
1 commit
-
Avoid that running test nvme/012 from the blktests suite triggers the
following false positive lockdep complaint:============================================
WARNING: possible recursive locking detected
5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
--------------------------------------------
ksoftirqd/1/16 is trying to acquire lock:
000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0but task is already holding lock:
00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0other info that might help us debug this:
Possible unsafe locking scenario:CPU0
----
lock(&(&fq->mq_flush_lock)->rlock);
lock(&(&fq->mq_flush_lock)->rlock);*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by ksoftirqd/1/16:
#0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0stack backtrace:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
dump_stack+0x67/0x90
__lock_acquire.cold.45+0x2b4/0x313
lock_acquire+0x98/0x160
_raw_spin_lock_irqsave+0x3b/0x80
flush_end_io+0x4e/0x1d0
blk_mq_complete_request+0x76/0x110
nvmet_req_complete+0x15/0x110 [nvmet]
nvmet_bio_done+0x27/0x50 [nvmet]
blk_update_request+0xd7/0x2d0
blk_mq_end_request+0x1a/0x100
blk_flush_complete_seq+0xe5/0x350
flush_end_io+0x12f/0x1d0
blk_done_softirq+0x9f/0xd0
__do_softirq+0xca/0x440
run_ksoftirqd+0x24/0x50
smpboot_thread_fn+0x113/0x1e0
kthread+0x121/0x140
ret_from_fork+0x3a/0x50Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Hannes Reinecke
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
22 Nov, 2019
1 commit
-
Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
27 Sep, 2019
1 commit
-
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40The bug is caused by the race between timeout handle and completion for
flush request.When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().Cc: Christoph Hellwig
Cc: Keith Busch
Cc: Bart Van Assche
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei
Reviewed-by: Bob Liu
Signed-off-by: Yufen Yu-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe
01 May, 2019
1 commit
-
All these files have some form of the usual GPLv2 boilerplate. Switch
them to use SPDX tags instead.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
25 Mar, 2019
1 commit
-
Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
Then we can remove the unused blk_mq_put_driver_tag_hctx().Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe
30 Jan, 2019
1 commit
-
Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.data + post flush a flush
blk_flush_complete_seq
case REQ_FSEQ_DATA
blk_flush_queue_rq
issued to driver blk_mq_dispatch_rq_list
try to issue a flush req
failed due to NON-NCQ command
.queue_rq return BLK_STS_DEV_RESOURCErequest completion
req->end_io // doesn't check RESTART
mq_flush_data_end_io
case REQ_FSEQ_POSTFLUSH
blk_kick_flush
do nothing because previous flush
has not been completed
blk_mq_run_hw_queue
insert rq to hctx->dispatch
due to RESTART is still set, do nothingTo fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
with blk_mq_sched_restart to check and clear the RESTART flag.Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: Florian Stecker
Tested-by: Florian Stecker
Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe
16 Nov, 2018
2 commits
-
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.Where the ->mq_ops != NULL check is redundant, remove it.
Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Only the mq locking is left in the flush state machine.
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
08 Nov, 2018
4 commits
-
We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
Prep patch for being able to place request based not just on
CPU location, but also on the type of request.Reviewed-by: Hannes Reinecke
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
Now that blk_flush_queue_rq() always returns false, we can
remove that return value. That bubbles through the stack,
allowing us to remove a bunch of state tracking around it.Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
Reviewed-by: Hannes Reinecke
Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
14 Oct, 2018
1 commit
-
blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe
09 Jun, 2018
1 commit
-
A recent commit reused the original request flags for the flush
queue handling. However, for some of the kick flush cases, the
original request was already completed. This caused a use after
free, if blk-mq wasn't used.Fixes: 84fca1b0c461 ("block: pass failfast and driver-specific flags to flush requests")
Reported-by: Dmitry Vyukov
Signed-off-by: Jens Axboe
06 Jun, 2018
1 commit
-
If flush requests are being sent to the device we need to inherit the
failfast and driver-specific flags, too, otherwise I/O will fail.Reviewed-by: Christoph Hellwig
Signed-off-by: Hannes Reinecke
Signed-off-by: Jens Axboe
05 Nov, 2017
3 commits
-
The idea behind it is simple:
1) for none scheduler, driver tag has to be borrowed for flush rq,
otherwise we may run out of tag, and that causes an IO hang. And
get/put driver tag is actually noop for none, so reordering tags
isn't necessary at all.2) for a real I/O scheduler, we need not allocate a driver tag upfront
for flush rq. It works just fine to follow the same approach as
normal requests: allocate driver tag for each rq just before calling
->queue_rq().One driver visible change is that the driver tag isn't shared in the
flush request sequence. That won't be a problem, since we always do that
in legacy path.Then flush rq need not be treated specially wrt. get/put driver tag.
This cleans up the code - for instance, reorder_tags_to_front() can be
removed, and we needn't worry about request ordering in dispatch list
for avoiding I/O deadlock.Also we have to put the driver tag before requeueing.
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
In the following patch, we will use RQF_FLUSH_SEQ to decide:
1) if the flag isn't set, the flush rq need to be inserted via
blk_insert_flush()2) otherwise, the flush rq need to be dispatched directly since
it is in flush machinery now.So we use blk_mq_request_bypass_insert() for requests of bypassing
flush machinery, just like the legacy path did.Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
blk_insert_flush() should only insert request since run queue always
follows it.In case of bypassing flush, we don't need to run queue because every
blk_insert_flush() follows one run queue.Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
26 Aug, 2017
1 commit
-
Normally I wouldn't bother with this, but in my opinion the comments are
the most important part of this whole file since without them no one
would have any clue how this insanity works.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
24 Aug, 2017
1 commit
-
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Jun, 2017
1 commit
-
Instead of documenting the locking assumptions of most block layer
functions as a comment, use lockdep_assert_held() to verify locking
assumptions at runtime.Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe
09 Jun, 2017
1 commit
-
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
20 Apr, 2017
1 commit
-
blk_insert_flush should be using __blk_end_request to start with.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
25 Mar, 2017
1 commit
-
blkdev_issue_flush() is now always synchronous, and it no longer has a
flags argument. So remove the part of the comment about the WAIT flag.Signed-off-by: Eric Biggers
Signed-off-by: Jens Axboe
18 Feb, 2017
1 commit
-
For blk-mq with scheduling, we can potentially end up with ALL
driver tags assigned and sitting on the flush queues. If we
defer because of an inlfight data request, then we can deadlock
if that data request doesn't already have a tag assigned.This fixes a deadlock with running the xfs/297 xfstest, where
thousands of syncs can cause the drive queue to stall.Signed-off-by: Jens Axboe
Reviewed-by: Omar Sandoval