21 Nov, 2018
1 commit
-
commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.
c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen
Cc: Christoph Hellwig
Cc: James E.J. Bottomley
Cc: stable
Cc: jianchao.wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
22 Sep, 2018
1 commit
-
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
06 Sep, 2018
1 commit
-
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.The patch b089cfd95d32 ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).Signed-off-by: Mikulas Patocka
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org # 4.18
Signed-off-by: Jens Axboe
23 Aug, 2018
1 commit
-
Pull more block updates from Jens Axboe:
- Set of bcache fixes and changes (Coly)
- The flush warn fix (me)
- Small series of BFQ fixes (Paolo)
- wbt hang fix (Ming)
- blktrace fix (Steven)
- blk-mq hardware queue count update fix (Jianchao)
- Various little fixes
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
18 Aug, 2018
1 commit
-
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.Reviewed-by: Omar Sandoval
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe
15 Aug, 2018
2 commits
-
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)- Series improving how we handle sense information on the stack
(Kees)- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
... -
Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.Reported-by: Stefan Agner
Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: Jens Axboe
09 Aug, 2018
1 commit
-
This patch does not change any functionality.
Signed-off-by: Bart Van Assche
Reviewed-by: Johannes Thumshirn
Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Omar Sandoval
Cc: Alexandru Moise
Cc: Joseph Qi
Cc:
Signed-off-by: Jens Axboe
05 Aug, 2018
1 commit
-
It turns out that commit 721c7fc701c7 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=200439
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701c7, but letting the previously
uncaught request go through with a warning instead of stopping it.Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH
Acked-by: Mike Snitzer
Cc: Sagi Grimberg
Cc: Ilya Dryomov
Cc: Jens Axboe
Cc: Zdenek Kabelac
Signed-off-by: Linus Torvalds
03 Aug, 2018
1 commit
-
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.Cc: Tomas Janousek
Cc: Przemek Socha
Cc: Alan Stern
Cc:
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Tested-by: Patrick Steinhardt
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
30 Jul, 2018
1 commit
-
We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
Cc:
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: xiao jin
Signed-off-by: Jens Axboe
18 Jul, 2018
1 commit
-
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Minchan Kim
Cc: Dan Williams
Cc: Joshua Morris
Cc: Philipp Reisner
Cc: Matias Bjorling
Cc: Kent Overstreet
Cc: Alasdair Kergon
Signed-off-by: Jens Axboe
09 Jul, 2018
5 commits
-
With gcc 4.9.0 and 7.3.0:
block/blk-core.c: In function 'blk_pm_allow_request':
block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
switch (rq->q->rpm_status) {
^Convert the return statement below the switch() block into a default
case to fix this.Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM")
Signed-off-by: Geert Uytterhoeven
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe -
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe -
The payload of struct request is stored in the request.bio chain if
the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
blk_rq_bytes() which means that the value returned by that function
is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
clear to me whether this is an oversight or whether this happened on
purpose. Anyway, document that it is known that both functions ignore
RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b988 ("block: improve
handling of the magic discard payload").Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Cc: Ming Lei
Signed-off-by: Jens Axboe -
SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones
Cc: Omar Sandoval
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Tested-by: Andrew Jones
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
28 Jun, 2018
1 commit
-
This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:kernel BUG at block/blk-core.c:3347!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
RIP: 0010:blk_end_request_all+0x68/0x70
Call Trace:
dm_softirq_done+0x326/0x3d0 [dm_mod]
blk_done_softirq+0x19b/0x1e0
__do_softirq+0x128/0x60d
irq_exit+0x100/0x110
smp_call_function_single_interrupt+0x90/0x330
call_function_single_interrupt+0xf/0x20
Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei
Reviewed-by: Christoph Hellwig
Acked-by: Mike Snitzer
Signed-off-by: Bart Van Assche
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc:
Signed-off-by: Jens Axboe
20 Jun, 2018
1 commit
-
Commit 0ba99ca4838b ("block: Add warning for bi_next not NULL in
bio_endio()") breaks the dm driver. end_clone_bio() detects whether
or not a bio is the last bio associated with a request by checking
the .bi_next field. Commit 0ba99ca4838b clears that field before
end_clone_bio() has had a chance to inspect that field. Hence revert
commit 0ba99ca4838b.This patch avoids that KASAN reports the following complaint when
running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):==================================================================
BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xa4/0xf5
print_address_description+0x6f/0x270
kasan_report+0x241/0x360
__asan_load4+0x78/0x80
bio_advance+0x11b/0x1d0
blk_update_request+0xa7/0x5b0
scsi_end_request+0x56/0x320 [scsi_mod]
scsi_io_completion+0x7d6/0xb20 [scsi_mod]
scsi_finish_command+0x1c0/0x280 [scsi_mod]
scsi_softirq_done+0x19a/0x230 [scsi_mod]
blk_mq_complete_request+0x160/0x240
scsi_mq_done+0x50/0x1a0 [scsi_mod]
srp_recv_done+0x515/0x1330 [ib_srp]
__ib_process_cq+0xa0/0xf0 [ib_core]
ib_poll_handler+0x38/0xa0 [ib_core]
irq_poll_softirq+0xe8/0x1f0
__do_softirq+0x128/0x60d
run_ksoftirqd+0x3f/0x60
smpboot_thread_fn+0x352/0x460
kthread+0x1c1/0x1e0
ret_from_fork+0x24/0x30Allocated by task 1918:
save_stack+0x43/0xd0
kasan_kmalloc+0xad/0xe0
kasan_slab_alloc+0x11/0x20
kmem_cache_alloc+0xfe/0x350
mempool_alloc_slab+0x15/0x20
mempool_alloc+0xfb/0x270
bio_alloc_bioset+0x244/0x350
submit_bh_wbc+0x9c/0x2f0
__block_write_full_page+0x299/0x5a0
block_write_full_page+0x16b/0x180
blkdev_writepage+0x18/0x20
__writepage+0x42/0x80
write_cache_pages+0x376/0x8a0
generic_writepages+0xbe/0x110
blkdev_writepages+0xe/0x10
do_writepages+0x9b/0x180
__filemap_fdatawrite_range+0x178/0x1c0
file_write_and_wait_range+0x59/0xc0
blkdev_fsync+0x46/0x80
vfs_fsync_range+0x66/0x100
do_fsync+0x3d/0x70
__x64_sys_fsync+0x21/0x30
do_syscall_64+0x77/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xbeFreed by task 9:
save_stack+0x43/0xd0
__kasan_slab_free+0x137/0x190
kasan_slab_free+0xe/0x10
kmem_cache_free+0xd3/0x380
mempool_free_slab+0x17/0x20
mempool_free+0x63/0x160
bio_free+0x81/0xa0
bio_put+0x59/0x60
end_bio_bh_io_sync+0x5d/0x70
bio_endio+0x1a7/0x360
blk_update_request+0xd0/0x5b0
end_clone_bio+0xa3/0xd0 [dm_mod]
bio_endio+0x1a7/0x360
blk_update_request+0xd0/0x5b0
scsi_end_request+0x56/0x320 [scsi_mod]
scsi_io_completion+0x7d6/0xb20 [scsi_mod]
scsi_finish_command+0x1c0/0x280 [scsi_mod]
scsi_softirq_done+0x19a/0x230 [scsi_mod]
blk_mq_complete_request+0x160/0x240
scsi_mq_done+0x50/0x1a0 [scsi_mod]
srp_recv_done+0x515/0x1330 [ib_srp]
__ib_process_cq+0xa0/0xf0 [ib_core]
ib_poll_handler+0x38/0xa0 [ib_core]
irq_poll_softirq+0xe8/0x1f0
__do_softirq+0x128/0x60dThe buggy address belongs to the object at ffff8801300e0640
which belongs to the cache bio-0 of size 200
The buggy address is located 144 bytes inside of
200-byte region [ffff8801300e0640, ffff8801300e0708)
The buggy address belongs to the page:
page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
flags: 0x8000000000008100(slab|head)
raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detectedMemory state around the buggy address:
ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================Cc: Kent Overstreet
Fixes: 0ba99ca4838b ("block: Add warning for bi_next not NULL in bio_endio()")
Acked-by: Mike Snitzer
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
07 Jun, 2018
1 commit
-
blk_partition_remap() will only clear bi_partno if an actual remapping
has happened. But flush request et al don't have an actual size, so
the remapping doesn't happen and bi_partno is never cleared.
So for stacked devices blk_partition_remap() will be called on each level.
If (as is the case for native nvme multipathing) one of the lower-level
devices do _not_support partitioning a spurious I/O error is generated.Reviewed-by: Johannes Thumshirn
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Hannes Reinecke
Signed-off-by: Jens Axboe
03 Jun, 2018
1 commit
-
If we end up splitting a bio and the queue goes away between
the initial submission and the later split submission, then we
can block forever in blk_queue_enter() waiting for the reference
to drop to zero. This will never happen, since we already hold
a reference.Mark a split bio as already having entered the queue, so we can
just use the live non-blocking queue enter variant.Thanks to Tetsuo Handa for the analysis.
Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe
01 Jun, 2018
3 commits
-
Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.Signed-off-by: Christoph Hellwig
Reviewed-by: Damien Le Moal
Tested-by: Damien Le Moal
Signed-off-by: Jens Axboe -
Reported-by: Damien Le Moal
Signed-off-by: Christoph Hellwig
Reviewed-by: Damien Le Moal
Tested-by: Damien Le Moal
Signed-off-by: Jens Axboe -
No point in doing this in elevator_init.
Signed-off-by: Christoph Hellwig
Reported-by: Damien Le Moal
Reviewed-by: Damien Le Moal
Tested-by: Damien Le Moal
Signed-off-by: Jens Axboe
31 May, 2018
1 commit
-
Convert the core block functionality to embedded bio sets.
Reviewed-by: Christoph Hellwig
Signed-off-by: Kent Overstreet
Signed-off-by: Jens Axboe
29 May, 2018
1 commit
-
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it. The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.Signed-off-by: Keith Busch
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
for completions]
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
15 May, 2018
2 commits
-
Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.Signed-off-by: Kent Overstreet
Signed-off-by: Jens Axboe -
Minor optimization - remove a pointer indirection when using fs_bio_set.
Signed-off-by: Kent Overstreet
Signed-off-by: Jens Axboe
14 May, 2018
4 commits
-
We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
09 May, 2018
3 commits
-
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfqThese can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
struct blk_issue_stat squashes three things into one u64:
- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttlingIt turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
08 May, 2018
2 commits
-
Commit 9c40cef2b799 ("sched: Move blk_schedule_flush_plug() out of
__schedule()") moved the blk_schedule_flush_plug() call out of the
interrupt/preempt disabled region in the scheduler. This allows to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug_list().But it makes more sense to disable interrupts explicitly when the request
queue is locked end reenable them when the request to is unlocked. This
shortens the interrupt disabled section which is important when the plug
list contains requests for more than one queue. The comment which claims
that disabling interrupts around the loop is misleading as the called
functions can reenable interrupts unconditionally anyway and obfuscates the
scope badly:local_irq_save(flags);
spin_lock(q->queue_lock);
...
queue_unplugged(q...);
scsi_request_fn();
spin_unlock_irq(q->queue_lock);-------------------^^^ ????
spin_lock_irq(q->queue_lock);
spin_unlock(q->queue_lock);
local_irq_restore(flags);Aside of that the detached interrupt disabling is a constant pain for
PREEMPT_RT as it requires patching and special casing when RT is enabled
while with the spin_*_irq() variants this happens automatically.Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Jens Axboe -
Commit 2fff8a924d4c ("block: Check locking assumptions at runtime") added a
lockdep_assert_held(q->queue_lock) which makes the WARN_ON() redundant
because lockdep will detect and warn about context violations.The unconditional WARN_ON() does not provide real additional value, so it
can be removed.Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Jens Axboe
17 Apr, 2018
1 commit
-
rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.Cc: Bart Van Assche
Cc: Tejun Heo
Cc: Ming Lei
Cc: Martin Steigerwald
Cc: stable@vger.kernel.org
Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe
15 Apr, 2018
1 commit
-
When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1][1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
Without this fix, I get an IO error in this test:
# dd if=/dev/sda of=/dev/null iflag=direct & \
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 secondsThe interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.Reviewed-by: Bart Van Assche
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins
Signed-off-by: Jens Axboe
11 Apr, 2018
1 commit
-
Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
call with blk_queue_enter() / blk_queue_exit().Reported-by: Ming Lei
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
Signed-off-by: Bart Van Assche
Cc: Ming Lei
Cc: Joseph Qi
Signed-off-by: Jens Axboe