13 Jan, 2021
3 commits
-
[ Upstream commit 52abca64fd9410ea6c9a3a74eab25663b403d7da ]
blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:- Do not accept any requests while suspended.
- Only process power management requests while suspending or resuming.
Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is. In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine. If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.[ bvanassche: modified commit message and removed Cc: stable because
without the previous patches from this series this patch would break
parallel SCSI domain validation + introduced queue_rpm_status() ]Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Can Guo
Cc: Stanley Chu
Cc: Ming Lei
Cc: Rafael J. Wysocki
Reported-and-tested-by: Martin Kepplinger
Reviewed-by: Hannes Reinecke
Reviewed-by: Can Guo
Signed-off-by: Alan Stern
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen
Signed-off-by: Sasha Levin -
[ Upstream commit a4d34da715e3cb7e0741fe603dcd511bed067e00 ]
Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo
Cc: Stanley Chu
Cc: Alan Stern
Cc: Ming Lei
Cc: Rafael J. Wysocki
Cc: Martin Kepplinger
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Jens Axboe
Reviewed-by: Can Guo
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen
Signed-off-by: Sasha Levin -
[ Upstream commit 0854bcdcdec26aecdc92c303816f349ee1fba2bc ]
Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern
Cc: Stanley Chu
Cc: Ming Lei
Cc: Rafael J. Wysocki
Cc: Can Guo
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Jens Axboe
Reviewed-by: Can Guo
Signed-off-by: Bart Van Assche
Signed-off-by: Martin K. Petersen
Signed-off-by: Sasha Levin
14 Oct, 2020
1 commit
-
A zoned device with limited resources to open or activate zones may
return an error when the host exceeds those limits. The same command may
be successful if retried later, but the host needs to wait for specific
zone states before it should expect a retry to succeed. Have the block
layer provide an appropriate status for these conditions so applications
can distinuguish this error for special handling.Cc: linux-api@vger.kernel.org
Cc: Niklas Cassel
Reviewed-by: Christoph Hellwig
Reviewed-by: Damien Le Moal
Reviewed-by: Johannes Thumshirn
Reviewed-by: Martin K. Petersen
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
09 Oct, 2020
1 commit
-
syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092
Signed-off-by: Tetsuo Handa
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
06 Oct, 2020
1 commit
-
blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.This case isn't currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
blk_crypto_rq_bio_prep() able to fail.Suggested-by: Satya Tangirala
Signed-off-by: Eric Biggers
Reviewed-by: Mike Snitzer
Reviewed-by: Satya Tangirala
Cc: Miaohe Lin
Signed-off-by: Jens Axboe
25 Sep, 2020
3 commits
-
Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.Reported-by: Konstantin Khlebnikov
Suggested-by: Christoph Hellwig
Signed-off-by: Mike Snitzer
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Acked-by: David Sterba [btrfs]
Acked-by: Richard Weinberger [ubifs, mtd]
Signed-off-by: Jens Axboe
12 Sep, 2020
1 commit
-
These functions can be used to enable iostat for partitions on devices
like md, bcache.Signed-off-by: Song Liu
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
04 Sep, 2020
1 commit
-
The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.Co-developed-with: Kashyap Desai
Signed-off-by: John Garry
Tested-by: Don Brace #SCSI resv cmds patches used
Tested-by: Douglas Gilbert
Signed-off-by: Jens Axboe
02 Sep, 2020
4 commits
-
If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
blk_cloned_rq_check_limits() will return IO error which will cause
device-mapper to fail the paths.Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
paths.Suggested-by: Martin K. Petersen
Signed-off-by: Ritika Srivastava
Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Replace returning legacy errno codes with blk_status_t in
blk_cloned_rq_check_limits().Signed-off-by: Ritika Srivastava
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.
Signed-off-by: Xianting Tian
Signed-off-by: Jens Axboe -
It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.Reviewed-by: Christoph Hellwig
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe
01 Sep, 2020
1 commit
-
If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.Initialize the default value just like we do for ->ra_pages.
Cc: stable@vger.kernel.org
Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
04 Aug, 2020
1 commit
-
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
08 Jul, 2020
1 commit
-
If blk_mq_submit_bio flushes the plug list, bios for other disks can
show up on current->bio_list. As that doesn't involve any stacking of
block device it is entirely harmless and we should not warn about
this case.Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: kernel test robot
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
03 Jul, 2020
1 commit
-
bio_alloc_bioset references current->bio_list[1], so we need to
initialize it for the blk-mq submission path as well.Fixes: ff93ea0ce763 ("block: shortcut __submit_bio_noacct for blk-mq drivers")
Reported-by: Qian Cai
Reported-by: Naresh Kamboju
Signed-off-by: Christoph Hellwig
Tested-by: Naresh Kamboju
Signed-off-by: Jens Axboe
01 Jul, 2020
8 commits
-
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
For blk-mq drivers bios can only be inserted for the same queue. So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Split out a __submit_bio_noacct helper for the actual de-recursion
algorithm, and simplify the loop by using a continue when we can't
enter the queue for a bio.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The variable is only used once, so just open code the bio_sector()
there.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
All registers disks must have a valid queue pointer, so don't bother to
log a warning for that case.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The "generic_make_request: " prefix has no value, and will soon become
stale.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
29 Jun, 2020
1 commit
-
blkcg_bio_issue_check is a giant inline function that does three entirely
different things. Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
24 Jun, 2020
4 commits
-
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.Signed-off-by: Luis Chamberlain
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Suggested-by: Nicolai Stange
Signed-off-by: Luis Chamberlain
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Cc: Bart Van Assche
Cc: Omar Sandoval
Cc: Hannes Reinecke
Cc: Nicolai Stange
Cc: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: yu kuai
Signed-off-by: Jens Axboe -
Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.Signed-off-by: Luis Chamberlain
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe -
This adds documentation for the gendisk / request_queue refcount
helpers.Signed-off-by: Luis Chamberlain
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
22 Jun, 2020
1 commit
-
Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.Signed-off-by: Jens Axboe
03 Jun, 2020
2 commits
-
Pull block updates from Jens Axboe:
"Core block changes that have been queued up for this release:- Remove dead blk-throttle and blk-wbt code (Guoqing)
- Include pid in blktrace note traces (Jan)
- Don't spew I/O errors on wouldblock termination (me)
- Zone append addition (Johannes, Keith, Damien)
- IO accounting improvements (Konstantin, Christoph)
- blk-mq hardware map update improvements (Ming)
- Scheduler dispatch improvement (Salman)
- Inline block encryption support (Satya)
- Request map fixes and improvements (Weiping)
- blk-iocost tweaks (Tejun)
- Fix for timeout failing with error injection (Keith)
- Queue re-run fixes (Douglas)
- CPU hotplug improvements (Christoph)
- Queue entry/exit improvements (Christoph)
- Move DMA drain handling to the few drivers that use it (Christoph)
- Partition handling cleanups (Christoph)"
* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
block: mark bio_wouldblock_error() bio with BIO_QUIET
blk-wbt: rename __wbt_update_limits to wbt_update_limits
blk-wbt: remove wbt_update_limits
blk-throttle: remove tg_drain_bios
blk-throttle: remove blk_throtl_drain
null_blk: force complete for timeout request
blk-mq: drain I/O when all CPUs in a hctx are offline
blk-mq: add blk_mq_all_tag_iter
blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
blk-mq: use BLK_MQ_NO_TAG in more places
blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
blk-mq: move more request initialization to blk_mq_rq_ctx_init
blk-mq: simplify the blk_mq_get_request calling convention
blk-mq: remove the bio argument to ->prepare_request
nvme: force complete cancelled requests
blk-mq: blk-mq: provide forced completion method
block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
block: blk-crypto-fallback: remove redundant initialization of variable err
block: reduce part_stat_lock() scope
block: use __this_cpu_add() instead of access by smp_processor_id()
... -
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.comThe only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: John Hubbard
Reviewed-by: Christoph Hellwig
Reviewed-by: William Kucharski
Reviewed-by: Johannes Thumshirn
Cc: Chao Yu
Cc: Cong Wang
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Eric Biggers
Cc: Gao Xiang
Cc: Jaegeuk Kim
Cc: Joseph Qi
Cc: Junxiao Bi
Cc: Michal Hocko
Cc: Zi Yan
Cc: Miklos Szeredi
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds
29 May, 2020
1 commit
-
This reverts commit c58c1f83436b501d45d4050fd1296d71a9760bcb.
io_uring does do the right thing for this case, and we're still returning
-EAGAIN to userspace for the cases we don't support. Revert this change
to avoid doing endless spins of resubmits.Cc: stable@vger.kernel.org # v5.6
Reported-by: Bijan Mottahedeh
Signed-off-by: Jens Axboe
27 May, 2020
4 commits
-
We only need the stats lock (aka preempt_disable()) for updating the
states, not for looking up or dropping the hd_struct reference.Signed-off-by: Christoph Hellwig
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe -
Move the non-"new_io" branch of blk_account_io_start() into separate
function. Fix merge accounting for discards (they were counted as write
merges).The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike
blk_account_io_start(), as there is no reason for that.[hch: rebased]
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
All callers are in blk-core.c, so move update_io_ticks over.
Signed-off-by: Christoph Hellwig
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe -
Add two new helpers to simplify I/O accounting for bio based drivers.
Currently these drivers use the generic_start_io_acct and
generic_end_io_acct helpers which have very cumbersome calling
conventions, don't actually return the time they started accounting,
and try to deal with accounting for partitions, which can't happen
for bio based drivers. The new helpers will be used to subsequently
replace uses of the old helpers.The main API is the bio based wrappes in blkdev.h, but for zram
which wants to account rw_page based I/O lower level routines are
provided as well.Signed-off-by: Christoph Hellwig
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe