06 Oct, 2020
3 commits
-
Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Also move the definition from the public blkdev.h to the private
block/blk.h header.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Also move the definition from the public blkdev.h to the private
block/blk.h header.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
02 Sep, 2020
3 commits
-
We can trivially derive the gendisk from the hd_struct.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().Reviewed-by: Christoph Hellwig
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe -
Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.Reviewed-by: Christoph Hellwig
Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe
17 Jul, 2020
1 commit
-
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocatedAlthough the discard bio starts at LBA 0 and has 50<<< 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocatedWe an see there is no 2048 sectors segment anymore, everything is clean.
Reported-and-tested-by: Acshai Manoj
Signed-off-by: Coly Li
Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Xiao Ni
Cc: Bart Van Assche
Cc: Christoph Hellwig
Cc: Enzo Matsumiya
Cc: Jens Axboe
Signed-off-by: Jens Axboe
09 Jul, 2020
1 commit
-
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.Signed-off-by: Ming Lei
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe
02 Jul, 2020
1 commit
-
This reverts commits the following commits:
37f4a24c2469a10a4c16c641671bd766e276cf9f
723bf178f158abd1ce6069cb049581b3cb003aab
36a3df5a4574d5ddf59804fcd0c4e9654c514d9aThe last one is the culprit, but we have to go a bit deeper to get this
to revert cleanly. There's been a report that this breaks some MMC
setups [1], and also causes an issue with swap [2]. Until this can be
figured out, revert the offending commits.[1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
[2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/Reported-by: Marek Szyprowski
Reported-by: Qian Cai
Signed-off-by: Jens Axboe
01 Jul, 2020
3 commits
-
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The queue can be trivially derived from the bio, so pass one less
argument.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.Signed-off-by: Ming Lei
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe
29 Jun, 2020
1 commit
-
blkcg_bio_issue_check is a giant inline function that does three entirely
different things. Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
24 Jun, 2020
2 commits
-
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.Signed-off-by: Luis Chamberlain
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.Reviewed-by: Daniel Wagner
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
05 Jun, 2020
1 commit
-
For optimized block readers not holding a mutex, the "number of sectors"
64-bit value is protected from tearing on 32-bit architectures by a
sequence counter.Disable preemption before entering that sequence counter's write side
critical section. Otherwise, the read side can preempt the write side
section and spin for the entire scheduler tick. If the reader belongs to
a real-time scheduling class, it can spin forever and the kernel will
livelock.Fixes: c83f6bf98dc1 ("block: add partition resize function to blkpg ioctl")
Cc:
Signed-off-by: Ahmed S. Darwish
Reviewed-by: Sebastian Andrzej Siewior
Signed-off-by: Jens Axboe
30 May, 2020
1 commit
-
After the commit 5addeae1bedc4 ("blk-cgroup: remove blkcg_drain_queue"),
there is no caller of blk_throtl_drain, so let's remove it.Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe
27 May, 2020
3 commits
-
Move the non-"new_io" branch of blk_account_io_start() into separate
function. Fix merge accounting for discards (they were counted as write
merges).The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike
blk_account_io_start(), as there is no reason for that.[hch: rebased]
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
percpu variables have a perfectly fine working stub implementation
for UP kernels, so use that.Signed-off-by: Christoph Hellwig
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe -
All callers are in blk-core.c, so move update_io_ticks over.
Signed-off-by: Christoph Hellwig
Reviewed-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
19 May, 2020
4 commits
-
The flush_queue_delayed was introdued to hold queue if flush is
running for non-queueable flush drive by commit 3ac0cc450870
("hold queue if flush is running for non-queueable flush drive"),
but the non mq parts of the flush code had been removed by
commit 7e992f847a08 ("block: remove non mq parts from the flush code"),
as well as removing the usage of the flush_queue_delayed flag.
Thus remove the unused flush_queue_delayed flag.Signed-off-by: Baolin Wang
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe -
part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers. Merge each function into
the only caller, and remove the superflous blk-mq checks.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request. This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns. Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference. This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there. Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
14 May, 2020
1 commit
-
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
13 May, 2020
2 commits
-
Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
max_sectors argument.This max_sectors argument can be used to specify constraints from the
hardware.Signed-off-by: Christoph Hellwig
[ jth: rebased and made public for blk-map.c ]
Signed-off-by: Johannes Thumshirn
Reviewed-by: Daniel Wagner
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe -
gendisk can't be gone when there is IO activity, so not hold
part0's refcount in IO path.Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Cc: Yufen Yu
Cc: Christoph Hellwig
Cc: Hou Tao
Signed-off-by: Jens Axboe
25 Apr, 2020
1 commit
-
create_io_context just has a single caller, which also happens to not
even use the return value. Just open code it there.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Apr, 2020
4 commits
-
The function has a single caller, so just open code it.
Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Move hd_ref_init out of line as there it isn't anywhere near a fast path,
and rename the rcu ref freeing callbacks to be more descriptive.Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
All callers have the hd_struct at hand, so pass it instead of performing
another lookup.Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Split each sub-command out into a separate helper, and move those helpers
to block/partitions/core.c instead of having a lot of partition
manipulation logic open coded in block/ioctl.c.Signed-off-by: Christoph Hellwig
28 Mar, 2020
2 commits
-
The bio_map_* helpers are just the low-level helpers for the
blk_rq_map_* APIs. Move them together for better logical grouping,
as no there isn't much overlap with other code in bio.c.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
25 Mar, 2020
2 commits
-
These macros are just used by a few files. Move them out of genhd.h,
which is included everywhere into a new standalone header.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
None of this needs to be exposed to drivers.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
24 Mar, 2020
1 commit
-
Move the sysfs _show methods that are used both on the full disk and
partition nodes to genhd.c instead of hiding them in the partitioning
code. Also move the declaration for these methods to block/blk.h so
that we don't expose them to drivers.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
12 Mar, 2020
1 commit
-
Remove 'q' from arguments since it is not used anymore after
commit 7e992f847a08e ("block: remove non mq parts from the
flush code").Signed-off-by: Guoqing Jiang
Reviewed-by: Nikolay Borisov
Reviewed-by: Bart Van Assche
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe
21 Dec, 2019
1 commit
-
Avoid that running test nvme/012 from the blktests suite triggers the
following false positive lockdep complaint:============================================
WARNING: possible recursive locking detected
5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
--------------------------------------------
ksoftirqd/1/16 is trying to acquire lock:
000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0but task is already holding lock:
00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0other info that might help us debug this:
Possible unsafe locking scenario:CPU0
----
lock(&(&fq->mq_flush_lock)->rlock);
lock(&(&fq->mq_flush_lock)->rlock);*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by ksoftirqd/1/16:
#0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0stack backtrace:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
dump_stack+0x67/0x90
__lock_acquire.cold.45+0x2b4/0x313
lock_acquire+0x98/0x160
_raw_spin_lock_irqsave+0x3b/0x80
flush_end_io+0x4e/0x1d0
blk_mq_complete_request+0x76/0x110
nvmet_req_complete+0x15/0x110 [nvmet]
nvmet_bio_done+0x27/0x50 [nvmet]
blk_update_request+0xd7/0x2d0
blk_mq_end_request+0x1a/0x100
blk_flush_complete_seq+0xe5/0x350
flush_end_io+0x12f/0x1d0
blk_done_softirq+0x9f/0xd0
__do_softirq+0xca/0x440
run_ksoftirqd+0x24/0x50
smpboot_thread_fn+0x113/0x1e0
kthread+0x121/0x140
ret_from_fork+0x3a/0x50Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Hannes Reinecke
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
06 Dec, 2019
1 commit
-
7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
and bio_endio(). This way looks wrong because bio may be freed
without calling bio_endio(), for example, blk_rq_unprep_clone() is
called from dm_mq_queue_rq() when the underlying queue of dm-mpath
is busy.So memory leak of bio integrity data is caused by commit 7c20f11680a4.
Fixes this issue by re-adding bio_integrity_free() to bio_uninit().
Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
Reviewed-by: Christoph Hellwig
Signed-off-by Justin TeeAdd commit log, and simplify/fix the original patch wroten by Justin.
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe