Eric Lee / smarc-fsl-linux-kernel

22 May, 2020

1 commit

9398554fb block: remove the error_sector argument to blkdev_issue_flush ... Browse Code »

The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-22 22:45:46 +0800

19 May, 2020

13 commits

172ce41db block: Remove unused flush_queue_delayed in struct blk_flush_queue ... Browse Code »

The flush_queue_delayed was introdued to hold queue if flush is
running for non-queueable flush drive by commit 3ac0cc450870
("hold queue if flush is running for non-queueable flush drive"),
but the non mq parts of the flush code had been removed by
commit 7e992f847a08 ("block: remove non mq parts from the flush code"),
as well as removing the usage of the flush_queue_delayed flag.
Thus remove the unused flush_queue_delayed flag.

Signed-off-by: Baolin Wang
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Baolin Wang
2020-05-19 23:42:46 +0800
cecbc9ce8 null_blk: Zero-initialize read buffers in non-memory-backed mode ... Browse Code »

This patch suppresses an uninteresting KMSAN complaint without affecting
performance of the null_blk driver if CONFIG_KMSAN is disabled.

Reported-by: Alexander Potapenko
Signed-off-by: Bart Van Assche
Tested-by: Alexander Potapenko
Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Damien Le Moal
Cc: Chaitanya Kulkarni
Cc: Alexander Potapenko
Signed-off-by: Jens Axboe

Bart Van Assche
2020-05-19 23:40:29 +0800
854b5f01d block: Document the bio_vec properties ... Browse Code »

Since it is nontrivial that nth_page() does not have to be used for a
bio_vec, document this.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
CC: Christoph Hellwig
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2020-05-19 23:40:29 +0800
c1527c0e1 bio.h: Declare the arguments of the bio iteration functions const ... Browse Code »

This change makes it possible to pass 'const struct bio *' arguments to
these functions.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Reviewed-by: Chaitanya Kulkarni
Cc: Ming Lei
Cc: Damien Le Moal
Cc: Chaitanya Kulkarni
Cc: Alexander Potapenko
Signed-off-by: Jens Axboe

Bart Van Assche
2020-05-19 23:40:29 +0800
c8210a576 block: Fix type of first compat_put_{,u}long() argument ... Browse Code »

This patch fixes the following sparse warnings:

block/ioctl.c:209:16: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:209:16: expected void const volatile [noderef] *
block/ioctl.c:209:16: got signed int [usertype] *argp
block/ioctl.c:214:16: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:214:16: expected void const volatile [noderef] *
block/ioctl.c:214:16: got unsigned int [usertype] *argp
block/ioctl.c:666:40: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:666:40: expected signed int [usertype] *argp
block/ioctl.c:666:40: got void [noderef] *argp
block/ioctl.c:672:41: warning: incorrect type in argument 1 (different address spaces)
block/ioctl.c:672:41: expected unsigned int [usertype] *argp
block/ioctl.c:672:41: got void [noderef] *argp

Fixes: 9b81648cb5e3 ("compat_ioctl: simplify up block/ioctl.c")
Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Acked-by: Arnd Bergmann
Cc: Arnd Bergmann
Signed-off-by: Jens Axboe

Bart Van Assche
2020-05-19 23:40:29 +0800
10ec5e86f block: merge part_{inc,dev}_in_flight into their only callers ... Browse Code »

part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers. Merge each function into
the only caller, and remove the superflous blk-mq checks.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:35:24 +0800
76268f3ac block: don't call part_{inc,dec}_in_flight for blk-mq devices ... Browse Code »

part_inc_in_flight and part_dec_in_flight are no-ops for blk-mq queues,
so remove the calls in purely blk-mq callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:35:24 +0800
b2f609e19 block: move the blk-mq calls out of part_in_flight{,_rw} ... Browse Code »

Don't bother to call part_in_flight / part_in_flight_rw on blk-mq
devices, just call the blk-mq versions directly.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:35:24 +0800
f1394b798 block: mark blk_account_io_completion static ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:35:24 +0800
ac7c5675f blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference ... Browse Code »

blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request. This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns. Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference. This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there. Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:34:29 +0800
35b371ff0 blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request_hctx ... Browse Code »

No need for two queue references.

Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:34:29 +0800
22fa792cd blk-mq: remove a pointless queue enter pair in blk_mq_alloc_request ... Browse Code »

No need for two queue references.

Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:34:29 +0800
a5ea58110 blk-mq: move the call to blk_queue_enter_live out of blk_mq_get_request ... Browse Code »

Move the blk_queue_enter_live calls into the callers, where they can
successively be cleaned up.

Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-19 23:34:29 +0800

17 May, 2020

2 commits

870c153cf blktrace: Report pid with note messages ... Browse Code »

Currently informational messages within block trace do not have PID
information of the process reporting the message included. With BFQ it
is sometimes useful to have the information and there's no good reason
to omit the information from the trace. So just fill in pid information
when generating note message.

Signed-off-by: Jan Kara
Reviewed-by: Chaitanya Kulkarni
Acked-by: Paolo Valente
Signed-off-by: Jens Axboe

Jan Kara
2020-05-17 04:29:39 +0800
2771cefea block: remove the REQ_NOWAIT_INLINE flag ... Browse Code »

Signed-off-by: Christoph Hellwig
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-17 04:23:54 +0800

14 May, 2020

7 commits

488f6682c block: blk-crypto-fallback for Inline Encryption ... Browse Code »

Blk-crypto delegates crypto operations to inline encryption hardware
when available. The separately configurable blk-crypto-fallback contains
a software fallback to the kernel crypto API - when enabled, blk-crypto
will use this fallback for en/decryption when inline encryption hardware
is not available.

This lets upper layers not have to worry about whether or not the
underlying device has support for inline encryption before deciding to
specify an encryption context for a bio. It also allows for testing
without actual inline encryption hardware - in particular, it makes it
possible to test the inline encryption code in ext4 and f2fs simply by
running xfstests with the inlinecrypt mount option, which in turn allows
for things like the regular upstream regression testing of ext4 to cover
the inline encryption code paths.

For more details, refer to Documentation/block/inline-encryption.rst.

Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Signed-off-by: Jens Axboe

Satya Tangirala
2020-05-14 23:48:03 +0800
d145dc230 block: Make blk-integrity preclude hardware inline encryption ... Browse Code »

Whenever a device supports blk-integrity, make the kernel pretend that
the device doesn't support inline encryption (essentially by setting the
keyslot manager in the request queue to NULL).

There's no hardware currently that supports both integrity and inline
encryption. However, it seems possible that there will be such hardware
in the near future (like the NVMe key per I/O support that might support
both inline encryption and PI).

But properly integrating both features is not trivial, and without
real hardware that implements both, it is difficult to tell if it will
be done correctly by the majority of hardware that support both.
So it seems best not to support both features together right now, and
to decide what to do at probe time.

Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Satya Tangirala
2020-05-14 23:48:03 +0800
a892c8d52 block: Inline encryption support for blk-mq ... Browse Code »

We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.

We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.

We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.

Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.

Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.

Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Satya Tangirala
2020-05-14 23:47:53 +0800
1b2628397 block: Keyslot Manager for Inline Encryption ... Browse Code »

Inline Encryption hardware allows software to specify an encryption context
(an encryption key, crypto algorithm, data unit num, data unit size) along
with a data transfer request to a storage device, and the inline encryption
hardware will use that context to en/decrypt the data. The inline
encryption hardware is part of the storage device, and it conceptually sits
on the data path between system memory and the storage device.

Inline Encryption hardware implementations often function around the
concept of "keyslots". These implementations often have a limited number
of "keyslots", each of which can hold a key (we say that a key can be
"programmed" into a keyslot). Requests made to the storage device may have
a keyslot and a data unit number associated with them, and the inline
encryption hardware will en/decrypt the data in the requests using the key
programmed into that associated keyslot and the data unit number specified
with the request.

As keyslots are limited, and programming keys may be expensive in many
implementations, and multiple requests may use exactly the same encryption
contexts, we introduce a Keyslot Manager to efficiently manage keyslots.

We also introduce a blk_crypto_key, which will represent the key that's
programmed into keyslots managed by keyslot managers. The keyslot manager
also functions as the interface that upper layers will use to program keys
into inline encryption hardware. For more information on the Keyslot
Manager, refer to documentation found in block/keyslot-manager.c and
linux/keyslot-manager.h.

Co-developed-by: Eric Biggers
Signed-off-by: Eric Biggers
Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Satya Tangirala
2020-05-14 23:46:54 +0800
54b259f68 Documentation: Document the blk-crypto framework ... Browse Code »

The blk-crypto framework adds support for inline encryption. There are
numerous changes throughout the storage stack. This patch documents the
main design choices in the block layer, the API presented to users of
the block layer (like fscrypt or layered devices) and the API presented
to drivers for adding support for inline encryption.

Signed-off-by: Satya Tangirala
Reviewed-by: Eric Biggers
Signed-off-by: Jens Axboe

Satya Tangirala
2020-05-14 23:46:54 +0800
81ca627a9 iocost: don't let vrate run wild while there's no saturation signal ... Browse Code »

When the QoS targets are met and nothing is being throttled, there's
no way to tell how saturated the underlying device is - it could be
almost entirely idle, at the cusp of saturation or anywhere inbetween.
Given that there's no information, it's best to keep vrate as-is in
this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
handling"), this was the case - if the device isn't missing QoS
targets and nothing is being throttled, busy_level was reset to zero.

While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
nr_lagging handling") broke this. Now, while the device is hitting
QoS targets and nothing is being throttled, vrate keeps getting
adjusted according to the existing busy_level.

This led to vrate keeping climing till it hits max when there's an IO
issuer with limited request concurrency if the vrate started low.
vrate starts getting adjusted upwards until the issuer can issue IOs
w/o being throttled. From then on, QoS targets keeps getting met and
nothing on the system needs throttling and vrate keeps getting
increased due to the existing busy_level.

This patch makes the following changes to the busy_level logic.

* Reset busy_level if nr_shortages is zero to avoid the above
scenario.

* Make non-zero nr_lagging block lowering nr_level but still clear
positive busy_level if there's clear non-saturation signal - QoS
targets are met and nr_shortages is non-zero. nr_lagging's role is
preventing adjusting vrate upwards while there are long-running
commands and it shouldn't keep busy_level positive while there's
clear non-saturation signal.

* Restructure code for clarity and add comments.

Signed-off-by: Tejun Heo
Reported-by: Andy Newell
Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
Signed-off-by: Jens Axboe

Tejun Heo
2020-05-14 23:32:09 +0800
71ac860af block: move blk_io_schedule() out of header file ... Browse Code »

blk_io_schedule() isn't called from performance sensitive code path, and
it is easier to maintain by exporting it as symbol.

Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.

Cc: Christoph Hellwig
Fixes: e6249cdd46e4 ("block: add blk_io_schedule() for avoiding task hung in sync dio")
Reported-by: Satya Tangirala
Tested-by: Satya Tangirala
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2020-05-14 22:06:04 +0800

13 May, 2020

16 commits

02ef12a66 zonefs: use REQ_OP_ZONE_APPEND for sync DIO ... Browse Code »

Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.

We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.

Signed-off-by: Johannes Thumshirn
Acked-by: Damien Le Moal
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
29b2a3aa2 block: export bio_release_pages and bio_iov_iter_get_pages ... Browse Code »

Export bio_release_pages and bio_iov_iter_get_pages, so they can be used
from modular code.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
e0489ed5d null_blk: Support REQ_OP_ZONE_APPEND ... Browse Code »

Support REQ_OP_ZONE_APPEND requests for null_blk devices with zoned
mode enabled. Use the internally tracked zone write pointer position
as the actual write position and return it using the command request
__sector field in the case of an mq device and using the command BIO
sector in the case of a BIO device.

Signed-off-by: Damien Le Moal
Signed-off-by: Johannes Thumshirn
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Damien Le Moal
2020-05-13 10:36:28 +0800
5795eb443 scsi: sd_zbc: emulate ZONE_APPEND commands ... Browse Code »

Emulate ZONE_APPEND for SCSI disks using a regular WRITE(16) command
with a start LBA set to the target zone write pointer position.

In order to always know the write pointer position of a sequential write
zone, the write pointer of all zones is tracked using an array of 32bits
zone write pointer offset attached to the scsi disk structure. Each
entry of the array indicate a zone write pointer position relative to
the zone start sector. The write pointer offsets are maintained in sync
with the device as follows:
1) the write pointer offset of a zone is reset to 0 when a
REQ_OP_ZONE_RESET command completes.
2) the write pointer offset of a zone is set to the zone size when a
REQ_OP_ZONE_FINISH command completes.
3) the write pointer offset of a zone is incremented by the number of
512B sectors written when a write, write same or a zone append
command completes.
4) the write pointer offset of all zones is reset to 0 when a
REQ_OP_ZONE_RESET_ALL command completes.

Since the block layer does not write lock zones for zone append
commands, to ensure a sequential ordering of the regular write commands
used for the emulation, the target zone of a zone append command is
locked when the function sd_zbc_prepare_zone_append() is called from
sd_setup_read_write_cmnd(). If the zone write lock cannot be obtained
(e.g. a zone append is in-flight or a regular write has already locked
the zone), the zone append command dispatching is delayed by returning
BLK_STS_ZONE_RESOURCE.

To avoid the need for write locking all zones for REQ_OP_ZONE_RESET_ALL
requests, use a spinlock to protect accesses and modifications of the
zone write pointer offsets. This spinlock is initialized from sd_probe()
using the new function sd_zbc_init().

Co-developed-by: Damien Le Moal
Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
02494d35b scsi: sd_zbc: factor out sanity checks for zoned commands ... Browse Code »

Factor sanity checks for zoned commands from sd_zbc_setup_zone_mgmt_cmnd().

This will help with the introduction of an emulated ZONE_APPEND command.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
e732671aa block: Modify revalidate zones ... Browse Code »

Modify the interface of blk_revalidate_disk_zones() to add an optional
driver callback function that a driver can use to extend processing
done during zone revalidation. The callback, if defined, is executed
with the device request queue frozen, after all zones have been
inspected.

Signed-off-by: Damien Le Moal
Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Damien Le Moal
2020-05-13 10:36:28 +0800
1392d3701 block: introduce blk_req_zone_write_trylock ... Browse Code »

Introduce blk_req_zone_write_trylock(), which either grabs the write-lock
for a sequential zone or returns false, if the zone is already locked.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
0512a75b9 block: Introduce REQ_OP_ZONE_APPEND ... Browse Code »

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g. because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Keith Busch
2020-05-13 10:36:28 +0800
e45811057 block: rename __bio_add_pc_page to bio_add_hw_page ... Browse Code »

Rename __bio_add_pc_page() to bio_add_hw_page() and explicitly pass in a
max_sectors argument.

This max_sectors argument can be used to specify constraints from the
hardware.

Signed-off-by: Christoph Hellwig
[ jth: rebased and made public for blk-map.c ]
Signed-off-by: Johannes Thumshirn
Reviewed-by: Daniel Wagner
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-13 10:36:28 +0800
02992df82 block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no ... Browse Code »

blk_queue_zone_is_seq() and blk_queue_zone_no() have not been called with
CONFIG_BLK_DEV_ZONED disabled until now.

The introduction of REQ_OP_ZONE_APPEND will change this, so we need to
provide noop fallbacks for the !CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Johannes Thumshirn
2020-05-13 10:36:28 +0800
e6249cdd4 block: add blk_io_schedule() for avoiding task hung in sync dio ... Browse Code »

Sync dio could be big, or may take long time in discard or in case of
IO failure.

We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
so apply the same trick for prevent task hung from happening in sync dio.

Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
task hung warning.

Signed-off-by: Ming Lei
Reviewed-by: Bart Van Assche
Cc: Salman Qazi
Cc: Jesse Barnes
Cc: Christoph Hellwig
Cc: Bart Van Assche
Cc: Hannes Reinecke
Signed-off-by: Jens Axboe

Ming Lei
2020-05-13 10:32:42 +0800
27eb3af9a block: don't hold part0's refcount in IO path ... Browse Code »

gendisk can't be gone when there is IO activity, so not hold
part0's refcount in IO path.

Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Cc: Yufen Yu
Cc: Christoph Hellwig
Cc: Hou Tao
Signed-off-by: Jens Axboe

Ming Lei
2020-05-13 10:31:40 +0800
520138c3b block: re-organize fields of 'struct hd_part' ... Browse Code »

Put all fields accessed in IO path together at the beginning
of the struct, so that all can be fetched in single cacheline.

Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Cc: Yufen Yu
Cc: Christoph Hellwig
Cc: Hou Tao
Signed-off-by: Jens Axboe

Ming Lei
2020-05-13 10:31:39 +0800
07c4e1e83 block: only define 'nr_sects_seq' in hd_part for 32bit SMP ... Browse Code »

The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
so define it just for 32bit SMP.

Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Cc: Yufen Yu
Cc: Christoph Hellwig
Cc: Hou Tao
Signed-off-by: Jens Axboe

Ming Lei
2020-05-13 10:31:39 +0800
b7d6c3033 block: fix use-after-free on cached last_lookup partition ... Browse Code »

delete_partition() clears the cached last_lookup partition. However the
.last_lookup cache may be overwritten by one IO path after it is cleared
from delete_partition(). Then another IO path may use the cached deleting
partition after hd_struct_free() is called, then use-after-free is triggered
on the cached partition.

Fixes the issue by the following approach:

1) always get the partition's refcount via hd_struct_try_get() before
setting .last_lookup

2) move clearing .last_lookup from delete_partition() to hd_struct_free()
which is the release handle of the partition's percpu-refcount, so that no
IO path can cache deleteing partition via .last_lookup.

It is one candidate approach of Yufen's patch[1] which adds overhead
in fast path by indirect lookup which may introduce one extra cacheline
in IO path. Also this patch relies on percpu-refcount's protection, and
it is easier to understand and verify.

[1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#t

Reported-by: Yufen Yu
Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Cc: Christoph Hellwig
Cc: Hou Tao
Signed-off-by: Jens Axboe

Ming Lei
2020-05-13 10:31:39 +0800
aa880ad69 block: reset mapping if failed to update hardware queue count ... Browse Code »

When we increase hardware queue count, blk_mq_update_queue_map will
reset the mapping between cpu and hardware queue base on the hardware
queue count(set->nr_hw_queues). The mapping cannot be reset if it
encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
continue using it, then blk_mq_map_swqueue will touch a invalid memory,
because the mapping points to a wrong hctx.

blktest block/030:

null_blk: module loaded
Increasing nr_hw_queues to 8 fails, fallback to 1
==================================================================
BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
Read of size 8 at addr 0000000000000128 by task nproc/8541

CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0xa5/0xe6
__kasan_report.cold+0x65/0xbb
kasan_report+0x45/0x60
check_memory_region+0x15e/0x1c0
__kasan_check_read+0x15/0x20
blk_mq_map_swqueue+0x2f2/0x830
__blk_mq_update_nr_hw_queues+0x3df/0x690
blk_mq_update_nr_hw_queues+0x32/0x50
nullb_device_submit_queues_store+0xde/0x160 [null_blk]
configfs_write_file+0x1c4/0x250 [configfs]
__vfs_write+0x4c/0x90
vfs_write+0x14b/0x2d0
ksys_write+0xdd/0x180
__x64_sys_write+0x47/0x50
do_syscall_64+0x6f/0x310
entry_SYSCALL_64_after_hwframe+0x49/0xb3

Signed-off-by: Weiping Zhang
Tested-by: Bart van Assche
Signed-off-by: Jens Axboe

Weiping Zhang
2020-05-13 10:20:22 +0800

11 May, 2020

1 commit

ae979182e bdi: fix up for "remove the name field in struct backing_dev_info" ... Browse Code »

Fixes: 1cd925d58385 ("bdi: remove the name field in struct backing_dev_info")
Signed-off-by: Stephen Rothwell
Signed-off-by: Jens Axboe

Stephen Rothwell
2020-05-11 23:08:26 +0800