31 Mar, 2020
1 commit
-
Pull block updates from Jens Axboe:
- Online capacity resizing (Balbir)
- Number of hardware queue change fixes (Bart)
- null_blk fault injection addition (Bart)
- Cleanup of queue allocation, unifying the node/no-node API
(Christoph)- Cleanup of genhd, moving code to where it makes sense (Christoph)
- Cleanup of the partition handling code (Christoph)
- disk stat fixes/improvements (Konstantin)
- BFQ improvements (Paolo)
- Various fixes and improvements
* tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits)
block: return NULL in blk_alloc_queue() on error
block: move bio_map_* to blk-map.c
Revert "blkdev: check for valid request queue before issuing flush"
block: simplify queue allocation
bcache: pass the make_request methods to blk_queue_make_request
null_blk: use blk_mq_init_queue_data
block: add a blk_mq_init_queue_data helper
block: move the ->devnode callback to struct block_device_operations
block: move the part_stat* helpers from genhd.h to a new header
block: move block layer internals out of include/linux/genhd.h
block: move guard_bio_eod to bio.c
block: unexport get_gendisk
block: unexport disk_map_sector_rcu
block: unexport disk_get_part
block: mark part_in_flight and part_in_flight_rw static
block: mark block_depr static
block: factor out requeue handling from dispatch code
block/diskstats: replace time_in_queue with sum of request times
block/diskstats: accumulate all per-cpu counters in one pass
block/diskstats: more accurate approximation of io_ticks for slow disks
...
27 Mar, 2020
1 commit
-
There really isn't any good reason to stash a method directly into
struct gendisk. Move it together with the other block device
operations.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
25 Mar, 2020
7 commits
-
get_gendisk is not used by any modular code.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
disk_map_sector_rcu is not used by any modular code.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
disk_get_part is not used by any modular code.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Column "time_in_queue" in diskstats is supposed to show total waiting time
of all requests. I.e. value should be equal to the sum of times from other
columns. But this is not true, because column "time_in_queue" is counted
separately in jiffies rather than in nanoseconds as other times.This patch removes redundant counter for "time_in_queue" and shows total
time of read, write, discard and flush requests.Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe -
Reading /proc/diskstats iterates over all cpus for summing each field.
It's faster to sum all fields in one pass.Hammering /proc/diskstats with fio shows 2x performance improvement:
fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
--size=1k --bs=1k --fallocate=none --create_on_open=1 \
--time_based=1 --runtime=10 --invalidate=0 --group_reportJOBS=1 JOBS=10
Before: 7k iops 64k iops
After: 18k iops 120k iopsAlso this way code is more compact:
add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
Function old new delta
part_stat_read_all - 194 +194
diskstats_show 1344 631 -713
part_stat_show 1219 392 -827
Total: Before=14966947, After=14965601, chg -0.01%Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
24 Mar, 2020
3 commits
-
Move the sysfs _show methods that are used both on the full disk and
partition nodes to genhd.c instead of hiding them in the partitioning
code. Also move the declaration for these methods to block/blk.h so
that we don't expose them to drivers.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Thes functions aren't really related to partition support, so move them
to a more suitable place.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
This function is only used by init/do_mounts.c, which can't be modular.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
19 Mar, 2020
1 commit
-
Allow block/genhd to notify user space (via udev) about disk size changes
using a new helper set_capacity_revalidate_and_notify(), which is a wrapper
on top of set_capacity(). set_capacity_revalidate_and_notify() will only
notify via udev if the current capacity or the target capacity is not zero
and iff the capacity changes.Suggested-by: Christoph Hellwig
Signed-off-by: Someswarudu Sangaraju
Signed-off-by: Balbir Singh
Reviewed-by: Bob Liu
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
12 Mar, 2020
1 commit
-
Commit b72053072c0b ("block: allow partitions on host aware zone
devices") introduced the helper function disk_has_partitions() to check
if a given disk has valid partitions. However, since this function result
directly depends on the disk partition table length rather than the
actual existence of valid partitions in the table, it returns true even
after all partitions are removed from the disk. For host aware zoned
block devices, this results in zone management support to be kept
disabled even after removing all partitions.Fix this by changing disk_has_partitions() to walk through the partition
table entries and return true if and only if a valid non-zero size
partition is found.Fixes: b72053072c0b ("block: allow partitions on host aware zone devices")
Cc: stable@vger.kernel.org # 5.5
Reviewed-by: Damien Le Moal
Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Shin'ichiro Kawasaki
Signed-off-by: Jens Axboe
22 Nov, 2019
1 commit
-
Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
06 Sep, 2019
1 commit
-
When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
case, a call to elevator_init_mq() is added to __device_add_disk(),
resulting in the delayed initialization of the queue elevator
after the device driver finished probing the device information. This
effectively allows elevator_init_mq() access to more information
about the device.
2) elevator_init = true preserves the current behavior of initializing
the elevator directly from blk_mq_init_allocated_queue(). This case
is used for the special request based DM devices where the device
gendisk is created before the queue initialization and device
information (e.g. queue limits) is already known when the queue
initialization is executed.Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().Reviewed-by: Ming Lei
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe
17 Jul, 2019
1 commit
-
The runtime configurable module parameter files are located under
/sys/module/MODULENAME/parameters, not /sys/module/MODULENAME.Cc: Jens Axboe
Signed-off-by: Akinobu Mita
Signed-off-by: Jens Axboe
15 Jun, 2019
1 commit
-
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.So, replace the following form:
sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])
with:
struct_size(new_ptbl, part, target)
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Jens Axboe
01 Jun, 2019
1 commit
-
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
01 May, 2019
1 commit
-
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
22 Apr, 2019
1 commit
-
commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:Process1 Worker Process2
md_free
blkdev_open
del_gendisk
add delete_partition_work_fn() to wq
__blkdev_get
get_gendisk
put_disk
disk_release
kfree(disk)
find part from ext_devt_idr
get_disk_and_module(disk)
cause use after freedelete_partition_work_fn
put_device(part)
part_release
remove part from ext_devt_idrBefore is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.Thanks to Jan Kara for providing the solution and more clear comments
for the code.Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Jan Kara
Suggested-by: Jan Kara
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe
13 Apr, 2019
3 commits
-
Drivers now report to the block layer if they support media change
events. If this is not the case, there's no need to allocate the event
structure, and all event handling code can effectively be skipped. This
simplifies code flow in particular for non-removable sd devices.This effectively reverts commit 75e3f3ee3c64 ("block: always allocate
genhd->ev if check_events is implemented").The sysfs files for the events are kept in place even if no events are
supported, as user space may rely on them being present. The only
difference is that an error code is now returned if the user tries to
set poll_msecs.Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe -
Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.This patch doesn't change behavior.
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe -
The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe
01 Mar, 2019
2 commits
-
Replace hard coded function name register_blkdev with __func__, to
improve robustness and to conform to the Linux kernel coding
style. Issue found using checkpatch.Signed-off-by: Keyur Patel
Signed-off-by: Jens Axboe -
If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.Signed-off-by: zhengbin
Signed-off-by: Jens Axboe
10 Dec, 2018
4 commits
-
The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.This patch just refactors the code, there's no functional change.
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
We want to convert to per-cpu in_flight counters.
The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.Suggested-by: Jens Axboe
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
16 Nov, 2018
1 commit
-
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.Where the ->mq_ops != NULL check is redundant, remove it.
Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
01 Oct, 2018
1 commit
-
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.Signed-off-by: Jens Axboe
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...Signed-off-by: Jens Axboe
28 Sep, 2018
1 commit
-
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().Signed-off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
22 Sep, 2018
1 commit
-
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
18 Jul, 2018
2 commits
-
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requestsThis is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.tj: Refreshed on top of v4.17 and other previous updates.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Andy Newell
Signed-off-by: Jens Axboe -
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Jaegeuk Kim
Signed-off-by: Jens Axboe
05 Jun, 2018
1 commit
-
Pull procfs updates from Al Viro:
"Christoph's proc_create_... cleanups series"* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
xfs, proc: hide unused xfs procfs helpers
isdn/gigaset: add back gigaset_procinfo assignment
proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
tty: replace ->proc_fops with ->proc_show
ide: replace ->proc_fops with ->proc_show
ide: remove ide_driver_proc_write
isdn: replace ->proc_fops with ->proc_show
atm: switch to proc_create_seq_private
atm: simplify procfs code
bluetooth: switch to proc_create_seq_data
netfilter/x_tables: switch to proc_create_seq_private
netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
neigh: switch to proc_create_seq_data
hostap: switch to proc_create_{seq,single}_data
bonding: switch to proc_create_seq_data
rtc/proc: switch to proc_create_single_data
drbd: switch to proc_create_single
resource: switch to proc_create_seq_data
staging/rtl8192u: simplify procfs code
jfs: simplify procfs code
...
25 May, 2018
1 commit
-
Convert the S_ symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.see: https://lkml.org/lkml/2016/8/2/1945
Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplaceMiscellanea:
o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesisSigned-off-by: Joe Perches
Signed-off-by: Jens Axboe
16 May, 2018
1 commit
-
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.All trivial callers converted over.
Signed-off-by: Christoph Hellwig
26 Apr, 2018
1 commit
-
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe