06 Sep, 2019
1 commit
-
Introduce the definition of elevator features through the
elevator_features flags in the elevator_type structure. Each flag can
represent a feature supported by an elevator. The first feature defined
by this patch is support for zoned block device sequential write
constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
by the mq-deadline elevator using zone write locking.Other possible features are IO priorities, write hints, latency targets
or single-LUN dual-actuator disks (for which the elevator could maintain
one LBA ordered list per actuator).The required_elevator_features field is also added to the request_queue
structure to allow a device driver to specify elevator feature flags
that an elevator must support for the correct operation of the device
(e.g. device drivers for zoned block devices can have the
ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
The helper function blk_queue_required_elevator_features() is
defined for setting this new field.With these two new fields in place, the elevator functions
elevator_match() and elevator_find() are modified to allow a user to set
only an elevator with a set of features that satisfies the device
required features. Elevators not matching the device requirements are
not shown in the device sysfs queue/scheduler file to prevent their use.The "none" elevator can always be selected as before.
Reviewed-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe
23 Jul, 2019
1 commit
-
blk_mq_sched_completed_request is a function that checks if the elevator
related to the request has started_request implemented, but currently, none of
the available IO schedulers implement started_request, so remove both.Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Jens Axboe
11 Jul, 2019
2 commits
-
The elevator_name field in struct elevator_type is declared as an array
of characters (ELV_NAME_MAX size) but in practice used as a string
pointer with its initialization done statically within each
elevator elevator_type structure declaration.Change the declaration of elevator_name to the more appropriate
"const char *" type.Acked-by: Marcos Paulo de Souza
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe -
The ELV_MQUEUE_XXX definitions in include/linux/elevator.h are unused
since the removal of elevator_may_queue_fn in kernel 5.0. Remove these
definitions and also remove the documentation of elevator_may_queue_fn
in Documentiation/block/biodoc.txt.Acked-by: Marcos Paulo de Souza
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe
21 Jun, 2019
1 commit
-
We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.This also means we stop recounting segments except for cloning
and partial segments.To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
14 Apr, 2019
1 commit
-
A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.Tested-by: Kai Krakow
Reported-by: Paolo Valente
Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe
08 Nov, 2018
2 commits
-
This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
28 Sep, 2018
1 commit
-
Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
01 Jun, 2018
1 commit
-
These are only used by the block core. Also move the declarations to
block/blk.h.Reported-by: Damien Le Moal
Signed-off-by: Christoph Hellwig
Reviewed-by: Damien Le Moal
Tested-by: Damien Le Moal
Signed-off-by: Jens Axboe
19 Jan, 2018
1 commit
-
These two functions are only called from inside the block layer so
unexport them.Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
15 Nov, 2017
1 commit
-
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
26 Oct, 2017
1 commit
-
Since we now lookup elevator types with the appropriate multiqueue
capability, allow schedulers to register with an alias alongside
the real name. This is in preparation for allowing 'mq-deadline'
to register an alias of 'deadline' as well.Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
19 Jun, 2017
2 commits
-
This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
No need to have two different callouts of bfq vs kyber.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
07 Jun, 2017
1 commit
-
gcc 7.1 reports the following warning:
block/elevator.c: In function ‘elv_register’:
block/elevator.c:898:5: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=]
"%s_io_cq", e->elevator_name);
^~~~~~~~~~
block/elevator.c:897:3: note: ‘snprintf’ output between 7 and 22 bytes into a destination of size 21
snprintf(e->icq_cache_name, sizeof(e->icq_cache_name),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%s_io_cq", e->elevator_name);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~The bug is that the name of the icq_cache is 6 characters longer than
the elevator name, but only ELV_NAME_MAX + 5 characters were reserved
for it --- so in the case of a maximum-length elevator name, the 'q'
character in "_io_cq" would be truncated by snprintf(). Fix it by
reserving ELV_NAME_MAX + 6 characters instead.Signed-off-by: Eric Biggers
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
04 May, 2017
1 commit
-
This provides the infrastructure for schedulers to expose their internal
state through debugfs. We add a list of queue attributes and a list of
hctx attributes to struct elevator_type and wire them up when switching
schedulers.Signed-off-by: Omar Sandoval
Reviewed-by: Hannes ReineckeAdd missing seq_file.h header in blk-mq-debugfs.h
Signed-off-by: Jens Axboe
02 May, 2017
1 commit
-
Since commit 84253394927c ("remove the mg_disk driver") removed the
only caller of elevator_change(), also remove the elevator_change()
function itself.Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Markus Trippelsdorf
Signed-off-by: Jens Axboe
15 Apr, 2017
1 commit
-
Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
08 Apr, 2017
1 commit
-
Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
07 Apr, 2017
1 commit
-
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
11 Feb, 2017
1 commit
-
bio is used in bfq-mq's get_rq_priv, to get the request group. We could
pass directly the group here, but I thought that passing the bio was
more general, giving the possibility to get other pieces of information
if needed.Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
09 Feb, 2017
2 commits
-
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code. I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
27 Jan, 2017
1 commit
-
When we invoke dispatch_requests(), the scheduler empties everything
into the passed in list. This isn't always a good thing, since it
means that we remove items that we could have potentially merged
with.Change the function to dispatch single requests at the time. If
we do that, we can backoff exactly at the point where the device
can't consume more IO, and leave the rest with the scheduler for
better merging and future dispatch decision making.Signed-off-by: Jens Axboe
Reviewed-by: Omar Sandoval
Tested-by: Hannes Reinecke
18 Jan, 2017
2 commits
-
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.Signed-off-by: Jens Axboe
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval -
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.Signed-off-by: Jens Axboe
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
10 Dec, 2016
1 commit
-
Signed-off-by: Jens Axboe
Reviewed-by: Hannes Reinecke
28 Oct, 2016
1 commit
-
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Jul, 2016
1 commit
-
Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -eIO_ROOT=/mnt-cgroup/io
mkdir -p $IO_ROOT
if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
ficd $IO_ROOT
for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
donemkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weightRUNTIME=10
(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&sleep $((RUNTIME+1))
for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
doneEOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686After applying the patch:
# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471Signed-off-by: Tahsin Erdogan
Signed-off-by: Jens Axboe
08 Jun, 2016
1 commit
-
This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
10 Jun, 2015
1 commit
-
A previous commit wanted to make CFQ default to IOPS mode on
non-rotational storage, however it did so when the queue was
initialized and the non-rotational flag is only set later on
in the probe.Add an elevator hook that gets called off the add_disk() path,
at that point we know that feature probing has finished, and
we can reliably check for the various flags that drivers can
set.Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
Tested-by: Romain Francoise
Signed-off-by: Jens Axboe
23 Jun, 2014
1 commit
-
This reverts commit b5097e956a4d2919ee248d6481e4204c5568ed5c.
The original commit is buggy, we do use the registration functions
at runtime, for instance when loading IO schedulers through sysfs.Reported-by: Damien Wyart
12 Jun, 2014
1 commit
-
elv_abort_queue has no callers, and blk_abort_flushes is only called by
elv_abort_queue.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
11 Jun, 2014
1 commit
-
elv_register is only called by elevator init functions:
__init cfq_init
__init deadline_init
__init noop_initCc: Andrew Morton
Signed-off-by: Fabian Frederick
Signed-off-by: Jens Axboe
25 Feb, 2014
2 commits
-
rq_fifo_clear() reset the csd.list through INIT_LIST_HEAD for no clear
purpose. The csd.list doesn't need to be initialized as a list head
because it's only ever used as a list node.Lets remove this useless initialization.
Reviewed-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe -
Block layer currently abuses rq->csd.list.next for storing fifo_time.
That is a terrible hack and completely unnecessary as well. Union
achieves the same space saving in a cleaner way.Signed-off-by: Jan Kara
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Ingo Molnar
Cc: Jens Axboe
Signed-off-by: Frederic Weisbecker
Signed-off-by: Jens Axboe
03 Jul, 2013
1 commit
-
There's a race between elevator switching and normal io operation.
Because the allocation of struct elevator_queue and struct elevator_data
don't in a atomic operation.So there are have chance to use NULL
->elevator_data.
For example:
Thread A: Thread B
blk_queu_bio elevator_switch
spin_lock_irq(q->queue_block) elevator_alloc
elv_merge elevator_init_fnBecause call elevator_alloc, it can't hold queue_lock and the
->elevator_data is NULL.So at the same time, threadA call elv_merge and
nedd some info of elevator_data.So the crash happened.Move the elevator_alloc into func elevator_init_fn, it make the
operations in a atomic operation.Using the follow method can easy reproduce this bug
1:dd if=/dev/sdb of=/dev/null
2:while true;do echo noop > scheduler;echo deadline > scheduler;doneThe test method also use this method.
Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe
01 Mar, 2013
1 commit
-
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...