10 Oct, 2016
2 commits
-
Pull blk-mq CPU hotplug update from Jens Axboe:
"This is the conversion of blk-mq to the new hotplug state machine"* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
blk-mq: fixup "Convert to new hotplug state machine"
blk-mq: Convert to new hotplug state machine
blk-mq/cpu-notif: Convert to new hotplug state machine -
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
23 Sep, 2016
1 commit
-
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.Signed-off-by: Jens Axboe
Tested-by: Josef Bacik
22 Sep, 2016
1 commit
-
Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing
Signed-off-by: Jens Axboe
21 Sep, 2016
1 commit
-
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe
17 Sep, 2016
1 commit
-
This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.This should be a complete noop functionality-wise.
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
15 Sep, 2016
5 commits
-
Unused now that NVMe sets up irq affinity before calling into blk-mq.
Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
This allows drivers specify their own queue mapping by overriding the
setup-time function that builds the mq_map. This can be used for
example to build the map based on the MSI-X vector mapping provided
by the core interrupt layer for PCI devices.Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple. Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time. To do this the request_queue's
'requeue_work' member was changed to a delayed_work.blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
14 Sep, 2016
2 commits
-
The blk_mq_alloc_single_hw_queue() is a prototype artifact that
should have been removed with
commit cdef54dd85ad66e77262ea57796a3e81683dd5d6
"blk-mq: remove alloc_hctx and free_hctx methods" where the last
users of it were deleted.Fixes: cdef54dd85ad ("blk-mq: remove alloc_hctx and free_hctx methods")
Signed-off-by: Linus Walleij
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.Signed-off-by: Stephen Bates
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe
29 Aug, 2016
2 commits
-
Various cache line optimizations:
- Move delay_work towards the end. It's huge, and we don't use it
a lot (only SCSI).- Move the atomic state into the same cacheline as the the dispatch
list and lock.- Rearrange a few members to pack it better.
- Shrink the max-order for dispatch accounting from 10 to 7. This
means that ->dispatched[] and ->run now take up their own
cacheline.This shrinks struct blk_mq_hw_ctx down to 8 cachelines.
Signed-off-by: Jens Axboe
-
We don't need the larger delayed work struct, since we always run it
immediately.Signed-off-by: Jens Axboe
08 Jul, 2016
1 commit
-
The new nvme-rdma driver will need to reinitialize all the tags as part of
the error recovery procedure (realloc the tag memory region). Add a helper
in blk-mq for it that can iterate over all requests in a tagset to make
this easier.Signed-off-by: Sagi Grimberg
Tested-by: Ming Lin
Reviewed-by: Stephen Bates
Signed-off-by: Christoph Hellwig
Reviewed-by: Steve Wise
Tested-by: Steve Wise
Signed-off-by: Jens Axboe
06 Jul, 2016
1 commit
-
For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.Based on an earlier patch from Christoph Hellwig .
Signed-off-by: Ming Lin
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe
13 Apr, 2016
2 commits
-
No caller outside the blk-mq code so we can settle
with it static.Signed-off-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe -
Its useful to iterate on all the active tags in cases
where we will need to fail all the queues IO.Signed-off-by: Sagi Grimberg
[hch: carefully check for valid tagsets]
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe
20 Mar, 2016
1 commit
-
queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().Reported-by: Xiong Zhou
Signed-off-by: Thomas Gleixner
Signed-off-by: Jens Axboe
10 Feb, 2016
1 commit
-
The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Tested-by: Jon Derrick
Signed-off-by: Jens Axboe
02 Dec, 2015
1 commit
-
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
08 Nov, 2015
1 commit
-
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch
22 Oct, 2015
1 commit
-
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios. This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request(). The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().This fixes crash signatures like the following:
BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
[] ? copy_user_handle_tail+0x5f/0x70
[] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
[] pmem_make_request+0xd1/0x200 [nd_pmem]
[] ? mempool_alloc+0x72/0x1a0
[] generic_make_request+0xd6/0x110
[] submit_bio+0x76/0x170
[] submit_bh_wbc+0x12f/0x160
[] submit_bh+0x12/0x20
[] jbd2_write_superblock+0x8d/0x170
[] jbd2_mark_journal_empty+0x5d/0x90
[] jbd2_journal_destroy+0x24b/0x270
[] ? put_pwq_unlocked+0x2a/0x30
[] ? destroy_workqueue+0x225/0x250
[] ext4_put_super+0x64/0x360
[] generic_shutdown_super+0x6a/0xf0Cc: Jens Axboe
Cc: Keith Busch
Cc: Ross Zwisler
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe
01 Oct, 2015
2 commits
-
And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe
30 Sep, 2015
1 commit
-
There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.null_add_dev
--> blk_mq_init_queue
--> blk_mq_init_allocated_queue
--> add to 'all_q_list' (*)
--> add_disk
--> blk_register_queue
--> blk_mq_register_disk (++)null_del_dev
--> del_gendisk
--> blk_unregister_queue
--> blk_mq_unregister_disk (--)
--> blk_cleanup_queue
--> blk_mq_free_queue
--> del from 'all_q_list' (*)blk_mq_queue_reinit
--> blk_mq_sysfs_unregister (-)
--> blk_mq_sysfs_register (+)While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback. But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished. Because '/sys/block/*/mq/' is not exists.There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking. So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .Signed-off-by: Akinobu Mita
Reviewed-by: Ming Lei
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
02 Jun, 2015
1 commit
-
Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
17 Apr, 2015
1 commit
-
Commit 889fa31f00b2 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.Reported-by: Guenter Roeck
Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe
10 Apr, 2015
1 commit
-
Casting to void and adding the size of the request is "shit code" and
only a "crazy monkey on crack" would write that. So lets clean it up.Signed-off-by: Jens Axboe
13 Mar, 2015
2 commits
-
Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
and export it.DM's suspend support must be able to run the queue without starting
stopped hw queues.Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
Add a variant of blk_mq_init_queue that allows a previously allocated
queue to be initialized. blk_mq_init_allocated_queue models
blk_init_allocated_queue -- which was also created for DM's use.DM's approach to device creation requires a placeholder request_queue be
allocated for use with alloc_dev() but the decision about what type of
request_queue will be ultimately created is deferred until all component
devices referenced in the DM table are processed to determine the table
type (request-based, blk-mq request-based, or bio-based).Also, because of DM's late finalization of the request_queue type
the call to blk_mq_register_disk() doesn't happen during alloc_dev().
Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
once the blk-mq queue is fully allocated.Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
13 Feb, 2015
1 commit
-
Pull core block IO changes from Jens Axboe:
"This contains:- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
11 Feb, 2015
1 commit
-
We no longer use it outside of blk-mq.c, so we can make it static
and stop exporting it. Additionally, kill the 'async' argument, as
there's only one used of it.Signed-off-by: Jens Axboe
24 Jan, 2015
1 commit
-
This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)Cc: Jens Axboe
Cc: Tejun Heo
Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe
08 Jan, 2015
4 commits
-
Adds a helper function a driver can use to abort requeued requests in
case any are pending when h/w queues are being removed.Signed-off-by: Jens Axboe
-
Kicking requeued requests will start h/w queues in a work_queue, which
may alter the driver's requested state to temporarily stop them. This
patch exports a method to cancel the q->requeue_work so a driver can be
assured stopped h/w queues won't be started up before it is ready.Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe -
Drivers can iterate over all allocated request tags, but their callback
needs a way to know if the driver started the request in the first place.Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe -
We store it in the tag set, we don't need it in the hardware queue.
While removing cmd_size, place ->queue_num further down to avoid
a hole on 64-bit archs. It's not used in any fast paths, so we
can safely move it.Signed-off-by: Jens Axboe
03 Jan, 2015
1 commit
-
Commit b4c6a028774b exported the start and unfreeze, but we need
the regular blk_mq_freeze_queue() for the loop conversion.Signed-off-by: Jens Axboe