Eric Lee / smarc-fsl-linux-kernel

18 Jul, 2018

1 commit

6ce3dd6ee blk-mq: issue directly if hw queue isn't busy in case of 'none' ... Browse Code »

In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai
Cc: Laurence Oberman
Cc: Omar Sandoval
Cc: Christoph Hellwig
Cc: Bart Van Assche
Cc: Hannes Reinecke
Reported-by: Kashyap Desai
Tested-by: Kashyap Desai
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-07-18 06:04:00 +0800

09 Jul, 2018

2 commits

0da73d00c blk-mq: code clean-up by adding an API to clear set->mq_map ... Browse Code »

set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().

This patch provides an API to clear set->mq_map to make it clear.

Signed-off-by: Minwoo Im
Signed-off-by: Jens Axboe

Minwoo Im
2018-07-09 23:07:53 +0800
8ab6bb9ee blk-mq: cleanup blk_mq_get_driver_tag() ... Browse Code »

We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc3e.

So cleanup the usage and remove the two extra parameters.

Cc: Bart Van Assche
Cc: Christoph Hellwig
Tested-by: Andrew Jones
Reviewed-by: Omar Sandoval
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-07-09 23:07:52 +0800

29 May, 2018

1 commit

12f5b9314 blk-mq: Remove generation seqeunce ... Browse Code »

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it. The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.

Signed-off-by: Keith Busch
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
for completions]
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2018-05-29 22:59:21 +0800

26 Apr, 2018

1 commit

bf0ddaba6 blk-mq: fix sysfs inflight counter ... Browse Code »

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-04-26 23:02:01 +0800

25 Apr, 2018

1 commit

fe644072d block: mq: Add some minor doc for core structs ... Browse Code »

As it came up in discussion on the mailing list that the semantic
meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely
obvious to everyone, let's add some minimal kerneldoc for a
starter.

Signed-off-by: Linus Walleij
Signed-off-by: Jens Axboe

Linus Walleij
2018-04-25 21:58:18 +0800

20 Jan, 2018

1 commit

c77ff7fd0 blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() ... Browse Code »

Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.

Reviewed-by: Mike Snitzer
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2018-01-20 03:51:59 +0800

18 Jan, 2018

1 commit

396eaf21e blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback ... Browse Code »

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO. But dm-rq currently can't get the underlying queue's
dispatch feedback at all. Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging). This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request. Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Ming Lei
2018-01-18 00:46:54 +0800

10 Jan, 2018

3 commits

5a61c3639 blk-mq: remove REQ_ATOM_STARTED ... Browse Code »

After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.

Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
left and is removed.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2018-01-10 00:31:15 +0800
358f70da4 blk-mq: make blk_abort_request() trigger timeout path ... Browse Code »

With issue/complete and timeout paths now using the generation number
and state based synchronization, blk_abort_request() is the only one
which depends on REQ_ATOM_COMPLETE for arbitrating completion.

There's no reason for blk_abort_request() to be a completely separate
path. This patch makes blk_abort_request() piggyback on the timeout
path instead of trying to terminate the request directly.

This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

Note that this makes blk_abort_request() asynchronous - it initiates
abortion but the actual termination will happen after a short while,
even when the caller owns the request. AFAICS, SCSI and ATA should be
fine with that and I think mtip32xx and dasd should be safe but not
completely sure. It'd be great if people who know the drivers take a
look.

v2: - Add comment explaining the lack of synchronization around
->deadline update as requested by Bart.

Signed-off-by: Tejun Heo
Cc: Asai Thambi SP
Cc: Stefan Haberland
Cc: Jan Hoeppner
Cc: Bart Van Assche
Signed-off-by: Jens Axboe

Tejun Heo
2018-01-10 00:31:15 +0800
1d9bd5161 blk-mq: replace timeout synchronization with a RCU and generation based scheme ... Browse Code »

Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.

Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.

2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.

By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.

Signed-off-by: Tejun Heo
Cc: "jianchao.wang"
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Bart Van Assche
Signed-off-by: Jens Axboe

Tejun Heo
2018-01-10 00:31:15 +0800

15 Nov, 2017

1 commit

e2c5923c3 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.

Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:

- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.

- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.

- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)

- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)

- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).

- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).

- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).

- Fix laptop mode on blk-mq (me).

- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).

- blktrace race fix, oopsing on sg load (me).

- blk-mq optimizations (me).

- Obscure waitqueue race fix for kyber (Omar).

- NBD fixes (Josef).

- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).

- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.

- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.

- BFQ updates (Paolo).

- blk-mq atomic flags memory ordering fixes (Peter Z).

- Loop cgroup support (Shaohua).

- Lots of minor fixes from lots of different folks, both for core and
driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...

Linus Torvalds
2017-11-15 07:32:19 +0800

11 Nov, 2017

2 commits

79f720a75 blk-mq: only run the hardware queue if IO is pending ... Browse Code »

Currently we are inconsistent in when we decide to run the queue. Using
blk_mq_run_hw_queues() we check if the hctx has pending IO before
running it, but we don't do that from the individual queue run function,
blk_mq_run_hw_queue(). This results in a lot of extra and pointless
queue runs, potentially, on flush requests and (much worse) on tag
starvation situations. This is observable just looking at top output,
with lots of kworkers active. For the !async runs, it just adds to the
CPU overhead of blk-mq.

Move the has-pending check into the run function instead of having
callers do it.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-11-11 10:55:57 +0800
9a95e4ef7 block, nvme: Introduce blk_mq_req_flags_t ... Browse Code »

Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Tested-by: Oleksandr Natalenko
Cc: linux-nvme@lists.infradead.org
Cc: Christoph Hellwig
Cc: Johannes Thumshirn
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-11-11 10:53:25 +0800

05 Nov, 2017

3 commits

244c65a3c blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h ... Browse Code »

We need this helper to put the driver tag for flush rq, since we will
not share tag in the flush request sequence in the following patch
in case that I/O scheduler is applied.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-05 02:39:57 +0800
b0850297c block: pass 'run_queue' to blk_mq_request_bypass_insert ... Browse Code »

Block flush need this function without running the queue, so add a
parameter controlling whether we run it or not.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-05 02:38:40 +0800
88022d720 blk-mq: don't handle failure in .get_budget ... Browse Code »

It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-05 02:31:08 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

01 Nov, 2017

2 commits

b347689ff blk-mq-sched: improve dispatching from sw queue ... Browse Code »

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-01 22:20:02 +0800
de1482974 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops ... Browse Code »

For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-11-01 22:20:02 +0800

12 Sep, 2017

1 commit

157f377be block: directly insert blk-mq request from blk_insert_cloned_request() ... Browse Code »

A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device. The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e. The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reported-by: Bart Van Assche
Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2017-09-12 06:43:57 +0800

10 Aug, 2017

1 commit

f299b7c7a blk-mq: provide internal in-flight variant ... Browse Code »

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:28 +0800

04 Jul, 2017

1 commit

03ffbcdd7 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq updates from Thomas Gleixner:
"The irq department delivers:

- Expand the generic infrastructure handling the irq migration on CPU
hotplug and convert X86 over to it. (Thomas Gleixner)

Aside of consolidating code this is a preparatory change for:

- Finalizing the affinity management for multi-queue devices. The
main change here is to shut down interrupts which are affine to a
outgoing CPU and reenabling them when the CPU comes online again.
That avoids moving interrupts pointlessly around and breaking and
reestablishing affinities for no value. (Christoph Hellwig)

Note: This contains also the BLOCK-MQ and NVME changes which depend
on the rework of the irq core infrastructure. Jens acked them and
agreed that they should go with the irq changes.

- Consolidation of irq domain code (Marc Zyngier)

- State tracking consolidation in the core code (Jeffy Chen)

- Add debug infrastructure for hierarchical irq domains (Thomas
Gleixner)

- Infrastructure enhancement for managing generic interrupt chips via
devmem (Bartosz Golaszewski)

- Constification work all over the place (Tobias Klauser)

- Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

- The usual set of fixes, updates and enhancements all over the
place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
irqchip/or1k-pic: Fix interrupt acknowledgement
irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
nvme: Allocate queues for all possible CPUs
blk-mq: Create hctx for each present CPU
blk-mq: Include all present CPUs in the default queue mapping
genirq: Avoid unnecessary low level irq function calls
genirq: Set irq masked state when initializing irq_desc
genirq/timings: Add infrastructure for estimating the next interrupt arrival time
genirq/timings: Add infrastructure to track the interrupt timings
genirq/debugfs: Remove pointless NULL pointer check
irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
irqchip/gic-v3-its: Add ACPI NUMA node mapping
irqchip/gic-v3-its-platform-msi: Make of_device_ids const
irqchip/gic-v3-its: Make of_device_ids const
irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
genirq/irqdomain: Remove auto-recursive hierarchy support
irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
...

Linus Torvalds
2017-07-04 07:50:31 +0800

29 Jun, 2017

1 commit

4b855ad37 blk-mq: Create hctx for each present CPU ... Browse Code »

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Cc: Keith Busch
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner

Christoph Hellwig
2017-06-29 05:00:07 +0800

19 Jun, 2017

3 commits

e4cdf1a1c blk-mq: remove __blk_mq_alloc_request ... Browse Code »

Move most code into blk_mq_rq_ctx_init, and the rest into
blk_mq_get_request.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-19 00:08:55 +0800
6af54051a blk-mq: simplify blk_mq_free_request ... Browse Code »

Merge three functions only tail-called by blk_mq_free_request into
blk_mq_free_request.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-19 00:08:55 +0800
6e15cf2a0 blk-mq: mark blk_mq_rq_ctx_init static ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-19 00:08:55 +0800

04 May, 2017

1 commit

d173a2516 blk-mq: move debugfs declarations to a separate header file ... Browse Code »

Preparation for adding more declarations.

Signed-off-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Omar Sandoval
2017-05-04 22:23:44 +0800

27 Apr, 2017

3 commits

62d6c9496 blk-mq-debugfs: Rename functions for registering and unregistering the mq directory ... Browse Code »

Since the blk_mq_debugfs_*register_hctxs() functions register and
unregister all attributes under the "mq" directory, rename these
into blk_mq_debugfs_*register_mq().

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800
4c9e4019f blk-mq: Let blk_mq_debugfs_register() look up the queue name ... Browse Code »

A later patch will move the call of blk_mq_debugfs_register() to
a function to which the queue name is not passed as an argument.
To avoid having to add a 'name' argument to multiple callers, let
blk_mq_debugfs_register() look up the queue name.

Signed-off-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800
2d0364c8c blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ... Browse Code »

A later patch in this series will modify blk_mq_debugfs_register()
such that it uses q->kobj.parent to determine the name of a
request queue. Hence make sure that that pointer is initialized
before blk_mq_debugfs_register() is called. To avoid lock inversion,
protect sysfs / debugfs registration with the queue sysfs_lock
instead of the global mutex all_q_mutex.

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-04-27 05:09:04 +0800

15 Apr, 2017

1 commit

229a92873 blk-mq: add shallow depth option for blk_mq_get_tag() ... Browse Code »

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-04-15 04:06:54 +0800

08 Apr, 2017

1 commit

65f619d25 Merge branch 'for-linus' into for-4.12/block ... Browse Code »

We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe

Jens Axboe
2017-04-08 02:45:20 +0800

07 Apr, 2017

1 commit

81380ca10 blk-mq: use the right hctx when getting a driver tag fails ... Browse Code »

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-04-07 22:56:26 +0800

22 Mar, 2017

1 commit

34dbad5d2 blk-stat: convert to callback-based statistics reporting ... Browse Code »

Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-03-22 00:03:11 +0800

09 Mar, 2017

2 commits

7ea5fe31c blk-mq: make lifetime consitent between q/ctx and its kobject ... Browse Code »

Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.

Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800
737f98cfe blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() ... Browse Code »

Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[ 19.350920] Workqueue: events_unbound async_run_entry_fn
[ 19.350920] Call Trace:
[ 19.350920] dump_stack+0x63/0x83
[ 19.350920] kobject_init+0x77/0x90
[ 19.350920] blk_mq_register_dev+0x40/0x130
[ 19.350920] blk_register_queue+0xb6/0x190
[ 19.350920] device_add_disk+0x1ec/0x4b0
[ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
[ 19.350920] async_run_entry_fn+0x48/0x150
[ 19.350920] process_one_work+0x1d0/0x480
[ 19.350920] worker_thread+0x48/0x4e0
[ 19.350920] kthread+0x101/0x140
[ 19.350920] ? process_one_work+0x480/0x480
[ 19.350920] ? kthread_create_on_node+0x60/0x60
[ 19.350920] ret_from_fork+0x2c/0x40

Cc: Omar Sandoval
Signed-off-by: Ming Lei
Tested-by: Peter Zijlstra (Intel)
Signed-off-by: Jens Axboe

Ming Lei
2017-03-09 00:56:12 +0800

02 Mar, 2017

1 commit

597483989 blk-mq: kill blk_mq_set_alloc_data() ... Browse Code »

Nothing is using it anymore.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
Tested-by: Sagi Grimberg

Omar Sandoval
2017-03-02 23:56:04 +0800

03 Feb, 2017

1 commit

18fbda91c block: use same block debugfs directory for blk-mq and blktrace ... Browse Code »

When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2017-02-03 01:20:16 +0800

28 Jan, 2017

1 commit

400f73b23 blk-mq: fix debugfs compilation issues ... Browse Code »

This fixes a couple of problems:

1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
all.

Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
Signed-off-by: Omar Sandoval

Augment Kconfig description.

Signed-off-by: Jens Axboe

Omar Sandoval
2017-01-28 06:03:01 +0800