Eric Lee / smarc-fsl-linux-kernel

30 Jan, 2015

1 commit

e09aae7ed blk-mq: release mq's kobjects in blk_release_queue() ... Browse Code »

The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
before the kobject is released because driver core can access it freely
before its release.

We can't do that in all ctx/hctx/mq_kobj's release handler because
it can be run before blk_cleanup_queue().

Given mq_kobj shouldn't have been introduced, this patch simply moves
mq's release into blk_release_queue().

Reported-by: Sasha Levin
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-01-30 00:30:51 +0800

01 Jan, 2015

1 commit

aed3ea94b block: wake up waiters when a queue is marked dying ... Browse Code »

If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.

Tested-by: Keith Busch
Signed-off-by: Jens Axboe

Jens Axboe
2015-01-01 00:39:16 +0800

09 Dec, 2014

1 commit

19c66e59c blk-mq: prevent unmapped hw queue from being scheduled ... Browse Code »

When one hardware queue has no mapped software queues, it
shouldn't have been scheduled. Otherwise WARNING or OOPS
can triggered.

blk_mq_hw_queue_mapped() helper is introduce for fixing
the problem.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-12-09 12:37:08 +0800

26 Sep, 2014

2 commits

f35526557 block: introduce blk_init_flush and its pair ... Browse Code »

These two temporary functions are introduced for holding flush
initialization and de-initialization, so that we can
introduce 'flush queue' easier in the following patch. And
once 'flush queue' and its allocation/free functions are ready,
they will be removed for sake of code readability.

Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-09-26 05:22:35 +0800
1bcb1eada blk-mq: allocate flush_rq in blk_mq_init_flush() ... Browse Code »

It is reasonable to allocate flush req in blk_mq_init_flush().

Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-09-26 05:22:34 +0800

23 Sep, 2014

1 commit

904158376 block: fix blk_abort_request on blk-mq ... Browse Code »

Signed-off-by: Christoph Hellwig

Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.

Signed-off-by: Jens Axboe

Christoph Hellwig
2014-09-23 02:00:08 +0800

02 Jul, 2014

1 commit

780db2071 blk-mq: decouble blk-mq freezing from generic bypassing ... Browse Code »

blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was

!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))

mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Nicholas A. Bellinger
Signed-off-by: Jens Axboe

Tejun Heo
2014-07-02 00:31:13 +0800

04 Jun, 2014

2 commits

cb96a42cc blk-mq: fix schedule from atomic context ... Browse Code »

blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().

This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-06-04 11:04:39 +0800
1aecfe488 blk-mq: move blk_mq_get_ctx/blk_mq_put_ctx to mq private header ... Browse Code »

The blk-mq tag code need these helpers.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-06-04 11:04:38 +0800

30 May, 2014

1 commit

67aec14ce blk-mq: make the sysfs mq/ layout reflect current mappings ... Browse Code »

Currently blk-mq registers all the hardware queues in sysfs,
regardless of whether it uses them (e.g. they have CPU mappings)
or not. The unused hardware queues lack the cpux/ directories,
and the other sysfs entries (like active, pending, etc) are all
zeroes.

Change this so that sysfs correctly reflects the current mappings
of the hardware queues.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-30 22:25:36 +0800

28 May, 2014

1 commit

f14bbe77a blk-mq: pass in suggested NUMA node to ->alloc_hctx() ... Browse Code »

Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.

So just pass in the suggested node directly to the alloc
function.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-28 02:06:53 +0800

22 May, 2014

1 commit

e814e71ba blk-mq: allow the hctx cpu hotplug notifier to return errors ... Browse Code »

Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-22 03:59:08 +0800

21 May, 2014

1 commit

e3a2b3f93 blk-mq: allow changing of queue depth through sysfs ... Browse Code »

For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-21 01:49:02 +0800

20 May, 2014

1 commit

e93ecf602 blk-mq: move the cache friendly bitmap type of out blk-mq-tag ... Browse Code »

We will use it for the pending list in blk-mq core as well.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-20 01:02:47 +0800

09 May, 2014

1 commit

4bb659b15 blk-mq: implement new and more efficient tagging scheme ... Browse Code »

blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.

This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:

1) We cache the last allocated (or freed) tag on a per blk-mq
software context basis. This allows us to limit the space
we have to search. The key element here is not caching it
in the shared tag structure, otherwise we end up dirtying
more shared cache lines on each allocate/free operation.

2) The tag space is split into cache line sized groups, and
each context will start off randomly in that space. Even up
to full utilization of the space, this divides the tag users
efficiently into cache line groups, avoiding dirtying the same
one both between allocators and between allocator and freeer.

This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-09 23:36:49 +0800

25 Apr, 2014

1 commit

385352016 blk-mq: respect rq_affinity ... Browse Code »

The blk-mq code is using it's own version of the I/O completion affinity
tunables, which causes a few issues:

- the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
still is present, thus breaking existing tuning setups.
- the rq_affinity = 1 mode, which is the defauly for legacy request based
drivers isn't implemented at all.
- blk-mq drivers don't implement any completion affinity with the default
flag settings.

This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
respects the queue-wide rq_affinity flags and also implements the
rq_affinity = 1 mode.

This means I/O completion affinity can now only be tuned block-queue wide
instead of per context, which seems more sensible to me anyway.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-04-25 22:24:07 +0800

24 Apr, 2014

1 commit

87ee7b112 blk-mq: fix race with timeouts and requeue events ... Browse Code »

If a requeue event races with a timeout, we can get into the
situation where we attempt to complete a request from the
timeout handler when it's not start anymore. This causes a crash.
So have the timeout handler check that REQ_ATOM_STARTED is still
set on the request - if not, we ignore the event. If this happens,
the request has now been marked as complete. As a consequence, we
need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
as to maintain proper request state.

Signed-off-by: Jens Axboe

Jens Axboe
2014-04-24 22:51:47 +0800

16 Apr, 2014

3 commits

24d2f9030 blk-mq: split out tag initialization, support shared tags ... Browse Code »

Add a new blk_mq_tag_set structure that gets set up before we initialize
the queue. A single blk_mq_tag_set structure can be shared by multiple
queues.

Signed-off-by: Christoph Hellwig

Modular export of blk_mq_{alloc,free}_tagset added by me.

Signed-off-by: Jens Axboe

Christoph Hellwig
2014-04-16 04:18:02 +0800
8727af4b9 blk-mq: make ->flush_rq fully transparent to drivers ... Browse Code »

Drivers shouldn't have to care about the block layer setting aside a
request to implement the flush state machine. We already override the
mq context and tag to make it more transparent, but so far haven't deal
with the driver private data in the request. Make sure to override this
as well, and while we're at it add a proper helper sitting in blk-mq.c
that implements the full impersonation.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-04-16 04:03:02 +0800
9d74e2573 blk-mq: do not initialize req->special ... Browse Code »

Drivers can reach their private data easily using the blk_mq_rq_to_pdu
helper and don't need req->special. By not initializing it code can
be simplified nicely, and we also shave off a few more instructions from
the I/O path.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-04-16 04:03:02 +0800

21 Mar, 2014

2 commits

eeabc850b blk-mq: merge blk_mq_insert_request and blk_mq_run_request ... Browse Code »

It's almost identical to blk_mq_insert_request, so fold the two into one
slightly more generic function by making the flush special case a bit
smarted.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-03-21 22:57:37 +0800
676141e48 blk-mq: don't dump CPU -> hw queue map on driver load ... Browse Code »

Now that we are out of initial debug/bringup mode, remove
the verbose dump of the mapping table.

Provide the mapping table in sysfs, under the hardware queue
directory, in the cpu_list file.

Signed-off-by: Jens Axboe

Jens Axboe
2014-03-21 03:31:44 +0800

11 Feb, 2014

2 commits

18741986a blk-mq: rework flush sequencing logic ... Browse Code »

Witch to using a preallocated flush_rq for blk-mq similar to what's done
with the old request path. This allows us to set up the request properly
with a tag from the actually allowed range and ->rq_disk as needed by
some drivers. To make life easier we also switch to dynamic allocation
of ->flush_rq for the old path.

This effectively reverts most of

"blk-mq: fix for flush deadlock"

and

"blk-mq: Don't reserve a tag for flush request"

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-02-11 00:29:00 +0800
30a91cb4e blk-mq: rework I/O completions ... Browse Code »

Rework I/O completions to work more like the old code path. blk_mq_end_io
now stays out of the business of deferring completions to others CPUs
and calling blk_mark_rq_complete. The latter is very important to allow
completing requests that have timed out and thus are already marked completed,
the former allows using the IPI callout even for driver specific completions
instead of having to reimplement them.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-02-11 00:27:31 +0800

09 Jan, 2014

1 commit

3d6efbf62 blk-mq: use __smp_call_function_single directly ... Browse Code »

__smp_call_function_single already avoids multiple IPIs by internally
queing up the items, and now also is available for non-SMP builds as
a trivially correct stub, so there is no need to wrap it. If the
additional lock roundtrip cause problems my patch to convert the
generic IPI code to llists is waiting to get merged will fix it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-01-09 05:31:27 +0800

01 Jan, 2014

2 commits

3edcc0ce8 block: blk-mq: don't export blk_mq_free_queue() ... Browse Code »

blk_mq_free_queue() is called from release handler of
queue kobject, so it needn't be called from drivers.

Cc: Jens Axboe
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-01-01 00:53:05 +0800
43a5e4e21 block: blk-mq: support draining mq queue ... Browse Code »

blk_mq_drain_queue() is introduced so that we can drain
mq queue inside blk_cleanup_queue().

Also don't accept new requests any more if queue is marked
as dying.

Cc: Jens Axboe
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-01-01 00:53:05 +0800

25 Oct, 2013

1 commit

320ae51fe blk-mq: new multi-queue block IO queueing mechanism ... Browse Code »

Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.

- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li
Alexander Gordeev
Christoph Hellwig
Mike Christie
Matias Bjorling
Jeff Moyer

Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-25 18:56:00 +0800