30 Jan, 2015

1 commit

  • The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
    before the kobject is released because driver core can access it freely
    before its release.

    We can't do that in all ctx/hctx/mq_kobj's release handler because
    it can be run before blk_cleanup_queue().

    Given mq_kobj shouldn't have been introduced, this patch simply moves
    mq's release into blk_release_queue().

    Reported-by: Sasha Levin
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jan, 2015

1 commit

  • If it's dying, we can't expect new request to complete and come
    in an wake up other tasks waiting for requests. So after we
    have marked it as dying, wake up everybody currently waiting
    for a request. Once they wake, they will retry their allocation
    and fail appropriately due to the state of the queue.

    Tested-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Dec, 2014

1 commit


26 Sep, 2014

2 commits

  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is reasonable to allocate flush req in blk_mq_init_flush().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Sep, 2014

1 commit


02 Jul, 2014

1 commit

  • blk_mq freezing is entangled with generic bypassing which bypasses
    blkcg and io scheduler and lets IO requests fall through the block
    layer to the drivers in FIFO order. This allows forward progress on
    IOs with the advanced features disabled so that those features can be
    configured or altered without worrying about stalling IO which may
    lead to deadlock through memory allocation.

    However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
    doesn't make use of blkcg or ioscheds and it maps bypssing to
    freezing, which blocks request processing and drains all the in-flight
    ones. This causes problems as bypassing assumes that request
    processing is online. blk-mq works around this by conditionally
    allowing request processing for the problem case - during queue
    initialization.

    Another weirdity is that except for during queue cleanup, bypassing
    started on the generic side prevents blk-mq from processing new
    requests but doesn't drain the in-flight ones. This shouldn't break
    anything but again highlights that something isn't quite right here.

    The root cause is conflating blk-mq freezing and generic bypassing
    which are two different mechanisms. The only intersecting purpose
    that they serve is during queue cleanup. Let's properly separate
    blk-mq freezing from generic bypassing and simply use it where
    necessary.

    * request_queue->mq_freeze_depth is added and
    blk_mq_[un]freeze_queue() now operate on this counter instead of
    ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
    but the counter is tested directly. This will be further updated by
    later changes.

    * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
    blk_mq_freeze_queue(). Queue cleanup path now calls
    blk_mq_freeze_queue() directly.

    * blk_queue_enter()'s fast path condition is simplified to simply
    check @q->mq_freeze_depth. Previously, the condition was

    !blk_queue_dying(q) &&
    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

    mq_freeze_depth is incremented right after dying is set and
    blk_queue_init_done() exception isn't necessary as blk-mq doesn't
    start frozen, which only leaves the blk_queue_bypass() test which
    can be replaced by @q->mq_freeze_depth test.

    This change simplifies the code and reduces confusion in the area.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

04 Jun, 2014

2 commits


30 May, 2014

1 commit

  • Currently blk-mq registers all the hardware queues in sysfs,
    regardless of whether it uses them (e.g. they have CPU mappings)
    or not. The unused hardware queues lack the cpux/ directories,
    and the other sysfs entries (like active, pending, etc) are all
    zeroes.

    Change this so that sysfs correctly reflects the current mappings
    of the hardware queues.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 May, 2014

1 commit


22 May, 2014

1 commit


21 May, 2014

1 commit

  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 May, 2014

1 commit


09 May, 2014

1 commit

  • blk-mq currently uses percpu_ida for tag allocation. But that only
    works well if the ratio between tag space and number of CPUs is
    sufficiently high. For most devices and systems, that is not the
    case. The end result if that we either only utilize the tag space
    partially, or we end up attempting to fully exhaust it and run
    into lots of lock contention with stealing between CPUs. This is
    not optimal.

    This new tagging scheme is a hybrid bitmap allocator. It uses
    two tricks to both be SMP friendly and allow full exhaustion
    of the space:

    1) We cache the last allocated (or freed) tag on a per blk-mq
    software context basis. This allows us to limit the space
    we have to search. The key element here is not caching it
    in the shared tag structure, otherwise we end up dirtying
    more shared cache lines on each allocate/free operation.

    2) The tag space is split into cache line sized groups, and
    each context will start off randomly in that space. Even up
    to full utilization of the space, this divides the tag users
    efficiently into cache line groups, avoiding dirtying the same
    one both between allocators and between allocator and freeer.

    This scheme shows drastically better behaviour, both on small
    tag spaces but on large ones as well. It has been tested extensively
    to show better performance for all the cases blk-mq cares about.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2014

1 commit

  • The blk-mq code is using it's own version of the I/O completion affinity
    tunables, which causes a few issues:

    - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
    still is present, thus breaking existing tuning setups.
    - the rq_affinity = 1 mode, which is the defauly for legacy request based
    drivers isn't implemented at all.
    - blk-mq drivers don't implement any completion affinity with the default
    flag settings.

    This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
    as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
    respects the queue-wide rq_affinity flags and also implements the
    rq_affinity = 1 mode.

    This means I/O completion affinity can now only be tuned block-queue wide
    instead of per context, which seems more sensible to me anyway.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Apr, 2014

1 commit

  • If a requeue event races with a timeout, we can get into the
    situation where we attempt to complete a request from the
    timeout handler when it's not start anymore. This causes a crash.
    So have the timeout handler check that REQ_ATOM_STARTED is still
    set on the request - if not, we ignore the event. If this happens,
    the request has now been marked as complete. As a consequence, we
    need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
    as to maintain proper request state.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Apr, 2014

3 commits

  • Add a new blk_mq_tag_set structure that gets set up before we initialize
    the queue. A single blk_mq_tag_set structure can be shared by multiple
    queues.

    Signed-off-by: Christoph Hellwig

    Modular export of blk_mq_{alloc,free}_tagset added by me.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers shouldn't have to care about the block layer setting aside a
    request to implement the flush state machine. We already override the
    mq context and tag to make it more transparent, but so far haven't deal
    with the driver private data in the request. Make sure to override this
    as well, and while we're at it add a proper helper sitting in blk-mq.c
    that implements the full impersonation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers can reach their private data easily using the blk_mq_rq_to_pdu
    helper and don't need req->special. By not initializing it code can
    be simplified nicely, and we also shave off a few more instructions from
    the I/O path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Mar, 2014

2 commits


11 Feb, 2014

2 commits

  • Witch to using a preallocated flush_rq for blk-mq similar to what's done
    with the old request path. This allows us to set up the request properly
    with a tag from the actually allowed range and ->rq_disk as needed by
    some drivers. To make life easier we also switch to dynamic allocation
    of ->flush_rq for the old path.

    This effectively reverts most of

    "blk-mq: fix for flush deadlock"

    and

    "blk-mq: Don't reserve a tag for flush request"

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Rework I/O completions to work more like the old code path. blk_mq_end_io
    now stays out of the business of deferring completions to others CPUs
    and calling blk_mark_rq_complete. The latter is very important to allow
    completing requests that have timed out and thus are already marked completed,
    the former allows using the IPI callout even for driver specific completions
    instead of having to reimplement them.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jan, 2014

1 commit

  • __smp_call_function_single already avoids multiple IPIs by internally
    queing up the items, and now also is available for non-SMP builds as
    a trivially correct stub, so there is no need to wrap it. If the
    additional lock roundtrip cause problems my patch to convert the
    generic IPI code to llists is waiting to get merged will fix it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jan, 2014

2 commits


25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe