10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

22 Sep, 2016

1 commit


17 Sep, 2016

2 commits

  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Sep, 2016

2 commits

  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Feb, 2016

1 commit

  • The hardware's provided queue count may change at runtime with resource
    provisioning. This patch allows a block driver to alter the number of
    h/w queues available when its resource count changes.

    The main part is a new blk-mq API to request a new number of h/w queues
    for a given live tag set. The new API freezes all queues using that set,
    then adjusts the allocated count prior to remapping these to CPUs.

    The bulk of the rest just shifts where h/w contexts and all their
    artifacts are allocated and freed.

    The number of max h/w contexts is capped to the number of possible cpus
    since there is no use for more than that. As such, all pre-allocated
    memory for pointers need to account for the max possible rather than
    the initial number of queues.

    A side effect of this is that the blk-mq will proceed successfully as
    long as it can allocate at least one h/w context. Previously it would
    fail request queue initialization if less than the requested number
    was allocated.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Tested-by: Jon Derrick
    Signed-off-by: Jens Axboe

    Keith Busch
     

02 Dec, 2015

1 commit


12 Nov, 2015

1 commit


10 Oct, 2015

1 commit


30 Sep, 2015

1 commit

  • Notifier callbacks for CPU_ONLINE action can be run on the other CPU
    than the CPU which was just onlined. So it is possible for the
    process running on the just onlined CPU to insert request and run
    hw queue before establishing new mapping which is done by
    blk_mq_queue_reinit_notify().

    This can cause a problem when the CPU has just been onlined first time
    since the request queue was initialized. At this time ctx->index_hw
    for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
    zero before blk_mq_queue_reinit_notify() is called by notifier
    callbacks for CPU_ONLINE action.

    For example, there is a single hw queue (hctx) and two CPU queues
    (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
    a request is inserted into ctx1->rq_list and set bit0 in pending
    bitmap as ctx1->index_hw is still zero.

    And then while running hw queue, flush_busy_ctxs() finds bit0 is set
    in pending bitmap and tries to retrieve requests in
    hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
    request in ctx1->rq_list is ignored.

    Fix it by ensuring that new mapping is established before onlined cpu
    starts running.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Jens Axboe
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

30 Jan, 2015

1 commit

  • The kobject memory inside blk-mq hctx/ctx shouldn't have been freed
    before the kobject is released because driver core can access it freely
    before its release.

    We can't do that in all ctx/hctx/mq_kobj's release handler because
    it can be run before blk_cleanup_queue().

    Given mq_kobj shouldn't have been introduced, this patch simply moves
    mq's release into blk_release_queue().

    Reported-by: Sasha Levin
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jan, 2015

1 commit

  • If it's dying, we can't expect new request to complete and come
    in an wake up other tasks waiting for requests. So after we
    have marked it as dying, wake up everybody currently waiting
    for a request. Once they wake, they will retry their allocation
    and fail appropriately due to the state of the queue.

    Tested-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Dec, 2014

1 commit


26 Sep, 2014

2 commits

  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is reasonable to allocate flush req in blk_mq_init_flush().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Sep, 2014

1 commit


02 Jul, 2014

1 commit

  • blk_mq freezing is entangled with generic bypassing which bypasses
    blkcg and io scheduler and lets IO requests fall through the block
    layer to the drivers in FIFO order. This allows forward progress on
    IOs with the advanced features disabled so that those features can be
    configured or altered without worrying about stalling IO which may
    lead to deadlock through memory allocation.

    However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
    doesn't make use of blkcg or ioscheds and it maps bypssing to
    freezing, which blocks request processing and drains all the in-flight
    ones. This causes problems as bypassing assumes that request
    processing is online. blk-mq works around this by conditionally
    allowing request processing for the problem case - during queue
    initialization.

    Another weirdity is that except for during queue cleanup, bypassing
    started on the generic side prevents blk-mq from processing new
    requests but doesn't drain the in-flight ones. This shouldn't break
    anything but again highlights that something isn't quite right here.

    The root cause is conflating blk-mq freezing and generic bypassing
    which are two different mechanisms. The only intersecting purpose
    that they serve is during queue cleanup. Let's properly separate
    blk-mq freezing from generic bypassing and simply use it where
    necessary.

    * request_queue->mq_freeze_depth is added and
    blk_mq_[un]freeze_queue() now operate on this counter instead of
    ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
    but the counter is tested directly. This will be further updated by
    later changes.

    * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
    blk_mq_freeze_queue(). Queue cleanup path now calls
    blk_mq_freeze_queue() directly.

    * blk_queue_enter()'s fast path condition is simplified to simply
    check @q->mq_freeze_depth. Previously, the condition was

    !blk_queue_dying(q) &&
    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

    mq_freeze_depth is incremented right after dying is set and
    blk_queue_init_done() exception isn't necessary as blk-mq doesn't
    start frozen, which only leaves the blk_queue_bypass() test which
    can be replaced by @q->mq_freeze_depth test.

    This change simplifies the code and reduces confusion in the area.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Nicholas A. Bellinger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

04 Jun, 2014

2 commits


30 May, 2014

1 commit

  • Currently blk-mq registers all the hardware queues in sysfs,
    regardless of whether it uses them (e.g. they have CPU mappings)
    or not. The unused hardware queues lack the cpux/ directories,
    and the other sysfs entries (like active, pending, etc) are all
    zeroes.

    Change this so that sysfs correctly reflects the current mappings
    of the hardware queues.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 May, 2014

1 commit


22 May, 2014

1 commit


21 May, 2014

1 commit

  • For request_fn based devices, the block layer exports a 'nr_requests'
    file through sysfs to allow adjusting of queue depth on the fly.
    Currently this returns -EINVAL for blk-mq, since it's not wired up.
    Wire this up for blk-mq, so that it now also always dynamic
    adjustments of the allowed queue depth for any given block device
    managed by blk-mq.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 May, 2014

1 commit


09 May, 2014

1 commit

  • blk-mq currently uses percpu_ida for tag allocation. But that only
    works well if the ratio between tag space and number of CPUs is
    sufficiently high. For most devices and systems, that is not the
    case. The end result if that we either only utilize the tag space
    partially, or we end up attempting to fully exhaust it and run
    into lots of lock contention with stealing between CPUs. This is
    not optimal.

    This new tagging scheme is a hybrid bitmap allocator. It uses
    two tricks to both be SMP friendly and allow full exhaustion
    of the space:

    1) We cache the last allocated (or freed) tag on a per blk-mq
    software context basis. This allows us to limit the space
    we have to search. The key element here is not caching it
    in the shared tag structure, otherwise we end up dirtying
    more shared cache lines on each allocate/free operation.

    2) The tag space is split into cache line sized groups, and
    each context will start off randomly in that space. Even up
    to full utilization of the space, this divides the tag users
    efficiently into cache line groups, avoiding dirtying the same
    one both between allocators and between allocator and freeer.

    This scheme shows drastically better behaviour, both on small
    tag spaces but on large ones as well. It has been tested extensively
    to show better performance for all the cases blk-mq cares about.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2014

1 commit

  • The blk-mq code is using it's own version of the I/O completion affinity
    tunables, which causes a few issues:

    - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
    still is present, thus breaking existing tuning setups.
    - the rq_affinity = 1 mode, which is the defauly for legacy request based
    drivers isn't implemented at all.
    - blk-mq drivers don't implement any completion affinity with the default
    flag settings.

    This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
    as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
    respects the queue-wide rq_affinity flags and also implements the
    rq_affinity = 1 mode.

    This means I/O completion affinity can now only be tuned block-queue wide
    instead of per context, which seems more sensible to me anyway.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Apr, 2014

1 commit

  • If a requeue event races with a timeout, we can get into the
    situation where we attempt to complete a request from the
    timeout handler when it's not start anymore. This causes a crash.
    So have the timeout handler check that REQ_ATOM_STARTED is still
    set on the request - if not, we ignore the event. If this happens,
    the request has now been marked as complete. As a consequence, we
    need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
    as to maintain proper request state.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Apr, 2014

3 commits

  • Add a new blk_mq_tag_set structure that gets set up before we initialize
    the queue. A single blk_mq_tag_set structure can be shared by multiple
    queues.

    Signed-off-by: Christoph Hellwig

    Modular export of blk_mq_{alloc,free}_tagset added by me.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers shouldn't have to care about the block layer setting aside a
    request to implement the flush state machine. We already override the
    mq context and tag to make it more transparent, but so far haven't deal
    with the driver private data in the request. Make sure to override this
    as well, and while we're at it add a proper helper sitting in blk-mq.c
    that implements the full impersonation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Drivers can reach their private data easily using the blk_mq_rq_to_pdu
    helper and don't need req->special. By not initializing it code can
    be simplified nicely, and we also shave off a few more instructions from
    the I/O path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Mar, 2014

2 commits


11 Feb, 2014

2 commits

  • Witch to using a preallocated flush_rq for blk-mq similar to what's done
    with the old request path. This allows us to set up the request properly
    with a tag from the actually allowed range and ->rq_disk as needed by
    some drivers. To make life easier we also switch to dynamic allocation
    of ->flush_rq for the old path.

    This effectively reverts most of

    "blk-mq: fix for flush deadlock"

    and

    "blk-mq: Don't reserve a tag for flush request"

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Rework I/O completions to work more like the old code path. blk_mq_end_io
    now stays out of the business of deferring completions to others CPUs
    and calling blk_mark_rq_complete. The latter is very important to allow
    completing requests that have timed out and thus are already marked completed,
    the former allows using the IPI callout even for driver specific completions
    instead of having to reimplement them.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jan, 2014

1 commit

  • __smp_call_function_single already avoids multiple IPIs by internally
    queing up the items, and now also is available for non-SMP builds as
    a trivially correct stub, so there is no need to wrap it. If the
    additional lock roundtrip cause problems my patch to convert the
    generic IPI code to llists is waiting to get merged will fix it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jan, 2014

2 commits


25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe