13 Jan, 2019

1 commit

  • commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.

    For a zoned block device using mq-deadline, if a write request for a
    zone is received while another write was already dispatched for the same
    zone, dd_dispatch_request() will return NULL and the newly inserted
    write request is kept in the scheduler queue waiting for the ongoing
    zone write to complete. With this behavior, when no other request has
    been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
    and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
    __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
    queue when the already dispatched write request completes. The newly
    dispatched request stays stuck in the scheduler queue until eventually
    another request is submitted.

    This problem does not affect SCSI disk as the SCSI stack handles queue
    restart on request completion. However, this problem is can be triggered
    the nullblk driver with zoned mode enabled.

    Fix this by always requesting a queue restart in dd_dispatch_request()
    if no request was dispatched while WRITE requests are queued.

    Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
    Cc:
    Signed-off-by: Damien Le Moal
    Signed-off-by: Greg Kroah-Hartman

    Add missing export of blk_mq_sched_restart()

    Signed-off-by: Jens Axboe

    Damien Le Moal
     

21 Aug, 2018

1 commit

  • Currently, when update nr_hw_queues, IO scheduler's init_hctx will
    be invoked before the mapping between ctx and hctx is adapted
    correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
    may depend on this mapping and get wrong result and panic finally.
    A simply way to fix this is that switch the IO scheduler to 'none'
    before update the nr_hw_queues, and then switch it back after
    update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
    to nobody use them any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

18 Jul, 2018

1 commit

  • In case of 'none' io scheduler, when hw queue isn't busy, it isn't
    necessary to enqueue request to sw queue and dequeue it from
    sw queue because request may be submitted to hw queue asap without
    extra cost, meantime there shouldn't be much request in sw queue,
    and we don't need to worry about effect on IO merge.

    There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
    which may connect high performance devices, so 'none' is often required
    for obtaining good performance.

    This patch improves IOPS and decreases CPU unilization on megaraid_sas,
    per Kashyap's test.

    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jul, 2018

3 commits

  • It won't be efficient to dequeue request one by one from sw queue,
    but we have to do that when queue is busy for better merge performance.

    This patch takes the Exponential Weighted Moving Average(EWMA) to figure
    out if queue is busy, then only dequeue request one by one from sw queue
    when queue is busy.

    Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

    1) for high-performance SSD, most of times dispatch may succeed, then
    there may be nothing left in ctx->rq_list, so don't try to merge over
    sw queue if it is empty, then we can save one acquiring of ctx->lock

    2) we can't expect good merge performance on per-cpu sw queue, and missing
    one merge on sw queue won't be a big deal since tasks can be scheduled from
    one CPU to another.

    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Kashyap Desai
    Reported-by: Kashyap Desai
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • We have to remove synchronize_rcu() from blk_queue_cleanup(),
    otherwise long delay can be caused during lun probe. For removing
    it, we have to avoid to iterate the set->tag_list in IO path, eg,
    blk_mq_sched_restart().

    This patch reverts 5b79413946d (Revert "blk-mq: don't handle
    TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
    and there isn't any reason to restart all queues in one tags any more,
    see the following reasons:

    1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
    which can wake up queues waiting for driver tag.

    2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
    target or host is ready, but SCSI built-in restart can cover all these well,
    see scsi_end_request(), queue will be rerun after any request initiated from
    this host/target is completed.

    In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
    when running I/O on these 8 luns concurrently.

    Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: linux-scsi@vger.kernel.org
    Reported-by: Andrew Jones
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jun, 2018

1 commit

  • Now we setup q->nr_requests when switching to one new scheduler,
    but not do it for 'none', then q->nr_requests may not be correct
    for 'none'.

    This patch fixes this issue by always updating 'nr_requests' when
    switching to 'none'.

    Cc: Marco Patalano
    Cc: "Ewan D. Milne"
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jun, 2018

2 commits


31 May, 2018

1 commit


02 Feb, 2018

1 commit


18 Jan, 2018

1 commit


05 Jan, 2018

1 commit

  • Commit de1482974080
    ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
    changes the function to return bool type, and then commit 1f460b63d4b3
    ("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE")
    changes it back to void, but the comment remains.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     

11 Nov, 2017

2 commits

  • Currently we are inconsistent in when we decide to run the queue. Using
    blk_mq_run_hw_queues() we check if the hctx has pending IO before
    running it, but we don't do that from the individual queue run function,
    blk_mq_run_hw_queue(). This results in a lot of extra and pointless
    queue runs, potentially, on flush requests and (much worse) on tag
    starvation situations. This is observable just looking at top output,
    with lots of kworkers active. For the !async runs, it just adds to the
    CPU overhead of blk-mq.

    Move the has-pending check into the run function instead of having
    callers do it.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This reverts commit 358a3a6bccb74da9d63a26b2dd5f09f1e9970e0b.

    We have cases that aren't covered 100% in the drivers, so for now
    we have to retain the shared tag restart loops.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Nov, 2017

3 commits

  • The idea behind it is simple:

    1) for none scheduler, driver tag has to be borrowed for flush rq,
    otherwise we may run out of tag, and that causes an IO hang. And
    get/put driver tag is actually noop for none, so reordering tags
    isn't necessary at all.

    2) for a real I/O scheduler, we need not allocate a driver tag upfront
    for flush rq. It works just fine to follow the same approach as
    normal requests: allocate driver tag for each rq just before calling
    ->queue_rq().

    One driver visible change is that the driver tag isn't shared in the
    flush request sequence. That won't be a problem, since we always do that
    in legacy path.

    Then flush rq need not be treated specially wrt. get/put driver tag.
    This cleans up the code - for instance, reorder_tags_to_front() can be
    removed, and we needn't worry about request ordering in dispatch list
    for avoiding I/O deadlock.

    Also we have to put the driver tag before requeueing.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In case of IO scheduler we always pre-allocate one driver tag before
    calling blk_insert_flush(), and flush request will be marked as
    RQF_FLUSH_SEQ once it is in flush machinery.

    So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle
    the request, otherwise the flush request is dispatched to ->dispatch
    list directly.

    This is a preparation patch for not preallocating a driver tag for flush
    requests, and for not treating flush requests as a special case. This is
    similar to what the legacy path does.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is enough to just check if we can get the budget via .get_budget().
    And we don't need to deal with device state change in .get_budget().

    For SCSI, one issue to be fixed is that we have to call
    scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
    to handle the request. And it isn't enough to simply call
    blk_mq_end_request() to do that if this request is marked as
    RQF_DONTPREP.

    Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Nov, 2017

6 commits

  • SCSI restarts its queue in scsi_end_request() automatically, so we don't
    need to handle this case in blk-mq.

    Especailly any request won't be dequeued in this case, we needn't to
    worry about IO hang caused by restart vs. dispatch.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now restart is used in the following cases, and TAG_SHARED is for
    SCSI only.

    1) .get_budget() returns BLK_STS_RESOURCE
    - if resource in target/host level isn't satisfied, this SCSI device
    will be added in shost->starved_list, and the whole queue will be rerun
    (via SCSI's built-in RESTART) in scsi_end_request() after any request
    initiated from this host/targe is completed. Forget to mention, host level
    resource can't be an issue for blk-mq at all.

    - the same is true if resource in the queue level isn't satisfied.

    - if there isn't outstanding request on this queue, then SCSI's RESTART
    can't work(blk-mq's can't work too), and the queue will be run after
    SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
    RESTART when this request is finished

    2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
    - if there isn't onprogressing request on this queue, the queue
    will be run after SCSI_QUEUE_DELAY

    - otherwise, SCSI's RESTART covers the rerun.

    3) blk_mq_get_driver_tag() failed
    - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
    allocation.

    In one word, SCSI's built-in RESTART is enough to cover the queue
    rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

    In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
    running I/O on these 8 luns concurrently.

    Aslo Roman Pen reported the current RESTART is very expensive especialy
    when there are lots of LUNs attached in one host, such as in his
    test, RESTART causes half of IOPS be cut.

    Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • SCSI devices use host-wide tagset, and the shared driver tag space is
    often quite big. However, there is also a queue depth for each lun(
    .cmd_per_lun), which is often small, for example, on both lpfc and
    qla2xxx, .cmd_per_lun is just 3.

    So lots of requests may stay in sw queue, and we always flush all
    belonging to same hw queue and dispatch them all to driver.
    Unfortunately it is easy to cause queue busy because of the small
    .cmd_per_lun. Once these requests are flushed out, they have to stay in
    hctx->dispatch, and no bio merge can happen on these requests, and
    sequential IO performance is harmed.

    This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
    from a sw queue, so that we can dispatch them in scheduler's way. We can
    then avoid dequeueing too many requests from sw queue, since we don't
    flush ->dispatch completely.

    This patch improves dispatching from sw queue by using the .get_budget
    and .put_budget callbacks.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • For SCSI devices, there is often a per-request-queue depth, which needs
    to be respected before queuing one request.

    Currently blk-mq always dequeues the request first, then calls
    .queue_rq() to dispatch the request to lld. One obvious issue with this
    approach is that I/O merging may not be successful, because when the
    per-request-queue depth can't be respected, .queue_rq() has to return
    BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
    list. This means it never gets a chance to be merged with other IO.

    This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
    then we can try to get reserved budget first before dequeuing request.
    If the budget for queueing I/O can't be satisfied, we don't need to
    dequeue request at all. Hence the request can be left in the IO
    scheduler queue, for more merging opportunities.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • So that it becomes easy to support to dispatch from sw queue in the
    following patch.

    No functional change.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Suggested-by: Christoph Hellwig # for simplifying dispatch logic
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When the hw queue is busy, we shouldn't take requests from the scheduler
    queue any more, otherwise it is difficult to do IO merge.

    This patch fixes the awful IO performance on some SCSI devices(lpfc,
    qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
    hw queue is busy.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

04 Jul, 2017

1 commit

  • When mq-deadline is taken, IOPS of sequential read and
    seqential write is observed more than 20% drop on sata(scsi-mq)
    devices, compared with using 'none' scheduler.

    The reason is that the default nr_requests for scheduler is
    too big for small queuedepth devices, and latency is increased
    much.

    Since the principle of taking 256 requests for mq scheduler
    is based on 128 queue depth, this patch changes into
    double size of min(hw queue_depth, 128).

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Jun, 2017

1 commit


22 Jun, 2017

1 commit

  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jun, 2017

1 commit

  • Document the locking assumptions in functions that modify
    blk_mq_ctx.rq_list to make it easier for humans to verify
    this code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Jun, 2017

4 commits

  • It is required that no dispatch can happen any more once
    blk_mq_quiesce_queue() returns, and we don't have such requirement
    on APIs of stopping queue.

    But blk_mq_quiesce_queue() still may not block/drain dispatch in the
    the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
    QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
    sections for fixing this issue.

    Also blk_mq_quiesce_queue() is implemented via stopping queue, which
    limits its uses, and easy to cause race, because any queue restart in
    other paths may break blk_mq_quiesce_queue(). With the introduced
    flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
    for quiescing any more.

    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
    the schedule needs it (bfq only at the moment). The caller to the
    per-request initializer is moved out so that it can be merged with
    a similar call for the kyber I/O scheduler.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Having these as separate helpers in a header really does not help
    readability, or my chances to refactor this code sanely.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Having them out of line in blk-mq-sched.c just makes the code flow
    unnecessarily complicated.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 May, 2017

1 commit


04 May, 2017

1 commit

  • This provides the infrastructure for schedulers to expose their internal
    state through debugfs. We add a list of queue attributes and a list of
    hctx attributes to struct elevator_type and wire them up when switching
    schedulers.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke

    Add missing seq_file.h header in blk-mq-debugfs.h

    Signed-off-by: Jens Axboe

    Omar Sandoval
     

02 May, 2017

1 commit


27 Apr, 2017

1 commit

  • At least one driver, mtip32xx, has a hard coded dependency on
    the value of the reserved tag used for internal commands. While
    that should really be fixed up, for now let's ensure that we just
    bypass the scheduler tags an allocation marked as reserved. They
    are used for house keeping or error handling, so we can safely
    ignore them in the scheduler.

    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Apr, 2017

1 commit


08 Apr, 2017

2 commits

  • Schedulers need to be informed when a hardware queue is added or removed
    at runtime so they can allocate/free per-hardware queue data. So,
    replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
    at init time, with .init_hctx() and .exit_hctx() hooks.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • To improve scalability, if hardware queues are shared, restart
    a single hardware queue in round-robin fashion. Rename
    blk_mq_sched_restart_queues() to reflect the new semantics.
    Remove blk_mq_sched_mark_restart_queue() because this function
    has no callers. Remove flag QUEUE_FLAG_RESTART because this
    patch removes the code that uses this flag.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

07 Apr, 2017

1 commit

  • In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
    back to the original scheduler. However, at this point, we've already
    torn down the original scheduler's tags, so this causes a crash. Doing
    the fallback like the legacy elevator path is much harder for mq, so fix
    it by just falling back to none, instead.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval