26 Sep, 2018

1 commit

  • [ Upstream commit b04f50ab8a74129b3041a2836c33c916be3c6667 ]

    Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

    1) for high-performance SSD, most of times dispatch may succeed, then
    there may be nothing left in ctx->rq_list, so don't try to merge over
    sw queue if it is empty, then we can save one acquiring of ctx->lock

    2) we can't expect good merge performance on per-cpu sw queue, and missing
    one merge on sw queue won't be a big deal since tasks can be scheduled from
    one CPU to another.

    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Tested-by: Kashyap Desai
    Reported-by: Kashyap Desai
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

20 Dec, 2017

1 commit

  • [ Upstream commit 5e3d02bbafad38975099b5848f5ebadedcf7bb7e ]

    When the hw queue is busy, we shouldn't take requests from the scheduler
    queue any more, otherwise it is difficult to do IO merge.

    This patch fixes the awful IO performance on some SCSI devices(lpfc,
    qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
    hw queue is busy.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

04 Jul, 2017

1 commit

  • When mq-deadline is taken, IOPS of sequential read and
    seqential write is observed more than 20% drop on sata(scsi-mq)
    devices, compared with using 'none' scheduler.

    The reason is that the default nr_requests for scheduler is
    too big for small queuedepth devices, and latency is increased
    much.

    Since the principle of taking 256 requests for mq scheduler
    is based on 128 queue depth, this patch changes into
    double size of min(hw queue_depth, 128).

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Jun, 2017

1 commit


22 Jun, 2017

1 commit

  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jun, 2017

1 commit

  • Document the locking assumptions in functions that modify
    blk_mq_ctx.rq_list to make it easier for humans to verify
    this code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Jun, 2017

4 commits

  • It is required that no dispatch can happen any more once
    blk_mq_quiesce_queue() returns, and we don't have such requirement
    on APIs of stopping queue.

    But blk_mq_quiesce_queue() still may not block/drain dispatch in the
    the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
    QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
    sections for fixing this issue.

    Also blk_mq_quiesce_queue() is implemented via stopping queue, which
    limits its uses, and easy to cause race, because any queue restart in
    other paths may break blk_mq_quiesce_queue(). With the introduced
    flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
    for quiescing any more.

    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
    the schedule needs it (bfq only at the moment). The caller to the
    per-request initializer is moved out so that it can be merged with
    a similar call for the kyber I/O scheduler.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Having these as separate helpers in a header really does not help
    readability, or my chances to refactor this code sanely.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Having them out of line in blk-mq-sched.c just makes the code flow
    unnecessarily complicated.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 May, 2017

1 commit


04 May, 2017

1 commit

  • This provides the infrastructure for schedulers to expose their internal
    state through debugfs. We add a list of queue attributes and a list of
    hctx attributes to struct elevator_type and wire them up when switching
    schedulers.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke

    Add missing seq_file.h header in blk-mq-debugfs.h

    Signed-off-by: Jens Axboe

    Omar Sandoval
     

02 May, 2017

1 commit


27 Apr, 2017

1 commit

  • At least one driver, mtip32xx, has a hard coded dependency on
    the value of the reserved tag used for internal commands. While
    that should really be fixed up, for now let's ensure that we just
    bypass the scheduler tags an allocation marked as reserved. They
    are used for house keeping or error handling, so we can safely
    ignore them in the scheduler.

    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Apr, 2017

1 commit


08 Apr, 2017

2 commits

  • Schedulers need to be informed when a hardware queue is added or removed
    at runtime so they can allocate/free per-hardware queue data. So,
    replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
    at init time, with .init_hctx() and .exit_hctx() hooks.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • To improve scalability, if hardware queues are shared, restart
    a single hardware queue in round-robin fashion. Rename
    blk_mq_sched_restart_queues() to reflect the new semantics.
    Remove blk_mq_sched_mark_restart_queue() because this function
    has no callers. Remove flag QUEUE_FLAG_RESTART because this
    patch removes the code that uses this flag.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

07 Apr, 2017

4 commits

  • In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
    back to the original scheduler. However, at this point, we've already
    torn down the original scheduler's tags, so this causes a crash. Doing
    the fallback like the legacy elevator path is much harder for mq, so fix
    it by just falling back to none, instead.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • If a new hardware queue is added at runtime, we don't allocate scheduler
    tags for it, leading to a crash. This hooks up the scheduler framework
    to blk_mq_{init,exit}_hctx() to make sure everything gets properly
    initialized/freed.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation cleanup for the next couple of fixes, push
    blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • While dispatching requests, if we fail to get a driver tag, we mark the
    hardware queue as waiting for a tag and put the requests on a
    hctx->dispatch list to be run later when a driver tag is freed. However,
    blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
    queues if using a single-queue scheduler with a multiqueue device. If
    blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
    are processing. This means we end up using the hardware queue of the
    previous request, which may or may not be the same as that of the
    current request. If it isn't, the wrong hardware queue will end up
    waiting for a tag, and the requests will be on the wrong dispatch list,
    leading to a hang.

    The fix is twofold:

    1. Make sure we save which hardware queue we were trying to get a
    request for in blk_mq_get_driver_tag() regardless of whether it
    succeeds or not.
    2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
    blk_mq_hw_queue to make it clear that it must handle multiple
    hardware queues, since I've already messed this up on a couple of
    occasions.

    This didn't appear in testing with nvme and mq-deadline because nvme has
    more driver tags than the default number of scheduler tags. However,
    with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

02 Mar, 2017

3 commits


24 Feb, 2017

1 commit

  • In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
    after we dispatch requests left over on our hardware queue dispatch
    list. This is so we'll go back and dispatch requests from the scheduler.
    In this case, it's only necessary to restart the hardware queue that we
    are running; there's no reason to run other hardware queues just because
    we are using shared tags.

    So, split out blk_mq_sched_mark_restart() into two operations, one for
    just the hardware queue and one for the whole request queue. The core
    code only needs the hctx variant, but I/O schedulers will want to use
    both.

    This also requires adjusting blk_mq_sched_restart_queues() to always
    check the queue restart flag, not just when using shared tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

23 Feb, 2017

1 commit


18 Feb, 2017

2 commits


11 Feb, 2017

1 commit

  • bio is used in bfq-mq's get_rq_priv, to get the request group. We could
    pass directly the group here, but I thought that passing the bio was
    more general, giving the possibility to get other pieces of information
    if needed.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Feb, 2017

1 commit


04 Feb, 2017

1 commit

  • If we end up doing a request-to-request merge when we have completed
    a bio-to-request merge, we free the request from deep down in that
    path. For blk-mq-sched, the merge path has to hold the appropriate
    lock, but we don't need it for freeing the request. And in fact
    holding the lock is problematic, since we are now calling the
    mq sched put_rq_private() hook with the lock held. Other call paths
    do not hold this lock.

    Fix this inconsistency by ensuring that the caller frees a merged
    request. Then we can do it outside of the lock, making it both more
    efficient and fixing the blk-mq-sched problem of invoking parts of
    the scheduler with an unknown lock state.

    Reported-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

03 Feb, 2017

1 commit


28 Jan, 2017

3 commits


27 Jan, 2017

4 commits

  • When we invoke dispatch_requests(), the scheduler empties everything
    into the passed in list. This isn't always a good thing, since it
    means that we remove items that we could have potentially merged
    with.

    Change the function to dispatch single requests at the time. If
    we do that, we can backoff exactly at the point where the device
    can't consume more IO, and leave the rest with the scheduler for
    better merging and future dispatch decision making.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval
    Tested-by: Hannes Reinecke

    Jens Axboe
     
  • If we have both multiple hardware queues and shared tag map between
    devices, we need to ensure that we propagate the hardware queue
    restart bit higher up. This is because we can get into a situation
    where we don't have any IO pending on a hardware queue, yet we fail
    getting a tag to start new IO. If that happens, it's not enough to
    mark the hardware queue as needing a restart, we need to bubble
    that up to the higher level queue as well.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval
    Tested-by: Hannes Reinecke

    Jens Axboe
     
  • We don't trigger this from the normal IO path, since we always use
    blocking allocations from there. But Bart saw it testing multipath
    dm, since that is a heavy user of atomic request allocations in
    the map and clone path.

    Reported-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
    we must ensure that we don't later overwrite that in
    blk_mq_sched_get_request(). Initialize alloc_data->flags before
    passing it in.

    Reported-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jan, 2017

1 commit

  • Add Kconfig entries to manage what devices get assigned an MQ
    scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
    The latter is useful for admin type queues that still allocate a blk-mq
    queue and tag set, but aren't use for normal IO.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe