10 Oct, 2020

1 commit

  • After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
    for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
    directly to handle flush request rather than blk_mq_sched_insert_request()
    in the case of elevator.

    Then, all flush request either have set RQF_FLUSH_SEQ flag when call
    blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
    So, remove the dead code path.

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

06 Oct, 2020

1 commit


08 Sep, 2020

1 commit


04 Sep, 2020

2 commits

  • Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
    multiple reply queues with single hostwide tags.

    In addition, these drivers want to use interrupt assignment in
    pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
    CPU hotplug may cause in-flight IO completion to not be serviced when an
    interrupt is shutdown. That problem is solved in commit bf0beec0607d
    ("blk-mq: drain I/O when all CPUs in a hctx are offline").

    However, to take advantage of that blk-mq feature, the HBA HW queuess are
    required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
    queues need to be exposed to the upper layer.

    In making that transition, the per-SCSI command request tags are no
    longer unique per Scsi host - they are just unique per hctx. As such, the
    HBA LLDD would have to generate this tag internally, which has a certain
    performance overhead.

    However another problem is that blk-mq assumes the host may accept
    (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
    core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
    counter was removed, which would stop the LLDD being sent more than
    .can_queue commands; however, it should still be ensured that the block
    layer does not issue more than .can_queue commands to the Scsi host.

    To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
    which may be requested at init time.

    New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
    tagset to indicate whether the shared sbitmap should be used.

    Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
    are still allocated per hctx; the reason for this is that if tags and
    requests were only allocated for a single hctx - like hctx0 - it may break
    block drivers which expect a request be associated with a specific hctx,
    i.e. not always hctx0. This will introduce extra memory usage.

    This change is based on work originally from Ming Lei in [1] and from
    Bart's suggestion in [2].

    [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
    [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
    [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Pass hctx/tagset flags argument down to blk_mq_init_tags() and
    blk_mq_free_tags() for selective init/free.

    For now, make it include the alloc policy flag, which can be evaluated
    when needed (in blk_mq_init_tags()).

    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     

02 Sep, 2020

4 commits


17 Aug, 2020

1 commit

  • SCHED_RESTART code path is relied to re-run queue for dispatch requests
    in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
    requests to hctx->dispatch.

    memory barriers have to be used for ordering the following two pair of OPs:

    1) adding requests to hctx->dispatch and checking SCHED_RESTART in
    blk_mq_dispatch_rq_list()

    2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
    in blk_mq_sched_restart().

    Without the added memory barrier, either:

    1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
    blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
    dispatch side

    or

    2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
    in dispatch side, meantime checking if there is request in
    hctx->dispatch from blk_mq_sched_restart() is missed.

    IO hang in ltp/fs_fill test is reported by kernel test robot:

    https://lkml.org/lkml/2020/7/26/77

    Turns out it is caused by the above out-of-order OPs. And the IO hang
    can't be observed any more after applying this patch.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: David Jeffery
    Cc:
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Aug, 2020

1 commit


10 Jul, 2020

1 commit


30 Jun, 2020

4 commits

  • More and more drivers want to get batching requests queued from
    block layer, such as mmc, and tcp based storage drivers. Also
    current in-tree users have virtio-scsi, virtio-blk and nvme.

    For none, we already support batching dispatch.

    But for io scheduler, every time we just take one request from scheduler
    and pass the single request to blk_mq_dispatch_rq_list(). This way makes
    batching dispatch not possible when io scheduler is applied. One reason
    is that we don't want to hurt sequential IO performance, becasue IO
    merge chance is reduced if more requests are dequeued from scheduler
    queue.

    Try to support batching dispatch for io scheduler by starting with the
    following simple approach:

    1) still make sure we can get budget before dequeueing request

    2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
    we fackback to non-batching dispatch, otherwise dequeue as many as
    possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

    Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
    performance got improved much.

    In future, maybe we can develop more intelligent algorithem for batching
    dispatch.

    Baolin has tested this patch and found that MMC performance is improved[3].

    [1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
    [2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
    [3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
    for supporting fully batching submission.

    With the obtained budget count, it is easier to put extra budgets
    in case of .queue_rq failure.

    Meantime remove the old 'got_budget' parameter.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
    hctx, so it is better to pass hctx instead of request queue, because
    blk-mq's dispatch target is hctx instead of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk-mq budget is abstract from scsi's device queue depth, and it is
    always per-request-queue instead of hctx.

    It can be quite absurd to get a budget from one hctx, then dequeue a
    request from scheduler queue, and this request may not belong to this
    hctx, at least for bfq and deadline.

    So fix the mess and always pass request queue to get/put budget
    callback.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Douglas Anderson
    Reviewed-by: Sagi Grimberg
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Cc: Douglas Anderson
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Apr, 2020

1 commit


24 Apr, 2020

1 commit

  • Flushes bypass the I/O scheduler and get added to hctx->dispatch
    in blk_mq_sched_bypass_insert. This can happen while a kworker is running
    hctx->run_work work item and is past the point in
    blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

    The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
    because the I/O scheduler can feed an arbitrary number of commands.

    Since we have only one hctx->run_work, the commands waiting in
    hctx->dispatch will wait an arbitrary length of time for run_work to be
    rerun.

    A similar phenomenon exists with dispatches from the software queue.

    The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
    blk_mq_do_dispatch_ctx and return from the run_work handler and let it
    rerun.

    Signed-off-by: Salman Qazi
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Salman Qazi
     

21 Apr, 2020

1 commit

  • If ever a thread running blk-mq code tries to get budget and fails it
    immediately stops doing work and assumes that whenever budget is freed
    up that queues will be kicked and whatever work the thread was trying
    to do will be tried again.

    One path where budget is freed and queues are kicked in the normal
    case can be seen in scsi_finish_command(). Specifically:
    - scsi_finish_command()
    - scsi_device_unbusy()
    - # Decrement "device_busy", AKA release budget
    - scsi_io_completion()
    - scsi_end_request()
    - blk_mq_run_hw_queues()

    The above is all well and good. The problem comes up when a thread
    claims the budget but then releases it without actually dispatching
    any work. Since we didn't schedule any work we'll never run the path
    of finishing work / kicking the queues.

    This isn't often actually a problem which is why this issue has
    existed for a while and nobody noticed. Specifically we only get into
    this situation when we unexpectedly found that we weren't going to do
    any work. Code that later receives new work kicks the queues. All
    good, right?

    The problem shows up, however, if timing is just wrong and we hit a
    race. To see this race let's think about the case where we only have
    a budget of 1 (only one thread can hold budget). Now imagine that a
    thread got budget and then decided not to dispatch work. It's about
    to call put_budget() but then the thread gets context switched out for
    a long, long time. While in this state, any and all kicks of the
    queue (like the when we received new work) will be no-ops because
    nobody can get budget. Finally the thread holding budget gets to run
    again and returns. All the normal kicks will have been no-ops and we
    have an I/O stall.

    As you can see from the above, you need just the right timing to see
    the race. To start with, the only case it happens if we thought we
    had work, actually managed to get the budget, but then actually didn't
    have work. That's pretty rare to start with. Even then, there's
    usually a very small amount of time between realizing that there's no
    work and putting the budget. During this small amount of time new
    work has to come in and the queue kick has to make it all the way to
    trying to get the budget and fail. It's pretty unlikely.

    One case where this could have failed is illustrated by an example of
    threads running blk_mq_do_dispatch_sched():

    * Threads A and B both run has_work() at the same time with the same
    "hctx". Imagine has_work() is exact. There's no lock, so it's OK
    if Thread A and B both get back true.
    * Thread B gets interrupted for a long time right after it decides
    that there is work. Maybe its CPU gets an interrupt and the
    interrupt handler is slow.
    * Thread A runs, get budget, dispatches work.
    * Thread A's work finishes and budget is released.
    * Thread B finally runs again and gets budget.
    * Since Thread A already took care of the work and no new work has
    come in, Thread B will get NULL from dispatch_request(). I believe
    this is specifically why dispatch_request() is allowed to return
    NULL in the first place if has_work() must be exact.
    * Thread B will now be holding the budget and is about to call
    put_budget(), but hasn't called it yet.
    * Thread B gets interrupted for a long time (again). Dang interrupts.
    * Now Thread C (maybe with a different "hctx" but the same queue)
    comes along and runs blk_mq_do_dispatch_sched().
    * Thread C won't do anything because it can't get budget.
    * Finally Thread B will run again and put the budget without kicking
    any queues.

    Even though the example above is with blk_mq_do_dispatch_sched() I
    believe the race is possible any time someone is holding budget but
    doesn't do work.

    Unfortunately, the unlikely has become more likely if you happen to be
    using the BFQ I/O scheduler. BFQ, by design, sometimes returns "true"
    for has_work() but then NULL for dispatch_request() and stays in this
    state for a while (currently up to 9 ms). Suddenly you only need one
    race to hit, not two races in a row. With my current setup this is
    easy to reproduce in reboot tests and traces have actually shown that
    we hit a race similar to the one described above.

    Note that we only need to fix blk_mq_do_dispatch_sched() and
    blk_mq_do_dispatch_ctx() and not the other places that put budget. In
    other cases we know that we have work to do on at least one "hctx" and
    code already exists to kick that "hctx"'s queue. When that work
    finally finishes all the queues will be kicked using the normal flow.

    One last note is that (at least in the SCSI case) budget is shared by
    all "hctx"s that have the same queue. Thus we need to make sure to
    kick the whole queue, not just re-run dispatching on a single "hctx".

    Signed-off-by: Douglas Anderson
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

12 Mar, 2020

1 commit

  • commit 01e99aeca397 ("blk-mq: insert passthrough request into
    hctx->dispatch directly") may change to add flush request to the tail
    of dispatch by applying the 'add_head' parameter of
    blk_mq_sched_insert_request.

    Turns out this way causes performance regression on NCQ controller because
    flush is non-NCQ command, which can't be queued when there is any in-flight
    NCQ command. When adding flush rq to the front of hctx->dispatch, it is
    easier to introduce extra time to flush rq's latency compared with adding
    to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
    flush merge is increased, and less flush requests may be issued to
    controller.

    So always insert flush request to the front of dispatch queue just like
    before applying commit 01e99aeca397 ("blk-mq: insert passthrough request
    into hctx->dispatch directly").

    Cc: Damien Le Moal
    Cc: Shinichiro Kawasaki
    Reported-by: Shinichiro Kawasaki
    Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Feb, 2020

1 commit

  • For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Sep, 2019

1 commit

  • Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
    lockdep_assert_held() called in blk_mq_sched_free_requests() which is
    run in failure path of elevator_init_mq().

    blk_mq_sched_free_requests() is called in the following 3 functions:

    elevator_init_mq()
    elevator_exit()
    blk_cleanup_queue()

    In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
    by 'mutex_lock(&q->sysfs_lock)'.

    So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
    into elevator_exit() for fixing the report by syzbot.

    Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
    Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    Reviewed-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jul, 2019

1 commit

  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2019

1 commit

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

13 Jun, 2019

1 commit

  • blk_mq_sched_free_requests() may be called in failure path in which
    q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
    blk_mq_sched_free_requests for avoiding the false positive.

    This function is actually safe to call in case of !q->elevator because
    hctx->sched_tags is checked.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Yi Zhang
    Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue")
    Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

04 May, 2019

1 commit

  • Just like aio/io_uring, we need to grab 2 refcount for queuing one
    request, one is for submission, another is for completion.

    If the request isn't queued from plug code path, the refcount grabbed
    in generic_make_request() serves for submission. In theroy, this
    refcount should have been released after the sumission(async run queue)
    is done. blk_freeze_queue() works with blk_sync_queue() together
    for avoiding race between cleanup queue and IO submission, given async
    run queue activities are canceled because hctx->run_work is scheduled with
    the refcount held, so it is fine to not hold the refcount when
    running the run queue work function for dispatch IO.

    However, if request is staggered into plug list, and finally queued
    from plug code path, the refcount in submission side is actually missed.
    And we may start to run queue after queue is removed because the queue's
    kobject refcount isn't guaranteed to be grabbed in flushing plug list
    context, then kernel oops is triggered, see the following race:

    blk_mq_flush_plug_list():
    blk_mq_sched_insert_requests()
    insert requests to sw queue or scheduler queue
    blk_mq_run_hw_queue

    Because of concurrent run queue, all requests inserted above may be
    completed before calling the above blk_mq_run_hw_queue. Then queue can
    be freed during the above blk_mq_run_hw_queue().

    Fixes the issue by grab .q_usage_counter before calling
    blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
    safe because the queue is absolutely alive before inserting request.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Bart Van Assche
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 May, 2019

1 commit


05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Feb, 2019

1 commit

  • Currently, the queue mapping result is saved in a two-dimensional
    array. In the hot path, to get a hctx, we need do following:

    q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

    This isn't very efficient. We could save the queue mapping result into
    ctx directly with different hctx type, like,

    ctx->hctxs[type]

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

18 Dec, 2018

2 commits

  • When a request is added to rq list of sw queue(ctx), the rq may be from
    a different type of hctx, especially after multi queue mapping is
    introduced.

    So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
    blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
    hctx can be dispatched to current hctx in case that read queue or poll
    queue is enabled.

    This patch fixes this issue by introducing per-queue-type list.

    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei

    Changed by me to not use separately cacheline aligned lists, just
    place them all in the same cacheline where we had just the one list
    and lock before.

    Signed-off-by: Jens Axboe

    Ming Lei
     
  • For a zoned block device using mq-deadline, if a write request for a
    zone is received while another write was already dispatched for the same
    zone, dd_dispatch_request() will return NULL and the newly inserted
    write request is kept in the scheduler queue waiting for the ongoing
    zone write to complete. With this behavior, when no other request has
    been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
    and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
    __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
    queue when the already dispatched write request completes. The newly
    dispatched request stays stuck in the scheduler queue until eventually
    another request is submitted.

    This problem does not affect SCSI disk as the SCSI stack handles queue
    restart on request completion. However, this problem is can be triggered
    the nullblk driver with zoned mode enabled.

    Fix this by always requesting a queue restart in dd_dispatch_request()
    if no request was dispatched while WRITE requests are queued.

    Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
    Cc:
    Signed-off-by: Damien Le Moal

    Add missing export of blk_mq_sched_restart()

    Signed-off-by: Jens Axboe

    Damien Le Moal
     

16 Dec, 2018

1 commit

  • It is not necessary to issue request directly with bypass 'true'
    in blk_mq_sched_insert_requests and handle the non-issued requests
    itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
    handle them totally. Remove the blk_rq_can_direct_dispatch check,
    because blk_mq_try_issue_directly can handle it well.If request is
    direct-issued unsuccessfully, insert the reset.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

21 Nov, 2018

1 commit

  • If the first request allocated and issued by a process is a passhthrough
    request, we don't set up an IO context for it. Ensure that
    blk_mq_sched_assign_ioc() ignores a NULL io_context.

    Fixes: e2b3fa5af70c ("block: Remove bio->bi_ioc")
    Reported-by: Ming Lei
    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Nov, 2018

1 commit

  • bio->bi_ioc is never set so always NULL. Remove references to it in
    bio_disassociate_task() and in rq_ioc() and delete this field from
    struct bio. With this change, rq_ioc() always returns
    current->io_context without the need for a bio argument. Further
    simplify the code and make it more readable by also removing this
    helper, which also allows to simplify blk_mq_sched_assign_ioc() by
    removing its bio argument.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Adam Manzanares
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

16 Nov, 2018

1 commit

  • With the legacy request path gone there is no good reason to keep
    queue_lock as a pointer, we can always use the embedded lock now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Fixed floppy and blk-cgroup missing conversions and half done edits.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Nov, 2018

4 commits

  • It's somewhat strange to have a list insertion function that
    relies on the fact that the caller has mapped things correctly.
    Pass in the hardware queue directly for insertion, which makes
    for a much cleaner interface and implementation.

    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We call blk_mq_map_queue() a lot, at least two times for each
    request per IO, sometimes more. Since we now have an indirect
    call as well in that function. cache the mapping so we don't
    have to re-call blk_mq_map_queue() for the same request
    multiple times.

    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The mapping used to be dependent on just the CPU location, but
    now it's a tuple of (type, cpu) instead. This is a prep patch
    for allowing a single software queue to map to multiple hardware
    queues. No functional changes in this patch.

    This changes the software queue count to an unsigned short
    to save a bit of space. We can still support 64K-1 CPUs,
    which should be enough. Add a check to catch a wrap.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prep patch for being able to place request based not just on
    CPU location, but also on the type of request.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe