08 Apr, 2022

1 commit

  • commit 572299f03afd676dd4e20669cdaf5ed0fe1379d4 upstream.

    When IO requests are made continuously and the target block device
    handles requests faster than request arrival, the request dispatch loop
    keeps on repeating to dispatch the arriving requests very long time,
    more than a minute. Since the loop runs as a workqueue worker task, the
    very long loop duration triggers workqueue watchdog timeout and BUG [1].

    To avoid the very long loop duration, break the loop periodically. When
    opportunity to dispatch requests still exists, check need_resched(). If
    need_resched() returns true, the dispatch loop already consumed its time
    slice, then reschedule the dispatch work and break the loop. With heavy
    IO load, need_resched() does not return true for 20~30 seconds. To cover
    such case, check time spent in the dispatch loop with jiffies. If more
    than 1 second is spent, reschedule the dispatch work and break the loop.

    [1]

    [ 609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
    [ 609.701820] Showing busy workqueues and worker pools:
    [ 609.707915] workqueue events: flags=0x0
    [ 609.712615] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    [ 609.712626] pending: drm_fb_helper_damage_work [drm_kms_helper]
    [ 609.712687] workqueue events_freezable: flags=0x4
    [ 609.732943] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    [ 609.732952] pending: pci_pme_list_scan
    [ 609.732968] workqueue events_power_efficient: flags=0x80
    [ 609.751947] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    [ 609.751955] pending: neigh_managed_work
    [ 609.752018] workqueue kblockd: flags=0x18
    [ 609.769480] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
    [ 609.769488] in-flight: 1020:blk_mq_run_work_fn
    [ 609.769498] pending: blk_mq_timeout_work, blk_mq_run_work_fn
    [ 609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
    [ 639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
    [ 639.909513] Showing busy workqueues and worker pools:
    [ 639.915404] workqueue events: flags=0x0
    [ 639.920197] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    [ 639.920215] pending: drm_fb_helper_damage_work [drm_kms_helper]
    [ 639.920365] workqueue kblockd: flags=0x18
    [ 639.939932] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
    [ 639.939942] in-flight: 1020:blk_mq_run_work_fn
    [ 639.939955] pending: blk_mq_timeout_work, blk_mq_run_work_fn
    [ 639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67

    Fixes: 6e6fcbc27e778 ("blk-mq: support batching dispatch in case of io")
    Signed-off-by: Shin'ichiro Kawasaki
    Cc: stable@vger.kernel.org # v5.10+
    Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/
    Link: https://lore.kernel.org/r/20220318022641.133484-1-shinichiro.kawasaki@wdc.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Shin'ichiro Kawasaki
     

28 Jul, 2021

1 commit

  • If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we
    call blk_mq_sched_free_tags() -> blk_mq_free_rqs().

    It is incorrect to do so, as any rqs would have already been freed in the
    blk_mq_alloc_rqs() call.

    Fix by calling blk_mq_free_rq_map() only directly.

    Fixes: 6917ff0b5bd41 ("blk-mq-sched: refactor scheduler initialization")
    Signed-off-by: John Garry
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe

    John Garry
     

25 Jun, 2021

1 commit

  • Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

    bfqd -> ioc:
    put_io_context+0x33/0x90 -> ioc->lock grabbed
    blk_mq_free_request+0x51/0x140
    blk_put_request+0xe/0x10
    blk_attempt_req_merge+0x1d/0x30
    elv_attempt_insert_merge+0x56/0xa0
    blk_mq_sched_try_insert_merge+0x4b/0x60
    bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
    blk_mq_sched_insert_requests+0xd6/0x2b0
    blk_mq_flush_plug_list+0x154/0x280
    blk_finish_plug+0x40/0x60
    ext4_writepages+0x696/0x1320
    do_writepages+0x1c/0x80
    __filemap_fdatawrite_range+0xd7/0x120
    sync_file_range+0xac/0xf0

    ioc->bfqd:
    bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
    put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
    exit_io_context+0x48/0x50
    do_exit+0x7e9/0xdd0
    do_group_exit+0x54/0xc0

    To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
    free the merged request but rather leave that upto the caller similarly
    to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
    to free all the merged requests after dropping bfqd->lock.

    Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Reviewed-by: Ming Lei
    Acked-by: Paolo Valente
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Jun, 2021

2 commits

  • The insert_requests and dispatch_request elevator operations are
    mandatory for the correct execution of an elevator, and all implemented
    elevators (bfq, kyber and mq-deadline) implement them. As a result,
    there is no need to check for these operations before calling them when
    a queue has an elevator set. This simplifies the code in
    __blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request().

    To avoid out-of-tree elevators to crash the kernel in case of bad
    implementation, add a check in elv_register() to verify that these
    operations are implemented.

    A small, probably not significant, IOPS improvement of 0.1% is observed
    with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio
    runs doing 4K random direct reads with psync and 32 jobs).

    Signed-off-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • tagset can't be used after blk_cleanup_queue() is returned because
    freeing tagset usually follows blk_clenup_queue(). Commit d97e594c5166
    ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds
    check on q->tag_set->flags in blk_mq_exit_sched(), and causes
    use-after-free.

    Fixes it by using hctx->flags.

    Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com
    Fixes: d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap")
    Cc: John Garry
    Signed-off-by: Ming Lei
    Tested-by: John Garry
    Reviewed-by: John Garry
    Link: https://lore.kernel.org/r/20210609063046.122843-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

04 Jun, 2021

1 commit

  • Provided the device driver does not implement dispatch budget accounting
    (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
    requests from the IO scheduler as long as it is willing to give out any.
    That defeats scheduling heuristics inside the scheduler by creating
    false impression that the device can take more IO when it in fact
    cannot.

    For example with BFQ IO scheduler on top of virtio-blk device setting
    blkio cgroup weight has barely any impact on observed throughput of
    async IO because __blk_mq_do_dispatch_sched() always sucks out all the
    IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
    when that is all dispatched, it will give out IO of lower weight cgroups
    as well. And then we have to wait for all this IO to be dispatched to
    the disk (which means lot of it actually has to complete) before the
    IO scheduler is queried again for dispatching more requests. This
    completely destroys any service differentiation.

    So grab request tag for a request pulled out of the IO scheduler already
    in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
    cannot get it because we are unlikely to be able to dispatch it. That
    way only single request is going to wait in the dispatch list for some
    tag to free.

    Reviewed-by: Ming Lei
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 May, 2021

1 commit

  • The tags used for an IO scheduler are currently per hctx.

    As such, when q->nr_hw_queues grows, so does the request queue total IO
    scheduler tag depth.

    This may cause problems for SCSI MQ HBAs whose total driver depth is
    fixed.

    Ming and Yanhui report higher CPU usage and lower throughput in scenarios
    where the fixed total driver tag depth is appreciably lower than the total
    scheduler tag depth:
    https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

    In that scenario, since the scheduler tag is got first, much contention
    is introduced since a driver tag may not be available after we have got
    the sched tag.

    Improve this scenario by introducing request queue-wide tags for when
    a tagset-wide sbitmap is used. The static sched requests are still
    allocated per hctx, as requests are initialised per hctx, as in
    blk_mq_init_request(..., hctx_idx, ...) ->
    set->ops->init_request(.., hctx_idx, ...).

    For simplicity of resizing the request queue sbitmap when updating the
    request queue depth, just init at the max possible size, so we don't need
    to deal with the possibly with swapping out a new sbitmap for old if
    we need to grow.

    Signed-off-by: John Garry
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe

    John Garry
     

11 May, 2021

1 commit

  • __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
    passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
    for the current CPU again and uses that to get the corresponding Kyber
    context in the passed hctx. However, the thread may be preempted between
    the two calls to blk_mq_get_ctx(), and the ctx returned the second time
    may no longer correspond to the passed hctx. This "works" accidentally
    most of the time, but it can cause us to read garbage if the second ctx
    came from an hctx with more ctx's than the first one (i.e., if
    ctx->index_hw[hctx->type] > hctx->nr_ctx).

    This manifested as this UBSAN array index out of bounds error reported
    by Jakub:

    UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
    index 13106 is out of range for type 'long unsigned int [128]'
    Call Trace:
    dump_stack+0xa4/0xe5
    ubsan_epilogue+0x5/0x40
    __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
    queued_spin_lock_slowpath+0x476/0x480
    do_raw_spin_lock+0x1c2/0x1d0
    kyber_bio_merge+0x112/0x180
    blk_mq_submit_bio+0x1f5/0x1100
    submit_bio_noacct+0x7b0/0x870
    submit_bio+0xc2/0x3a0
    btrfs_map_bio+0x4f0/0x9d0
    btrfs_submit_data_bio+0x24e/0x310
    submit_one_bio+0x7f/0xb0
    submit_extent_page+0xc4/0x440
    __extent_writepage_io+0x2b8/0x5e0
    __extent_writepage+0x28d/0x6e0
    extent_write_cache_pages+0x4d7/0x7a0
    extent_writepages+0xa2/0x110
    do_writepages+0x8f/0x180
    __writeback_single_inode+0x99/0x7f0
    writeback_sb_inodes+0x34e/0x790
    __writeback_inodes_wb+0x9e/0x120
    wb_writeback+0x4d2/0x660
    wb_workfn+0x64d/0xa10
    process_one_work+0x53a/0xa80
    worker_thread+0x69/0x5b0
    kthread+0x20b/0x240
    ret_from_fork+0x1f/0x30

    Only Kyber uses the hctx, so fix it by passing the request_queue to
    ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
    map the queues itself to avoid the mismatch.

    Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging")
    Reported-by: Jakub Kicinski
    Signed-off-by: Omar Sandoval
    Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

29 Apr, 2021

1 commit

  • Pull SCSI updates from James Bottomley:
    "This consists of the usual driver updates (ufs, target, tcmu,
    smartpqi, lpfc, zfcp, qla2xxx, mpt3sas, pm80xx).

    The major core change is using a sbitmap instead of an atomic for
    queue tracking"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (412 commits)
    scsi: target: tcm_fc: Fix a kernel-doc header
    scsi: target: Shorten ALUA error messages
    scsi: target: Fix two format specifiers
    scsi: target: Compare explicitly with SAM_STAT_GOOD
    scsi: sd: Introduce a new local variable in sd_check_events()
    scsi: dc395x: Open-code status_byte(u8) calls
    scsi: 53c700: Open-code status_byte(u8) calls
    scsi: smartpqi: Remove unused functions
    scsi: qla4xxx: Remove an unused function
    scsi: myrs: Remove unused functions
    scsi: myrb: Remove unused functions
    scsi: mpt3sas: Fix two kernel-doc headers
    scsi: fcoe: Suppress a compiler warning
    scsi: libfc: Fix a format specifier
    scsi: aacraid: Remove an unused function
    scsi: core: Introduce enum scsi_disposition
    scsi: core: Modify the scsi_send_eh_cmnd() return value for the SDEV_BLOCK case
    scsi: core: Rename scsi_softirq_done() into scsi_complete()
    scsi: core: Remove an incorrect comment
    scsi: core: Make the scsi_alloc_sgtables() documentation more accurate
    ...

    Linus Torvalds
     

09 Apr, 2021

1 commit

  • list_sort() internally casts the comparison function passed to it
    to a different type with constant struct list_head pointers, and
    uses this pointer to call the functions, which trips indirect call
    Control-Flow Integrity (CFI) checking.

    Instead of removing the consts, this change defines the
    list_cmp_func_t type and changes the comparison function types of
    all list_sort() callers to use const pointers, thus avoiding type
    mismatches.

    Suggested-by: Nick Desaulniers
    Signed-off-by: Sami Tolvanen
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Kees Cook
    Tested-by: Nick Desaulniers
    Tested-by: Nathan Chancellor
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

    Sami Tolvanen
     

05 Mar, 2021

1 commit

  • SCSI uses a global atomic variable to track queue depth for each
    LUN/request queue.

    This doesn't scale well when there are lots of CPU cores and the disk is
    very fast. It has been observed that IOPS is affected a lot by tracking
    queue depth via sdev->device_busy in the I/O path.

    Return budget token from .get_budget callback. The budget token can be
    passed to driver so that we can replace the atomic variable with
    sbitmap_queue and alleviate the scaling problems that way.

    Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com
    Cc: Omar Sandoval
    Cc: Kashyap Desai
    Cc: Sumanesh Samanta
    Cc: Ewan D. Milne
    Tested-by: Sumanesh Samanta
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Signed-off-by: Martin K. Petersen

    Ming Lei
     

02 Mar, 2021

1 commit

  • Commit a1ce35fa49852db60fc6e268038530be533c5b15 ("block: remove dead
    elevator code") removed all users of RQF_SORTED. However it is still
    defined, and there is one reference left to it (which in effect is
    dead code). Clear it all up.

    Signed-off-by: Jean Delvare
    Cc: Jens Axboe
    Cc: Ming Lei
    Cc: Omar Sandoval
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jean Delvare
     

22 Feb, 2021

1 commit


05 Dec, 2020

1 commit


10 Oct, 2020

1 commit

  • After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
    for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
    directly to handle flush request rather than blk_mq_sched_insert_request()
    in the case of elevator.

    Then, all flush request either have set RQF_FLUSH_SEQ flag when call
    blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
    So, remove the dead code path.

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

06 Oct, 2020

1 commit


08 Sep, 2020

1 commit


04 Sep, 2020

2 commits

  • Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
    multiple reply queues with single hostwide tags.

    In addition, these drivers want to use interrupt assignment in
    pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
    CPU hotplug may cause in-flight IO completion to not be serviced when an
    interrupt is shutdown. That problem is solved in commit bf0beec0607d
    ("blk-mq: drain I/O when all CPUs in a hctx are offline").

    However, to take advantage of that blk-mq feature, the HBA HW queuess are
    required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
    queues need to be exposed to the upper layer.

    In making that transition, the per-SCSI command request tags are no
    longer unique per Scsi host - they are just unique per hctx. As such, the
    HBA LLDD would have to generate this tag internally, which has a certain
    performance overhead.

    However another problem is that blk-mq assumes the host may accept
    (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
    core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
    counter was removed, which would stop the LLDD being sent more than
    .can_queue commands; however, it should still be ensured that the block
    layer does not issue more than .can_queue commands to the Scsi host.

    To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
    which may be requested at init time.

    New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
    tagset to indicate whether the shared sbitmap should be used.

    Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
    are still allocated per hctx; the reason for this is that if tags and
    requests were only allocated for a single hctx - like hctx0 - it may break
    block drivers which expect a request be associated with a specific hctx,
    i.e. not always hctx0. This will introduce extra memory usage.

    This change is based on work originally from Ming Lei in [1] and from
    Bart's suggestion in [2].

    [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
    [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
    [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Pass hctx/tagset flags argument down to blk_mq_init_tags() and
    blk_mq_free_tags() for selective init/free.

    For now, make it include the alloc policy flag, which can be evaluated
    when needed (in blk_mq_init_tags()).

    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     

02 Sep, 2020

4 commits


17 Aug, 2020

1 commit

  • SCHED_RESTART code path is relied to re-run queue for dispatch requests
    in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
    requests to hctx->dispatch.

    memory barriers have to be used for ordering the following two pair of OPs:

    1) adding requests to hctx->dispatch and checking SCHED_RESTART in
    blk_mq_dispatch_rq_list()

    2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
    in blk_mq_sched_restart().

    Without the added memory barrier, either:

    1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
    blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
    dispatch side

    or

    2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
    in dispatch side, meantime checking if there is request in
    hctx->dispatch from blk_mq_sched_restart() is missed.

    IO hang in ltp/fs_fill test is reported by kernel test robot:

    https://lkml.org/lkml/2020/7/26/77

    Turns out it is caused by the above out-of-order OPs. And the IO hang
    can't be observed any more after applying this patch.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: David Jeffery
    Cc:
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Aug, 2020

1 commit


10 Jul, 2020

1 commit


30 Jun, 2020

4 commits

  • More and more drivers want to get batching requests queued from
    block layer, such as mmc, and tcp based storage drivers. Also
    current in-tree users have virtio-scsi, virtio-blk and nvme.

    For none, we already support batching dispatch.

    But for io scheduler, every time we just take one request from scheduler
    and pass the single request to blk_mq_dispatch_rq_list(). This way makes
    batching dispatch not possible when io scheduler is applied. One reason
    is that we don't want to hurt sequential IO performance, becasue IO
    merge chance is reduced if more requests are dequeued from scheduler
    queue.

    Try to support batching dispatch for io scheduler by starting with the
    following simple approach:

    1) still make sure we can get budget before dequeueing request

    2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
    we fackback to non-batching dispatch, otherwise dequeue as many as
    possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

    Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
    performance got improved much.

    In future, maybe we can develop more intelligent algorithem for batching
    dispatch.

    Baolin has tested this patch and found that MMC performance is improved[3].

    [1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
    [2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
    [3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
    for supporting fully batching submission.

    With the obtained budget count, it is easier to put extra budgets
    in case of .queue_rq failure.

    Meantime remove the old 'got_budget' parameter.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
    hctx, so it is better to pass hctx instead of request queue, because
    blk-mq's dispatch target is hctx instead of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk-mq budget is abstract from scsi's device queue depth, and it is
    always per-request-queue instead of hctx.

    It can be quite absurd to get a budget from one hctx, then dequeue a
    request from scheduler queue, and this request may not belong to this
    hctx, at least for bfq and deadline.

    So fix the mess and always pass request queue to get/put budget
    callback.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Douglas Anderson
    Reviewed-by: Sagi Grimberg
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Cc: Douglas Anderson
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Apr, 2020

1 commit


24 Apr, 2020

1 commit

  • Flushes bypass the I/O scheduler and get added to hctx->dispatch
    in blk_mq_sched_bypass_insert. This can happen while a kworker is running
    hctx->run_work work item and is past the point in
    blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

    The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
    because the I/O scheduler can feed an arbitrary number of commands.

    Since we have only one hctx->run_work, the commands waiting in
    hctx->dispatch will wait an arbitrary length of time for run_work to be
    rerun.

    A similar phenomenon exists with dispatches from the software queue.

    The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
    blk_mq_do_dispatch_ctx and return from the run_work handler and let it
    rerun.

    Signed-off-by: Salman Qazi
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Salman Qazi
     

21 Apr, 2020

1 commit

  • If ever a thread running blk-mq code tries to get budget and fails it
    immediately stops doing work and assumes that whenever budget is freed
    up that queues will be kicked and whatever work the thread was trying
    to do will be tried again.

    One path where budget is freed and queues are kicked in the normal
    case can be seen in scsi_finish_command(). Specifically:
    - scsi_finish_command()
    - scsi_device_unbusy()
    - # Decrement "device_busy", AKA release budget
    - scsi_io_completion()
    - scsi_end_request()
    - blk_mq_run_hw_queues()

    The above is all well and good. The problem comes up when a thread
    claims the budget but then releases it without actually dispatching
    any work. Since we didn't schedule any work we'll never run the path
    of finishing work / kicking the queues.

    This isn't often actually a problem which is why this issue has
    existed for a while and nobody noticed. Specifically we only get into
    this situation when we unexpectedly found that we weren't going to do
    any work. Code that later receives new work kicks the queues. All
    good, right?

    The problem shows up, however, if timing is just wrong and we hit a
    race. To see this race let's think about the case where we only have
    a budget of 1 (only one thread can hold budget). Now imagine that a
    thread got budget and then decided not to dispatch work. It's about
    to call put_budget() but then the thread gets context switched out for
    a long, long time. While in this state, any and all kicks of the
    queue (like the when we received new work) will be no-ops because
    nobody can get budget. Finally the thread holding budget gets to run
    again and returns. All the normal kicks will have been no-ops and we
    have an I/O stall.

    As you can see from the above, you need just the right timing to see
    the race. To start with, the only case it happens if we thought we
    had work, actually managed to get the budget, but then actually didn't
    have work. That's pretty rare to start with. Even then, there's
    usually a very small amount of time between realizing that there's no
    work and putting the budget. During this small amount of time new
    work has to come in and the queue kick has to make it all the way to
    trying to get the budget and fail. It's pretty unlikely.

    One case where this could have failed is illustrated by an example of
    threads running blk_mq_do_dispatch_sched():

    * Threads A and B both run has_work() at the same time with the same
    "hctx". Imagine has_work() is exact. There's no lock, so it's OK
    if Thread A and B both get back true.
    * Thread B gets interrupted for a long time right after it decides
    that there is work. Maybe its CPU gets an interrupt and the
    interrupt handler is slow.
    * Thread A runs, get budget, dispatches work.
    * Thread A's work finishes and budget is released.
    * Thread B finally runs again and gets budget.
    * Since Thread A already took care of the work and no new work has
    come in, Thread B will get NULL from dispatch_request(). I believe
    this is specifically why dispatch_request() is allowed to return
    NULL in the first place if has_work() must be exact.
    * Thread B will now be holding the budget and is about to call
    put_budget(), but hasn't called it yet.
    * Thread B gets interrupted for a long time (again). Dang interrupts.
    * Now Thread C (maybe with a different "hctx" but the same queue)
    comes along and runs blk_mq_do_dispatch_sched().
    * Thread C won't do anything because it can't get budget.
    * Finally Thread B will run again and put the budget without kicking
    any queues.

    Even though the example above is with blk_mq_do_dispatch_sched() I
    believe the race is possible any time someone is holding budget but
    doesn't do work.

    Unfortunately, the unlikely has become more likely if you happen to be
    using the BFQ I/O scheduler. BFQ, by design, sometimes returns "true"
    for has_work() but then NULL for dispatch_request() and stays in this
    state for a while (currently up to 9 ms). Suddenly you only need one
    race to hit, not two races in a row. With my current setup this is
    easy to reproduce in reboot tests and traces have actually shown that
    we hit a race similar to the one described above.

    Note that we only need to fix blk_mq_do_dispatch_sched() and
    blk_mq_do_dispatch_ctx() and not the other places that put budget. In
    other cases we know that we have work to do on at least one "hctx" and
    code already exists to kick that "hctx"'s queue. When that work
    finally finishes all the queues will be kicked using the normal flow.

    One last note is that (at least in the SCSI case) budget is shared by
    all "hctx"s that have the same queue. Thus we need to make sure to
    kick the whole queue, not just re-run dispatching on a single "hctx".

    Signed-off-by: Douglas Anderson
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

12 Mar, 2020

1 commit

  • commit 01e99aeca397 ("blk-mq: insert passthrough request into
    hctx->dispatch directly") may change to add flush request to the tail
    of dispatch by applying the 'add_head' parameter of
    blk_mq_sched_insert_request.

    Turns out this way causes performance regression on NCQ controller because
    flush is non-NCQ command, which can't be queued when there is any in-flight
    NCQ command. When adding flush rq to the front of hctx->dispatch, it is
    easier to introduce extra time to flush rq's latency compared with adding
    to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
    flush merge is increased, and less flush requests may be issued to
    controller.

    So always insert flush request to the front of dispatch queue just like
    before applying commit 01e99aeca397 ("blk-mq: insert passthrough request
    into hctx->dispatch directly").

    Cc: Damien Le Moal
    Cc: Shinichiro Kawasaki
    Reported-by: Shinichiro Kawasaki
    Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Feb, 2020

1 commit

  • For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Sep, 2019

1 commit

  • Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
    lockdep_assert_held() called in blk_mq_sched_free_requests() which is
    run in failure path of elevator_init_mq().

    blk_mq_sched_free_requests() is called in the following 3 functions:

    elevator_init_mq()
    elevator_exit()
    blk_cleanup_queue()

    In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
    by 'mutex_lock(&q->sysfs_lock)'.

    So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
    into elevator_exit() for fixing the report by syzbot.

    Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
    Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    Reviewed-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jul, 2019

1 commit

  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2019

1 commit

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

13 Jun, 2019

1 commit

  • blk_mq_sched_free_requests() may be called in failure path in which
    q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
    blk_mq_sched_free_requests for avoiding the false positive.

    This function is actually safe to call in case of !q->elevator because
    hctx->sched_tags is checked.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Yi Zhang
    Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue")
    Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei