13 Jan, 2021

2 commits

  • [ Upstream commit a4d34da715e3cb7e0741fe603dcd511bed067e00 ]

    Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
    used by any kernel code.

    Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Alan Stern
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Martin Kepplinger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Bart Van Assche
     
  • [ Upstream commit 0854bcdcdec26aecdc92c303816f349ee1fba2bc ]

    Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
    functions set RQF_PM. This is the first step towards removing
    BLK_MQ_REQ_PREEMPT.

    Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
    Cc: Alan Stern
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Can Guo
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Bart Van Assche
     

25 Oct, 2020

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Christoph
    - rdma error handling fixes (Chao Leng)
    - fc error handling and reconnect fixes (James Smart)
    - fix the qid displace when tracing ioctl command (Keith Busch)
    - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
    - fix MTDT for passthru (Logan Gunthorpe)
    - blacklist Write Same on more devices (Kai-Heng Feng)
    - fix an uninitialized work struct (zhenwei pi)"

    - lightnvm out-of-bounds fix (Colin)

    - SG allocation leak fix (Doug)

    - rnbd fixes (Gioh, Guoqing, Jack)

    - zone error translation fixes (Keith)

    - kerneldoc markup fix (Mauro)

    - zram lockdep fix (Peter)

    - Kill unused io_context members (Yufen)

    - NUMA memory allocation cleanup (Xianting)

    - NBD config wakeup fix (Xiubo)

    * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
    block: blk-mq: fix a kernel-doc markup
    nvme-fc: shorten reconnect delay if possible for FC
    nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
    nvme-fc: fix error loop in create_hw_io_queues
    nvme-fc: fix io timeout to abort I/O
    null_blk: use zone status for max active/open
    nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
    nvmet: cleanup nvmet_passthru_map_sg()
    nvmet: limit passthru MTDS by BIO_MAX_PAGES
    nvmet: fix uninitialized work for zero kato
    nvme-pci: disable Write Zeroes on Sandisk Skyhawk
    nvme: use queuedata for nvme_req_qid
    nvme-rdma: fix crash due to incorrect cqe
    nvme-rdma: fix crash when connect rejected
    block: remove unused members for io_context
    blk-mq: remove the calling of local_memory_node()
    zram: Fix __zram_bvec_{read,write}() locking order
    skd_main: remove unused including
    sgl_alloc_order: fix memory leak
    lightnvm: fix out-of-bounds write to array devices->info[]
    ...

    Linus Torvalds
     

24 Oct, 2020

1 commit


20 Oct, 2020

1 commit

  • We don't need to check whether the node is memoryless numa node before
    calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
    allocator to pick a node. Page allocator should deal with memoryless
    nodes just fine. It has zonelists constructed for each possible nodes.
    And it will automatically fall back into a node which is closest to the
    requested node. As long as __GFP_THISNODE is not enforced of course.

    The code comments of kmem_cache_alloc_node() of SLAB also showed this:
    * Fallback to other node is possible if __GFP_THISNODE is not set.

    blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
    of local_memory_node().

    Signed-off-by: Xianting Tian
    Signed-off-by: Jens Axboe

    Xianting Tian
     

15 Oct, 2020

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Improve DM core's bio splitting to use blk_max_size_offset(). Also
    fix bio splitting for bios that were deferred to the worker thread
    due to a DM device being suspended.

    - Remove DM core's special handling of NVMe devices now that block core
    has internalized efficiencies drivers previously needed to be
    concerned about (via now removed direct_make_request).

    - Fix request-based DM to not bounce through indirect dm_submit_bio;
    instead have block core make direct call to blk_mq_submit_bio().

    - Various DM core cleanups to simplify and improve code.

    - Update DM cryot to not use drivers that set
    CRYPTO_ALG_ALLOCATES_MEMORY.

    - Fix DM raid's raid1 and raid10 discard limits for the purposes of
    linux-stable. But then remove DM raid's discard limits settings now
    that MD raid can efficiently handle large discards.

    - A couple small cleanups across various targets.

    * tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm: fix request-based DM to not bounce through indirect dm_submit_bio
    dm: remove special-casing of bio-based immutable singleton target on NVMe
    dm: export dm_copy_name_and_uuid
    dm: fix comment in __dm_suspend()
    dm: fold dm_process_bio() into dm_submit_bio()
    dm: fix missing imposition of queue_limits from dm_wq_work() thread
    dm snap persistent: simplify area_io()
    dm thin metadata: Remove unused local variable when create thin and snap
    dm raid: remove unnecessary discard limits for raid10
    dm raid: fix discard limits for raid1 and raid10
    dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
    dm: use dm_table_get_device_name() where appropriate in targets
    dm table: make 'struct dm_table' definition accessible to all of DM core
    dm: eliminate need for start_io_acct() forward declaration
    dm: simplify __process_abnormal_io()
    dm: push use of on-stack flush_bio down to __send_empty_flush()
    dm: optimize max_io_len() by inlining max_io_len_target_boundary()
    dm: push md->immutable_target optimization down to __process_bio()
    dm: change max_io_len() to use blk_max_size_offset()
    dm table: stack 'chunk_sectors' limit to account for target-specific splitting

    Linus Torvalds
     

14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

10 Oct, 2020

1 commit


08 Oct, 2020

1 commit

  • It is unnecessary to force request-based DM to call into bio-based
    dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
    call blk_mq_submit_bio().

    Fix this by establishing a request-based DM block_device_operations
    (dm_rq_blk_dops, which doesn't have .submit_bio) and update
    dm_setup_md_queue() to set md->disk->fops to it for
    DM_TYPE_REQUEST_BASED.

    Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
    blk_mq_submit_bio.

    Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations")
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     

07 Oct, 2020

1 commit

  • According to Documentation/block/stat.rst, inflight should not include
    I/O requests that are in the queue but not yet dispatched to the device,
    but blk-mq identifies as inflight any request that has a tag allocated,
    which, for queues without elevator, happens at request allocation time
    and before it is queued in the ctx (default case in blk_mq_submit_bio).

    In addition, current behavior is different for queues with elevator from
    queues without it, since for the former the driver tag is allocated at
    dispatch time. A more precise approach would be to only consider
    requests with state MQ_RQ_IN_FLIGHT.

    This effectively reverts commit 6131837b1de6 ("blk-mq: count allocated
    but not started requests in iostats inflight") to consolidate blk-mq
    behavior with itself (elevator case) and with original documentation,
    but it differs from the behavior used by the legacy path.

    This version differs from v1 by using blk_mq_rq_state to access the
    state attribute. Avoid using blk_mq_request_started, which was
    suggested, since we don't want to include MQ_RQ_COMPLETE.

    Signed-off-by: Gabriel Krisman Bertazi
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     

06 Oct, 2020

1 commit

  • blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
    __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

    However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
    setup_clone() in drivers/md/dm-rq.c.

    This case isn't currently reachable with a bio that actually has an
    encryption context. However, it's fragile to rely on this. Just make
    blk_crypto_rq_bio_prep() able to fail.

    Suggested-by: Satya Tangirala
    Signed-off-by: Eric Biggers
    Reviewed-by: Mike Snitzer
    Reviewed-by: Satya Tangirala
    Cc: Miaohe Lin
    Signed-off-by: Jens Axboe

    Eric Biggers
     

29 Sep, 2020

1 commit

  • Blk-mq should call commit_rqs once 'bd.last != true' and no more
    request will come(so virtscsi can kick the virtqueue, e.g.). We already
    do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
    list not empty and 'queued > 0'. However, we can seen the same scene
    once the last request in list call queue_rq and return error like
    BLK_STS_IOERR which will not requeue the request, and lead that list
    empty but need call commit_rqs too(Or the request for virtscsi will stay
    timeout until other request kick virtqueue).

    We found this problem by do fsstress test with offline/online virtscsi
    device repeat quickly.

    Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
    Reported-by: zhangyi (F)
    Signed-off-by: yangerkun
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    yangerkun
     

28 Sep, 2020

1 commit

  • We found blk_mq_alloc_rq_maps() takes more time in kernel space when
    testing nvme device hot-plugging. The test and anlysis as below.

    Debug code,
    1, blk_mq_alloc_rq_maps():
    u64 start, end;
    depth = set->queue_depth;
    start = ktime_get_ns();
    pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
    current->pid, current->comm, current->nvcsw, current->nivcsw,
    set->queue_depth, set->nr_hw_queues);
    do {
    err = __blk_mq_alloc_rq_maps(set);
    if (!err)
    break;

    set->queue_depth >>= 1;
    if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
    err = -ENOMEM;
    break;
    }
    } while (set->queue_depth);
    end = ktime_get_ns();
    pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
    current->pid, current->comm,
    current->nvcsw, current->nivcsw, end - start);

    2, __blk_mq_alloc_rq_maps():
    u64 start, end;
    for (i = 0; i < set->nr_hw_queues; i++) {
    start = ktime_get_ns();
    if (!__blk_mq_alloc_rq_map(set, i))
    goto out_unwind;
    end = ktime_get_ns();
    pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
    }

    Test nvme hot-plugging with above debug code, we found it totally cost more
    than 3ms in kernel space without being scheduled out when alloc rqs for all
    16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
    time will be increased with hw queue number and queue depth increasing. And
    in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
    "queue_depth >>= 1", more time will be consumed.
    [ 428.428771] nvme nvme0: pci function 10000:01:00.0
    [ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
    [ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
    [ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
    [ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
    [ 432.593404] hw queue 0 init cost time 22883 ns
    [ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
    [ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
    [ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
    [ 432.596203] hw queue 0 init cost time 242630 ns
    [ 432.596441] hw queue 1 init cost time 235913 ns
    [ 432.596659] hw queue 2 init cost time 216461 ns
    [ 432.596877] hw queue 3 init cost time 215851 ns
    [ 432.597107] hw queue 4 init cost time 228406 ns
    [ 432.597336] hw queue 5 init cost time 227298 ns
    [ 432.597564] hw queue 6 init cost time 224633 ns
    [ 432.597785] hw queue 7 init cost time 219954 ns
    [ 432.597937] hw queue 8 init cost time 150930 ns
    [ 432.598082] hw queue 9 init cost time 143496 ns
    [ 432.598231] hw queue 10 init cost time 147261 ns
    [ 432.598397] hw queue 11 init cost time 164522 ns
    [ 432.598542] hw queue 12 init cost time 143401 ns
    [ 432.598692] hw queue 13 init cost time 148934 ns
    [ 432.598841] hw queue 14 init cost time 147194 ns
    [ 432.598991] hw queue 15 init cost time 148942 ns
    [ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
    [ 432.602611] nvme0n1: p1

    So use this patch to trigger schedule between each hw queue init, to avoid
    other threads getting stuck. It is not in atomic context when executing
    __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

    Signed-off-by: Xianting Tian
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Xianting Tian
     

11 Sep, 2020

1 commit

  • NVMe shares tagset between fabric queue and admin queue or between
    connect_q and NS queue, so hctx_may_queue() can be called to allocate
    request for these queues.

    Tags can be reserved in these tagset. Before error recovery, there is
    often lots of in-flight requests which can't be completed, and new
    reserved request may be needed in error recovery path. However,
    hctx_may_queue() can always return false because there is too many
    in-flight requests which can't be completed during error handling.
    Finally, nothing can proceed.

    Fix this issue by always allowing reserved tag allocation in
    hctx_may_queue(). This is reasonable because reserved tags are supposed
    to always be available.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Cc: David Milburn
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

04 Sep, 2020

8 commits

  • High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
    contention is possible for mq-deadline and bfq IO schedulers
    when nr_hw_queues is more than one.

    It is because kblockd work queue can submit IO from all online CPUs
    (through blk_mq_run_hw_queues()) even though only one hctx has pending
    commands.

    The elevator callback .has_work for mq-deadline and bfq scheduler considers
    pending work if there are any IOs on request queue but it does not account
    hctx context.

    Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
    the elevator even though there are no requests queued.

    [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]

    Signed-off-by: Kashyap Desai
    Signed-off-by: Hannes Reinecke
    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    Kashyap Desai
     
  • For when using a shared sbitmap, no longer should the number of active
    request queues per hctx be relied on for when judging how to share the tag
    bitmap.

    Instead maintain the number of active request queues per tag_set, and make
    the judgement based on that.

    Originally-from: Kashyap Desai
    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • The per-hctx nr_active value can no longer be used to fairly assign a share
    of tag depth per request queue for when using a shared sbitmap, as it does
    not consider that the tags are shared tags over all hctx's.

    For this case, record the nr_active_requests per request_queue, and make
    the judgement based on that value.

    Co-developed-with: Kashyap Desai
    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
    multiple reply queues with single hostwide tags.

    In addition, these drivers want to use interrupt assignment in
    pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
    CPU hotplug may cause in-flight IO completion to not be serviced when an
    interrupt is shutdown. That problem is solved in commit bf0beec0607d
    ("blk-mq: drain I/O when all CPUs in a hctx are offline").

    However, to take advantage of that blk-mq feature, the HBA HW queuess are
    required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
    queues need to be exposed to the upper layer.

    In making that transition, the per-SCSI command request tags are no
    longer unique per Scsi host - they are just unique per hctx. As such, the
    HBA LLDD would have to generate this tag internally, which has a certain
    performance overhead.

    However another problem is that blk-mq assumes the host may accept
    (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
    core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
    counter was removed, which would stop the LLDD being sent more than
    .can_queue commands; however, it should still be ensured that the block
    layer does not issue more than .can_queue commands to the Scsi host.

    To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
    which may be requested at init time.

    New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
    tagset to indicate whether the shared sbitmap should be used.

    Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
    are still allocated per hctx; the reason for this is that if tags and
    requests were only allocated for a single hctx - like hctx0 - it may break
    block drivers which expect a request be associated with a specific hctx,
    i.e. not always hctx0. This will introduce extra memory usage.

    This change is based on work originally from Ming Lei in [1] and from
    Bart's suggestion in [2].

    [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
    [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
    [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
    with the goal of later being able to use a common shared tag bitmap across
    all HW contexts in a set.

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    John Garry
     
  • Pass hctx/tagset flags argument down to blk_mq_init_tags() and
    blk_mq_free_tags() for selective init/free.

    For now, make it include the alloc policy flag, which can be evaluated
    when needed (in blk_mq_init_tags()).

    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • The function does not set the depth, but rather transitions from
    shared to non-shared queues and vice versa.

    So rename it to blk_mq_update_tag_set_shared() to better reflect
    its purpose.

    [jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]

    Signed-off-by: Hannes Reinecke
    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
    queues, all of which should belong to LUNs attached to same HBA.

    So rename it to make the point explicitly.

    [jpg: rebase a few times, add rnbd-clt.c change]

    Suggested-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Aug, 2020

1 commit

  • c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
    to add request which has been through ->queue_rq() to the hw queue dispatch
    list, however it adds request running out of budget or driver tag to hw queue
    too. This way basically bypasses request merge, and causes too many request
    dispatched to LLD, and system% is unnecessary increased.

    Fixes this issue by adding request not through ->queue_rq into sw/scheduler
    queue, and this way is safe because no ->queue_rq is called on this request
    yet.

    High %system can be observed on Azure storvsc device, and even soft lock
    is observed. This patch reduces %system during heavy sequential IO,
    meantime decreases soft lockup risk.

    Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
    Signed-off-by: Ming Lei
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Aug, 2020

2 commits

  • SCHED_RESTART code path is relied to re-run queue for dispatch requests
    in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
    requests to hctx->dispatch.

    memory barriers have to be used for ordering the following two pair of OPs:

    1) adding requests to hctx->dispatch and checking SCHED_RESTART in
    blk_mq_dispatch_rq_list()

    2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
    in blk_mq_sched_restart().

    Without the added memory barrier, either:

    1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
    blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
    dispatch side

    or

    2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
    in dispatch side, meantime checking if there is request in
    hctx->dispatch from blk_mq_sched_restart() is missed.

    IO hang in ltp/fs_fill test is reported by kernel test robot:

    https://lkml.org/lkml/2020/7/26/77

    Turns out it is caused by the above out-of-order OPs. And the IO hang
    can't be observed any more after applying this patch.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: David Jeffery
    Cc:
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Fix a kernel-doc warning in block/blk-mq.c:

    ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'

    Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
    Signed-off-by: Randy Dunlap
    Cc: André Almeida
    Cc: Jens Axboe
    Cc: Ming Lei
    Cc: linux-block@vger.kernel.org
    Signed-off-by: Jens Axboe

    Randy Dunlap
     

04 Aug, 2020

1 commit

  • Pull core block updates from Jens Axboe:
    "Good amount of cleanups and tech debt removals in here, and as a
    result, the diffstat shows a nice net reduction in code.

    - Softirq completion cleanups (Christoph)

    - Stop using ->queuedata (Christoph)

    - Cleanup bd claiming (Christoph)

    - Use check_events, moving away from the legacy media change
    (Christoph)

    - Use inode i_blkbits consistently (Christoph)

    - Remove old unused writeback congestion bits (Christoph)

    - Cleanup/unify submission path (Christoph)

    - Use bio_uninit consistently, instead of bio_disassociate_blkg
    (Christoph)

    - sbitmap cleared bits handling (John)

    - Request merging blktrace event addition (Jan)

    - sysfs add/remove race fixes (Luis)

    - blk-mq tag fixes/optimizations (Ming)

    - Duplicate words in comments (Randy)

    - Flush deferral cleanup (Yufen)

    - IO context locking/retry fixes (John)

    - struct_size() usage (Gustavo)

    - blk-iocost fixes (Chengming)

    - blk-cgroup IO stats fixes (Boris)

    - Various little fixes"

    * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
    block: blk-timeout: delete duplicated word
    block: blk-mq-sched: delete duplicated word
    block: blk-mq: delete duplicated word
    block: genhd: delete duplicated words
    block: elevator: delete duplicated word and fix typos
    block: bio: delete duplicated words
    block: bfq-iosched: fix duplicated word
    iocost_monitor: start from the oldest usage index
    iocost: Fix check condition of iocg abs_vdebt
    block: Remove callback typedefs for blk_mq_ops
    block: Use non _rcu version of list functions for tag_set_list
    blk-cgroup: show global disk stats in root cgroup io.stat
    blk-cgroup: make iostat functions visible to stat printing
    block: improve discard bio alignment in __blkdev_issue_discard()
    block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
    block: defer flush request no matter whether we have elevator
    block: make blk_timeout_init() static
    block: remove retry loop in ioc_release_fn()
    block: remove unnecessary ioc nested locking
    block: integrate bd_start_claiming into __blkdev_get
    ...

    Linus Torvalds
     

01 Aug, 2020

1 commit


28 Jul, 2020

1 commit

  • tag_set_list is only accessed under the tag_set_lock lock. There is
    no need for using the _rcu list functions.

    The _rcu list function were introduced to allow read access to the
    tag_set_list protected under RCU, see 705cda97ee3a ("blk-mq: Make it
    safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
    05b79413946d ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
    Those changes got reverted later but the cleanup commit missed a
    couple of places to undo the changes.

    Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
    Signed-off-by: Daniel Wagner
    Reviewed-by: Hannes Reinecke
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Daniel Wagner
     

10 Jul, 2020

1 commit


09 Jul, 2020

2 commits

  • Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
    all are good to do during getting driver tag.

    Meantime blk-flush related code is simplified and flush request needn't
    to update the request table manually any more.

    Signed-off-by: Ming Lei
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Current handling of q->mq_ops->queue_rq result is a bit ugly:

    - two branches which needs to 'continue' have to check if the
    dispatch local list is empty, otherwise one bad request may
    be retrieved via 'rq = list_first_entry(list, struct request, queuelist);'

    - the branch of 'if (unlikely(ret != BLK_STS_OK))' isn't easy
    to follow, since it is actually one error branch.

    Streamline this handling, so the code becomes more readable, meantime
    potential kernel oops can be avoided in case that the last request in
    local dispatch list is failed.

    Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t")
    Signed-off-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jul, 2020

1 commit

  • dm-multipath is the only user of blk_mq_queue_inflight(). When
    dm-multipath calls blk_mq_queue_inflight() to check if it has
    outstanding IO it can get a false negative. The reason for this is
    blk_mq_rq_inflight() doesn't consider requests that are no longer
    MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
    called or finished yet) as "inflight".

    This causes request-based dm-multipath's dm_wait_for_completion() to
    return before all outstanding dm-multipath requests have actually
    completed. This breaks DM multipath's suspend functionality because
    blk-mq requests complete after DM's suspend has finished -- which
    shouldn't happen.

    Fix this by considering any request not in the MQ_RQ_IDLE state
    (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
    blk_mq_rq_inflight().

    Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Jul, 2020

1 commit

  • This reverts commits the following commits:

    37f4a24c2469a10a4c16c641671bd766e276cf9f
    723bf178f158abd1ce6069cb049581b3cb003aab
    36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

    The last one is the culprit, but we have to go a bit deeper to get this
    to revert cleanly. There's been a report that this breaks some MMC
    setups [1], and also causes an issue with swap [2]. Until this can be
    figured out, revert the offending commits.

    [1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
    [2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

    Reported-by: Marek Szyprowski
    Reported-by: Qian Cai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2020

5 commits


30 Jun, 2020

2 commits

  • More and more drivers want to get batching requests queued from
    block layer, such as mmc, and tcp based storage drivers. Also
    current in-tree users have virtio-scsi, virtio-blk and nvme.

    For none, we already support batching dispatch.

    But for io scheduler, every time we just take one request from scheduler
    and pass the single request to blk_mq_dispatch_rq_list(). This way makes
    batching dispatch not possible when io scheduler is applied. One reason
    is that we don't want to hurt sequential IO performance, becasue IO
    merge chance is reduced if more requests are dequeued from scheduler
    queue.

    Try to support batching dispatch for io scheduler by starting with the
    following simple approach:

    1) still make sure we can get budget before dequeueing request

    2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
    we fackback to non-batching dispatch, otherwise dequeue as many as
    possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().

    Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
    performance got improved much.

    In future, maybe we can develop more intelligent algorithem for batching
    dispatch.

    Baolin has tested this patch and found that MMC performance is improved[3].

    [1] https://lore.kernel.org/linux-block/20200512075501.GF1531898@T590/#r
    [2] https://lore.kernel.org/linux-block/fe6bd8b9-6ed9-b225-f80c-314746133722@grimberg.me/
    [3] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=TXA@mail.gmail.com/

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
    for supporting fully batching submission.

    With the obtained budget count, it is easier to put extra budgets
    in case of .queue_rq failure.

    Meantime remove the old 'got_budget' parameter.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei