04 Sep, 2020

5 commits

  • For when using a shared sbitmap, no longer should the number of active
    request queues per hctx be relied on for when judging how to share the tag
    bitmap.

    Instead maintain the number of active request queues per tag_set, and make
    the judgement based on that.

    Originally-from: Kashyap Desai
    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • The per-hctx nr_active value can no longer be used to fairly assign a share
    of tag depth per request queue for when using a shared sbitmap, as it does
    not consider that the tags are shared tags over all hctx's.

    For this case, record the nr_active_requests per request_queue, and make
    the judgement based on that value.

    Co-developed-with: Kashyap Desai
    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

    Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

    In this way, we can drop the blk-mq-tag.h include of blk-mq.h

    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
    multiple reply queues with single hostwide tags.

    In addition, these drivers want to use interrupt assignment in
    pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
    CPU hotplug may cause in-flight IO completion to not be serviced when an
    interrupt is shutdown. That problem is solved in commit bf0beec0607d
    ("blk-mq: drain I/O when all CPUs in a hctx are offline").

    However, to take advantage of that blk-mq feature, the HBA HW queuess are
    required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
    queues need to be exposed to the upper layer.

    In making that transition, the per-SCSI command request tags are no
    longer unique per Scsi host - they are just unique per hctx. As such, the
    HBA LLDD would have to generate this tag internally, which has a certain
    performance overhead.

    However another problem is that blk-mq assumes the host may accept
    (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
    core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
    counter was removed, which would stop the LLDD being sent more than
    .can_queue commands; however, it should still be ensured that the block
    layer does not issue more than .can_queue commands to the Scsi host.

    To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
    which may be requested at init time.

    New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
    tagset to indicate whether the shared sbitmap should be used.

    Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
    are still allocated per hctx; the reason for this is that if tags and
    requests were only allocated for a single hctx - like hctx0 - it may break
    block drivers which expect a request be associated with a specific hctx,
    i.e. not always hctx0. This will introduce extra memory usage.

    This change is based on work originally from Ming Lei in [1] and from
    Bart's suggestion in [2].

    [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
    [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
    [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     
  • Pass hctx/tagset flags argument down to blk_mq_init_tags() and
    blk_mq_free_tags() for selective init/free.

    For now, make it include the alloc policy flag, which can be evaluated
    when needed (in blk_mq_init_tags()).

    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    John Garry
     

02 Jul, 2020

1 commit

  • This reverts commits the following commits:

    37f4a24c2469a10a4c16c641671bd766e276cf9f
    723bf178f158abd1ce6069cb049581b3cb003aab
    36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

    The last one is the culprit, but we have to go a bit deeper to get this
    to revert cleanly. There's been a report that this breaks some MMC
    setups [1], and also causes an issue with swap [2]. Until this can be
    figured out, revert the offending commits.

    [1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
    [2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

    Reported-by: Marek Szyprowski
    Reported-by: Qian Cai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2020

1 commit


30 Jun, 2020

3 commits

  • Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
    for supporting fully batching submission.

    With the obtained budget count, it is easier to put extra budgets
    in case of .queue_rq failure.

    Meantime remove the old 'got_budget' parameter.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
    hctx, so it is better to pass hctx instead of request queue, because
    blk-mq's dispatch target is hctx instead of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Johannes Thumshirn
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk-mq budget is abstract from scsi's device queue depth, and it is
    always per-request-queue instead of hctx.

    It can be quite absurd to get a budget from one hctx, then dequeue a
    request from scheduler queue, and this request may not belong to this
    hctx, at least for bfq and deadline.

    So fix the mess and always pass request queue to get/put budget
    callback.

    Signed-off-by: Ming Lei
    Tested-by: Baolin Wang
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Douglas Anderson
    Reviewed-by: Sagi Grimberg
    Cc: Sagi Grimberg
    Cc: Baolin Wang
    Cc: Christoph Hellwig
    Cc: Douglas Anderson
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Jun, 2020

1 commit


07 Jun, 2020

1 commit

  • Allocation of the driver tag in the case of using a scheduler shares very
    little code with the "normal" tag allocation. Split out a new helper to
    streamline this path, and untangle it from the complex normal tag
    allocation.

    This way also avoids to fail driver tag allocation because of inactive hctx
    during cpu hotplug, and fixes potential hang risk.

    Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
    Signed-off-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Tested-by: John Garry
    Cc: Dongli Zhang
    Cc: Hannes Reinecke
    Cc: Daniel Wagner
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 May, 2020

1 commit


27 Feb, 2020

1 commit

  • The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(),
    blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
    it.

    Overall obj code size shows a minor reduction, before:
    text data bss dec hex filename
    27306 1312 0 28618 6fca block/blk-mq.o
    4303 272 0 4575 11df block/blk-mq-tag.o

    after:
    27282 1312 0 28594 6fb2 block/blk-mq.o
    4311 272 0 4583 11e7 block/blk-mq-tag.o

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: John Garry
    --
    This minor patch had been carried as part of the blk-mq shared tags RFC,
    I'd rather not carry it anymore as it required rebasing, so now or never..
    Signed-off-by: Jens Axboe

    John Garry
     

25 Feb, 2020

1 commit

  • For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Oct, 2019

1 commit


11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

03 Jul, 2019

1 commit

  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 May, 2019

1 commit

  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Mar, 2019

1 commit

  • Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
    is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
    Then we can remove the unused blk_mq_put_driver_tag_hctx().

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

21 Mar, 2019

1 commit


15 Feb, 2019

1 commit

  • Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
    This is needed since io_uring is now based on the block branch,
    to avoid a conflict between the multi-page bvecs and the bits
    of io_uring that touch the core block parts.

    * tag 'v5.0-rc6': (525 commits)
    Linux 5.0-rc6
    x86/mm: Make set_pmd_at() paravirt aware
    MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
    blk-mq: remove duplicated definition of blk_mq_freeze_queue
    Blk-iolatency: warn on negative inflight IO counter
    blk-iolatency: fix IO hang due to negative inflight counter
    MAINTAINERS: unify reference to xen-devel list
    x86/mm/cpa: Fix set_mce_nospec()
    futex: Handle early deadlock return correctly
    futex: Fix barrier comment
    net: dsa: b53: Fix for failure when irq is not defined in dt
    blktrace: Show requests without sector
    mips: cm: reprime error cause
    mips: loongson64: remove unreachable(), fix loongson_poweroff().
    sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    geneve: should not call rt6_lookup() when ipv6 was disabled
    KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
    KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
    kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
    signal: Better detection of synchronous signals
    ...

    Jens Axboe
     

09 Feb, 2019

1 commit


01 Feb, 2019

2 commits

  • Currently, we check whether the hctx type is supported every time
    in hot path. Actually, this is not necessary, we could save the
    default hctx into ctx->hctxs if the type is not supported when
    map swqueues and use it directly with ctx->hctxs[type].

    We also needn't check whether the poll is enabled or not, because
    the caller would clear the REQ_HIPRI in that case.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • Currently, the queue mapping result is saved in a two-dimensional
    array. In the hot path, to get a hctx, we need do following:

    q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

    This isn't very efficient. We could save the queue mapping result into
    ctx directly with different hctx type, like,

    ctx->hctxs[type]

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

18 Dec, 2018

1 commit

  • When a request is added to rq list of sw queue(ctx), the rq may be from
    a different type of hctx, especially after multi queue mapping is
    introduced.

    So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
    blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
    hctx can be dispatched to current hctx in case that read queue or poll
    queue is enabled.

    This patch fixes this issue by introducing per-queue-type list.

    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei

    Changed by me to not use separately cacheline aligned lists, just
    place them all in the same cacheline where we had just the one list
    and lock before.

    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Dec, 2018

1 commit


16 Dec, 2018

1 commit


10 Dec, 2018

1 commit

  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

05 Dec, 2018

1 commit

  • Having another indirect all in the fast path doesn't really help
    in our post-spectre world. Also having too many queue type is just
    going to create confusion, so I'd rather manage them centrally.

    Note that the queue type naming and ordering changes a bit - the
    first index now is the default queue for everything not explicitly
    marked, the optional ones are read and poll queues.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Nov, 2018

1 commit

  • If we are issuing a list of requests, we know if we're at the last one.
    If we fail issuing, ensure that we call ->commits_rqs() to flush any
    potential previous requests.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Nov, 2018

1 commit

  • Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
    from block layer's view, actually they don't because userspace may
    grab one kobject anytime via sysfs.

    This patch fixes the issue by the following approach:

    1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
    all ctxs

    2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
    handler of .mq_kobj

    3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
    .mq_kobj is always released after all ctxs are freed.

    This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
    is enabled.

    Reported-by: Guenter Roeck
    Cc: "jianchao.wang"
    Tested-by: Guenter Roeck
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Nov, 2018

7 commits

  • We call blk_mq_map_queue() a lot, at least two times for each
    request per IO, sometimes more. Since we now have an indirect
    call as well in that function. cache the mapping so we don't
    have to re-call blk_mq_map_queue() for the same request
    multiple times.

    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add support for the tag set carrying multiple queue maps, and
    for the driver to inform blk-mq how many it wishes to support
    through setting set->nr_maps.

    This adds an mq_ops helper for drivers that support more than 1
    map, mq_ops->rq_flags_to_type(). The function takes request/bio
    flags and CPU, and returns a queue map index for that. We then
    use the type information in blk_mq_map_queue() to index the map
    set.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The mapping used to be dependent on just the CPU location, but
    now it's a tuple of (type, cpu) instead. This is a prep patch
    for allowing a single software queue to map to multiple hardware
    queues. No functional changes in this patch.

    This changes the software queue count to an unsigned short
    to save a bit of space. We can still support 64K-1 CPUs,
    which should be enough. Add a check to catch a wrap.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prep patch for being able to place request based not just on
    CPU location, but also on the type of request.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Doesn't do anything right now, but it's needed as a prep patch
    to get the interfaces right.

    While in there, correct the blk_mq_map_queue() CPU type to an unsigned
    int.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is in preparation for allowing multiple sets of maps per
    queue, if so desired.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's just a pointer to set->mq_map, use that instead. Move the
    assignment a bit earlier, so we always know it's valid.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe