14 Nov, 2020

1 commit

  • For avoiding use-after-free on flush request, we call its .end_io() from
    both timeout code path and __blk_mq_end_request().

    When flush request's ref doesn't drop to zero, it is still used, we
    can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
    to zero really.

    Fixes: 65ff5cd04551 ("blk-mq: mark flush request as IDLE in flush_end_io()")
    Signed-off-by: Ming Lei
    Cc: Yi Zhang
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Oct, 2020

1 commit

  • Mark flush request as IDLE in its .end_io(), aligning it with how normal
    requests behave. The flush request stays in in-flight tags if we're not
    using an IO scheduler, so we need to change its state into IDLE.
    Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
    error recovery because flush the request state is kept as COMPLETED.

    Reported-by: Yi Zhang
    Signed-off-by: Ming Lei
    Tested-by: Yi Zhang
    Cc: Chao Leng
    Cc: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Aug, 2020

1 commit

  • In case of none scheduler, we share data request's driver tag for
    flush request, so have to mark the flush request as INFLIGHT for
    avoiding double account of this driver tag.

    Fixes: 568f27006577 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
    Reported-by: Matthew Wilcox
    Signed-off-by: Ming Lei
    Tested-by: Matthew Wilcox
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Jul, 2020

1 commit

  • Commit 7520872c0cf4 ("block: don't defer flushes on blk-mq + scheduling")
    tried to fix deadlock for cycled wait between flush requests and data
    request into flush_data_in_flight. The former holded all driver tags
    and wait for data request completion, but the latter can not complete
    for waiting free driver tags.

    After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
    for flush rq"), flush requests will not get driver tag before queuing
    into flush queue.

    * With elevator, flush request just get sched_tags before inserting
    flush queue. It will not get driver tag until issue them to driver.
    data request on list fq->flush_data_in_flight will complete in
    the end.

    * Without elevator, each flush request will get a driver tag when
    allocate request. Then data request on fq->flush_data_in_flight
    don't worry about lacking driver tag.

    In both of these cases, cycled wait cannot be true. So we may allow
    to defer flush request.

    Signed-off-by: Yufen Yu
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Yufen Yu
     

09 Jul, 2020

1 commit


02 Jul, 2020

1 commit

  • This reverts commits the following commits:

    37f4a24c2469a10a4c16c641671bd766e276cf9f
    723bf178f158abd1ce6069cb049581b3cb003aab
    36a3df5a4574d5ddf59804fcd0c4e9654c514d9a

    The last one is the culprit, but we have to go a bit deeper to get this
    to revert cleanly. There's been a report that this breaks some MMC
    setups [1], and also causes an issue with swap [2]. Until this can be
    figured out, revert the offending commits.

    [1] https://lore.kernel.org/linux-block/57fb09b1-54ba-f3aa-f82c-d709b0e6b281@samsung.com/
    [2] https://lore.kernel.org/linux-block/20200702043721.GA1087@lca.pw/

    Reported-by: Marek Szyprowski
    Reported-by: Qian Cai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jul, 2020

1 commit


29 Jun, 2020

1 commit

  • It is natural to release driver tag when this request is completed by
    LLD or device since its purpose is for LLD use.

    One big benefit is that the released tag can be re-used quicker since
    bio_endio() may take too long.

    Meantime we don't need to release driver tag for flush request.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 May, 2020

2 commits


19 May, 2020

1 commit

  • The flush_queue_delayed was introdued to hold queue if flush is
    running for non-queueable flush drive by commit 3ac0cc450870
    ("hold queue if flush is running for non-queueable flush drive"),
    but the non mq parts of the flush code had been removed by
    commit 7e992f847a08 ("block: remove non mq parts from the flush code"),
    as well as removing the usage of the flush_queue_delayed flag.
    Thus remove the unused flush_queue_delayed flag.

    Signed-off-by: Baolin Wang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Baolin Wang
     

28 Mar, 2020

1 commit


12 Mar, 2020

2 commits


25 Feb, 2020

1 commit

  • For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

21 Dec, 2019

1 commit

  • Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

22 Nov, 2019

1 commit

  • Requests that triggers flushing volatile writeback cache to disk (barriers)
    have significant effect to overall performance.

    Block layer has sophisticated engine for combining several flush requests
    into one. But there is no statistics for actual flushes executed by disk.
    Requests which trigger flushes usually are barriers - zero-size writes.

    This patch adds two iostat counters into /sys/class/block/$dev/stat and
    /proc/diskstats - count of completed flush requests and their total time.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

27 Sep, 2019

1 commit

  • We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
    as following:

    [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
    [ 108.827059] PGD 0 P4D 0
    [ 108.827313] Oops: 0000 [#1] SMP PTI
    [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
    [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
    [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
    [ 108.838191] Call Trace:
    [ 108.838406] bt_iter+0x74/0x80
    [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
    [ 108.839074] ? __switch_to_asm+0x34/0x70
    [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
    [ 108.840732] blk_mq_timeout_work+0x74/0x200
    [ 108.841151] process_one_work+0x297/0x680
    [ 108.841550] worker_thread+0x29c/0x6f0
    [ 108.841926] ? rescuer_thread+0x580/0x580
    [ 108.842344] kthread+0x16a/0x1a0
    [ 108.842666] ? kthread_flush_work+0x170/0x170
    [ 108.843100] ret_from_fork+0x35/0x40

    The bug is caused by the race between timeout handle and completion for
    flush request.

    When timeout handle function blk_mq_rq_timed_out() try to read
    'req->q->mq_ops', the 'req' have completed and reinitiated by next
    flush request, which would call blk_rq_init() to clear 'req' as 0.

    After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
    normal requests lifetime are protected by refcount. Until 'rq->ref'
    drop to zero, the request can really be free. Thus, these requests
    cannot been reused before timeout handle finish.

    However, flush request has defined .end_io and rq->end_io() is still
    called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
    can be reused by the next flush request handle, resulting in null
    pointer deference BUG ON.

    We fix this problem by covering flush request with 'rq->ref'.
    If the refcount is not zero, flush_end_io() return and wait the
    last holder recall it. To record the request status, we add a new
    entry 'rq_status', which will be used in flush_end_io().

    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org # v4.18+
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu

    -------
    v2:
    - move rq_status from struct request to struct blk_flush_queue
    v3:
    - remove unnecessary '{}' pair.
    v4:
    - let spinlock to protect 'fq->rq_status'
    v5:
    - move rq_status after flush_running_idx member of struct blk_flush_queue
    Signed-off-by: Jens Axboe

    Yufen Yu
     

01 May, 2019

1 commit


25 Mar, 2019

1 commit

  • Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
    is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
    Then we can remove the unused blk_mq_put_driver_tag_hctx().

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

30 Jan, 2019

1 commit

  • Florian reported a io hung issue when fsync(). It should be
    triggered by following race condition.

    data + post flush a flush

    blk_flush_complete_seq
    case REQ_FSEQ_DATA
    blk_flush_queue_rq
    issued to driver blk_mq_dispatch_rq_list
    try to issue a flush req
    failed due to NON-NCQ command
    .queue_rq return BLK_STS_DEV_RESOURCE

    request completion
    req->end_io // doesn't check RESTART
    mq_flush_data_end_io
    case REQ_FSEQ_POSTFLUSH
    blk_kick_flush
    do nothing because previous flush
    has not been completed
    blk_mq_run_hw_queue
    insert rq to hctx->dispatch
    due to RESTART is still set, do nothing

    To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
    with blk_mq_sched_restart to check and clear the RESTART flag.

    Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
    Reported-by: Florian Stecker
    Tested-by: Florian Stecker
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

16 Nov, 2018

2 commits


08 Nov, 2018

4 commits


14 Oct, 2018

1 commit


09 Jun, 2018

1 commit

  • A recent commit reused the original request flags for the flush
    queue handling. However, for some of the kick flush cases, the
    original request was already completed. This caused a use after
    free, if blk-mq wasn't used.

    Fixes: 84fca1b0c461 ("block: pass failfast and driver-specific flags to flush requests")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Jun, 2018

1 commit


05 Nov, 2017

3 commits

  • The idea behind it is simple:

    1) for none scheduler, driver tag has to be borrowed for flush rq,
    otherwise we may run out of tag, and that causes an IO hang. And
    get/put driver tag is actually noop for none, so reordering tags
    isn't necessary at all.

    2) for a real I/O scheduler, we need not allocate a driver tag upfront
    for flush rq. It works just fine to follow the same approach as
    normal requests: allocate driver tag for each rq just before calling
    ->queue_rq().

    One driver visible change is that the driver tag isn't shared in the
    flush request sequence. That won't be a problem, since we always do that
    in legacy path.

    Then flush rq need not be treated specially wrt. get/put driver tag.
    This cleans up the code - for instance, reorder_tags_to_front() can be
    removed, and we needn't worry about request ordering in dispatch list
    for avoiding I/O deadlock.

    Also we have to put the driver tag before requeueing.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In the following patch, we will use RQF_FLUSH_SEQ to decide:

    1) if the flag isn't set, the flush rq need to be inserted via
    blk_insert_flush()

    2) otherwise, the flush rq need to be dispatched directly since
    it is in flush machinery now.

    So we use blk_mq_request_bypass_insert() for requests of bypassing
    flush machinery, just like the legacy path did.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk_insert_flush() should only insert request since run queue always
    follows it.

    In case of bypassing flush, we don't need to run queue because every
    blk_insert_flush() follows one run queue.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Aug, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Jun, 2017

1 commit

  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jun, 2017

1 commit

  • Currently we use nornal Linux errno values in the block layer, and while
    we accept any error a few have overloaded magic meanings. This patch
    instead introduces a new blk_status_t value that holds block layer specific
    status codes and explicitly explains their meaning. Helpers to convert from
    and to the previous special meanings are provided for now, but I suspect
    we want to get rid of them in the long run - those drivers that have a
    errno input (e.g. networking) usually get errnos that don't know about
    the special block layer overloads, and similarly returning them to userspace
    will usually return somethings that strictly speaking isn't correct
    for file system operations, but that's left as an exercise for later.

    For now the set of errors is a very limited set that closely corresponds
    to the previous overloaded errno values, but there is some low hanging
    fruite to improve it.

    blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
    typechecking, so that we can easily catch places passing the wrong values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Apr, 2017

1 commit


25 Mar, 2017

1 commit


18 Feb, 2017

1 commit

  • For blk-mq with scheduling, we can potentially end up with ALL
    driver tags assigned and sitting on the flush queues. If we
    defer because of an inlfight data request, then we can deadlock
    if that data request doesn't already have a tag assigned.

    This fixes a deadlock with running the xfs/297 xfstest, where
    thousands of syncs can cause the drive queue to stall.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe