21 Mar, 2020

1 commit

  • [ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]

    For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

12 Jan, 2020

1 commit

  • [ Upstream commit b3c6a59975415bde29cfd76ff1ab008edbf614a9 ]

    Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Bart Van Assche
     

27 Sep, 2019

1 commit

  • We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
    as following:

    [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
    [ 108.827059] PGD 0 P4D 0
    [ 108.827313] Oops: 0000 [#1] SMP PTI
    [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
    [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
    [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
    [ 108.838191] Call Trace:
    [ 108.838406] bt_iter+0x74/0x80
    [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
    [ 108.839074] ? __switch_to_asm+0x34/0x70
    [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
    [ 108.840732] blk_mq_timeout_work+0x74/0x200
    [ 108.841151] process_one_work+0x297/0x680
    [ 108.841550] worker_thread+0x29c/0x6f0
    [ 108.841926] ? rescuer_thread+0x580/0x580
    [ 108.842344] kthread+0x16a/0x1a0
    [ 108.842666] ? kthread_flush_work+0x170/0x170
    [ 108.843100] ret_from_fork+0x35/0x40

    The bug is caused by the race between timeout handle and completion for
    flush request.

    When timeout handle function blk_mq_rq_timed_out() try to read
    'req->q->mq_ops', the 'req' have completed and reinitiated by next
    flush request, which would call blk_rq_init() to clear 'req' as 0.

    After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
    normal requests lifetime are protected by refcount. Until 'rq->ref'
    drop to zero, the request can really be free. Thus, these requests
    cannot been reused before timeout handle finish.

    However, flush request has defined .end_io and rq->end_io() is still
    called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
    can be reused by the next flush request handle, resulting in null
    pointer deference BUG ON.

    We fix this problem by covering flush request with 'rq->ref'.
    If the refcount is not zero, flush_end_io() return and wait the
    last holder recall it. To record the request status, we add a new
    entry 'rq_status', which will be used in flush_end_io().

    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org # v4.18+
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu

    -------
    v2:
    - move rq_status from struct request to struct blk_flush_queue
    v3:
    - remove unnecessary '{}' pair.
    v4:
    - let spinlock to protect 'fq->rq_status'
    v5:
    - move rq_status after flush_running_idx member of struct blk_flush_queue
    Signed-off-by: Jens Axboe

    Yufen Yu
     

01 May, 2019

1 commit


25 Mar, 2019

1 commit

  • Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
    is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
    Then we can remove the unused blk_mq_put_driver_tag_hctx().

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

30 Jan, 2019

1 commit

  • Florian reported a io hung issue when fsync(). It should be
    triggered by following race condition.

    data + post flush a flush

    blk_flush_complete_seq
    case REQ_FSEQ_DATA
    blk_flush_queue_rq
    issued to driver blk_mq_dispatch_rq_list
    try to issue a flush req
    failed due to NON-NCQ command
    .queue_rq return BLK_STS_DEV_RESOURCE

    request completion
    req->end_io // doesn't check RESTART
    mq_flush_data_end_io
    case REQ_FSEQ_POSTFLUSH
    blk_kick_flush
    do nothing because previous flush
    has not been completed
    blk_mq_run_hw_queue
    insert rq to hctx->dispatch
    due to RESTART is still set, do nothing

    To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
    with blk_mq_sched_restart to check and clear the RESTART flag.

    Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
    Reported-by: Florian Stecker
    Tested-by: Florian Stecker
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

16 Nov, 2018

2 commits


08 Nov, 2018

4 commits


14 Oct, 2018

1 commit


09 Jun, 2018

1 commit

  • A recent commit reused the original request flags for the flush
    queue handling. However, for some of the kick flush cases, the
    original request was already completed. This caused a use after
    free, if blk-mq wasn't used.

    Fixes: 84fca1b0c461 ("block: pass failfast and driver-specific flags to flush requests")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Jun, 2018

1 commit


05 Nov, 2017

3 commits

  • The idea behind it is simple:

    1) for none scheduler, driver tag has to be borrowed for flush rq,
    otherwise we may run out of tag, and that causes an IO hang. And
    get/put driver tag is actually noop for none, so reordering tags
    isn't necessary at all.

    2) for a real I/O scheduler, we need not allocate a driver tag upfront
    for flush rq. It works just fine to follow the same approach as
    normal requests: allocate driver tag for each rq just before calling
    ->queue_rq().

    One driver visible change is that the driver tag isn't shared in the
    flush request sequence. That won't be a problem, since we always do that
    in legacy path.

    Then flush rq need not be treated specially wrt. get/put driver tag.
    This cleans up the code - for instance, reorder_tags_to_front() can be
    removed, and we needn't worry about request ordering in dispatch list
    for avoiding I/O deadlock.

    Also we have to put the driver tag before requeueing.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In the following patch, we will use RQF_FLUSH_SEQ to decide:

    1) if the flag isn't set, the flush rq need to be inserted via
    blk_insert_flush()

    2) otherwise, the flush rq need to be dispatched directly since
    it is in flush machinery now.

    So we use blk_mq_request_bypass_insert() for requests of bypassing
    flush machinery, just like the legacy path did.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • blk_insert_flush() should only insert request since run queue always
    follows it.

    In case of bypassing flush, we don't need to run queue because every
    blk_insert_flush() follows one run queue.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Aug, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Jun, 2017

1 commit

  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jun, 2017

1 commit

  • Currently we use nornal Linux errno values in the block layer, and while
    we accept any error a few have overloaded magic meanings. This patch
    instead introduces a new blk_status_t value that holds block layer specific
    status codes and explicitly explains their meaning. Helpers to convert from
    and to the previous special meanings are provided for now, but I suspect
    we want to get rid of them in the long run - those drivers that have a
    errno input (e.g. networking) usually get errnos that don't know about
    the special block layer overloads, and similarly returning them to userspace
    will usually return somethings that strictly speaking isn't correct
    for file system operations, but that's left as an exercise for later.

    For now the set of errors is a very limited set that closely corresponds
    to the previous overloaded errno values, but there is some low hanging
    fruite to improve it.

    blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
    typechecking, so that we can easily catch places passing the wrong values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Apr, 2017

1 commit


25 Mar, 2017

1 commit


18 Feb, 2017

1 commit

  • For blk-mq with scheduling, we can potentially end up with ALL
    driver tags assigned and sitting on the flush queues. If we
    defer because of an inlfight data request, then we can deadlock
    if that data request doesn't already have a tag assigned.

    This fixes a deadlock with running the xfs/297 xfstest, where
    thousands of syncs can cause the drive queue to stall.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

01 Feb, 2017

1 commit

  • Instead of keeping two levels of indirection for requests types, fold it
    all into the operations. The little caveat here is that previously
    cmd_type only applied to struct request, while the request and bio op
    fields were set to plain REQ_OP_READ/WRITE even for passthrough
    operations.

    Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
    private requests, althought it has to add two for each so that we
    can communicate the data in/out nature of the request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jan, 2017

2 commits


18 Jan, 2017

1 commit

  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

10 Dec, 2016

1 commit


09 Nov, 2016

1 commit

  • If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
    depending on flush settings. Since op_is_sync() factors those flags
    in for deciding whether this request is sync or not, we should
    set REQ_SYNC to avoid screwing up this accounting.

    This should be less fragile.

    Reported-by: Logan Gunthorpe
    Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2016

1 commit

  • Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
    are followed by kicking the requeue list. Hence add an argument to
    these two functions that allows to kick the requeue list. This was
    proposed by Christoph Hellwig.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Nov, 2016

1 commit


28 Oct, 2016

2 commits

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A lot of the REQ_* flags are only used on struct requests, and only of
    use to the block layer and a few drivers that dig into struct request
    internals.

    This patch adds a new req_flags_t rq_flags field to struct request for
    them, and thus dramatically shrinks the number of common requests. It
    also removes the unfortunate situation where we have to fit the fields
    from the same enum into 32 bits for struct bio and 64 bits for
    struct request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Oct, 2016

1 commit


15 Sep, 2016

1 commit

  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jun, 2016

2 commits

  • To avoid confusion between REQ_OP_FLUSH, which is handled by
    request_fn drivers, and upper layers requesting the block layer
    perform a flush sequence along with possibly a WRITE, this patch
    renames REQ_FLUSH to REQ_PREFLUSH.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This adds a REQ_OP_FLUSH operation that is sent to request_fn
    based drivers by the block layer's flush code, instead of
    sending requests with the request->cmd_flags REQ_FLUSH bit set.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie