21 Mar, 2020

1 commit

  • [ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]

    For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

03 Jul, 2019

1 commit

  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 May, 2019

1 commit

  • Once blk_cleanup_queue() returns, tags shouldn't be used any more,
    because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
    ("blk-mq: Fix a use-after-free") fixes this issue exactly.

    However, that commit introduces another issue. Before 45a9c9d909b2,
    we are allowed to run queue during cleaning up queue if the queue's
    kobj refcount is held. After that commit, queue can't be run during
    queue cleaning up, otherwise oops can be triggered easily because
    some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

    We have invented ways for addressing this kind of issue before, such as:

    8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
    c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

    But still can't cover all cases, recently James reports another such
    kind of issue:

    https://marc.info/?l=linux-scsi&m=155389088124782&w=2

    This issue can be quite hard to address by previous way, given
    scsi_run_queue() may run requeues for other LUNs.

    Fixes the above issue by freeing hctx's resources in its release handler, and this
    way is safe becasue tags isn't needed for freeing such hctx resource.

    This approach follows typical design pattern wrt. kobject's release handler.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reported-by: James Smart
    Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
    Cc: stable@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Apr, 2019

1 commit

  • blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
    have been queued. If that happens when blk_mq_try_issue_directly() is called
    by the dm-mpath driver then dm-mpath will try to resubmit a request that is
    already queued and a kernel crash follows. Since it is nontrivial to fix
    blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
    changes that went into kernel v5.0.

    This patch reverts the following commits:
    * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
    * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
    * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: James Smart
    Cc: Dongli Zhang
    Cc: Laurence Oberman
    Cc:
    Reported-by: Laurence Oberman
    Tested-by: Laurence Oberman
    Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0.
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Mar, 2019

1 commit

  • Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag()
    is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag().
    Then we can remove the unused blk_mq_put_driver_tag_hctx().

    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

21 Mar, 2019

1 commit


15 Feb, 2019

1 commit

  • Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
    This is needed since io_uring is now based on the block branch,
    to avoid a conflict between the multi-page bvecs and the bits
    of io_uring that touch the core block parts.

    * tag 'v5.0-rc6': (525 commits)
    Linux 5.0-rc6
    x86/mm: Make set_pmd_at() paravirt aware
    MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
    blk-mq: remove duplicated definition of blk_mq_freeze_queue
    Blk-iolatency: warn on negative inflight IO counter
    blk-iolatency: fix IO hang due to negative inflight counter
    MAINTAINERS: unify reference to xen-devel list
    x86/mm/cpa: Fix set_mce_nospec()
    futex: Handle early deadlock return correctly
    futex: Fix barrier comment
    net: dsa: b53: Fix for failure when irq is not defined in dt
    blktrace: Show requests without sector
    mips: cm: reprime error cause
    mips: loongson64: remove unreachable(), fix loongson_poweroff().
    sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    geneve: should not call rt6_lookup() when ipv6 was disabled
    KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
    KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
    kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
    signal: Better detection of synchronous signals
    ...

    Jens Axboe
     

09 Feb, 2019

1 commit


01 Feb, 2019

2 commits

  • Currently, we check whether the hctx type is supported every time
    in hot path. Actually, this is not necessary, we could save the
    default hctx into ctx->hctxs if the type is not supported when
    map swqueues and use it directly with ctx->hctxs[type].

    We also needn't check whether the poll is enabled or not, because
    the caller would clear the REQ_HIPRI in that case.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • Currently, the queue mapping result is saved in a two-dimensional
    array. In the hot path, to get a hctx, we need do following:

    q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]

    This isn't very efficient. We could save the queue mapping result into
    ctx directly with different hctx type, like,

    ctx->hctxs[type]

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

18 Dec, 2018

1 commit

  • When a request is added to rq list of sw queue(ctx), the rq may be from
    a different type of hctx, especially after multi queue mapping is
    introduced.

    So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
    blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
    hctx can be dispatched to current hctx in case that read queue or poll
    queue is enabled.

    This patch fixes this issue by introducing per-queue-type list.

    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei

    Changed by me to not use separately cacheline aligned lists, just
    place them all in the same cacheline where we had just the one list
    and lock before.

    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Dec, 2018

1 commit


16 Dec, 2018

1 commit


10 Dec, 2018

1 commit

  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

05 Dec, 2018

1 commit

  • Having another indirect all in the fast path doesn't really help
    in our post-spectre world. Also having too many queue type is just
    going to create confusion, so I'd rather manage them centrally.

    Note that the queue type naming and ordering changes a bit - the
    first index now is the default queue for everything not explicitly
    marked, the optional ones are read and poll queues.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Nov, 2018

1 commit

  • If we are issuing a list of requests, we know if we're at the last one.
    If we fail issuing, ensure that we call ->commits_rqs() to flush any
    potential previous requests.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Nov, 2018

1 commit

  • Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
    from block layer's view, actually they don't because userspace may
    grab one kobject anytime via sysfs.

    This patch fixes the issue by the following approach:

    1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
    all ctxs

    2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
    handler of .mq_kobj

    3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
    .mq_kobj is always released after all ctxs are freed.

    This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
    is enabled.

    Reported-by: Guenter Roeck
    Cc: "jianchao.wang"
    Tested-by: Guenter Roeck
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Nov, 2018

7 commits

  • We call blk_mq_map_queue() a lot, at least two times for each
    request per IO, sometimes more. Since we now have an indirect
    call as well in that function. cache the mapping so we don't
    have to re-call blk_mq_map_queue() for the same request
    multiple times.

    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add support for the tag set carrying multiple queue maps, and
    for the driver to inform blk-mq how many it wishes to support
    through setting set->nr_maps.

    This adds an mq_ops helper for drivers that support more than 1
    map, mq_ops->rq_flags_to_type(). The function takes request/bio
    flags and CPU, and returns a queue map index for that. We then
    use the type information in blk_mq_map_queue() to index the map
    set.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The mapping used to be dependent on just the CPU location, but
    now it's a tuple of (type, cpu) instead. This is a prep patch
    for allowing a single software queue to map to multiple hardware
    queues. No functional changes in this patch.

    This changes the software queue count to an unsigned short
    to save a bit of space. We can still support 64K-1 CPUs,
    which should be enough. Add a check to catch a wrap.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prep patch for being able to place request based not just on
    CPU location, but also on the type of request.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Doesn't do anything right now, but it's needed as a prep patch
    to get the interfaces right.

    While in there, correct the blk_mq_map_queue() CPU type to an unsigned
    int.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is in preparation for allowing multiple sets of maps per
    queue, if so desired.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's just a pointer to set->mq_map, use that instead. Move the
    assignment a bit earlier, so we always know it's valid.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jul, 2018

1 commit

  • In case of 'none' io scheduler, when hw queue isn't busy, it isn't
    necessary to enqueue request to sw queue and dequeue it from
    sw queue because request may be submitted to hw queue asap without
    extra cost, meantime there shouldn't be much request in sw queue,
    and we don't need to worry about effect on IO merge.

    There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
    which may connect high performance devices, so 'none' is often required
    for obtaining good performance.

    This patch improves IOPS and decreases CPU unilization on megaraid_sas,
    per Kashyap's test.

    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jul, 2018

2 commits

  • set->mq_map is now currently cleared if something goes wrong when
    establishing a queue map in blk-mq-pci.c. It's also cleared before
    updating a queue map in blk_mq_update_queue_map().

    This patch provides an API to clear set->mq_map to make it clear.

    Signed-off-by: Minwoo Im
    Signed-off-by: Jens Axboe

    Minwoo Im
     
  • We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
    we never change '**hctx' as well. The last use of these went away
    with the flush cleanup, commit 0c2a6fe4dc3e.

    So cleanup the usage and remove the two extra parameters.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2018

1 commit

  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

26 Apr, 2018

1 commit

  • When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

25 Apr, 2018

1 commit


20 Jan, 2018

1 commit


18 Jan, 2018

1 commit

  • blk_insert_cloned_request() is called in the fast path of a dm-rq driver
    (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
    blk_mq_request_bypass_insert() to directly append the request to the
    blk-mq hctx->dispatch_list of the underlying queue.

    1) This way isn't efficient enough because the hctx spinlock is always
    used.

    2) With blk_insert_cloned_request(), we completely bypass underlying
    queue's elevator and depend on the upper-level dm-rq driver's elevator
    to schedule IO. But dm-rq currently can't get the underlying queue's
    dispatch feedback at all. Without knowing whether a request was issued
    or not (e.g. due to underlying queue being busy) the dm-rq elevator will
    not be able to provide effective IO merging (as a side-effect of dm-rq
    currently blindly destaging a request from its elevator only to requeue
    it after a delay, which kills any opportunity for merging). This
    obviously causes very bad sequential IO performance.

    Fix this by updating blk_insert_cloned_request() to use
    blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
    request to be issued directly to the underlying queue and returns the
    dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
    returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
    to _not_ destage the request. Whereby preserving the opportunity to
    merge IO.

    With this, request-based DM's blk-mq sequential IO performance is vastly
    improved (as much as 3X in mpath/virtio-scsi testing).

    Signed-off-by: Ming Lei
    [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
    they were refactored to make them less fragile and easier to read/review]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Jan, 2018

3 commits

  • After the recent updates to use generation number and state based
    synchronization, we can easily replace REQ_ATOM_STARTED usages by
    adding an extra state to distinguish completed but not yet freed
    state.

    Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
    blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
    left and is removed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With issue/complete and timeout paths now using the generation number
    and state based synchronization, blk_abort_request() is the only one
    which depends on REQ_ATOM_COMPLETE for arbitrating completion.

    There's no reason for blk_abort_request() to be a completely separate
    path. This patch makes blk_abort_request() piggyback on the timeout
    path instead of trying to terminate the request directly.

    This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

    Note that this makes blk_abort_request() asynchronous - it initiates
    abortion but the actual termination will happen after a short while,
    even when the caller owns the request. AFAICS, SCSI and ATA should be
    fine with that and I think mtip32xx and dasd should be safe but not
    completely sure. It'd be great if people who know the drivers take a
    look.

    v2: - Add comment explaining the lack of synchronization around
    ->deadline update as requested by Bart.

    Signed-off-by: Tejun Heo
    Cc: Asai Thambi SP
    Cc: Stefan Haberland
    Cc: Jan Hoeppner
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blk-mq timeout path synchronizes against the usual
    issue/completion path using a complex scheme involving atomic
    bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
    rules. Unfortunately, it contains quite a few holes.

    There's a complex dancing around REQ_ATOM_STARTED and
    REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
    they don't have a synchronization point across request recycle
    instances and it isn't clear what the barriers add.
    blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
    deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

    In fact, it's pretty easy to make blk_mq_check_expired() terminate a
    later instance of a request. If we induce 5 sec delay before
    time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
    2s, and issue back-to-back large IOs, blk-mq starts timing out
    requests spuriously pretty quickly. Nothing actually timed out. It
    just made the call on a recycle instance of a request and then
    terminated a later instance long after the original instance finished.
    The scenario isn't theoretical either.

    This patch replaces the broken synchronization mechanism with a RCU
    and generation number based one.

    1. Each request has a u64 generation + state value, which can be
    updated only by the request owner. Whenever a request becomes
    in-flight, the generation number gets bumped up too. This provides
    the basis for the timeout path to distinguish different recycle
    instances of the request.

    Also, marking a request in-flight and setting its deadline are
    protected with a seqcount so that the timeout path can fetch both
    values coherently.

    2. The timeout path fetches the generation, state and deadline. If
    the verdict is timeout, it records the generation into a dedicated
    request abortion field and does RCU wait.

    3. The completion path is also protected by RCU (from the previous
    patch) and checks whether the current generation number and state
    match the abortion field. If so, it skips completion.

    4. The timeout path, after RCU wait, scans requests again and
    terminates the ones whose generation and state still match the ones
    requested for abortion.

    By now, the timeout path knows that either the generation number
    and state changed if it lost the race or the completion will yield
    to it and can safely timeout the request.

    While it's more lines of code, it's conceptually simpler, doesn't
    depend on direct use of subtle memory ordering or coherence, and
    hopefully doesn't terminate the wrong instance.

    While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
    between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
    removed yet as it's still used in other places. Future patches will
    move all state tracking to the new mechanism and remove all bitops in
    the hot paths.

    Note that this patch adds a comment explaining a race condition in
    BLK_EH_RESET_TIMER path. The race has always been there and this
    patch doesn't change it. It's just documenting the existing race.

    v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

    v3: - Fixed possible extended seqcount / u64_stats_sync read looping
    spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
    of free_request. Fixed.

    v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

    v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
    commit message.

    Signed-off-by: Tejun Heo
    Cc: "jianchao.wang"
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

2 commits

  • Currently we are inconsistent in when we decide to run the queue. Using
    blk_mq_run_hw_queues() we check if the hctx has pending IO before
    running it, but we don't do that from the individual queue run function,
    blk_mq_run_hw_queue(). This results in a lot of extra and pointless
    queue runs, potentially, on flush requests and (much worse) on tag
    starvation situations. This is observable just looking at top output,
    with lots of kworkers active. For the !async runs, it just adds to the
    CPU overhead of blk-mq.

    Move the has-pending check into the run function instead of having
    callers do it.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Several block layer and NVMe core functions accept a combination
    of BLK_MQ_REQ_* flags through the 'flags' argument but there is
    no verification at compile time whether the right type of block
    layer flags is passed. Make it possible for sparse to verify this.
    This patch does not change any functionality.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: Oleksandr Natalenko
    Cc: linux-nvme@lists.infradead.org
    Cc: Christoph Hellwig
    Cc: Johannes Thumshirn
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

05 Nov, 2017

1 commit