18 Jul, 2018

1 commit

  • In case of 'none' io scheduler, when hw queue isn't busy, it isn't
    necessary to enqueue request to sw queue and dequeue it from
    sw queue because request may be submitted to hw queue asap without
    extra cost, meantime there shouldn't be much request in sw queue,
    and we don't need to worry about effect on IO merge.

    There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
    which may connect high performance devices, so 'none' is often required
    for obtaining good performance.

    This patch improves IOPS and decreases CPU unilization on megaraid_sas,
    per Kashyap's test.

    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jul, 2018

2 commits

  • set->mq_map is now currently cleared if something goes wrong when
    establishing a queue map in blk-mq-pci.c. It's also cleared before
    updating a queue map in blk_mq_update_queue_map().

    This patch provides an API to clear set->mq_map to make it clear.

    Signed-off-by: Minwoo Im
    Signed-off-by: Jens Axboe

    Minwoo Im
     
  • We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
    we never change '**hctx' as well. The last use of these went away
    with the flush cleanup, commit 0c2a6fe4dc3e.

    So cleanup the usage and remove the two extra parameters.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2018

1 commit

  • This patch simplifies the timeout handling by relying on the request
    reference counting to ensure the iterator is operating on an inflight
    and truly timed out request. Since the reference counting prevents the
    tag from being reallocated, the block layer no longer needs to prevent
    drivers from completing their requests while the timeout handler is
    operating on it: a driver completing a request is allowed to proceed to
    the next state without additional syncronization with the block layer.

    This also removes any need for generation sequence numbers since the
    request lifetime is prevented from being reallocated as a new sequence
    while timeout handling is operating on it.

    To enables this a refcount is added to struct request so that request
    users can be sure they're operating on the same request without it
    changing while they're processing it. The request's tag won't be
    released for reuse until both the timeout handler and the completion
    are done with it.

    Signed-off-by: Keith Busch
    [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
    for completions]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

26 Apr, 2018

1 commit

  • When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

25 Apr, 2018

1 commit


20 Jan, 2018

1 commit


18 Jan, 2018

1 commit

  • blk_insert_cloned_request() is called in the fast path of a dm-rq driver
    (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
    blk_mq_request_bypass_insert() to directly append the request to the
    blk-mq hctx->dispatch_list of the underlying queue.

    1) This way isn't efficient enough because the hctx spinlock is always
    used.

    2) With blk_insert_cloned_request(), we completely bypass underlying
    queue's elevator and depend on the upper-level dm-rq driver's elevator
    to schedule IO. But dm-rq currently can't get the underlying queue's
    dispatch feedback at all. Without knowing whether a request was issued
    or not (e.g. due to underlying queue being busy) the dm-rq elevator will
    not be able to provide effective IO merging (as a side-effect of dm-rq
    currently blindly destaging a request from its elevator only to requeue
    it after a delay, which kills any opportunity for merging). This
    obviously causes very bad sequential IO performance.

    Fix this by updating blk_insert_cloned_request() to use
    blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
    request to be issued directly to the underlying queue and returns the
    dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
    returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
    to _not_ destage the request. Whereby preserving the opportunity to
    merge IO.

    With this, request-based DM's blk-mq sequential IO performance is vastly
    improved (as much as 3X in mpath/virtio-scsi testing).

    Signed-off-by: Ming Lei
    [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
    they were refactored to make them less fragile and easier to read/review]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Jan, 2018

3 commits

  • After the recent updates to use generation number and state based
    synchronization, we can easily replace REQ_ATOM_STARTED usages by
    adding an extra state to distinguish completed but not yet freed
    state.

    Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
    blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
    left and is removed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With issue/complete and timeout paths now using the generation number
    and state based synchronization, blk_abort_request() is the only one
    which depends on REQ_ATOM_COMPLETE for arbitrating completion.

    There's no reason for blk_abort_request() to be a completely separate
    path. This patch makes blk_abort_request() piggyback on the timeout
    path instead of trying to terminate the request directly.

    This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.

    Note that this makes blk_abort_request() asynchronous - it initiates
    abortion but the actual termination will happen after a short while,
    even when the caller owns the request. AFAICS, SCSI and ATA should be
    fine with that and I think mtip32xx and dasd should be safe but not
    completely sure. It'd be great if people who know the drivers take a
    look.

    v2: - Add comment explaining the lack of synchronization around
    ->deadline update as requested by Bart.

    Signed-off-by: Tejun Heo
    Cc: Asai Thambi SP
    Cc: Stefan Haberland
    Cc: Jan Hoeppner
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blk-mq timeout path synchronizes against the usual
    issue/completion path using a complex scheme involving atomic
    bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
    rules. Unfortunately, it contains quite a few holes.

    There's a complex dancing around REQ_ATOM_STARTED and
    REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
    they don't have a synchronization point across request recycle
    instances and it isn't clear what the barriers add.
    blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
    deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

    In fact, it's pretty easy to make blk_mq_check_expired() terminate a
    later instance of a request. If we induce 5 sec delay before
    time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
    2s, and issue back-to-back large IOs, blk-mq starts timing out
    requests spuriously pretty quickly. Nothing actually timed out. It
    just made the call on a recycle instance of a request and then
    terminated a later instance long after the original instance finished.
    The scenario isn't theoretical either.

    This patch replaces the broken synchronization mechanism with a RCU
    and generation number based one.

    1. Each request has a u64 generation + state value, which can be
    updated only by the request owner. Whenever a request becomes
    in-flight, the generation number gets bumped up too. This provides
    the basis for the timeout path to distinguish different recycle
    instances of the request.

    Also, marking a request in-flight and setting its deadline are
    protected with a seqcount so that the timeout path can fetch both
    values coherently.

    2. The timeout path fetches the generation, state and deadline. If
    the verdict is timeout, it records the generation into a dedicated
    request abortion field and does RCU wait.

    3. The completion path is also protected by RCU (from the previous
    patch) and checks whether the current generation number and state
    match the abortion field. If so, it skips completion.

    4. The timeout path, after RCU wait, scans requests again and
    terminates the ones whose generation and state still match the ones
    requested for abortion.

    By now, the timeout path knows that either the generation number
    and state changed if it lost the race or the completion will yield
    to it and can safely timeout the request.

    While it's more lines of code, it's conceptually simpler, doesn't
    depend on direct use of subtle memory ordering or coherence, and
    hopefully doesn't terminate the wrong instance.

    While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
    between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
    removed yet as it's still used in other places. Future patches will
    move all state tracking to the new mechanism and remove all bitops in
    the hot paths.

    Note that this patch adds a comment explaining a race condition in
    BLK_EH_RESET_TIMER path. The race has always been there and this
    patch doesn't change it. It's just documenting the existing race.

    v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

    v3: - Fixed possible extended seqcount / u64_stats_sync read looping
    spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
    of free_request. Fixed.

    v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

    v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
    commit message.

    Signed-off-by: Tejun Heo
    Cc: "jianchao.wang"
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

2 commits

  • Currently we are inconsistent in when we decide to run the queue. Using
    blk_mq_run_hw_queues() we check if the hctx has pending IO before
    running it, but we don't do that from the individual queue run function,
    blk_mq_run_hw_queue(). This results in a lot of extra and pointless
    queue runs, potentially, on flush requests and (much worse) on tag
    starvation situations. This is observable just looking at top output,
    with lots of kworkers active. For the !async runs, it just adds to the
    CPU overhead of blk-mq.

    Move the has-pending check into the run function instead of having
    callers do it.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Several block layer and NVMe core functions accept a combination
    of BLK_MQ_REQ_* flags through the 'flags' argument but there is
    no verification at compile time whether the right type of block
    layer flags is passed. Make it possible for sparse to verify this.
    This patch does not change any functionality.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: Oleksandr Natalenko
    Cc: linux-nvme@lists.infradead.org
    Cc: Christoph Hellwig
    Cc: Johannes Thumshirn
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

05 Nov, 2017

3 commits

  • We need this helper to put the driver tag for flush rq, since we will
    not share tag in the flush request sequence in the following patch
    in case that I/O scheduler is applied.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Block flush need this function without running the queue, so add a
    parameter controlling whether we run it or not.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is enough to just check if we can get the budget via .get_budget().
    And we don't need to deal with device state change in .get_budget().

    For SCSI, one issue to be fixed is that we have to call
    scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
    to handle the request. And it isn't enough to simply call
    blk_mq_end_request() to do that if this request is marked as
    RQF_DONTPREP.

    Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

01 Nov, 2017

2 commits

  • SCSI devices use host-wide tagset, and the shared driver tag space is
    often quite big. However, there is also a queue depth for each lun(
    .cmd_per_lun), which is often small, for example, on both lpfc and
    qla2xxx, .cmd_per_lun is just 3.

    So lots of requests may stay in sw queue, and we always flush all
    belonging to same hw queue and dispatch them all to driver.
    Unfortunately it is easy to cause queue busy because of the small
    .cmd_per_lun. Once these requests are flushed out, they have to stay in
    hctx->dispatch, and no bio merge can happen on these requests, and
    sequential IO performance is harmed.

    This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
    from a sw queue, so that we can dispatch them in scheduler's way. We can
    then avoid dequeueing too many requests from sw queue, since we don't
    flush ->dispatch completely.

    This patch improves dispatching from sw queue by using the .get_budget
    and .put_budget callbacks.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • For SCSI devices, there is often a per-request-queue depth, which needs
    to be respected before queuing one request.

    Currently blk-mq always dequeues the request first, then calls
    .queue_rq() to dispatch the request to lld. One obvious issue with this
    approach is that I/O merging may not be successful, because when the
    per-request-queue depth can't be respected, .queue_rq() has to return
    BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
    list. This means it never gets a chance to be merged with other IO.

    This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
    then we can try to get reserved budget first before dequeuing request.
    If the budget for queueing I/O can't be satisfied, we don't need to
    dequeue request at all. Hence the request can be left in the IO
    scheduler queue, for more merging opportunities.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Sep, 2017

1 commit

  • A NULL pointer crash was reported for the case of having the BFQ IO
    scheduler attached to the underlying blk-mq paths of a DM multipath
    device. The crash occured in blk_mq_sched_insert_request()'s call to
    e->type->ops.mq.insert_requests().

    Paolo Valente correctly summarized why the crash occured with:
    "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
    blk_rq_prep_clone) creates a cloned request without invoking
    e->type->ops.mq.prepare_request for the target elevator e. The cloned
    request is therefore not initialized for the scheduler, but it is
    however inserted into the scheduler by blk_mq_sched_insert_request."

    All said, a request-based DM multipath device's IO scheduler should be
    the only one used -- when the original requests are issued to the
    underlying paths as cloned requests they are inserted directly in the
    underlying dispatch queue(s) rather than through an additional elevator.

    But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
    schedulers") switched blk_insert_cloned_request() from using
    blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
    incorrectly added elevator machinery into a call chain that isn't
    supposed to have any.

    To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
    that blk_insert_cloned_request() calls to insert the request without
    involving any elevator that may be attached to the cloned request's
    request_queue.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reported-by: Bart Van Assche
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Aug, 2017

1 commit

  • We don't have to inc/dec some counter, since we can just
    iterate the tags. That makes inc/dec a noop, but means we
    have to iterate busy tags to get an in-flight count.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Jul, 2017

1 commit

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     

29 Jun, 2017

1 commit

  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

19 Jun, 2017

3 commits


04 May, 2017

1 commit


27 Apr, 2017

3 commits


15 Apr, 2017

1 commit


08 Apr, 2017

1 commit

  • We've added a considerable amount of fixes for stalls and issues
    with the blk-mq scheduling in the 4.11 series since forking
    off the for-4.12/block branch. We need to do improvements on
    top of that for 4.12, so pull in the previous fixes to make
    our lives easier going forward.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Apr, 2017

1 commit

  • While dispatching requests, if we fail to get a driver tag, we mark the
    hardware queue as waiting for a tag and put the requests on a
    hctx->dispatch list to be run later when a driver tag is freed. However,
    blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
    queues if using a single-queue scheduler with a multiqueue device. If
    blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
    are processing. This means we end up using the hardware queue of the
    previous request, which may or may not be the same as that of the
    current request. If it isn't, the wrong hardware queue will end up
    waiting for a tag, and the requests will be on the wrong dispatch list,
    leading to a hang.

    The fix is twofold:

    1. Make sure we save which hardware queue we were trying to get a
    request for in blk_mq_get_driver_tag() regardless of whether it
    succeeds or not.
    2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
    blk_mq_hw_queue to make it clear that it must handle multiple
    hardware queues, since I've already messed this up on a couple of
    occasions.

    This didn't appear in testing with nvme and mq-deadline because nvme has
    more driver tags than the default number of scheduler tags. However,
    with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

22 Mar, 2017

1 commit

  • Currently, statistics are gathered in ~0.13s windows, and users grab the
    statistics whenever they need them. This is not ideal for both in-tree
    users:

    1. Writeback throttling wants its own dynamically sized window of
    statistics. Since the blk-stats statistics are reset after every
    window and the wbt windows don't line up with the blk-stats windows,
    wbt doesn't see every I/O.
    2. Polling currently grabs the statistics on every I/O. Again, depending
    on how the window lines up, we may miss some I/Os. It's also
    unnecessary overhead to get the statistics on every I/O; the hybrid
    polling heuristic would be just as happy with the statistics from the
    previous full window.

    This reworks the blk-stats infrastructure to be callback-based: users
    register a callback that they want called at a given time with all of
    the statistics from the window during which the callback was active.
    Users can dynamically bucketize the statistics. wbt and polling both
    currently use read vs. write, but polling can be extended to further
    subdivide based on request size.

    The callbacks are kept on an RCU list, and each callback has percpu
    stats buffers. There will only be a few users, so the overhead on the
    I/O completion side is low. The stats flushing is also simplified
    considerably: since the timer function is responsible for clearing the
    statistics, we don't have to worry about stale statistics.

    wbt is a trivial conversion. After the conversion, the windowing problem
    mentioned above is fixed.

    For polling, we register an extra callback that caches the previous
    window's statistics in the struct request_queue for the hybrid polling
    heuristic to use.

    Since we no longer have a single stats buffer for the request queue,
    this also removes the sysfs and debugfs stats entries. To replace those,
    we add a debugfs entry for the poll statistics.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Mar, 2017

2 commits

  • Currently from kobject view, both q->mq_kobj and ctx->kobj can
    be released during one cycle of blk_mq_register_dev() and
    blk_mq_unregister_dev(). Actually, sw queue's lifetime is
    same with its request queue's, which is covered by request_queue->kobj.

    So we don't need to call kobject_put() for the two kinds of
    kobject in __blk_mq_unregister_dev(), instead we do that
    in release handler of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Both q->mq_kobj and sw queues' kobjects should have been initialized
    once, instead of doing that each add_disk context.

    Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
    because percpu allocator fills zero to allocated variable.

    This patch fixes one issue[1] reported from Omar.

    [1] kernel wearning when doing unbind/bind on one scsi-mq device

    [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
    [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
    [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
    [ 19.350920] Workqueue: events_unbound async_run_entry_fn
    [ 19.350920] Call Trace:
    [ 19.350920] dump_stack+0x63/0x83
    [ 19.350920] kobject_init+0x77/0x90
    [ 19.350920] blk_mq_register_dev+0x40/0x130
    [ 19.350920] blk_register_queue+0xb6/0x190
    [ 19.350920] device_add_disk+0x1ec/0x4b0
    [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
    [ 19.350920] async_run_entry_fn+0x48/0x150
    [ 19.350920] process_one_work+0x1d0/0x480
    [ 19.350920] worker_thread+0x48/0x4e0
    [ 19.350920] kthread+0x101/0x140
    [ 19.350920] ? process_one_work+0x480/0x480
    [ 19.350920] ? kthread_create_on_node+0x60/0x60
    [ 19.350920] ret_from_fork+0x2c/0x40

    Cc: Omar Sandoval
    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Mar, 2017

1 commit


03 Feb, 2017

1 commit


28 Jan, 2017

1 commit

  • This fixes a couple of problems:

    1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
    2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
    all.

    Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

    Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
    Signed-off-by: Omar Sandoval

    Augment Kconfig description.

    Signed-off-by: Jens Axboe

    Omar Sandoval