21 Jun, 2018

1 commit

  • [ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

    When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

12 Sep, 2017

1 commit

  • A NULL pointer crash was reported for the case of having the BFQ IO
    scheduler attached to the underlying blk-mq paths of a DM multipath
    device. The crash occured in blk_mq_sched_insert_request()'s call to
    e->type->ops.mq.insert_requests().

    Paolo Valente correctly summarized why the crash occured with:
    "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
    blk_rq_prep_clone) creates a cloned request without invoking
    e->type->ops.mq.prepare_request for the target elevator e. The cloned
    request is therefore not initialized for the scheduler, but it is
    however inserted into the scheduler by blk_mq_sched_insert_request."

    All said, a request-based DM multipath device's IO scheduler should be
    the only one used -- when the original requests are issued to the
    underlying paths as cloned requests they are inserted directly in the
    underlying dispatch queue(s) rather than through an additional elevator.

    But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
    schedulers") switched blk_insert_cloned_request() from using
    blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
    incorrectly added elevator machinery into a call chain that isn't
    supposed to have any.

    To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
    that blk_insert_cloned_request() calls to insert the request without
    involving any elevator that may be attached to the cloned request's
    request_queue.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reported-by: Bart Van Assche
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Aug, 2017

1 commit

  • We don't have to inc/dec some counter, since we can just
    iterate the tags. That makes inc/dec a noop, but means we
    have to iterate busy tags to get an in-flight count.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Jul, 2017

1 commit

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     

29 Jun, 2017

1 commit

  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

19 Jun, 2017

3 commits


04 May, 2017

1 commit


27 Apr, 2017

3 commits


15 Apr, 2017

1 commit


08 Apr, 2017

1 commit

  • We've added a considerable amount of fixes for stalls and issues
    with the blk-mq scheduling in the 4.11 series since forking
    off the for-4.12/block branch. We need to do improvements on
    top of that for 4.12, so pull in the previous fixes to make
    our lives easier going forward.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Apr, 2017

1 commit

  • While dispatching requests, if we fail to get a driver tag, we mark the
    hardware queue as waiting for a tag and put the requests on a
    hctx->dispatch list to be run later when a driver tag is freed. However,
    blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
    queues if using a single-queue scheduler with a multiqueue device. If
    blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
    are processing. This means we end up using the hardware queue of the
    previous request, which may or may not be the same as that of the
    current request. If it isn't, the wrong hardware queue will end up
    waiting for a tag, and the requests will be on the wrong dispatch list,
    leading to a hang.

    The fix is twofold:

    1. Make sure we save which hardware queue we were trying to get a
    request for in blk_mq_get_driver_tag() regardless of whether it
    succeeds or not.
    2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
    blk_mq_hw_queue to make it clear that it must handle multiple
    hardware queues, since I've already messed this up on a couple of
    occasions.

    This didn't appear in testing with nvme and mq-deadline because nvme has
    more driver tags than the default number of scheduler tags. However,
    with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

22 Mar, 2017

1 commit

  • Currently, statistics are gathered in ~0.13s windows, and users grab the
    statistics whenever they need them. This is not ideal for both in-tree
    users:

    1. Writeback throttling wants its own dynamically sized window of
    statistics. Since the blk-stats statistics are reset after every
    window and the wbt windows don't line up with the blk-stats windows,
    wbt doesn't see every I/O.
    2. Polling currently grabs the statistics on every I/O. Again, depending
    on how the window lines up, we may miss some I/Os. It's also
    unnecessary overhead to get the statistics on every I/O; the hybrid
    polling heuristic would be just as happy with the statistics from the
    previous full window.

    This reworks the blk-stats infrastructure to be callback-based: users
    register a callback that they want called at a given time with all of
    the statistics from the window during which the callback was active.
    Users can dynamically bucketize the statistics. wbt and polling both
    currently use read vs. write, but polling can be extended to further
    subdivide based on request size.

    The callbacks are kept on an RCU list, and each callback has percpu
    stats buffers. There will only be a few users, so the overhead on the
    I/O completion side is low. The stats flushing is also simplified
    considerably: since the timer function is responsible for clearing the
    statistics, we don't have to worry about stale statistics.

    wbt is a trivial conversion. After the conversion, the windowing problem
    mentioned above is fixed.

    For polling, we register an extra callback that caches the previous
    window's statistics in the struct request_queue for the hybrid polling
    heuristic to use.

    Since we no longer have a single stats buffer for the request queue,
    this also removes the sysfs and debugfs stats entries. To replace those,
    we add a debugfs entry for the poll statistics.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Mar, 2017

2 commits

  • Currently from kobject view, both q->mq_kobj and ctx->kobj can
    be released during one cycle of blk_mq_register_dev() and
    blk_mq_unregister_dev(). Actually, sw queue's lifetime is
    same with its request queue's, which is covered by request_queue->kobj.

    So we don't need to call kobject_put() for the two kinds of
    kobject in __blk_mq_unregister_dev(), instead we do that
    in release handler of request queue.

    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Both q->mq_kobj and sw queues' kobjects should have been initialized
    once, instead of doing that each add_disk context.

    Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
    because percpu allocator fills zero to allocated variable.

    This patch fixes one issue[1] reported from Omar.

    [1] kernel wearning when doing unbind/bind on one scsi-mq device

    [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
    [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
    [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
    [ 19.350920] Workqueue: events_unbound async_run_entry_fn
    [ 19.350920] Call Trace:
    [ 19.350920] dump_stack+0x63/0x83
    [ 19.350920] kobject_init+0x77/0x90
    [ 19.350920] blk_mq_register_dev+0x40/0x130
    [ 19.350920] blk_register_queue+0xb6/0x190
    [ 19.350920] device_add_disk+0x1ec/0x4b0
    [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod]
    [ 19.350920] async_run_entry_fn+0x48/0x150
    [ 19.350920] process_one_work+0x1d0/0x480
    [ 19.350920] worker_thread+0x48/0x4e0
    [ 19.350920] kthread+0x101/0x140
    [ 19.350920] ? process_one_work+0x480/0x480
    [ 19.350920] ? kthread_create_on_node+0x60/0x60
    [ 19.350920] ret_from_fork+0x2c/0x40

    Cc: Omar Sandoval
    Signed-off-by: Ming Lei
    Tested-by: Peter Zijlstra (Intel)
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Mar, 2017

1 commit


03 Feb, 2017

1 commit


28 Jan, 2017

2 commits

  • This fixes a couple of problems:

    1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
    2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
    all.

    Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

    Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
    Signed-off-by: Omar Sandoval

    Augment Kconfig description.

    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Instead of letting the caller check this and handle the details
    of inserting a flush request, put the logic in the scheduler
    insertion function. This fixes direct flush insertion outside
    of the usual make_request_fn calls, like from dm via
    blk_insert_cloned_request().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jan, 2017

2 commits

  • If we have both multiple hardware queues and shared tag map between
    devices, we need to ensure that we propagate the hardware queue
    restart bit higher up. This is because we can get into a situation
    where we don't have any IO pending on a hardware queue, yet we fail
    getting a tag to start new IO. If that happens, it's not enough to
    mark the hardware queue as needing a restart, we need to bubble
    that up to the higher level queue as well.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval
    Tested-by: Hannes Reinecke

    Jens Axboe
     
  • In preparation for putting blk-mq debugging information in debugfs,
    create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    | `-- mq
    | |-- 0
    | | `-- cpu0
    | |-- 1
    | | `-- cpu1
    | |-- 2
    | | `-- cpu2
    | `-- 3
    | `-- cpu3
    `-- vda
    `-- mq
    `-- 0
    |-- cpu0
    |-- cpu1
    |-- cpu2
    `-- cpu3

    Also add the scaffolding for the actual files that will go in here,
    either under the hardware queue or software queue directories.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Jan, 2017

4 commits


15 Dec, 2016

1 commit

  • Pull SCSI updates from James Bottomley:
    "This update includes the usual round of major driver updates (ncr5380,
    lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).

    There's also an assortment of minor fixes, mostly in error legs or
    other not very user visible stuff. The major change is the
    pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
    effectively makes IRQ mapping generic for the drivers and allows
    blk_mq to use the information"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
    scsi: qla4xxx: switch to pci_alloc_irq_vectors
    scsi: hisi_sas: support deferred probe for v2 hw
    scsi: megaraid_sas: switch to pci_alloc_irq_vectors
    scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
    scsi: be2iscsi: set errno on error path
    scsi: be2iscsi: set errno on error path
    scsi: hpsa: fallback to use legacy REPORT PHYS command
    scsi: scsi_dh_alua: Fix RCU annotations
    scsi: hpsa: use %phN for short hex dumps
    scsi: hisi_sas: fix free'ing in probe and remove
    scsi: isci: switch to pci_alloc_irq_vectors
    scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
    scsi: dpt_i2o: double free on error path
    scsi: cxlflash: Migrate scsi command pointer to AFU command
    scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
    scsi: cxlflash: Cleanup queuecommand()
    scsi: cxlflash: Cleanup send_tmf()
    scsi: cxlflash: Remove AFU command lock
    scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
    scsi: cxlflash: Remove private command pool
    ...

    Linus Torvalds
     

10 Dec, 2016

1 commit


11 Nov, 2016

1 commit

  • For legacy block, we simply track them in the request queue. For
    blk-mq, we track them on a per-sw queue basis, which we can then
    sum up through the hardware queues and finally to a per device
    state.

    The stats are tracked in, roughly, 0.1s interval windows.

    Add sysfs files to display the stats.

    The feature is off by default, to avoid any extra overhead. In-kernel
    users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
    flags. We currently don't turn it on if someone just reads any of
    the stats files, that is something we could add as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Nov, 2016

1 commit

  • This will allow SCSI to have a single blk_mq_ops structure that either
    lets the LLDD map the queues to PCIe MSIx vectors or use the default.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Jens Axboe
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

03 Nov, 2016

1 commit

  • Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
    a helper function that performs this test.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

22 Sep, 2016

1 commit


17 Sep, 2016

2 commits

  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Sep, 2016

1 commit

  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig