05 Aug, 2019

1 commit

  • Spread queues among present CPUs first, then building mapping on other
    non-present CPUs.

    So we can minimize count of dead queues which are mapped by un-present
    CPUs only. Then bad IO performance can be avoided by unbalanced mapping
    between present CPUs and queues.

    The similar policy has been applied on Managed IRQ affinity.

    Cc: Yi Zhang
    Reported-by: Yi Zhang
    Reviewed-by: Bob Liu
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Jun, 2019

2 commits


01 May, 2019

1 commit


08 Nov, 2018

2 commits

  • Add a queue offset to the tag map. This enables users to map
    iteratively, for each queue map type they support.

    Bump maximum number of supported maps to 2, we're now fully
    able to support more than 1 map.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is in preparation for allowing multiple sets of maps per
    queue, if so desired.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2018

1 commit

  • From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
    blk-mq doesn't remap queue after CPU topo is changed, that said when
    some of these offline CPUs become online, they are still mapped to
    hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
    completion.

    This patch sets up the mapping from the beginning, and aligns to
    queue mapping for PCI device (blk_mq_pci_map_queues()).

    Cc: Stefan Haberland
    Cc: Keith Busch
    Cc: stable@vger.kernel.org
    Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
    Tested-by: Christian Borntraeger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Jul, 2017

1 commit

  • We already do this for PCI mappings, and the higher level code now
    expects that CPU on/offlining doesn't have an affect on the queue
    mappings.

    Signed-off-by: Christoph Hellwig
    Tested-by: Max Gurtovoy
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jul, 2017

1 commit

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     

29 Jun, 2017

2 commits

  • This patch performs sequential mapping between CPUs and queues.
    In case the system has more CPUs than HWQs then there are still
    CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
    and their siblings to the same HWQ.
    This actually fixes a bug that found unmapped HWQs in a system with
    2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
    running NVMEoF (opens upto maximum of 64 HWQs).

    Performance results running fio (72 jobs, 128 iodepth)
    using null_blk (w/w.o patch):

    bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24)
    ----- ---------------------------- ------------------------------ ---------------------------- -----------------------------
    512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K
    1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K
    2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K
    4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K
    8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K
    16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K
    32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K
    64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K
    128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • This way we get a nice distribution independent of the current cpu
    online / offline state.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

09 Nov, 2016

1 commit

  • This will allow SCSI to have a single blk_mq_ops structure that either
    lets the LLDD map the queues to PCIe MSIx vectors or use the default.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Jens Axboe
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

15 Sep, 2016

1 commit

  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Dec, 2015

1 commit

  • In architecture like powerpc, we can have cpus without any local memory
    attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
    can result in memory allocation hints for block hctx->numa_node populated
    with node values which does not have real memory.

    Instead use local_memory_node(), which is guaranteed to have memory.
    local_memory_node is a noop in other architectures that does not support
    memoryless nodes.

    Signed-off-by: Raghavendra K T
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Raghavendra K T
     

30 Sep, 2015

1 commit

  • Notifier callbacks for CPU_ONLINE action can be run on the other CPU
    than the CPU which was just onlined. So it is possible for the
    process running on the just onlined CPU to insert request and run
    hw queue before establishing new mapping which is done by
    blk_mq_queue_reinit_notify().

    This can cause a problem when the CPU has just been onlined first time
    since the request queue was initialized. At this time ctx->index_hw
    for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
    zero before blk_mq_queue_reinit_notify() is called by notifier
    callbacks for CPU_ONLINE action.

    For example, there is a single hw queue (hctx) and two CPU queues
    (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
    a request is inserted into ctx1->rq_list and set bit0 in pending
    bitmap as ctx1->index_hw is still zero.

    And then while running hw queue, flush_busy_ctxs() finds bit0 is set
    in pending bitmap and tries to retrieve requests in
    hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
    request in ctx1->rq_list is ignored.

    Fix it by ensuring that new mapping is established before onlined cpu
    starts running.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Jens Axboe
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

27 May, 2015

1 commit

  • Rename topology_thread_cpumask() to topology_sibling_cpumask()
    for more consistency with scheduler code.

    Signed-off-by: Bartosz Golaszewski
    Reviewed-by: Thomas Gleixner
    Acked-by: Russell King
    Acked-by: Catalin Marinas
    Cc: Benoit Cousson
    Cc: Fenghua Yu
    Cc: Guenter Roeck
    Cc: Jean Delvare
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Oleg Drokin
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Russell King
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com
    Signed-off-by: Ingo Molnar

    Bartosz Golaszewski
     

10 Dec, 2014

1 commit

  • Suppose that a system has two CPU sockets, three cores per socket,
    that it does not support hyperthreading and that four hardware
    queues are provided by a block driver. With the current algorithm
    this will lead to the following assignment of CPU cores to hardware
    queues:

    HWQ 0: 0 1
    HWQ 1: 2 3
    HWQ 2: 4 5
    HWQ 3: (none)

    This patch changes the queue assignment into:

    HWQ 0: 0 1
    HWQ 1: 2
    HWQ 2: 3 4
    HWQ 3: 5

    In other words, this patch has the following three effects:
    - All four hardware queues are used instead of only three.
    - CPU cores are spread more evenly over hardware queues. For the
    above example the range of the number of CPU cores associated
    with a single HWQ is reduced from [0..2] to [1..2].
    - If the number of HWQ's is a multiple of the number of CPU sockets
    it is now guaranteed that all CPU cores associated with a single
    HWQ reside on the same CPU socket.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Sagi Grimberg
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Nov, 2014

1 commit


29 May, 2014

1 commit


28 May, 2014

1 commit


16 Apr, 2014

1 commit


21 Mar, 2014

1 commit


25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe