15 Sep, 2016

1 commit

  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Dec, 2015

1 commit

  • In architecture like powerpc, we can have cpus without any local memory
    attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
    can result in memory allocation hints for block hctx->numa_node populated
    with node values which does not have real memory.

    Instead use local_memory_node(), which is guaranteed to have memory.
    local_memory_node is a noop in other architectures that does not support
    memoryless nodes.

    Signed-off-by: Raghavendra K T
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Raghavendra K T
     

30 Sep, 2015

1 commit

  • Notifier callbacks for CPU_ONLINE action can be run on the other CPU
    than the CPU which was just onlined. So it is possible for the
    process running on the just onlined CPU to insert request and run
    hw queue before establishing new mapping which is done by
    blk_mq_queue_reinit_notify().

    This can cause a problem when the CPU has just been onlined first time
    since the request queue was initialized. At this time ctx->index_hw
    for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
    zero before blk_mq_queue_reinit_notify() is called by notifier
    callbacks for CPU_ONLINE action.

    For example, there is a single hw queue (hctx) and two CPU queues
    (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
    a request is inserted into ctx1->rq_list and set bit0 in pending
    bitmap as ctx1->index_hw is still zero.

    And then while running hw queue, flush_busy_ctxs() finds bit0 is set
    in pending bitmap and tries to retrieve requests in
    hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
    request in ctx1->rq_list is ignored.

    Fix it by ensuring that new mapping is established before onlined cpu
    starts running.

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Jens Axboe
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

27 May, 2015

1 commit

  • Rename topology_thread_cpumask() to topology_sibling_cpumask()
    for more consistency with scheduler code.

    Signed-off-by: Bartosz Golaszewski
    Reviewed-by: Thomas Gleixner
    Acked-by: Russell King
    Acked-by: Catalin Marinas
    Cc: Benoit Cousson
    Cc: Fenghua Yu
    Cc: Guenter Roeck
    Cc: Jean Delvare
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Oleg Drokin
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Russell King
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com
    Signed-off-by: Ingo Molnar

    Bartosz Golaszewski
     

10 Dec, 2014

1 commit

  • Suppose that a system has two CPU sockets, three cores per socket,
    that it does not support hyperthreading and that four hardware
    queues are provided by a block driver. With the current algorithm
    this will lead to the following assignment of CPU cores to hardware
    queues:

    HWQ 0: 0 1
    HWQ 1: 2 3
    HWQ 2: 4 5
    HWQ 3: (none)

    This patch changes the queue assignment into:

    HWQ 0: 0 1
    HWQ 1: 2
    HWQ 2: 3 4
    HWQ 3: 5

    In other words, this patch has the following three effects:
    - All four hardware queues are used instead of only three.
    - CPU cores are spread more evenly over hardware queues. For the
    above example the range of the number of CPU cores associated
    with a single HWQ is reduced from [0..2] to [1..2].
    - If the number of HWQ's is a multiple of the number of CPU sockets
    it is now guaranteed that all CPU cores associated with a single
    HWQ reside on the same CPU socket.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Sagi Grimberg
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Nov, 2014

1 commit


29 May, 2014

1 commit


28 May, 2014

1 commit


16 Apr, 2014

1 commit


21 Mar, 2014

1 commit


25 Oct, 2013

1 commit

  • Linux currently has two models for block devices:

    - The classic request_fn based approach, where drivers use struct
    request units for IO. The block layer provides various helper
    functionalities to let drivers share code, things like tag
    management, timeout handling, queueing, etc.

    - The "stacked" approach, where a driver squeezes in between the
    block layer and IO submitter. Since this bypasses the IO stack,
    driver generally have to manage everything themselves.

    With drivers being written for new high IOPS devices, the classic
    request_fn based driver doesn't work well enough. The design dates
    back to when both SMP and high IOPS was rare. It has problems with
    scaling to bigger machines, and runs into scaling issues even on
    smaller machines when you have IOPS in the hundreds of thousands
    per device.

    The stacked approach is then most often selected as the model
    for the driver. But this means that everybody has to re-invent
    everything, and along with that we get all the problems again
    that the shared approach solved.

    This commit introduces blk-mq, block multi queue support. The
    design is centered around per-cpu queues for queueing IO, which
    then funnel down into x number of hardware submission queues.
    We might have a 1:1 mapping between the two, or it might be
    an N:M mapping. That all depends on what the hardware supports.

    blk-mq provides various helper functions, which include:

    - Scalable support for request tagging. Most devices need to
    be able to uniquely identify a request both in the driver and
    to the hardware. The tagging uses per-cpu caches for freed
    tags, to enable cache hot reuse.

    - Timeout handling without tracking request on a per-device
    basis. Basically the driver should be able to get a notification,
    if a request happens to fail.

    - Optional support for non 1:1 mappings between issue and
    submission queues. blk-mq can redirect IO completions to the
    desired location.

    - Support for per-request payloads. Drivers almost always need
    to associate a request structure with some driver private
    command structure. Drivers can tell blk-mq this at init time,
    and then any request handed to the driver will have the
    required size of memory associated with it.

    - Support for merging of IO, and plugging. The stacked model
    gets neither of these. Even for high IOPS devices, merging
    sequential IO reduces per-command overhead and thus
    increases bandwidth.

    For now, this is provided as a potential 3rd queueing model, with
    the hope being that, as it matures, it can replace both the classic
    and stacked model. That would get us back to having just 1 real
    model for block devices, leaving the stacked approach to dm/md
    devices (as it was originally intended).

    Contributions in this patch from the following people:

    Shaohua Li
    Alexander Gordeev
    Christoph Hellwig
    Mike Christie
    Matias Bjorling
    Jeff Moyer

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe