10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

23 Sep, 2016

1 commit


22 Sep, 2016

1 commit


21 Sep, 2016

1 commit


17 Sep, 2016

1 commit

  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Sep, 2016

5 commits

  • Unused now that NVMe sets up irq affinity before calling into blk-mq.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The mapping is identical for all queues in a tag_set, so stop wasting
    memory for building multiple. Note that for now I've kept the mq_map
    pointer in the request_queue, but we'll need to investigate if we can
    remove it without suffering too much from the additional pointer chasing.
    The same would apply to the mq_ops pointer as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_delay_kick_requeue_list() provides the ability to kick the
    q->requeue_list after a specified time. To do this the request_queue's
    'requeue_work' member was changed to a delayed_work.

    blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
    requests while it doesn't make sense to immediately requeue them
    (e.g. when all paths in a DM multipath have failed).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

14 Sep, 2016

2 commits

  • The blk_mq_alloc_single_hw_queue() is a prototype artifact that
    should have been removed with
    commit cdef54dd85ad66e77262ea57796a3e81683dd5d6
    "blk-mq: remove alloc_hctx and free_hctx methods" where the last
    users of it were deleted.

    Fixes: cdef54dd85ad ("blk-mq: remove alloc_hctx and free_hctx methods")
    Signed-off-by: Linus Walleij
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Linus Walleij
     
  • In order to help determine the effectiveness of polling in a running
    system it is usful to determine the ratio of how often the poll
    function is called vs how often the completion is checked. For this
    reason we add a poll_considered variable and add it to the sysfs entry
    for io_poll.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     

29 Aug, 2016

2 commits

  • Various cache line optimizations:

    - Move delay_work towards the end. It's huge, and we don't use it
    a lot (only SCSI).

    - Move the atomic state into the same cacheline as the the dispatch
    list and lock.

    - Rearrange a few members to pack it better.

    - Shrink the max-order for dispatch accounting from 10 to 7. This
    means that ->dispatched[] and ->run now take up their own
    cacheline.

    This shrinks struct blk_mq_hw_ctx down to 8 cachelines.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't need the larger delayed work struct, since we always run it
    immediately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Jul, 2016

1 commit

  • The new nvme-rdma driver will need to reinitialize all the tags as part of
    the error recovery procedure (realloc the tag memory region). Add a helper
    in blk-mq for it that can iterate over all requests in a tagset to make
    this easier.

    Signed-off-by: Sagi Grimberg
    Tested-by: Ming Lin
    Reviewed-by: Stephen Bates
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

06 Jul, 2016

1 commit

  • For some protocols like NVMe over Fabrics we need to be able to send
    initialization commands to a specific queue.

    Based on an earlier patch from Christoph Hellwig .

    Signed-off-by: Ming Lin
    [hch: disallow sleeping allocation, req_op fixes]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Ming Lin
     

13 Apr, 2016

2 commits


20 Mar, 2016

1 commit

  • queue_for_each_ctx() iterates over per_cpu variables under the assumption that
    the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
    holes. In case there are holes the iteration ends up accessing uninitialized
    memory and crashing as a result.

    Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
    macro blk_ctx_sum() which references queue_for_each_ctx().

    Reported-by: Xiong Zhou
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     

10 Feb, 2016

1 commit

  • The hardware's provided queue count may change at runtime with resource
    provisioning. This patch allows a block driver to alter the number of
    h/w queues available when its resource count changes.

    The main part is a new blk-mq API to request a new number of h/w queues
    for a given live tag set. The new API freezes all queues using that set,
    then adjusts the allocated count prior to remapping these to CPUs.

    The bulk of the rest just shifts where h/w contexts and all their
    artifacts are allocated and freed.

    The number of max h/w contexts is capped to the number of possible cpus
    since there is no use for more than that. As such, all pre-allocated
    memory for pointers need to account for the max possible rather than
    the initial number of queues.

    A side effect of this is that the blk-mq will proceed successfully as
    long as it can allocate at least one h/w context. Previously it would
    fail request queue initialization if less than the requested number
    was allocated.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Tested-by: Jon Derrick
    Signed-off-by: Jens Axboe

    Keith Busch
     

02 Dec, 2015

1 commit


08 Nov, 2015

1 commit

  • Add basic support for polling for specific IO to complete. This uses
    the cookie that blk-mq passes back, which enables the block layer
    to pass this cookie to the driver to spin for a specific request.

    This will be combined with request latency tracking, so we can make
    qualified decisions about when to poll and when not to. For now, for
    benchmark purposes, we add a sysfs file that controls whether polling
    is enabled or not.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     

22 Oct, 2015

1 commit

  • Allow pmem, and other synchronous/bio-based block drivers, to fallback
    on a per-cpu reference count managed by the core for tracking queue
    live/dead state.

    The existing per-cpu reference count for the blk_mq case is promoted to
    be used in all block i/o scenarios. This involves initializing it by
    default, waiting for it to drop to zero at exit, and holding a live
    reference over the invocation of q->make_request_fn() in
    generic_make_request(). The blk_mq code continues to take its own
    reference per blk_mq request and retains the ability to freeze the
    queue, but the check that the queue is frozen is moved to
    generic_make_request().

    This fixes crash signatures like the following:

    BUG: unable to handle kernel paging request at ffff880140000000
    [..]
    Call Trace:
    [] ? copy_user_handle_tail+0x5f/0x70
    [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
    [] pmem_make_request+0xd1/0x200 [nd_pmem]
    [] ? mempool_alloc+0x72/0x1a0
    [] generic_make_request+0xd6/0x110
    [] submit_bio+0x76/0x170
    [] submit_bh_wbc+0x12f/0x160
    [] submit_bh+0x12/0x20
    [] jbd2_write_superblock+0x8d/0x170
    [] jbd2_mark_journal_empty+0x5d/0x90
    [] jbd2_journal_destroy+0x24b/0x270
    [] ? put_pwq_unlocked+0x2a/0x30
    [] ? destroy_workqueue+0x225/0x250
    [] ext4_put_super+0x64/0x360
    [] generic_shutdown_super+0x6a/0xf0

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

01 Oct, 2015

2 commits

  • And replace the blk_mq_tag_busy_iter with it - the driver use has been
    replaced with a new helper a while ago, and internal to the block we
    only need the new version.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_complete_request may be a no-op if the request has already
    been completed by others means (e.g. a timeout or cancellation), but
    currently drivers have to set rq->errors before calling
    blk_mq_complete_request, which might leave us with the wrong error value.

    Add an error parameter to blk_mq_complete_request so that we can
    defer setting rq->errors until we known we won the race to complete the
    request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Sep, 2015

1 commit

  • There is a race between cpu hotplug handling and adding/deleting
    gendisk for blk-mq, where both are trying to register and unregister
    the same sysfs entries.

    null_add_dev
    --> blk_mq_init_queue
    --> blk_mq_init_allocated_queue
    --> add to 'all_q_list' (*)
    --> add_disk
    --> blk_register_queue
    --> blk_mq_register_disk (++)

    null_del_dev
    --> del_gendisk
    --> blk_unregister_queue
    --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
    --> blk_mq_free_queue
    --> del from 'all_q_list' (*)

    blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

    While the request queue is added to 'all_q_list' (*),
    blk_mq_queue_reinit() can be called for the queue anytime by CPU
    hotplug callback. But blk_mq_sysfs_unregister (-) and
    blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
    before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
    is finished. Because '/sys/block/*/mq/' is not exists.

    There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
    be used to track these sysfs stuff, but it is only fixing this issue
    partially.

    In order to fix it completely, we just need per-queue flag instead of
    per-hctx flag with appropriate locking. So this introduces
    q->mq_sysfs_init_done which is properly protected with all_q_mutex.

    Also, we need to ensure that blk_mq_map_swqueue() is called with
    all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and
    updated in blk_mq_map_swqueue(), so we should avoid
    blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
    in CPU hotplug handling or adding/deleting gendisk .

    Signed-off-by: Akinobu Mita
    Reviewed-by: Ming Lei
    Cc: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

02 Jun, 2015

1 commit

  • Storage controllers may expose multiple block devices that share hardware
    resources managed by blk-mq. This patch enhances the shared tags so a
    low-level driver can access the shared resources not tied to the unshared
    h/w contexts. This way the LLD can dynamically add and delete disks and
    request queues without having to track all the request_queue hctx's to
    iterate outstanding tags.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

17 Apr, 2015

1 commit

  • Commit 889fa31f00b2 was a bit too eager in reducing the loop count,
    so we ended up missing queues in some configurations. Ensure that
    our division rounds up, so that's not the case.

    Reported-by: Guenter Roeck
    Fixes: 889fa31f00b2 ("blk-mq: reduce unnecessary software queue looping")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2015

1 commit


13 Mar, 2015

2 commits

  • Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
    and export it.

    DM's suspend support must be able to run the queue without starting
    stopped hw queues.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Add a variant of blk_mq_init_queue that allows a previously allocated
    queue to be initialized. blk_mq_init_allocated_queue models
    blk_init_allocated_queue -- which was also created for DM's use.

    DM's approach to device creation requires a placeholder request_queue be
    allocated for use with alloc_dev() but the decision about what type of
    request_queue will be ultimately created is deferred until all component
    devices referenced in the DM table are processed to determine the table
    type (request-based, blk-mq request-based, or bio-based).

    Also, because of DM's late finalization of the request_queue type
    the call to blk_mq_register_disk() doesn't happen during alloc_dev().
    Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
    once the blk-mq queue is fully allocated.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

13 Feb, 2015

1 commit

  • Pull core block IO changes from Jens Axboe:
    "This contains:

    - A series from Christoph that cleans up and refactors various parts
    of the REQ_BLOCK_PC handling. Contributions in that series from
    Dongsu Park and Kent Overstreet as well.

    - CFQ:
    - A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
    - A stable patch fixing a potential crash in CFQ in OOM
    situations. From Konstantin Khlebnikov.

    - blk-mq:
    - Add support for tag allocation policies, from Shaohua. This is
    a prep patch enabling libata (and other SCSI parts) to use the
    blk-mq tagging, instead of rolling their own.
    - Various little tweaks from Keith and Mike, in preparation for
    DM blk-mq support.
    - Minor little fixes or tweaks from me.
    - A double free error fix from Tony Battersby.

    - The partition 4k issue fixes from Matthew and Boaz.

    - Add support for zero+unprovision for blkdev_issue_zeroout() from
    Martin"

    * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
    block: remove unused function blk_bio_map_sg
    block: handle the null_mapped flag correctly in blk_rq_map_user_iov
    blk-mq: fix double-free in error path
    block: prevent request-to-request merging with gaps if not allowed
    blk-mq: make blk_mq_run_queues() static
    dm: fix multipath regression due to initializing wrong request
    cfq-iosched: handle failure of cfq group allocation
    block: Quiesce zeroout wrapper
    block: rewrite and split __bio_copy_iov()
    block: merge __bio_map_user_iov into bio_map_user_iov
    block: merge __bio_map_kern into bio_map_kern
    block: pass iov_iter to the BLOCK_PC mapping functions
    block: add a helper to free bio bounce buffer pages
    block: use blk_rq_map_user_iov to implement blk_rq_map_user
    block: simplify bio_map_kern
    block: mark blk-mq devices as stackable
    block: keep established cmd_flags when cloning into a blk-mq request
    block: add blk-mq support to blk_insert_cloned_request()
    block: require blk_rq_prep_clone() be given an initialized clone request
    blk-mq: add tag allocation policy
    ...

    Linus Torvalds
     

11 Feb, 2015

1 commit


24 Jan, 2015

1 commit

  • This is the blk-mq part to support tag allocation policy. The default
    allocation policy isn't changed (though it's not a strict FIFO). The new
    policy is round-robin for libata. But it's a try-best implementation. If
    multiple tasks are competing, the tags returned will be mixed (which is
    unavoidable even with !mq, as requests from different tasks can be
    mixed in queue)

    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

08 Jan, 2015

4 commits


03 Jan, 2015

1 commit