14 Dec, 2014

1 commit

  • Pull block driver core update from Jens Axboe:
    "This is the pull request for the core block IO changes for 3.19. Not
    a huge round this time, mostly lots of little good fixes:

    - Fix a bug in sysfs blktrace interface causing a NULL pointer
    dereference, when enabled/disabled through that API. From Arianna
    Avanzini.

    - Various updates/fixes/improvements for blk-mq:

    - A set of updates from Bart, mostly fixing buts in the tag
    handling.

    - Cleanup/code consolidation from Christoph.

    - Extend queue_rq API to be able to handle batching issues of IO
    requests. NVMe will utilize this shortly. From me.

    - A few tag and request handling updates from me.

    - Cleanup of the preempt handling for running queues from Paolo.

    - Prevent running of unmapped hardware queues from Ming Lei.

    - Move the kdump memory limiting check to be in the correct
    location, from Shaohua.

    - Initialize all software queues at init time from Takashi. This
    prevents a kobject warning when CPUs are brought online that
    weren't online when a queue was registered.

    - Single writeback fix for I_DIRTY clearing from Tejun. Queued with
    the core IO changes, since it's just a single fix.

    - Version X of the __bio_add_page() segment addition retry from
    Maurizio. Hope the Xth time is the charm.

    - Documentation fixup for IO scheduler merging from Jan.

    - Introduce (and use) generic IO stat accounting helpers for non-rq
    drivers, from Gu Zheng.

    - Kill off artificial limiting of max sectors in a request from
    Christoph"

    * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    blk-mq: Fix uninitialized kobject at CPU hotplugging
    blktrace: don't let the sysfs interface remove trace from running list
    blk-mq: Use all available hardware queues
    blk-mq: Micro-optimize bt_get()
    blk-mq: Fix a race between bt_clear_tag() and bt_get()
    blk-mq: Avoid that __bt_get_word() wraps multiple times
    blk-mq: Fix a use-after-free
    blk-mq: prevent unmapped hw queue from being scheduled
    blk-mq: re-check for available tags after running the hardware queue
    blk-mq: fix hang in bt_get()
    blk-mq: move the kdump check to blk_mq_alloc_tag_set
    blk-mq: cleanup tag free handling
    blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
    blk: introduce generic io stat accounting help function
    blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
    genhd: check for int overflow in disk_expand_part_tbl()
    blk-mq: add blk_mq_free_hctx_request()
    blk-mq: export blk_mq_free_request()
    blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
    ...

    Linus Torvalds
     

09 Dec, 2014

1 commit


08 Dec, 2014

1 commit


01 Dec, 2014

1 commit


24 Nov, 2014

1 commit


18 Nov, 2014

2 commits


12 Nov, 2014

3 commits

  • The queuecommand() callback functions in SCSI low-level drivers
    need to know which hardware context has been selected by the
    block layer. Since this information is not available in the
    request structure, and since passing the hctx pointer directly to
    the queuecommand callback function would require modification of
    all SCSI LLDs, add a function to the block layer that allows to
    query the hardware context index.

    Signed-off-by: Bart Van Assche
    Acked-by: Jens Axboe
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig

    Bart Van Assche
     
  • blk-mq is using preempt_disable/enable in order to ensure that the
    queue runners are placed on the right CPU. This does not work with
    the RT patches, because __blk_mq_run_hw_queue takes a non-raw
    spinlock with the preemption-disabled region. If there is contention
    on the lock, this violates the rules for preemption-disabled regions.

    While this should be easily fixable within the RT patches just by doing
    migrate_disable/enable, we can do better and document _why_ this
    particular region runs with disabled preemption. After the previous
    patch, it is trivial to switch it to get/put_cpu; the RT patches then
    can change it to get_cpu_light, which lets virtio-blk run under RT
    kernels.

    Cc: Jens Axboe
    Cc: Thomas Gleixner
    Reported-by: Clark Williams
    Tested-by: Clark Williams
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
    except the one in blk-flush.c. In fact that one is always asynchronous,
    and it does not need smp_processor_id().

    We can do the same for all other calls, avoiding preempt_disable when
    async is true. This avoids peppering blk-mq.c with preemption-disabled
    regions.

    Cc: Jens Axboe
    Cc: Thomas Gleixner
    Reported-by: Clark Williams
    Tested-by: Clark Williams
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

05 Nov, 2014

1 commit

  • q->mq_usage_counter is a percpu_ref which is killed and drained when
    the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
    which involves freezing the queue is invoked on all existing queues.
    Because percpu_ref killing and draining involve a RCU grace period,
    doing the above on one queue after another may take a long time if
    there are many queues on the system.

    This patch splits out initiation of freezing and waiting for its
    completion, and updates blk_mq_queue_reinit_notify() so that the
    queues are frozen in parallel instead of one after another. Note that
    freezing and unfreezing are moved from blk_mq_queue_reinit() to
    blk_mq_queue_reinit_notify().

    Signed-off-by: Tejun Heo
    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Signed-off-by: Jens Axboe

    Tejun Heo
     

30 Oct, 2014

2 commits

  • Drivers can now tell blk-mq if they take advantage of the deferred
    issue through 'last' or not. If they do, don't do queue-direct
    for sync IO. This is a preparation patch for the nvme conversion.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since we have the notion of a 'last' request in a chain, we can use
    this to have the hardware optimize the issuing of requests. Add
    a list_head parameter to queue_rq that the driver can use to
    temporarily store hw commands for issue when 'last' is true. If we
    are doing a chain of requests, pass in a NULL list for the first
    request to force issue of that immediately, then batch the remainder
    for deferred issue until the last request has been sent.

    Instead of adding yet another argument to the hot ->queue_rq path,
    encapsulate the passed arguments in a blk_mq_queue_data structure.
    This is passed as a constant, and has been tested as faster than
    passing 4 (or even 3) args through ->queue_rq. Update drivers for
    the new ->queue_rq() prototype. There are no functional changes
    in this patch for drivers - if they don't use the passed in list,
    then they will just queue requests individually like before.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

14 Oct, 2014

1 commit


26 Sep, 2014

7 commits

  • This patch supports to run one single flush machinery for
    each blk-mq dispatch queue, so that:

    - current init_request and exit_request callbacks can
    cover flush request too, then the buggy copying way of
    initializing flush request's pdu can be fixed

    - flushing performance gets improved in case of multi hw-queue

    In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
    iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
    increased a lot over my test environment:
    - throughput: +70% in case of virtio-blk over null_blk
    - throughput: +30% in case of virtio-blk over SSD image

    The multi virtqueue feature isn't merged to QEMU yet, and patches for
    the feature can be found in below tree:

    git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4

    And simply passing 'num_queues=4 vectors=5' should be enough to
    enable multi queue(quad queue) feature for QEMU virtio-blk.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
    so that this function can find the corresponding blk_flush_queue
    bound with current mq context since the flush queue will become
    per hw-queue.

    For legacy queue, the parameter can be simply 'NULL'.

    For multiqueue case, the parameter should be set as the context
    from which the related request is originated. With this context
    info, the hw queue and related flush queue can be found easily.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now mission of the two helpers is over, and just call
    blk_alloc_flush_queue() and blk_free_flush_queue() directly.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch introduces 'struct blk_flush_queue' and puts all
    flush machinery related fields into this structure, so that

    - flush implementation details aren't exposed to driver
    - it is easy to convert to per dispatch-queue flush machinery

    This patch is basically a mechanical replacement.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • These two temporary functions are introduced for holding flush
    initialization and de-initialization, so that we can
    introduce 'flush queue' easier in the following patch. And
    once 'flush queue' and its allocation/free functions are ready,
    they will be removed for sake of code readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is reasonable to allocate flush req in blk_mq_init_flush().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Failure of initializing one hctx isn't handled, so this patch
    introduces blk_mq_init_hctx() and its pair to handle it explicitly.
    Also this patch makes code cleaner.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Sep, 2014

4 commits

  • blk-mq uses percpu_ref for its usage counter which tracks the number
    of in-flight commands and used to synchronously drain the queue on
    freeze. percpu_ref shutdown takes measureable wallclock time as it
    involves a sched RCU grace period. This means that draining a blk-mq
    takes measureable wallclock time. One would think that this shouldn't
    matter as queue shutdown should be a rare event which takes place
    asynchronously w.r.t. userland.

    Unfortunately, SCSI probing involves synchronously setting up and then
    tearing down a lot of request_queues back-to-back for non-existent
    LUNs. This means that SCSI probing may take above ten seconds when
    scsi-mq is used.

    [ 0.949892] scsi host0: Virtio SCSI HBA
    [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
    [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
    [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

    [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
    [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
    [ 16.194099] osd: LOADED open-osd 0.2.1
    [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
    [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
    [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
    [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
    [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

    This is also the reason why request_queues start in bypass mode which
    is ended on blk_register_queue() as shutting down a fully functional
    queue also involves a RCU grace period and the queues for non-existent
    SCSI devices never reach registration.

    blk-mq basically needs to do the same thing - start the mq in a
    degraded mode which is faster to shut down and then make it fully
    functional only after the queue reaches registration. percpu_ref
    recently grew facilities to force atomic operation until explicitly
    switched to percpu mode, which can be used for this purpose. This
    patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
    switch it to percpu mode only once blk_register_queue() is reached.

    Note that this issue was previously worked around by 0a30288da1ae
    ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
    probe") for v3.17. The temp fix was reverted in preparation of adding
    persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq,
    percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
    This patch and the prerequisite percpu_ref changes will be merged
    during v3.18 devel cycle.

    Signed-off-by: Tejun Heo
    Reported-by: Christoph Hellwig
    Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
    Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
    Reviewed-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Johannes Weiner

    Tejun Heo
     
  • With the recent addition of percpu_ref_reinit(), percpu_ref now can be
    used as a persistent switch which can be turned on and off repeatedly
    where turning off maps to killing the ref and waiting for it to drain;
    however, there currently isn't a way to initialize a percpu_ref in its
    off (killed and drained) state, which can be inconvenient for certain
    persistent switch use cases.

    Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
    selection of operation mode; however, currently a newly initialized
    percpu_ref is always in percpu mode making it impossible to avoid the
    latency overhead of switching to atomic mode.

    This patch adds @flags to percpu_ref_init() and implements the
    following flags.

    * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
    * PERCPU_REF_INIT_DEAD : start ref killed and drained

    These flags should be able to serve the above two use cases.

    v2: target_core_tpg.c conversion was missing. Fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Johannes Weiner

    Tejun Heo
     
  • This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which
    was a temporary fix for SCSI blk-mq stall issue. The following
    patches will fix the issue properly by introducing atomic mode to
    percpu_ref.

    Signed-off-by: Tejun Heo
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Cc: Christoph Hellwig

    Tejun Heo
     
  • …linux-block into for-3.18

    This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
    kludge for SCSI blk-mq stall during probe") which implements
    __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
    commit reverted and patches to implement proper fix will be added.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>

    Tejun Heo
     

24 Sep, 2014

1 commit

  • blk-mq uses percpu_ref for its usage counter which tracks the number
    of in-flight commands and used to synchronously drain the queue on
    freeze. percpu_ref shutdown takes measureable wallclock time as it
    involves a sched RCU grace period. This means that draining a blk-mq
    takes measureable wallclock time. One would think that this shouldn't
    matter as queue shutdown should be a rare event which takes place
    asynchronously w.r.t. userland.

    Unfortunately, SCSI probing involves synchronously setting up and then
    tearing down a lot of request_queues back-to-back for non-existent
    LUNs. This means that SCSI probing may take more than ten seconds
    when scsi-mq is used.

    This will be properly fixed by implementing a mechanism to keep
    q->mq_usage_counter in atomic mode till genhd registration; however,
    that involves rather big updates to percpu_ref which is difficult to
    apply late in the devel cycle (v3.17-rc6 at the moment). As a
    stop-gap measure till the proper fix can be implemented in the next
    cycle, this patch introduces __percpu_ref_kill_expedited() and makes
    blk_mq_freeze_queue() use it. This is heavy-handed but should work
    for testing the experimental SCSI blk-mq implementation.

    Signed-off-by: Tejun Heo
    Reported-by: Christoph Hellwig
    Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
    Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
    Cc: Kent Overstreet
    Cc: Jens Axboe
    Tested-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Sep, 2014

13 commits

  • Signed-off-by: Christoph Hellwig

    Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • It's not uncommon for crash dump kernels to be limited to 128MB or
    something low in that area. This is normally not a problem for
    devices as we don't use that much memory, but for some shared SCSI
    setups with huge queue depths, it can potentially fill most of
    memory with tons of request allocations. blk-mq does scale back
    when it fails to allocate memory, but it scales back just enough
    so that blk-mq succeeds. This could still leave the system with
    not enough memory to make any real progress.

    Check if we are in a kdump environment and limit the hardware
    queues and tag depth.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch removes two unnecessary blk_clear_rq_complete(),
    the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
    so:

    - The blk_clear_rq_complete() in blk_flush_restore_request()
    needn't because the request will be freed later, and clearing
    it here may open a small race window with timeout.

    - The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
    necessary too, even though REQ_ATOM_STARTED is cleared in
    __blk_mq_requeue_request(), in theory it still may cause a small
    race window with timeout since the two clear_bit() may be
    reordered.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Allow blk-mq to pass an argument to the timeout handler to indicate
    if we're timing out a reserved or regular command. For many drivers
    those need to be handled different.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Duplicate the (small) timeout handler in blk-mq so that we can pass
    arguments more easily to the driver timeout handler. This enables
    the next patch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Don't do a kmalloc from timer to handle timeouts, chances are we could be
    under heavy load or similar and thus just miss out on the timeouts.
    Fortunately it is very easy to just iterate over all in use tags, and doing
    this properly actually cleans up the blk_mq_busy_iter API as well, and
    prepares us for the next patch by passing a reserved argument to the
    iterator.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Now that we've changed the driver API on the submission side use the
    opportunity to fix up the name on the completion side to fit into the
    general scheme.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • When we call blk_mq_start_request from the core blk-mq code before calling into
    ->queue_rq there is a racy window where the timeout handler can hit before we've
    fully set up the driver specific part of the command.

    Move the call to blk_mq_start_request into the driver so the driver can start
    the request only once it is fully set up.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pass an explicit parameter for the last request in a batch to ->queue_rq
    instead of using a request flag. Besides being a cleaner and non-stateful
    interface this is also required for the next patch, which fixes the blk-mq
    I/O submission code to not start a time too early.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Moving patches from for-linus to 3.18 instead, pull in this changes
    that will go to Linus today.

    Jens Axboe
     
  • When requests are retried due to hw or sw resource shortages,
    we often stop the associated hardware queue. So ensure that we
    restart the queues when running the requeue work, otherwise the
    queue run will be a no-op.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • __blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
    back the queue depth if we are low on memory. So don't clear
    set->tags when we fail, this is handled directly in
    the parent function, blk_mq_alloc_tag_set().

    Reported-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We should not insert requests into the flush state machine from
    blk_mq_insert_request. All incoming flush requests come through
    blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
    should only be called for BLOCK_PC requests. All other callers
    deal with requests that already went through the flush statemchine
    and shouldn't be reinserted into it.

    Reported-by: Robert Elliott
    Debugged-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig