11 May, 2017

1 commit


04 May, 2017

2 commits

  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

28 Apr, 2017

3 commits

  • The only difference between ->run_work and ->delay_work, is that
    the latter is used to defer running a queue. This is done by
    marking the queue stopped, and scheduling ->delay_work to run
    sometime in the future. While the queue is stopped, direct runs
    or runs through ->run_work will not run the queue.

    If we combine the handlers, then we need to handle two things:

    1) If a delayed/stopped run is scheduled, then we should not run
    the queue before that has been completed.
    2) If a queue is delayed/stopped, the handler needs to restart
    the queue. Normally a run of a queue with the stopped bit set
    would be a no-op.

    Case 1 is handled by modifying a currently pending queue run
    to the deadline set by the caller of blk_mq_delay_queue().
    Subsequent attempts to queue a queue run will find the work
    item already pending, and direct runs will see a stopped queue
    as before.

    Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
    that tells the work handler that it should clear a stopped
    queue and run the handler.

    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This modifies (or adds, if not currently pending) an existing
    delayed work item.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • They serve the exact same purpose. Get rid of the non-delayed
    work variant, and just run it without delay for the normal case.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Apr, 2017

1 commit

  • We currently call blk_mq_free_queue() from blk_cleanup_queue()
    before we unregister the debugfs attributes for that queue in
    blk_release_queue(). This leaves a window open during which
    accessing most of the mq debugfs attributes would cause a
    use-after-free. Additionally, the "state" attribute allows
    running the queue, which we should not do after the queue has
    entered the "dead" state. Fix both cases by unregistering the
    debugfs attributes before freeing queue resources starts.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Apr, 2017

1 commit


20 Apr, 2017

5 commits


07 Apr, 2017

1 commit

  • Currently only dm and md/raid5 bios trigger
    trace_block_bio_complete(). Now that we have bio_chain() and
    bio_inc_remaining(), it is not possible, in general, for a driver to
    know when the bio is really complete. Only bio_endio() knows that.

    So move the trace_block_bio_complete() call to bio_endio().

    Now trace_block_bio_complete() pairs with trace_block_bio_queue().
    Any bio for which a 'queue' event is traced, will subsequently
    generate a 'complete' event.

    There are a few cases where completion tracing is not wanted.
    1/ If blk_update_request() has already generated a completion
    trace event at the 'request' level, there is no point generating
    one at the bio level too. In this case the bi_sector and bi_size
    will have changed, so the bio level event would be wrong

    2/ If the bio hasn't actually been queued yet, but is being aborted
    early, then a trace event could be confusing. Some filesystems
    call bio_endio() but do not want tracing.

    3/ The bio_integrity code interposes itself by replacing bi_end_io,
    then restoring it and calling bio_endio() again. This would produce
    two identical trace events if left like that.

    To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
    produce the trace event when this is set.
    We address point 1 above by clearing the flag in blk_update_request().
    We address point 2 above by only setting the flag when
    generic_make_request() is called.
    We address point 3 above by clearing the flag after generating a
    completion event.

    When bio_split() is used on a bio, particularly in blk_queue_split(),
    there is an extra complication. A new bio is split off the front, and
    may be handle directly without going through generic_make_request().
    The old bio, which has been advanced, is passed to
    generic_make_request(), so it will trigger a trace event a second
    time.
    Probably the best result when a split happens is to see a single
    'queue' event for the whole bio, then multiple 'complete' events - one
    for each component. To achieve this was can:
    - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
    - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
    This way, the split-off bio won't create a queue event, the original
    won't either even if it re-submitted to generic_make_request(),
    but both will produce completion events, each for their own range.

    So if generic_make_request() is called (which generates a QUEUED
    event), then bi_endio() will create a single COMPLETE event for each
    range that the bio is split into, unless the driver has explicitly
    requested it not to.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

05 Apr, 2017

1 commit

  • In 4.10 I introduced a patch that associates the ioc priority with
    each request in the block layer. This work was done in the single queue
    block layer code. This patch unifies ioc priority to request mapping across
    the single/multi queue block layers.

    I have tested this patch with the null block device driver with the following
    parameters.

    null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1

    I have not seen a performance regression with this patch and I would appreciate
    any feedback or additional testing.

    I have also verified that io priorities are passed to the device when using
    the SQ and MQ path to a SATA HDD that supports io priorities.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Adam Manzanares
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Adam Manzanares
     

29 Mar, 2017

3 commits

  • Before commit 780db2071a(blk-mq: decouble blk-mq freezing
    from generic bypassing), the dying flag is checked before
    entering queue, and Tejun converts the checking into .mq_freeze_depth,
    and assumes the counter is increased just after dying flag
    is set. Unfortunately we doesn't do that in blk_set_queue_dying().

    This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
    so that we can block new I/O coming once the queue is set as dying.

    Given blk_set_queue_dying() is always called in remove path
    of block device, and queue will be cleaned up later, we don't
    need to worry about undoing the counter.

    Cc: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • As the .q_usage_counter is used by both legacy and
    mq path, we need to block new I/O if queue becomes
    dead in blk_queue_enter().

    So rename it and we can use this function in both
    paths.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Without the barrier, reading DEAD flag of .q_usage_counter
    and reading .mq_freeze_depth may be reordered, then the
    following wait_event_interruptible() may never return.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Mar, 2017

1 commit

  • Currently there is no way to know the request size when the request is
    finished. Next patch will need this info. We could add extra field to
    record the size, but blk_issue_stat has enough space to record it, so
    this patch just overloads blk_issue_stat. With this, we will have 49bits
    to track time, which still is very long time.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

22 Mar, 2017

3 commits

  • If a driver allocates a queue for stacked usage, then it does
    not currently get stats allocated. This causes the later init
    of, eg, writeback throttling to blow up. Move the init to the
    queue allocation instead.

    Additionally, allow a NULL callback unregistration. This avoids
    having the caller check for that, fixing another oops on
    removal of a block device that doesn't have poll stats allocated.

    Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently, statistics are gathered in ~0.13s windows, and users grab the
    statistics whenever they need them. This is not ideal for both in-tree
    users:

    1. Writeback throttling wants its own dynamically sized window of
    statistics. Since the blk-stats statistics are reset after every
    window and the wbt windows don't line up with the blk-stats windows,
    wbt doesn't see every I/O.
    2. Polling currently grabs the statistics on every I/O. Again, depending
    on how the window lines up, we may miss some I/Os. It's also
    unnecessary overhead to get the statistics on every I/O; the hybrid
    polling heuristic would be just as happy with the statistics from the
    previous full window.

    This reworks the blk-stats infrastructure to be callback-based: users
    register a callback that they want called at a given time with all of
    the statistics from the window during which the callback was active.
    Users can dynamically bucketize the statistics. wbt and polling both
    currently use read vs. write, but polling can be extended to further
    subdivide based on request size.

    The callbacks are kept on an RCU list, and each callback has percpu
    stats buffers. There will only be a few users, so the overhead on the
    I/O completion side is low. The stats flushing is also simplified
    considerably: since the timer function is responsible for clearing the
    statistics, we don't have to worry about stale statistics.

    wbt is a trivial conversion. After the conversion, the windowing problem
    mentioned above is fixed.

    For polling, we register an extra callback that caches the previous
    window's statistics in the struct request_queue for the hybrid polling
    heuristic to use.

    Since we no longer have a single stats buffer for the request queue,
    this also removes the sysfs and debugfs stats entries. To replace those,
    we add a debugfs entry for the poll statistics.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • We always call wbt_exit() from blk_release_queue(), so these are
    unnecessary.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

12 Mar, 2017

1 commit

  • Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    changed current->bio_list so that it did not contain *all* of the
    queued bios, but only those submitted by the currently running
    make_request_fn.

    There are two places which walk the list and requeue selected bios,
    and others that check if the list is empty. These are no longer
    correct.

    So redefine current->bio_list to point to an array of two lists, which
    contain all queued bios, and adjust various code to test or walk both
    lists.

    Signed-off-by: NeilBrown
    Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    Signed-off-by: Jens Axboe

    NeilBrown
     

09 Mar, 2017

2 commits

  • To avoid recursion on the kernel stack when stacked block devices
    are in use, generic_make_request() will, when called recursively,
    queue new requests for later handling. They will be handled when the
    make_request_fn for the current bio completes.

    If any bios are submitted by a make_request_fn, these will ultimately
    be handled seqeuntially. If the handling of one of those generates
    further requests, they will be added to the end of the queue.

    This strict first-in-first-out behaviour can lead to deadlocks in
    various ways, normally because a request might need to wait for a
    previous request to the same device to complete. This can happen when
    they share a mempool, and can happen due to interdependencies
    particular to the device. Both md and dm have examples where this happens.

    These deadlocks can be erradicated by more selective ordering of bios.
    Specifically by handling them in depth-first order. That is: when the
    handling of one bio generates one or more further bios, they are
    handled immediately after the parent, before any siblings of the
    parent. That way, when generic_make_request() calls make_request_fn
    for some particular device, we can be certain that all previously
    submited requests for that device have been completely handled and are
    not waiting for anything in the queue of requests maintained in
    generic_make_request().

    An easy way to achieve this would be to use a last-in-first-out stack
    instead of a queue. However this will change the order of consecutive
    bios submitted by a make_request_fn, which could have unexpected consequences.
    Instead we take a slightly more complex approach.
    A fresh queue is created for each call to a make_request_fn. After it completes,
    any bios for a different device are placed on the front of the main queue, followed
    by any bios for the same device, followed by all bios that were already on
    the queue before the make_request_fn was called.
    This provides the depth-first approach without reordering bios on the same level.

    This, by itself, it not enough to remove all deadlocks. It just makes
    it possible for drivers to take the extra step required themselves.

    To avoid deadlocks, drivers must never risk waiting for a request
    after submitting one to generic_make_request. This includes never
    allocing from a mempool twice in the one call to a make_request_fn.

    A common pattern in drivers is to call bio_split() in a loop, handling
    the first part and then looping around to possibly split the next part.
    Instead, a driver that finds it needs to split a bio should queue
    (with generic_make_request) the second part, handle the first part,
    and then return. The new code in generic_make_request will ensure the
    requests to underlying bios are processed first, then the second bio
    that was split off. If it splits again, the same process happens. In
    each case one bio will be completely handled before the next one is attempted.

    With this is place, it should be possible to disable the
    punt_bios_to_recover() recovery thread for many block devices, and
    eventually it may be possible to remove it completely.

    Ref: http://www.spinics.net/lists/raid/msg54680.html
    Tested-by: Jinpu Wang
    Inspired-by: Lars Ellenberg
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
    leaking of device numbers for SCSI when SCSI registers multiple gendisks
    for one request_queue in succession. It can be easily reproduced using
    Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
    Furthermore the protection provided by this commit is not needed anymore
    as the problem it was fixing got also fixed by commit 165a5e22fafb
    "block: Move bdi_unregister() to del_gendisk()".

    [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Signed-off-by: Jan Kara
    Acked-by: Dan Williams
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

03 Mar, 2017

1 commit

  • Commit 6cd18e711dd8 "block: destroy bdi before blockdev is
    unregistered." moved bdi unregistration (at that time through
    bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because
    it needs to happen before blk_unregister_region() call in del_gendisk()
    for MD. SCSI though will free up the device number from sd_remove()
    called through a maze of callbacks from device_del() in
    __scsi_remove_device() before blk_cleanup_queue() and thus similar races
    as described in 6cd18e711dd8 can happen for SCSI as well as reported by
    Omar [1].

    Moving bdi_unregister() to del_gendisk() works for MD and fixes the
    problem for SCSI since del_gendisk() gets called from sd_remove() before
    freeing the device number.

    This also makes device_add_disk() (calling bdi_register_owner()) more
    symmetric with del_gendisk().

    [1] http://marc.info/?l=linux-block&m=148554717109098&w=2

    Tested-by: Lekshmi Pillai
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Feb, 2017

1 commit


09 Feb, 2017

2 commits

  • Add a new merge strategy that merges discard bios into a request until the
    maximum number of discard ranges (or the maximum discard size) is reached
    from the plug merging code. I/O scheduler merging is not wired up yet
    but might also be useful, although not for fast devices like NVMe which
    are the only user for now.

    Note that for now we don't support limiting the size of each discard range,
    but if needed that can be added later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Switch these constants to an enum, and make let the compiler ensure that
    all callers of blk_try_merge and elv_merge handle all potential values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Feb, 2017

1 commit

  • If we end up doing a request-to-request merge when we have completed
    a bio-to-request merge, we free the request from deep down in that
    path. For blk-mq-sched, the merge path has to hold the appropriate
    lock, but we don't need it for freeing the request. And in fact
    holding the lock is problematic, since we are now calling the
    mq sched put_rq_private() hook with the lock held. Other call paths
    do not hold this lock.

    Fix this inconsistency by ensuring that the caller frees a merged
    request. Then we can do it outside of the lock, making it both more
    efficient and fixing the blk-mq-sched problem of invoking parts of
    the scheduler with an unknown lock state.

    Reported-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

03 Feb, 2017

1 commit


02 Feb, 2017

6 commits

  • Warnings of the following form occur because scsi reuses a devt number
    while the block layer still has it referenced as the name of the bdi
    [1]:

    WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
    [..]
    Call Trace:
    dump_stack+0x86/0xc3
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x5f/0x80
    ? kernfs_path_from_node+0x4f/0x60
    sysfs_warn_dup+0x62/0x80
    sysfs_create_dir_ns+0x77/0x90
    kobject_add_internal+0xb2/0x350
    kobject_add+0x75/0xd0
    device_add+0x15a/0x650
    device_create_groups_vargs+0xe0/0xf0
    device_create_vargs+0x1c/0x20
    bdi_register+0x90/0x240
    ? lockdep_init_map+0x57/0x200
    bdi_register_owner+0x36/0x60
    device_add_disk+0x1bb/0x4e0
    ? __pm_runtime_use_autosuspend+0x5c/0x70
    sd_probe_async+0x10d/0x1c0
    async_run_entry_fn+0x39/0x170

    This is a brute-force fix to pass the devt release information from
    sd_probe() to the locations where we register the bdi,
    device_add_disk(), and unregister the bdi, blk_cleanup_queue().

    Thanks to Omar for the quick reproducer script [2]. This patch survives
    where an unmodified kernel fails in a few seconds.

    [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
    [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

    Cc: James Bottomley
    Cc: Bart Van Assche
    Cc: "Martin K. Petersen"
    Cc: Jan Kara
    Reported-by: Omar Sandoval
    Tested-by: Omar Sandoval
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • blk_get_backing_dev_info() is now a simple dereference. Remove that
    function and simplify some code around that.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currenly blk_get_backing_dev_info() is not safe to be called when the
    block device is not open as bdev->bd_disk is NULL in that case. However
    inode_to_bdi() uses this function and may be call called from flusher
    worker or other writeback related functions without bdev being open
    which leads to crashes such as:

    [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
    [113031.075614] Faulting instruction address: 0xc0000000003692e0
    0:mon> t
    [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
    [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
    [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
    [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
    [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
    [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
    [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
    [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Instead of storing backing_dev_info inside struct request_queue,
    allocate it dynamically, reference count it, and free it when the last
    reference is dropped. Currently only request_queue holds the reference
    but in the following patch we add other users referencing
    backing_dev_info.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • blk_set_queue_dying() does not acquire queue lock before it calls
    blk_queue_for_each_rl(). This allows a racing blkg_destroy() to
    remove blkg->q_node from the linked list and have
    blk_queue_for_each_rl() loop infitely over the removed blkg->q_node
    list node.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

01 Feb, 2017

2 commits

  • Instead of keeping two levels of indirection for requests types, fold it
    all into the operations. The little caveat here is that previously
    cmd_type only applied to struct request, while the request and bio op
    fields were set to plain REQ_OP_READ/WRITE even for passthrough
    operations.

    Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
    private requests, althought it has to add two for each so that we
    can communicate the data in/out nature of the request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This can be used to check for fs vs non-fs requests and basically
    removes all knowledge of BLOCK_PC specific from the block layer,
    as well as preparing for removing the cmd_type field in struct request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jan, 2017

1 commit