12 Feb, 2019

3 commits

  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    (cherry picked from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)

    Bart Van Assche
     
  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

    Mike Snitzer
     
  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    (cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)

    Mike Snitzer
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

29 Aug, 2017

1 commit

  • There is a race between changing I/O elevator and request_queue removal
    which can trigger the warning in kobject_add_internal. A program can
    use sysfs to request a change of elevator at the same time another task
    is unregistering the request_queue the elevator would be attached to.
    The elevator's kobject will then attempt to be connected to the
    request_queue in the object tree when the request_queue has just been
    removed from sysfs. This triggers the warning in kobject_add_internal
    as the request_queue no longer has a sysfs directory:

    kobject_add_internal failed for iosched (error: -2 parent: queue)
    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0

    To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
    changing the elevator and use the request_queue's sysfs_lock to
    serialize between clearing the flag and the elevator testing the flag.

    Signed-off-by: David Jeffery
    Tested-by: Ming Lei
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    David Jeffery
     

15 Jun, 2017

1 commit

  • Avoid that the following complaint is reported:

    BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
    in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
    1 lock held by rcuop/3/41:
    #0: (rcu_callback){......}, at: [] rcu_nocb_kthread+0x282/0x500
    Call Trace:
    dump_stack+0x86/0xcf
    ___might_sleep+0x174/0x260
    __might_sleep+0x4a/0x80
    flush_work+0x7e/0x2e0
    __cancel_work_timer+0x143/0x1c0
    cancel_work_sync+0x10/0x20
    blk_throtl_exit+0x25/0x60
    blkcg_exit_queue+0x35/0x40
    blk_release_queue+0x42/0x130
    kobject_put+0xa9/0x190

    This happens since we invoke callbacks that need to block from the
    queue release handler. Fix this by pushing the final release to
    a workqueue.

    Reported-by: Ross Zwisler
    Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free")
    Signed-off-by: Bart Van Assche
    Tested-by: Ross Zwisler

    Updated changelog
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

02 Jun, 2017

1 commit

  • Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
    essential that the memory allocated for struct request_queue
    stays around until all blk_exit_rl() calls have finished. Hence
    make blk_init_rl() take a reference on struct request_queue.

    This patch fixes the following crash:

    general protection fault: 0000 [#2] SMP
    CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    task: ffff88013a108040 task.stack: ffffc9000071c000
    RIP: 0010:free_request_size+0x1a/0x30
    RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
    RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
    RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
    R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
    R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
    FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
    Call Trace:
    mempool_destroy.part.10+0x21/0x40
    mempool_destroy+0xe/0x10
    blk_exit_rl+0x12/0x20
    blkg_free+0x4d/0xa0
    __blkg_release_rcu+0x59/0x170
    rcu_process_callbacks+0x260/0x4e0
    __do_softirq+0x116/0x250
    smpboot_thread_fn+0x123/0x1e0
    kthread+0x109/0x140
    ret_from_fork+0x31/0x40

    Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request")
    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: # v4.11+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

26 May, 2017

1 commit

  • The code in blk-mq-debugfs.c assumes that it is working on a blk-mq
    queue and is not intended to work on a blk-sq queue. Hence only
    register blk-mq debugfs attributes for blk-mq queues.

    Fixes: commit 9c1051aacde8 ("blk-mq: untangle debugfs and sysfs")
    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Reviewed-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 May, 2017

2 commits

  • Originally, I tied debugfs registration/unregistration together with
    sysfs. There's no reason to do this, and it's getting in the way of
    letting schedulers define their own debugfs attributes. Instead, tie the
    debugfs registration to the lifetime of the structures themselves.

    The saner lifetimes mean we can also get rid of the extra mq directory
    and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
    just nvme0n1/hctx0/tags.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Preparation for adding more declarations.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Apr, 2017

1 commit

  • A later patch in this series will modify blk_mq_debugfs_register()
    such that it uses q->kobj.parent to determine the name of a
    request queue. Hence make sure that that pointer is initialized
    before blk_mq_debugfs_register() is called. To avoid lock inversion,
    protect sysfs / debugfs registration with the queue sysfs_lock
    instead of the global mutex all_q_mutex.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Apr, 2017

1 commit

  • When CFQ is used as an elevator, it disables writeback throttling
    because they don't play well together. Later when a different elevator
    is chosen for the device, writeback throttling doesn't get enabled
    again as it should. Make sure CFQ enables writeback throttling (if it
    should be enabled by default) when we switch from it to another IO
    scheduler.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

09 Apr, 2017

1 commit


08 Apr, 2017

1 commit

  • We've added a considerable amount of fixes for stalls and issues
    with the blk-mq scheduling in the 4.11 series since forking
    off the for-4.12/block branch. We need to do improvements on
    top of that for 4.12, so pull in the previous fixes to make
    our lives easier going forward.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Apr, 2017

1 commit

  • In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
    back to the original scheduler. However, at this point, we've already
    torn down the original scheduler's tags, so this causes a crash. Doing
    the fallback like the legacy elevator path is much harder for mq, so fix
    it by just falling back to none, instead.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

29 Mar, 2017

2 commits

  • CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb when a
    request queue is reregistered. This has been a problem since wbt was
    introduced, but the WARN_ON(!list_empty(&stats->callbacks)) in the
    blk-stat rework exposed it. Fix it by cleaning up wbt when we unregister
    the queue.

    Fixes: 87760e5eef35 ("block: hook up writeback throttling")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Now that the remaining drivers have been converted to one request queue
    per gendisk, let's warn if a request queue gets registered more than
    once. This will catch future drivers which might do it inadvertently or
    any old drivers that I may have missed.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

28 Mar, 2017

2 commits

  • The throtl_slice is 100ms by default. This is a long time for SSD, a lot
    of IO can run. To make cgroups have smoother throughput, we choose a
    small value (20ms) for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • throtl_slice is important for blk-throttling. It's called slice
    internally but it really is a time window blk-throttling samples data.
    blk-throttling will make decision based on the samplings. An example is
    bandwidth measurement. A cgroup's bandwidth is measured in the time
    interval of throtl_slice.

    A small throtl_slice meanse cgroups have smoother throughput but burn
    more CPUs. It has 100ms default value, which is not appropriate for all
    disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
    it tunable.

    Since throtl_slice isn't a time slice, the sysfs name
    'throttle_sample_time' reflects its character better.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

22 Mar, 2017

2 commits

  • Currently, statistics are gathered in ~0.13s windows, and users grab the
    statistics whenever they need them. This is not ideal for both in-tree
    users:

    1. Writeback throttling wants its own dynamically sized window of
    statistics. Since the blk-stats statistics are reset after every
    window and the wbt windows don't line up with the blk-stats windows,
    wbt doesn't see every I/O.
    2. Polling currently grabs the statistics on every I/O. Again, depending
    on how the window lines up, we may miss some I/Os. It's also
    unnecessary overhead to get the statistics on every I/O; the hybrid
    polling heuristic would be just as happy with the statistics from the
    previous full window.

    This reworks the blk-stats infrastructure to be callback-based: users
    register a callback that they want called at a given time with all of
    the statistics from the window during which the callback was active.
    Users can dynamically bucketize the statistics. wbt and polling both
    currently use read vs. write, but polling can be extended to further
    subdivide based on request size.

    The callbacks are kept on an RCU list, and each callback has percpu
    stats buffers. There will only be a few users, so the overhead on the
    I/O completion side is low. The stats flushing is also simplified
    considerably: since the timer function is responsible for clearing the
    statistics, we don't have to worry about stale statistics.

    wbt is a trivial conversion. After the conversion, the windowing problem
    mentioned above is fixed.

    For polling, we register an extra callback that caches the previous
    window's statistics in the struct request_queue for the hybrid polling
    heuristic to use.

    Since we no longer have a single stats buffer for the request queue,
    this also removes the sysfs and debugfs stats entries. To replace those,
    we add a debugfs entry for the poll statistics.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • The stats buckets will become generic soon, so make the existing users
    use the common READ and WRITE definitions instead of one internal to
    blk-stat.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

03 Mar, 2017

1 commit

  • For legacy scheduling, we always call ioc_exit_icq() with both the
    ioc and queue lock held. This poses a problem for blk-mq with
    scheduling, since the queue lock isn't what we use in the scheduler.
    And since we don't need the queue lock held for ioc exit there,
    don't grab it and leave any extra locking up to the blk-mq scheduler.

    Reported-by: Paolo Valente
    Tested-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Feb, 2017

1 commit

  • When a new disk shows up, sysfs queue directory is created before elevator
    is registered. This allows a user to attempt a scheduler switch even though
    the initial registration hasn't completed yet.

    In one scenario, blk_register_queue() calls elv_register_queue() and
    right before cfq_registered_queue() is called, another process executes
    elevator_switch() and replaces q->elevator with deadline scheduler. When
    cfq_registered_queue() executes it interprets e->elevator_data as struct
    cfq_data even though it is actually struct deadline_data.

    Grab q->sysfs_lock in blk_register_queue() to synchronize with sysfs
    callers.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

09 Feb, 2017

1 commit

  • Add a new merge strategy that merges discard bios into a request until the
    maximum number of discard ranges (or the maximum discard size) is reached
    from the plug merging code. I/O scheduler merging is not wired up yet
    but might also be useful, although not for fast devices like NVMe which
    are the only user for now.

    Note that for now we don't support limiting the size of each discard range,
    but if needed that can be added later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Feb, 2017

1 commit

  • I noticed that when booting with a default blk-mq I/O scheduler, the
    /sys/block/*/queue/iosched directory was missing. However, switching
    after boot did create the directory. This is because we skip the initial
    elevator register/unregister when we don't have a ->request_fn(), but we
    should still do it for the ->mq_ops case.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

03 Feb, 2017

1 commit


02 Feb, 2017

2 commits

  • Instead of storing backing_dev_info inside struct request_queue,
    allocate it dynamically, reference count it, and free it when the last
    reference is dropped. Currently only request_queue holds the reference
    but in the following patch we add other users referencing
    backing_dev_info.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Jan, 2017

1 commit


14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

13 Dec, 2016

1 commit

  • We ran into a funky issue, where someone doing 256K buffered reads saw
    128K requests at the device level. Turns out it is read-ahead capping
    the request size, since we use 128K as the default setting. This
    doesn't make a lot of sense - if someone is issuing 256K reads, they
    should see 256K reads, regardless of the read-ahead setting, if the
    underlying device can support a 256K read in a single command.

    This patch introduces a bdi hint, io_pages. This is the soft max IO
    size for the lower level, I've hooked it up to the bdev settings here.
    Read-ahead is modified to issue the maximum of the user request size,
    and the read-ahead max size, but capped to the max request size on the
    device side. The latter is done to avoid reading ahead too much, if the
    application asks for a huge read. With this patch, the kernel behaves
    like the application expects.

    Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
    Signed-off-by: Jens Axboe
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

01 Dec, 2016

1 commit

  • This adds a new block layer operation to zero out a range of
    LBAs. This allows to implement zeroing for devices that don't use
    either discard with a predictable zero pattern or WRITE SAME of zeroes.
    The prominent example of that is NVMe with the Write Zeroes command,
    but in the future, this should also help with improving the way
    zeroing discards work. For this operation, suitable entry is exported in
    sysfs which indicate the number of maximum bytes allowed in one
    write zeroes operation by the device.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

29 Nov, 2016

2 commits


18 Nov, 2016

2 commits

  • The previous commit introduced the hybrid sleep/poll mode. Take
    that one step further, and use the completion latencies to
    automatically sleep for half the mean completion time. This is
    a good approximation.

    This changes the 'io_poll_delay' sysfs file a bit to expose the
    various options. Depending on the value, the polling code will
    behave differently:

    -1 Never enter hybrid sleep mode
    0 Use half of the completion mean for the sleep delay
    >0 Use this specific value as the sleep delay

    Signed-off-by: Jens Axboe
    Tested-By: Stephen Bates
    Reviewed-By: Stephen Bates

    Jens Axboe
     
  • This patch enables a hybrid polling mode. Instead of polling after IO
    submission, we can induce an artificial delay, and then poll after that.
    For example, if the IO is presumed to complete in 8 usecs from now, we
    can sleep for 4 usecs, wake up, and then do our polling. This still puts
    a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
    after the IO has completed, it'll happen before. With this hybrid
    scheme, we can achieve big latency reductions while still using the same
    (or less) amount of CPU.

    Signed-off-by: Jens Axboe
    Tested-By: Stephen Bates
    Reviewed-By: Stephen Bates

    Jens Axboe
     

12 Nov, 2016

1 commit


11 Nov, 2016

2 commits

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For legacy block, we simply track them in the request queue. For
    blk-mq, we track them on a per-sw queue basis, which we can then
    sum up through the hardware queues and finally to a per device
    state.

    The stats are tracked in, roughly, 0.1s interval windows.

    Add sysfs files to display the stats.

    The feature is off by default, to avoid any extra overhead. In-kernel
    users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
    flags. We currently don't turn it on if someone just reads any of
    the stats files, that is something we could add as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Oct, 2016

1 commit

  • The queue limits already have a 'chunk_sectors' setting, so
    we should be presenting it via sysfs.

    Signed-off-by: Hannes Reinecke

    [Damien: Updated Documentation/ABI/testing/sysfs-block]

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Shaun Tancheff
    Tested-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Hannes Reinecke