18 Aug, 2018

1 commit

  • commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.

    This commit fixes a bug that causes bfq to fail to guarantee a high
    responsiveness on some drives, if there is heavy random read+write I/O
    in the background. More precisely, such a failure allowed this bug to
    be found [1], but the bug may well cause other yet unreported
    anomalies.

    BFQ raises the weight of the bfq_queues associated with soft real-time
    applications, to privilege the I/O, and thus reduce latency, for these
    applications. This mechanism is named soft-real-time weight raising in
    BFQ. A soft real-time period may happen to be nested into an
    interactive weight raising period, i.e., it may happen that, when a
    bfq_queue switches to a soft real-time weight-raised state, the
    bfq_queue is already being weight-raised because deemed interactive
    too. In this case, BFQ saves in a special variable
    wr_start_at_switch_to_srt, the time instant when the interactive
    weight-raising period started for the bfq_queue, i.e., the time
    instant when BFQ started to deem the bfq_queue interactive. This value
    is then used to check whether the interactive weight-raising period
    would still be in progress when the soft real-time weight-raising
    period ends. If so, interactive weight raising is restored for the
    bfq_queue. This restore is useful, in particular, because it prevents
    bfq_queues from losing their interactive weight raising prematurely,
    as a consequence of spurious, short-lived soft real-time
    weight-raising periods caused by wrong detections as soft real-time.

    If, instead, a bfq_queue switches to soft-real-time weight raising
    while it *is not* already in an interactive weight-raising period,
    then the variable wr_start_at_switch_to_srt has no meaning during the
    following soft real-time weight-raising period. Unfortunately the
    handling of this case is wrong in BFQ: not only the variable is not
    flagged somehow as meaningless, but it is also set to the time when
    the switch to soft real-time weight-raising occurs. This may cause an
    interactive weight-raising period to be considered mistakenly as still
    in progress, and thus a spurious interactive weight-raising period to
    start for the bfq_queue, at the end of the soft-real-time
    weight-raising period. In particular the spurious interactive
    weight-raising period will be considered as still in progress, if the
    soft-real-time weight-raising period does not last very long. The
    bfq_queue will then be wrongly privileged and, if I/O bound, will
    unjustly steal bandwidth to truly interactive or soft real-time
    bfq_queues, harming responsiveness and low latency.

    This commit fixes this issue by just setting wr_start_at_switch_to_srt
    to minus infinity (farthest past time instant according to jiffies
    macros): when the soft-real-time weight-raising period ends, certainly
    no interactive weight-raising period will be considered as still in
    progress.

    [1] Background I/O Type: Random - Background I/O mix: Reads and writes
    - Application to start: LibreOffice Writer in
    http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Tested-by: Mirko Montanari
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

03 Aug, 2018

1 commit

  • [ Upstream commit a12bffebc0c9d6a5851f062aaea3aa7c4adc6042 ]

    In bfq_requests_merged(), there is a deadlock because the lock on
    bfqq->bfqd->lock is held by the calling function, but the code of
    this function tries to grab the lock again.

    This deadlock is currently hidden by another bug (fixed by next commit
    for this source file), which causes the body of bfq_requests_merged()
    to be never executed.

    This commit removes the deadlock by removing the lock/unlock pair.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filippo Muzzini
     

02 May, 2018

1 commit

  • commit 72961c4e6082be79825265d9193272b8a1634dec upstream.

    Even if we don't have an IO context attached to a request, we still
    need to clear the priv[0..1] pointers, as they could be pointing
    to previously used bic/bfqq structures. If we don't do so, we'll
    either corrupt memory on dispatching a request, or cause an
    imbalance in counters.

    Inspired by a fix from Kees.

    Reported-by: Oleksandr Natalenko
    Reported-by: Kees Cook
    Cc: stable@vger.kernel.org
    Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

25 Dec, 2017

1 commit

  • [ Upstream commit b5dc5d4d1f4ff9032eb6c21a3c571a1317dc9289 ]

    Similarly to CFQ, BFQ has its write-throttling heuristics, and it
    is better not to combine them with further write-throttling
    heuristics of a different nature.
    So this commit disables write-back throttling for a device if BFQ
    is used as I/O scheduler for that device.

    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Luca Miccio
     

10 Sep, 2017

1 commit

  • Pull followup block layer updates from Jens Axboe:
    "I ended up splitting the main pull request for this series into two,
    mainly because of clashes between NVMe fixes that went into 4.13 after
    the for-4.14 branches were split off. This pull request is mostly
    NVMe, but not exclusively. In detail, it contains:

    - Two pull request for NVMe changes from Christoph. Nothing new on
    the feature front, basically just fixes all over the map for the
    core bits, transport, rdma, etc.

    - Series from Bart, cleaning up various bits in the BFQ scheduler.

    - Series of bcache fixes, which has been lingering for a release or
    two. Coly sent this in, but patches from various people in this
    area.

    - Set of patches for BFQ from Paolo himself, updating both
    documentation and fixing some corner cases in performance.

    - Series from Omar, attempting to now get the 4k loop support
    correct. Our confidence level is higher this time.

    - Series from Shaohua for loop as well, improving O_DIRECT
    performance and fixing a use-after-free"

    * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits)
    bcache: initialize dirty stripes in flash_dev_run()
    loop: set physical block size to logical block size
    bcache: fix bch_hprint crash and improve output
    bcache: Update continue_at() documentation
    bcache: silence static checker warning
    bcache: fix for gc and write-back race
    bcache: increase the number of open buckets
    bcache: Correct return value for sysfs attach errors
    bcache: correct cache_dirty_target in __update_writeback_rate()
    bcache: gc does not work when triggering by manual command
    bcache: Don't reinvent the wheel but use existing llist API
    bcache: do not subtract sectors_to_gc for bypassed IO
    bcache: fix sequential large write IO bypass
    bcache: Fix leak of bdev reference
    block/loop: remove unused field
    block/loop: fix use after free
    bfq: Use icq_to_bic() consistently
    bfq: Suppress compiler warnings about comparisons
    bfq: Check kstrtoul() return value
    bfq: Declare local functions static
    ...

    Linus Torvalds
     

02 Sep, 2017

4 commits

  • Some code uses icq_to_bic() to convert an io_cq pointer to a
    bfq_io_cq pointer while other code uses a direct cast. Convert
    the code that uses a direct cast such that it uses icq_to_bic().

    Acked-by: Paolo Valente
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch avoids that the following warnings are reported when
    building with W=1:

    block/bfq-iosched.c: In function 'bfq_back_seek_max_store':
    block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION'
    STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
    ^~~~~~~~~~~~~~
    block/bfq-iosched.c: In function 'bfq_slice_idle_store':
    block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION'
    STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
    ^~~~~~~~~~~~~~
    block/bfq-iosched.c: In function 'bfq_slice_idle_us_store':
    block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION'
    USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
    ^~~~~~~~~~~~~~~~~~~

    Acked-by: Paolo Valente
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Make sysfs writes fail for invalid numbers instead of storing
    uninitialized data copied from the stack. This patch removes
    all uninitialized_var() occurrences from the BFQ source code.

    Acked-by: Paolo Valente
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch avoids that gcc 7 issues a warning about fall-through
    when building with W=1.

    Acked-by: Paolo Valente
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

31 Aug, 2017

1 commit

  • To provide a very smooth service, bfq starts to serve a bfq_queue
    only if the queue is 'eligible', i.e., if the same queue would
    have started to be served in the ideal, perfectly fair system that
    bfq simulates internally. This is obtained by associating each
    queue with a virtual start time, and by computing a special system
    virtual time quantity: a queue is eligible only if the system
    virtual time has reached the virtual start time of the
    queue. Finally, bfq guarantees that, when a new queue must be set
    in service, there is always at least one eligible entity for each
    active parent entity in the scheduler. To provide this guarantee,
    the function __bfq_lookup_next_entity pushes up, for each parent
    entity on which it is invoked, the system virtual time to the
    minimum among the virtual start times of the entities in the
    active tree for the parent entity (more precisely, the push up
    occurs if the system virtual time happens to be lower than all
    such virtual start times).

    There is however a circumstance in which __bfq_lookup_next_entity
    cannot push up the system virtual time for a parent entity, even
    if the system virtual time is lower than the virtual start times
    of all the child entities in the active tree. It happens if one of
    the child entities is in service. In fact, in such a case, there
    is already an eligible entity, the in-service one, even if it may
    not be not present in the active tree (because in-service entities
    may be removed from the active tree).

    Unfortunately, in the last re-design of the
    hierarchical-scheduling engine, the reset of the pointer to the
    in-service entity for a given parent entity--reset to be done as a
    consequence of the expiration of the in-service entity--always
    happens after the function __bfq_lookup_next_entity has been
    invoked. This causes the function to think that there is still an
    entity in service for the parent entity, and then that the system
    virtual time cannot be pushed up, even if actually such a
    no-more-in-service entity has already been properly reinserted
    into the active tree (or in some other tree if no more
    active). Yet, the system virtual time *had* to be pushed up, to be
    ready to correctly choose the next queue to serve. Because of the
    lack of this push up, bfq may wrongly set in service a queue that
    had been speculatively pre-computed as the possible
    next-in-service queue, but that would no more be the one to serve
    after the expiration and the reinsertion into the active trees of
    the previously in-service entities.

    This commit addresses this issue by making
    __bfq_lookup_next_entity properly push up the system virtual time
    if an expiration is occurring.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     

30 Aug, 2017

1 commit


29 Aug, 2017

1 commit


24 Aug, 2017

1 commit


11 Aug, 2017

2 commits

  • When a queue associated with a process remains empty, there are cases
    where throughput gets boosted if the device is idled to await the
    arrival of a new I/O request for that queue. Currently, BFQ assumes
    that one of these cases is when the device has no internal queueing
    (regardless of the properties of the I/O being served). Unfortunately,
    this condition has proved to be too general. So, this commit refines it
    as "the device has no internal queueing and is rotational".

    This refinement provides a significant throughput boost with random
    I/O, on flash-based storage without internal queueing. For example, on
    a HiKey board, throughput increases by up to 125%, growing, e.g., from
    6.9MB/s to 15.6MB/s with two or three random readers in parallel.

    Signed-off-by: Paolo Valente
    Signed-off-by: Luca Miccio
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The logic that decides whether to idle the device is scattered across
    three functions. Almost all of the logic is in the function
    bfq_bfqq_may_idle, but (1) part of the decision is made in
    bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
    switch off idling regardless of the output of bfq_bfqq_may_idle. In
    addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
    their decisions as a function of parameters that are used, for similar
    purposes, also in bfq_bfqq_may_idle. This commit addresses these
    issues by moving all the logic into bfq_bfqq_may_idle.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

12 Jul, 2017

1 commit

  • There are mq devices (eg., virtio-blk, nbd and loopback) which don't
    invoke blk_mq_run_hw_queues() after the completion of a request.
    If bfq is enabled on these devices and the slice_idle attribute or
    strict_guarantees attribute is set as zero, it is possible that
    after a request completion the remaining requests of busy bfq queue
    will stalled in the bfq schedule until a new request arrives.

    To fix the scheduler latency problem, we need to check whether or not
    all issued requests have completed and dispatch more requests to driver
    if there is no request in driver.

    The problem can be reproduced by running the following script
    on a virtio-blk device with nr_hw_queues as 1:

    #!/bin/sh

    dev=vdb
    # mount point for dev
    mp=/tmp/mnt
    cd $mp

    job=strict.job
    cat < $job
    [global]
    direct=1
    bs=4k
    size=256M
    rw=write
    ioengine=libaio
    iodepth=128
    runtime=5
    time_based

    [1]
    filename=1.data

    [2]
    new_group
    filename=2.data
    EOF

    echo bfq > /sys/block/$dev/queue/scheduler
    echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
    fio $job

    Signed-off-by: Hou Tao
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Hou Tao
     

04 Jul, 2017

1 commit

  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

28 Jun, 2017

1 commit

  • This commit fixes a bug triggered by a non-trivial sequence of
    events. These events are briefly described in the next two
    paragraphs. The impatiens, or those who are familiar with queue
    merging and splitting, can jump directly to the last paragraph.

    On each I/O-request arrival for a shared bfq_queue, i.e., for a
    bfq_queue that is the result of the merge of two or more bfq_queues,
    BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
    many random I/O requests have arrived for the bfq_queue; if the device
    is non rotational, then random requests must be also small for the
    bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
    detected as seeky, then a split occurs: the bfq I/O context of the
    process that has issued the request is redirected from the shared
    bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
    shared bfq_queue actually happens to be shared only by one process
    (because of previous splits), then no new bfq_queue is created: the
    state of the shared bfq_queue is just changed from shared to non
    shared.

    Regardless of whether a brand new non-shared bfq_queue is created, or
    the pre-existing shared bfq_queue is just turned into a non-shared
    bfq_queue, several parameters of the non-shared bfq_queue are set
    (restored) to the original values they had when the bfq_queue
    associated with the bfq I/O context of the process (that has just
    issued an I/O request) was merged with the shared bfq_queue. One of
    these parameters is the weight-raising state.

    If, on the split of a shared bfq_queue,
    1) a pre-existing shared bfq_queue is turned into a non-shared
    bfq_queue;
    2) the previously shared bfq_queue happens to be busy;
    3) the weight-raising state of the previously shared bfq_queue happens
    to change;
    the number of weight-raised busy queues changes. The field
    wr_busy_queues must then be updated accordingly, but such an update
    was missing. This commit adds the missing update.

    Reported-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

19 Jun, 2017

3 commits

  • This patch makes sure we always allocate requests in the core blk-mq
    code and use a common prepare_request method to initialize them for
    both mq I/O schedulers. For Kyber and additional limit_depth method
    is added that is called before allocating the request.

    Also because none of the intializations can really fail the new method
    does not return an error - instead the bfq finish method is hardened
    to deal with the no-IOC case.

    Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
    scheduling code as RQF_ELFPRIV is all that is needed now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • icq_to_bic is a container_of operation, so we need to check for NULL
    before it. Also move the check outside the spinlock while we're at
    it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • No need to have two different callouts of bfq vs kyber.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jun, 2017

1 commit

  • In blk-cgroup, operations on blkg objects are protected with the
    request_queue lock. This is no more the lock that protects
    I/O-scheduler operations in blk-mq. In fact, the latter are now
    protected with a finer-grained per-scheduler-instance lock. As a
    consequence, although blkg lookups are also rcu-protected, blk-mq I/O
    schedulers may see inconsistent data when they access blkg and
    blkg-related objects. BFQ does access these objects, and does incur
    this problem, in the following case.

    The blkg_lookup performed in bfq_get_queue, being protected (only)
    through rcu, may happen to return the address of a copy of the
    original blkg. If this is the case, then the blkg_get performed in
    bfq_get_queue, to pin down the blkg, is useless: it does not prevent
    blk-cgroup code from destroying both the original blkg and all objects
    directly or indirectly referred by the copy of the blkg. BFQ accesses
    these objects, which typically causes a crash for NULL-pointer
    dereference of memory-protection violation.

    Some additional protection mechanism should be added to blk-cgroup to
    address this issue. In the meantime, this commit provides a quick
    temporary fix for BFQ: cache (when safe) blkg data that might
    disappear right after a blkg_lookup.

    In particular, this commit exploits the following facts to achieve its
    goal without introducing further locks. Destroy operations on a blkg
    invoke, as a first step, hooks of the scheduler associated with the
    blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
    consequence, for any blkg associated with the request queue an
    instance of BFQ is attached to, we are guaranteed that such a blkg is
    not destroyed, and that all the pointers it contains are consistent,
    while that instance is holding its bfqd->lock. A blkg_lookup performed
    with bfqd->lock held then returns a fully consistent blkg, which
    remains consistent until this lock is held. In more detail, this holds
    even if the returned blkg is a copy of the original one.

    Finally, also the object describing a group inside BFQ needs to be
    protected from destruction on the blkg_free of the original blkg
    (which invokes bfq_pd_free). This commit adds private refcounting for
    this object, to let it disappear only after no bfq_queue refers to it
    any longer.

    This commit also removes or updates some stale comments on locking
    issues related to blk-cgroup operations.

    Reported-by: Tomas Konir
    Reported-by: Lee Tibbert
    Reported-by: Marco Piazza
    Signed-off-by: Paolo Valente
    Tested-by: Tomas Konir
    Tested-by: Lee Tibbert
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

10 May, 2017

1 commit

  • The introduction of the BFQ and Kyber I/O schedulers has triggered a
    new wave of I/O benchmarks. Unfortunately, comments and discussions on
    these benchmarks confirm that there is still little awareness that it
    is very hard to achieve, at the same time, a low latency and a high
    throughput. In particular, virtually all benchmarks measure
    throughput, or throughput-related figures of merit, but, for BFQ, they
    use the scheduler in its default configuration. This configuration is
    geared, instead, toward a low latency. This is evidently a sign that
    BFQ documentation is still too unclear on this important aspect. This
    commit addresses this issue by stressing how BFQ configuration must be
    (easily) changed if the only goal is maximum throughput.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

20 Apr, 2017

1 commit

  • The call to bfq_check_ioprio_change will dereference bic, however,
    the null check for bic is after this call. Move the the null
    check on bic to before the call to avoid any potential null
    pointer dereference issues.

    Detected by CoverityScan, CID#1430138 ("Dereference before null check")

    Signed-off-by: Colin Ian King
    Signed-off-by: Jens Axboe

    Colin Ian King
     

19 Apr, 2017

16 commits

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • When a bfq queue is set in service and when it is merged, a reference
    to the I/O context associated with the queue is taken. This reference
    is then released when the queue is deselected from service or
    split. More precisely, the release of the reference is postponed to
    when the scheduler lock is released, to avoid nesting between the
    scheduler and the I/O-context lock. In fact, such nesting would lead
    to deadlocks, because of other code paths that take the same locks in
    the opposite order. This postponing of I/O-context releases does
    complicate code.

    This commit addresses these issue by modifying involved operations in
    such a way to not need to get the above I/O-context references any
    more. Then it also removes any get and release of these references.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Many popular I/O-intensive services or applications spawn or
    reactivate many parallel threads/processes during short time
    intervals. Examples are systemd during boot or git grep. These
    services or applications benefit mostly from a high throughput: the
    quicker the I/O generated by their processes is cumulatively served,
    the sooner the target job of these services or applications gets
    completed. As a consequence, it is almost always counterproductive to
    weight-raise any of the queues associated to the processes of these
    services or applications: in most cases it would just lower the
    throughput, mainly because weight-raising also implies device idling.

    To address this issue, an I/O scheduler needs, first, to detect which
    queues are associated with these services or applications. In this
    respect, we have that, from the I/O-scheduler standpoint, these
    services or applications cause bursts of activations, i.e.,
    activations of different queues occurring shortly after each
    other. However, a shorter burst of activations may be caused also by
    the start of an application that does not consist in a lot of parallel
    I/O-bound threads (see the comments on the function bfq_handle_burst
    for details).

    In view of these facts, this commit introduces:
    1) an heuristic to detect (only) bursts of queue activations caused by
    services or applications consisting in many parallel I/O-bound
    threads;
    2) the prevention of device idling and weight-raising for the queues
    belonging to these bursts.

    Signed-off-by: Arianna Avanzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     
  • This patch is basically the counterpart, for NCQ-capable rotational
    devices, of the previous patch. Exactly as the previous patch does on
    flash-based devices and for any workload, this patch disables device
    idling on rotational devices, but only for random I/O. In fact, only
    with these queues disabling idling boosts the throughput on
    NCQ-capable rotational devices. To not break service guarantees,
    idling is disabled for NCQ-enabled rotational devices only when the
    same symmetry conditions considered in the previous patches hold.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • This patch boosts the throughput on NCQ-capable flash-based devices,
    while still preserving latency guarantees for interactive and soft
    real-time applications. The throughput is boosted by just not idling
    the device when the in-service queue remains empty, even if the queue
    is sync and has a non-null idle window. This helps to keep the drive's
    internal queue full, which is necessary to achieve maximum
    performance. This solution to boost the throughput is a port of
    commits a68bbdd and f7d7b7a for CFQ.

    As already highlighted in a previous patch, allowing the device to
    prefetch and internally reorder requests trivially causes loss of
    control on the request service order, and hence on service guarantees.
    Fortunately, as discussed in detail in the comments on the function
    bfq_bfqq_may_idle(), if every process has to receive the same
    fraction of the throughput, then the service order enforced by the
    internal scheduler of a flash-based device is relatively close to that
    enforced by BFQ. In particular, it is close enough to let service
    guarantees be substantially preserved.

    Things change in an asymmetric scenario, i.e., if not every process
    has to receive the same fraction of the throughput. In this case, to
    guarantee the desired throughput distribution, the device must be
    prevented from prefetching requests. This is exactly what this patch
    does in asymmetric scenarios.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A seeky queue (i..e, a queue containing random requests) is assigned a
    very small device-idling slice, for throughput issues. Unfortunately,
    given the process associated with a seeky queue, this behavior causes
    the following problem: if the process, say P, performs sync I/O and
    has a higher weight than some other processes doing I/O and associated
    with non-seeky queues, then BFQ may fail to guarantee to P its
    reserved share of the throughput. The reason is that idling is key
    for providing service guarantees to processes doing sync I/O [1].

    This commit addresses this issue by allowing the device-idling slice
    to be reduced for a seeky queue only if the scenario happens to be
    symmetric, i.e., if all the queues are to receive the same share of
    the throughput.

    [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
    Scheduler", Proceedings of the First Workshop on Mobile System
    Technologies (MST-2015), May 2015.
    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

    Signed-off-by: Arianna Avanzini
    Signed-off-by: Riccardo Pizzetti
    Signed-off-by: Samuele Zecchini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     
  • A set of processes may happen to perform interleaved reads, i.e.,
    read requests whose union would give rise to a sequential read pattern.
    There are two typical cases: first, processes reading fixed-size chunks
    of data at a fixed distance from each other; second, processes reading
    variable-size chunks at variable distances. The latter case occurs for
    example with QEMU, which splits the I/O generated by a guest into
    multiple chunks, and lets these chunks be served by a pool of I/O
    threads, iteratively assigning the next chunk of I/O to the first
    available thread. CFQ denotes as 'cooperating' a set of processes that
    are doing interleaved I/O, and when it detects cooperating processes,
    it merges their queues to obtain a sequential I/O pattern from the union
    of their I/O requests, and hence boost the throughput.

    Unfortunately, in the following frequent case, the mechanism
    implemented in CFQ for detecting cooperating processes and merging
    their queues is not responsive enough to handle also the fluctuating
    I/O pattern of the second type of processes. Suppose that one process
    of the second type issues a request close to the next request to serve
    of another process of the same type. At that time the two processes
    would be considered as cooperating. But, if the request issued by the
    first process is to be merged with some other already-queued request,
    then, from the moment at which this request arrives, to the moment
    when CFQ controls whether the two processes are cooperating, the two
    processes are likely to be already doing I/O in distant zones of the
    disk surface or device memory.

    CFQ uses however preemption to get a sequential read pattern out of
    the read requests performed by the second type of processes too. As a
    consequence, CFQ uses two different mechanisms to achieve the same
    goal: boosting the throughput with interleaved I/O.

    This patch introduces Early Queue Merge (EQM), a unified mechanism to
    get a sequential read pattern with both types of processes. The main
    idea is to immediately check whether a newly-arrived request lets some
    pair of processes become cooperating, both in the case of actual
    request insertion and, to be responsive with the second type of
    processes, in the case of request merge. Both types of processes are
    then handled by just merging their queues.

    Signed-off-by: Arianna Avanzini
    Signed-off-by: Mauro Andreolini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     
  • This patch introduces an heuristic that reduces latency when the
    I/O-request pool is saturated. This goal is achieved by disabling
    device idling, for non-weight-raised queues, when there are weight-
    raised queues with pending or in-flight requests. In fact, as
    explained in more detail in the comment on the function
    bfq_bfqq_may_idle(), this reduces the rate at which processes
    associated with non-weight-raised queues grab requests from the pool,
    thereby increasing the probability that processes associated with
    weight-raised queues get a request immediately (or at least soon) when
    they need one. Along the same line, if there are weight-raised queues,
    then this patch halves the service rate of async (write) requests for
    non-weight-raised queues.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • I/O schedulers typically allow NCQ-capable drives to prefetch I/O
    requests, as NCQ boosts the throughput exactly by prefetching and
    internally reordering requests.

    Unfortunately, as discussed in detail and shown experimentally in [1],
    this may cause fairness and latency guarantees to be violated. The
    main problem is that the internal scheduler of an NCQ-capable drive
    may postpone the service of some unlucky (prefetched) requests as long
    as it deems serving other requests more appropriate to boost the
    throughput.

    This patch addresses this issue by not disabling device idling for
    weight-raised queues, even if the device supports NCQ. This allows BFQ
    to start serving a new queue, and therefore allows the drive to
    prefetch new requests, only after the idling timeout expires. At that
    time, all the outstanding requests of the expired queue have been most
    certainly served.

    [1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To guarantee a low latency also to the I/O requests issued by soft
    real-time applications, this patch introduces a further heuristic,
    which weight-raises (in the sense explained in the previous patch)
    also the queues associated to applications deemed as soft real-time.

    To be deemed as soft real-time, an application must meet two
    requirements. First, the application must not require an average
    bandwidth higher than the approximate bandwidth required to playback
    or record a compressed high-definition video. Second, the request
    pattern of the application must be isochronous, i.e., after issuing a
    request or a batch of requests, the application must stop issuing new
    requests until all its pending requests have been completed. After
    that, the application may issue a new batch, and so on.

    As for the second requirement, it is critical to require also that,
    after all the pending requests of the application have been completed,
    an adequate minimum amount of time elapses before the application
    starts issuing new requests. This prevents also greedy (i.e.,
    I/O-bound) applications from being incorrectly deemed, occasionally,
    as soft real-time. In fact, if *any amount of time* is fine, then even
    a greedy application may, paradoxically, meet both the above
    requirements, if: (1) the application performs random I/O and/or the
    device is slow, and (2) the CPU load is high. The reason is the
    following. First, if condition (1) is true, then, during the service
    of the application, the throughput may be low enough to let the
    application meet the bandwidth requirement. Second, if condition (2)
    is true as well, then the application may occasionally behave in an
    apparently isochronous way, because it may simply stop issuing
    requests while the CPUs are busy serving other processes.

    To address this issue, the heuristic leverages the simple fact that
    greedy applications issue *all* their requests as quickly as they can,
    whereas soft real-time applications spend some time processing data
    after each batch of requests is completed. In particular, the
    heuristic works as follows. First, according to the above isochrony
    requirement, the heuristic checks whether an application may be soft
    real-time, thereby giving to the application the opportunity to be
    deemed as such, only when both the following two conditions happen to
    hold: 1) the queue associated with the application has expired and is
    empty, 2) there is no outstanding request of the application.

    Suppose that both conditions hold at time, say, t_c and that the
    application issues its next request at time, say, t_i. At time t_c the
    heuristic computes the next time instant, called soft_rt_next_start in
    the code, such that, only if t_i >= soft_rt_next_start, then both the
    next conditions will hold when the application issues its next
    request: 1) the application will meet the above bandwidth requirement,
    2) a given minimum time interval, say Delta, will have elapsed from
    time t_c (so as to filter out greedy application).

    The current value of Delta is a little bit higher than the value that
    we have found, experimentally, to be adequate on a real,
    general-purpose machine. In particular we had to increase Delta to
    make the filter quite precise also in slower, embedded systems, and in
    KVM/QEMU virtual machines (details in the comments on the code).

    If the application actually issues its next request after time
    soft_rt_next_start, then its associated queue will be weight-raised
    for a relatively short time interval. If, during this time interval,
    the application proves again to meet the bandwidth and isochrony
    requirements, then the end of the weight-raising period for the queue
    is moved forward, and so on. Note that an application whose associated
    queue never happens to be empty when it expires will never have the
    opportunity to be deemed as soft real-time.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • This patch introduces a simple heuristic to load applications quickly,
    and to perform the I/O requested by interactive applications just as
    quickly. To this purpose, both a newly-created queue and a queue
    associated with an interactive application (we explain in a moment how
    BFQ decides whether the associated application is interactive),
    receive the following two special treatments:

    1) The weight of the queue is raised.

    2) The queue unconditionally enjoys device idling when it empties; in
    fact, if the requests of a queue are sync, then performing device
    idling for the queue is a necessary condition to guarantee that the
    queue receives a fraction of the throughput proportional to its weight
    (see [1] for details).

    For brevity, we call just weight-raising the combination of these
    two preferential treatments. For a newly-created queue,
    weight-raising starts immediately and lasts for a time interval that:
    1) depends on the device speed and type (rotational or
    non-rotational), and 2) is equal to the time needed to load (start up)
    a large-size application on that device, with cold caches and with no
    additional workload.

    Finally, as for guaranteeing a fast execution to interactive,
    I/O-related tasks (such as opening a file), consider that any
    interactive application blocks and waits for user input both after
    starting up and after executing some task. After a while, the user may
    trigger new operations, after which the application stops again, and
    so on. Accordingly, the low-latency heuristic weight-raises again a
    queue in case it becomes backlogged after being idle for a
    sufficiently long (configurable) time. The weight-raising then lasts
    for the same time as for a just-created queue.

    According to our experiments, the combination of this low-latency
    heuristic and of the improvements described in the previous patch
    allows BFQ to guarantee a high application responsiveness.

    [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
    Scheduler", Proceedings of the First Workshop on Mobile System
    Technologies (MST-2015), May 2015.
    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • This patch deals with two sources of unfairness, which can also cause
    high latencies and throughput loss. The first source is related to
    write requests. Write requests tend to starve read requests, basically
    because, on one side, writes are slower than reads, whereas, on the
    other side, storage devices confuse schedulers by deceptively
    signaling the completion of write requests immediately after receiving
    them. This patch addresses this issue by just throttling writes. In
    particular, after a write request is dispatched for a queue, the
    budget of the queue is decremented by the number of sectors to write,
    multiplied by an (over)charge coefficient. The value of the
    coefficient is the result of our tuning with different devices.

    The second source of unfairness has to do with slowness detection:
    when the in-service queue is expired, BFQ also controls whether the
    queue has been "too slow", i.e., has consumed its last-assigned budget
    at such a low rate that it would have been impossible to consume all
    of this budget within the maximum time slice T_max (Subsec. 3.5 in
    [1]). In this case, the queue is always (over)charged the whole
    budget, to reduce its utilization of the device. Both this overcharge
    and the slowness-detection criterion may cause unfairness.

    First, always charging a full budget to a slow queue is too coarse. It
    is much more accurate, and this patch lets BFQ do so, to charge an
    amount of service 'equivalent' to the amount of time during which the
    queue has been in service. As explained in more detail in the comments
    on the code, this enables BFQ to provide time fairness among slow
    queues.

    Secondly, because of ZBR, a queue may be deemed as slow when its
    associated process is performing I/O on the slowest zones of a
    disk. However, unless the process is truly too slow, not reducing the
    disk utilization of the queue is more profitable in terms of disk
    throughput than the opposite. A similar problem is caused by logical
    block mapping on non-rotational devices. For this reason, this patch
    lets a queue be charged time, and not budget, only if the queue has
    consumed less than 2/3 of its assigned budget. As an additional,
    important benefit, this tolerance allows BFQ to preserve enough
    elasticity to still perform bandwidth, and not time, distribution with
    little unlucky or quasi-sequential processes.

    Finally, for the same reasons as above, this patch makes slowness
    detection itself much less harsh: a queue is deemed slow only if it
    has consumed its budget at less than half of the peak rate.

    [1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Unless the maximum budget B_max that BFQ can assign to a queue is set
    explicitly by the user, BFQ automatically updates B_max. In
    particular, BFQ dynamically sets B_max to the number of sectors that
    can be read, at the current estimated peak rate, during the maximum
    time, T_max, allowed before a budget timeout occurs. In formulas, if
    we denote as R_est the estimated peak rate, then B_max = T_max ∗
    R_est. Hence, the higher R_est is with respect to the actual device
    peak rate, the higher the probability that processes incur budget
    timeouts unjustly is. Besides, a too high value of B_max unnecessarily
    increases the deviation from an ideal, smooth service.

    Unfortunately, it is not trivial to estimate the peak rate correctly:
    because of the presence of sw and hw queues between the scheduler and
    the device components that finally serve I/O requests, it is hard to
    say exactly when a given dispatched request is served inside the
    device, and for how long. As a consequence, it is hard to know
    precisely at what rate a given set of requests is actually served by
    the device.

    On the opposite end, the dispatch time of any request is trivially
    available, and, from this piece of information, the "dispatch rate"
    of requests can be immediately computed. So, the idea in the next
    function is to use what is known, namely request dispatch times
    (plus, when useful, request completion times), to estimate what is
    unknown, namely in-device request service rate.

    The main issue is that, because of the above facts, the rate at
    which a certain set of requests is dispatched over a certain time
    interval can vary greatly with respect to the rate at which the
    same requests are then served. But, since the size of any
    intermediate queue is limited, and the service scheme is lossless
    (no request is silently dropped), the following obvious convergence
    property holds: the number of requests dispatched MUST become
    closer and closer to the number of requests completed as the
    observation interval grows. This is the key property used in
    this new version of the peak-rate estimator.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The feedback-loop algorithm used by BFQ to compute queue (process)
    budgets is basically a set of three update rules, one for each of the
    main reasons why a queue may be expired. If many processes suddenly
    switch from sporadic I/O to greedy and sequential I/O, then these
    rules are quite slow to assign large budgets to these processes, and
    hence to achieve a high throughput. On the opposite side, BFQ assigns
    the maximum possible budget B_max to a just-created queue. This allows
    a high throughput to be achieved immediately if the associated process
    is I/O-bound and performs sequential I/O from the beginning. But it
    also increases the worst-case latency experienced by the first
    requests issued by the process, because the larger the budget of a
    queue waiting for service is, the later the queue will be served by
    B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
    soft real-time application.

    To tackle these throughput and latency problems, on one hand this
    patch changes the initial budget value to B_max/2. On the other hand,
    it re-tunes the three rules, adopting a more aggressive,
    multiplicative increase/linear decrease scheme. This scheme trades
    latency for throughput more than before, and tends to assign large
    budgets quickly to processes that are or become I/O-bound. For two of
    the expiration reasons, the new version of the rules also contains
    some more little improvements, briefly described below.

    *No more backlog.* In this case, the budget was larger than the number
    of sectors actually read/written by the process before it stopped
    doing I/O. Hence, to reduce latency for the possible future I/O
    requests of the process, the old rule simply set the next budget to
    the number of sectors actually consumed by the process. However, if
    there are still outstanding requests, then the process may have not
    yet issued its next request just because it is still waiting for the
    completion of some of the still outstanding ones. If this sub-case
    holds true, then the new rule, instead of decreasing the budget,
    doubles it, proactively, in the hope that: 1) a larger budget will fit
    the actual needs of the process, and 2) the process is sequential and
    hence a higher throughput will be achieved by serving the process
    longer after granting it access to the device.

    *Budget timeout*. The original rule set the new budget to the maximum
    value B_max, to maximize throughput and let all processes experiencing
    budget timeouts receive the same share of the device time. In our
    experiments we verified that this sudden jump to B_max did not provide
    sensible benefits; rather it increased the latency of processes
    performing sporadic and short I/O. The new rule only doubles the
    budget.

    [1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Add complete support for full hierarchical scheduling, with a cgroups
    interface. Full hierarchical scheduling is implemented through the
    'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
    associated with processes, and groups are represented in general by
    entities. Given the bfq_queues associated with the processes belonging
    to a given group, the entities representing these queues are sons of
    the entity representing the group. At higher levels, if a group, say
    G, contains other groups, then the entity representing G is the parent
    entity of the entities representing the groups in G.

    Hierarchical scheduling is performed as follows: if the timestamps of
    a leaf entity (i.e., of a bfq_queue) change, and such a change lets
    the entity become the next-to-serve entity for its parent entity, then
    the timestamps of the parent entity are recomputed as a function of
    the budget of its new next-to-serve leaf entity. If the parent entity
    belongs, in its turn, to a group, and its new timestamps let it become
    the next-to-serve for its parent entity, then the timestamps of the
    latter parent entity are recomputed as well, and so on. When a new
    bfq_queue must be set in service, the reverse path is followed: the
    next-to-serve highest-level entity is chosen, then its next-to-serve
    child entity, and so on, until the next-to-serve leaf entity is
    reached, and the bfq_queue that this entity represents is set in
    service.

    Writeback is accounted for on a per-group basis, i.e., for each group,
    the async I/O requests of the processes of the group are enqueued in a
    distinct bfq_queue, and the entity associated with this queue is a
    child of the entity associated with the group.

    Weights can be assigned explicitly to groups and processes through the
    cgroups interface, differently from what happens, for single
    processes, if the cgroups interface is not used (as explained in the
    description of the previous patch). In particular, since each node has
    a full scheduler, each group can be assigned its own weight.

    Signed-off-by: Fabio Checconi
    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     
  • We tag as v0 the version of BFQ containing only BFQ's engine plus
    hierarchical support. BFQ's engine is introduced by this commit, while
    hierarchical support is added by next commit. We use the v0 tag to
    distinguish this minimal version of BFQ from the versions containing
    also the features and the improvements added by next commits. BFQ-v0
    coincides with the version of BFQ submitted a few years ago [1], apart
    from the introduction of preemption, described below.

    BFQ is a proportional-share I/O scheduler, whose general structure,
    plus a lot of code, are borrowed from CFQ.

    - Each process doing I/O on a device is associated with a weight and a
    (bfq_)queue.

    - BFQ grants exclusive access to the device, for a while, to one queue
    (process) at a time, and implements this service model by
    associating every queue with a budget, measured in number of
    sectors.

    - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

    - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
    holding the device for too long and dramatically reducing
    throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
    sync requests may not be expired immediately when it empties. In
    contrast, BFQ may idle the device for a short time interval,
    giving the process the chance to go on being served if it issues
    a new request in time. Device idling typically boosts the
    throughput on rotational devices, if processes do synchronous
    and sequential I/O. In addition, under BFQ, device idling is
    also instrumental in guaranteeing the desired throughput
    fraction to processes issuing sync requests (see [2] for
    details).

    - With respect to idling for service guarantees, if several
    processes are competing for the device at the same time, but
    all processes (and groups, after the following commit) have
    the same weight, then BFQ guarantees the expected throughput
    distribution without ever idling the device. Throughput is
    thus as high as possible in this common scenario.

    - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in the next commit, which focuses
    exactly on this feature.

    - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

    - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons:

    - First, with any proportional-share scheduler, the maximum
    deviation with respect to an ideal service is proportional to
    the maximum budget (slice) assigned to queues. As a consequence,
    BFQ can keep this deviation tight not only because of the
    accurate service of B-WF2Q+, but also because BFQ *does not*
    need to assign a larger budget to a queue to let the queue
    receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
    budget that best fits the needs of the process, or best
    leverages the I/O pattern of the process. In particular, BFQ
    updates queue budgets with a simple feedback-loop algorithm that
    allows a high throughput to be achieved, while still providing
    tight latency guarantees to time-sensitive applications. When
    the in-service queue expires, this algorithm computes the next
    budget of the queue so as to:

    - Let large budgets be eventually assigned to the queues
    associated with I/O-bound applications performing sequential
    I/O: in fact, the longer these applications are served once
    got access to the device, the higher the throughput is.

    - Let small budgets be eventually assigned to the queues
    associated with time-sensitive applications (which typically
    perform sporadic and short I/O), because, the smaller the
    budget assigned to a queue waiting for service is, the sooner
    B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

    - Weights can be assigned to processes only indirectly, through I/O
    priorities, and according to the relation:
    weight = 10 * (IOPRIO_BE_NR - ioprio).
    The next patch provides, instead, a cgroups interface through which
    weights can be assigned explicitly.

    - If several processes are competing for the device at the same time,
    but all processes and groups have the same weight, then BFQ
    guarantees the expected throughput distribution without ever idling
    the device. It uses preemption instead. Throughput is then much
    higher in this common scenario.

    - ioprio classes are served in strict priority order, i.e.,
    lower-priority queues are not served as long as there are
    higher-priority queues. Among queues in the same class, the
    bandwidth is distributed in proportion to the weight of each
    queue. A very thin extra bandwidth is however guaranteed to the Idle
    class, to prevent it from starving.

    - If the strict_guarantees parameter is set (default: unset), then BFQ
    - always performs idling when the in-service queue becomes empty;
    - forces the device to serve one I/O request at a time, by
    dispatching a new request only if there is no outstanding
    request.
    In the presence of differentiated weights or I/O-request sizes,
    both the above conditions are needed to guarantee that every
    queue receives its allotted share of the bandwidth (see
    Documentation/block/bfq-iosched.txt for more details). Setting
    strict_guarantees may evidently affect throughput.

    [1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

    [2] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Fabio Checconi
    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente