18 Aug, 2020

1 commit

  • Changes from v1:
    - update commit description with proper ref-accounting justification

    commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    introduce leak forbfq_group and blkcg_gq objects because of get/put
    imbalance.
    In fact whole idea of original commit is wrong because bfq_group entity
    can not dissapear under us because it is referenced by child bfq_queue's
    entities from here:
    -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

    -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

    So parent entity can not disappear while child entity is in tree,
    and child entities already has proper protection.
    This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

    bfq_group leak trace caused by bad commit:
    -> blkg_alloc
    -> bfq_pq_alloc
    -> bfqg_get (+1)
    ->bfq_activate_bfqq
    ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
    ->bfq_get_entity
    ->bfqg_and_blkg_get (+1) bfq_del_bfqq_busy
    ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
    -> bfq_forget_entity(is_in_service = true)
    entity->on_st_or_in_serv = false do not touch reference
    -> blkcg_css_offline
    -> blkcg_destroy_blkgs
    -> blkg_destroy
    -> bfq_pd_offline
    -> __bfq_deactivate_entity
    if (!entity->on_st_or_in_serv) /* true, because (Note2)
    return false;
    -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
    So bfq_group and blkcg_gq will leak forever, see test-case below.

    ##TESTCASE_BEGIN:
    #!/bin/bash

    max_iters=${1:-100}
    #prep cgroup mounts
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/blkio
    mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

    # Prepare blkdev
    grep blkio /proc/cgroups
    truncate -s 1M img
    losetup /dev/loop0 img
    echo bfq > /sys/block/loop0/queue/scheduler

    grep blkio /proc/cgroups
    for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
    done
    ##TESTCASE_END:

    Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

22 Mar, 2020

1 commit

  • A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

03 Feb, 2020

3 commits

  • BFQ schedules generic entities, which may represent either bfq_queues
    or groups of bfq_queues. When an entity is inserted into a service
    tree, a reference must be taken, to make sure that the entity does not
    disappear while still referred in the tree. Unfortunately, such a
    reference is mistakenly taken only if the entity represents a
    bfq_queue. This commit takes a reference also in case the entity
    represents a group.

    Tested-by: Oleksandr Natalenko
    Tested-by: Chris Evich
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • ifdefs around gets and puts of bfq groups reduce readability, remove them.

    Tested-by: Oleksandr Natalenko
    Reported-by: Jens Axboe
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The flag on_st in the bfq_entity data structure is true if the entity
    is on a service tree or is in service. Yet the name of the field,
    confusingly, does not mention the second, very important case. Extend
    the name to mention the second case too.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

21 Nov, 2019

1 commit

  • In most cases blk_tracing is not active, but bfq_log_bfqq macro
    generate pid_str unconditionally, which result in significant overhead.

    ## Test
    modprobe null_blk
    echo bfq > /sys/block/nullb0/queue/scheduler
    fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
    --runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k

    # Results
    | | baseline | w/ patch | gain |
    | iops | 113.19K | 126.42K | +11% |

    Acked-by: Paolo Valente
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

08 Nov, 2019

2 commits

  • blkg_rwstat is now only used by bfq-iosched and blk-throtl when on
    cgroup1. Let's move it into its own files and gate it behind a config
    option.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
    from blk-cgroup core to populate six stat knobs. blk-cgroup core is
    moving away from blkg_rwstat to improve scalability and won't be able
    to support this usage.

    It isn't like the sharing gains all that much. Let's break it out to
    dedicated rwstat counters which are updated when on cgroup1. This
    makes use of bfqg_*rwstat*() helpers outside of
    CONFIG_BFQ_CGROUP_DEBUG. Move them out.

    v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.

    Signed-off-by: Tejun Heo
    Cc: Paolo Valente
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Sep, 2019

1 commit

  • This adds to BFQ the missing per-device weight interfaces:
    blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
    implementation pretty closely resembles what we had in CFQ and the parsing code
    is basically reused.

    Tests
    =====

    Using two cgroups and three block devices, having weights setup as:

    Cgroup test1 test2
    ============================================
    default 100 500
    sda 500 100
    sdb default default
    sdc 200 200

    cgroup v1 runs
    --------------

    sda.test1.out: READ: bw=913MiB/s
    sda.test2.out: READ: bw=183MiB/s

    sdb.test1.out: READ: bw=213MiB/s
    sdb.test2.out: READ: bw=1054MiB/s

    sdc.test1.out: READ: bw=650MiB/s
    sdc.test2.out: READ: bw=650MiB/s

    cgroup v2 runs
    --------------

    sda.test1.out: READ: bw=915MiB/s
    sda.test2.out: READ: bw=184MiB/s

    sdb.test1.out: READ: bw=216MiB/s
    sdb.test2.out: READ: bw=1069MiB/s

    sdc.test1.out: READ: bw=621MiB/s
    sdc.test2.out: READ: bw=622MiB/s

    Signed-off-by: Fam Zheng
    Acked-by: Tejun Heo
    Reviewed-by: Paolo Valente

    Signed-off-by: Jens Axboe

    Fam Zheng
     

25 Jun, 2019

1 commit

  • A bfq_queue Q may happen to be synchronized with another
    bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
    receive new I/O. We call Q2 "waker queue".

    If I/O plugging is being performed for Q, and Q is not receiving any
    more I/O because of the above synchronization, then, thanks to BFQ's
    injection mechanism, the waker queue is likely to get served before
    the I/O-plugging timeout fires.

    Unfortunately, this fact may not be sufficient to guarantee a high
    throughput during the I/O plugging, because the inject limit for Q may
    be too low to guarantee a lot of injected I/O. In addition, the
    duration of the plugging, i.e., the time before Q finally receives new
    I/O, may not be minimized, because the waker queue may happen to be
    served only after other queues.

    To address these issues, this commit introduces the explicit detection
    of the waker queue, and the unconditional injection of a pending I/O
    request of the waker queue on each invocation of
    bfq_dispatch_request().

    One may be concerned that this systematic injection of I/O from the
    waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
    the contrary, next Q's I/O is brought forward dramatically, for it is
    not blocked for milliseconds.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

21 Jun, 2019

2 commits


01 May, 2019

1 commit


22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • The function bfq_bfqq_expire() invokes the function
    __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
    If this happens, then no other instruction of bfq_bfqq_expire() must
    be executed, or a use-after-free will occur.

    Basing on the assumption that __bfq_bfqq_expire() invokes
    bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
    assumed to be freed if its refcounter is equal to one right before
    invoking __bfq_bfqq_expire().

    But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
    weights tree") this assumption is false. __bfq_bfqq_expire() may also
    invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
    ("block, bfq: fix queue removal from weights tree"), also
    the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
    may invoke bfq_put_queue() twice, and this is the actual case where
    the in-service queue may happen to be freed.

    To address this issue, this commit moves the check on the refcounter
    of the queue right around the last bfq_put_queue() that may be invoked
    on the queue.

    Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
    Reported-by: Dmitrii Tcvetkov
    Reported-by: Douglas Anderson
    Tested-by: Dmitrii Tcvetkov
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Apr, 2019

1 commit


01 Apr, 2019

5 commits

  • bfq saves the state of a queue each time a merge occurs, to be
    able to resume such a state when the queue is associated again
    with its original process, on a split.

    Unfortunately bfq does not save & restore also the weight of the
    queue. If the weight is not correctly resumed when the queue is
    recycled, then the weight of the recycled queue could differ
    from the weight of the original queue.

    This commit adds the missing save & resume of the weight.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • The function "bfq_log_bfqq" prints the pid of the process
    associated with the queue passed as input.

    Unfortunately, if the queue is shared, then more than one process
    is associated with the queue. The pid that gets printed in this
    case is the pid of one of the associated processes.
    Which process gets printed depends on the exact sequence of merge
    events the queue underwent. So printing such a pid is rather
    useless and above all is often rather confusing because it
    reports a random pid between those of the associated processes.

    This commit addresses this issue by printing SHARED instead of a pid
    if the queue is shared.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • To boost throughput with a set of processes doing interleaved I/O
    (i.e., a set of processes whose individual I/O is random, but whose
    merged cumulative I/O is sequential), BFQ merges the queues associated
    with these processes, i.e., redirects the I/O of these processes into a
    common, shared queue. In the shared queue, I/O requests are ordered by
    their position on the medium, thus sequential I/O gets dispatched to
    the device when the shared queue is served.

    Queue merging costs execution time, because, to detect which queues to
    merge, BFQ must maintain a list of the head I/O requests of active
    queues, ordered by request positions. Measurements showed that this
    costs about 10% of BFQ's total per-request processing time.

    Request processing time becomes more and more critical as the speed of
    the underlying storage device grows. Yet, fortunately, queue merging
    is basically useless on the very devices that are so fast to make
    request processing time critical. To reach a high throughput, these
    devices must have many requests queued at the same time. But, in this
    configuration, the internal scheduling algorithms of these devices do
    also the job of queue merging: they reorder requests so as to obtain
    as much as possible a sequential I/O pattern. As a consequence, with
    processes doing interleaved I/O, the throughput reached by one such
    device is likely to be the same, with and without queue merging.

    In view of this fact, this commit disables queue merging, and all
    related housekeeping, for non-rotational devices with internal
    queueing. The total, single-lock-protected, per-request processing
    time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
    (time measured with simple code instrumentation, and using the
    throughput-sync.sh script of the S suite [1], in performance-profiling
    mode). To put this result into context, the total,
    single-lock-protected, per-request execution time of the lightest I/O
    scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
    ~800 LOC, against ~10500 LOC for BFQ).

    Disabling merging provides a further, remarkable benefit in terms of
    throughput. Merging tends to make many workloads artificially more
    uneven, mainly because of shared queues remaining non empty for
    incomparably more time than normal queues. So, if, e.g., one of the
    queues in a set of merged queues has a higher weight than a normal
    queue, then the shared queue may inherit such a high weight and, by
    staying almost always active, may force BFQ to perform I/O plugging
    most of the time. This evidently makes it harder for BFQ to let the
    device reach a high throughput.

    As a practical example of this problem, and of the benefits of this
    commit, we measured again the throughput in the nasty scenario
    considered in previous commit messages: dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes. With
    this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
    PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
    of the other I/O schedulers. As such, this is also likely to be the
    maximum possible throughput reachable with this workload on this
    device, because I/O is mostly random, and the other schedulers
    basically just pass I/O requests to the drive as fast as possible.

    [1] https://github.com/Algodev-github/S

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Alessio Masola
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The processes associated with a bfq_queue, say Q, may happen to
    generate their cumulative I/O at a lower rate than the rate at which
    the device could serve the same I/O. This is rather probable, e.g., if
    only one process is associated with Q and the device is an SSD. It
    results in Q becoming often empty while in service. If BFQ is not
    allowed to switch to another queue when Q becomes empty, then, during
    the service of Q, there will be frequent "service holes", i.e., time
    intervals during which Q gets empty and the device can only consume
    the I/O already queued in its hardware queues. This easily causes
    considerable losses of throughput.

    To counter this problem, BFQ implements a request injection mechanism,
    which tries to fill the above service holes with I/O requests taken
    from other bfq_queues. The hard part in this mechanism is finding the
    right amount of I/O to inject, so as to both boost throughput and not
    break Q's bandwidth and latency guarantees. To this goal, the current
    version of this mechanism measures the bandwidth enjoyed by Q while it
    is being served, and tries to inject the maximum possible amount of
    extra service that does not cause Q's bandwidth to decrease too
    much.

    This solution has an important shortcoming. For bandwidth measurements
    to be stable and reliable, Q must remain in service for a much longer
    time than that needed to serve a single I/O request. Unfortunately,
    this does not hold with many workloads. This commit addresses this
    issue by changing the way the amount of injection allowed is
    dynamically computed. It tunes injection as a function of the service
    times of single I/O requests of Q, instead of Q's
    bandwidth. Single-request service times are evidently meaningful even
    if Q gets very few I/O requests completed while it is in service.

    As a testbed for this new solution, we measured the throughput reached
    by BFQ for one of the nastiest workloads and configurations for this
    scheduler: the workload generated by the dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes.
    With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
    a PLEXTOR PX-256M5.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In most cases, it is detrimental for throughput to plug I/O dispatch
    when the in-service bfq_queue becomes temporarily empty (plugging is
    performed to wait for the possible arrival, soon, of new I/O from the
    in-service queue). There is however a case where plugging is needed
    for service guarantees. If a bfq_queue, say Q, has a higher weight
    than some other active bfq_queue, and is sync, i.e., contains sync
    I/O, then, to guarantee that Q does receive a higher share of the
    throughput than other lower-weight queues, it is necessary to plug I/O
    dispatch when Q remains temporarily empty while being served.

    For this reason, BFQ performs I/O plugging when some active bfq_queue
    has a higher weight than some other active bfq_queue. But this is
    unnecessarily overkill. In fact, if the in-service bfq_queue actually
    has a weight lower than or equal to the other queues, then the queue
    *must not* be guaranteed a higher share of the throughput than the
    other queues. So, not plugging I/O cannot cause any harm to the
    queue. And can boost throughput.

    Taking advantage of this fact, this commit does not plug I/O for sync
    bfq_queues with a weight lower than or equal to the weights of the
    other queues. Here is an example of the resulting throughput boost
    with the dbench workload, which is particularly nasty for BFQ. With
    the dbench test in the Phoronix suite, BFQ reaches its lowest total
    throughput with 6 clients on a filesystem with journaling, in case the
    journaling daemon has a higher weight than normal processes. Before
    this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
    after this commit it is ~100 MB/sec.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

01 Feb, 2019

2 commits

  • When a new I/O request arrives for a bfq_queue, say Q, bfq checks
    whether that request is close to
    (a) the head request of some other queue waiting to be served, or
    (b) the last request dispatched for the in-service queue (in case Q
    itself is not the in-service queue)

    If a queue, say Q2, is found for which the above condition holds, then
    bfq merges Q and Q2, to hopefully get a more sequential I/O in the
    resulting merged queue, and thus a possibly higher throughput.

    Case (b) is checked by comparing the new request for Q with the last
    request dispatched, assuming that the latter necessarily belonged to the
    in-service queue. Unfortunately, this assumption is no longer always
    correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
    into seeky idle queues on NCQ flash").

    When the assumption does not hold, queues that must not be merged may be
    merged, causing unexpected loss of control on per-queue service
    guarantees.

    This commit solves this problem by adding an extra field, which stores
    the actual last request dispatched for the in-service queue, and by
    using this new field to correctly check case (b).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
    be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
    these service guaranteed can be provided only by plugging I/O dispatch,
    completely or partially, when the queue in service remains temporarily
    empty. A case where asymmetry is particularly strong is when some active
    bfq_queues belong to a higher-priority class than some other active
    bfq_queues. Unfortunately, this important case is not considered at all
    in the code for detecting asymmetric scenarios. This commit adds the
    missing logic.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

07 Dec, 2018

1 commit

  • Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection")', if there are process groups with I/O requests waiting for
    completion, then BFQ tags the scenario as 'asymmetric'. This detection
    is needed for preserving service guarantees (for details, see comments
    on the computation * of the variable asymmetric_scenario in the
    function bfq_better_to_idle).

    Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
    scenarios detection")' contains an error exactly in the updating of
    the number of groups with I/O requests waiting for completion: if a
    group has more than one descendant process, then the above number of
    groups, which is renamed from num_active_groups to a more appropriate
    num_groups_with_pending_reqs by this commit, may happen to be wrongly
    decremented multiple times, namely every time one of the descendant
    processes gets all its pending I/O requests completed.

    A correct, complete solution should work as follows. Consider a group
    that is inactive, i.e., that has no descendant process with pending
    I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
    is still accounting for this group, because the group still has some
    descendant process with some I/O request still in
    flight. num_groups_with_pending_reqs should be decremented when the
    in-flight request of the last descendant process is finally completed
    (assuming that nothing else has changed for the group in the meantime,
    in terms of composition of the group and active/inactive state of
    child groups and processes). To accomplish this, an additional
    pending-request counter must be added to entities, and must be
    updated correctly.

    To avoid this additional field and operations, this commit resorts to
    the following tradeoff between simplicity and accuracy: for an
    inactive group that is still counted in num_groups_with_pending_reqs,
    this commit decrements num_groups_with_pending_reqs when the first
    descendant process of the group remains with no request waiting for
    completion.

    This simplified scheme provides a fix to the unbalanced decrements
    introduced by 2d29c9f89fcd. Since this error was also caused by lack
    of comments on this non-trivial issue, this commit also adds related
    comments.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Reported-by: Steven Barrett
    Tested-by: Steven Barrett
    Tested-by: Lucjan Lucjanov
    Reviewed-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

14 Oct, 2018

1 commit

  • bfq defines as asymmetric a scenario where an active entity, say E
    (representing either a single bfq_queue or a group of other entities),
    has a higher weight than some other entities. If the entity E does sync
    I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
    other entities in the following situation: E is in service but
    temporarily has no pending I/O request. In fact, without this plugging,
    all the times that E stops being temporarily idle, it may find the
    internal queues of the storage device already filled with an
    out-of-control number of extra requests, from other entities. So E may
    have to wait for the service of these extra requests, before finally
    having its own requests served. This may easily break service
    guarantees, with E getting less than its fair share of the device
    throughput. Usually, the end result is that E gets the same fraction of
    the throughput as the other entities, instead of getting more, according
    to its higher weight.

    Yet there are two other more subtle cases where E, even if its weight is
    actually equal to or even lower than the weight of any other active
    entities, may get less than its fair share of the throughput in case the
    above I/O plugging is not performed:
    1. other entities issue larger requests than E;
    2. other entities contain more active child entities than E (or in
    general tend to have more backlog than E).

    In the first case, other entities may get more service than E because
    they get larger requests, than those of E, served during the temporary
    idle periods of E. In the second case, other entities get more service
    because, by having many child entities, they have many requests ready
    for dispatching while E is temporarily idle.

    This commit addresses this issue by extending the definition of
    asymmetric scenario: a scenario is asymmetric when
    - active entities representing bfq_queues have differentiated weights,
    as in the original definition
    or (inclusive)
    - one or more entities representing groups of entities are active.

    This broader definition makes sure that I/O plugging will be performed
    in all the above cases, provided that there is at least one active
    group. Of course, this definition is very coarse, so it will trigger
    I/O plugging also in cases where it is not needed, such as, e.g.,
    multiple active entities with just one child each, and all with the same
    I/O-request size. The reason for this coarse definition is just that a
    finer-grained definition would be rather heavy to compute.

    On the opposite end, even this new definition does not trigger I/O
    plugging in all cases where there is no active group, and all bfq_queues
    have the same weight. So, in these cases some unfairness may occur if
    there are asymmetries in I/O-request sizes. We made this choice because
    I/O plugging may lower throughput, and probably a user that has not
    created any group cares more about throughput than about perfect
    fairness. At any rate, as for possible applications that may care about
    service guarantees, bfq already guarantees a high responsiveness and a
    low latency to soft real-time applications automatically.

    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     

15 Sep, 2018

1 commit

  • The Achilles' heel of BFQ is its failing to reach a high throughput
    with sync random I/O on flash storage with internal queueing, in case
    the processes doing I/O have differentiated weights.

    The cause of this failure is as follows. If at least two processes do
    sync I/O, and have a different weight from each other, then BFQ plugs
    I/O dispatching every time one of these processes, while it is being
    served, remains temporarily without pending I/O requests. This
    plugging is necessary to guarantee that every process enjoys a
    bandwidth proportional to its weight; but it empties the internal
    queue(s) of the drive. And this kills throughput with random I/O. So,
    if some processes have differentiated weights and do both sync and
    random I/O, the end result is a throughput collapse.

    This commit tries to counter this problem by injecting the service of
    other processes, in a controlled way, while the process in service
    happens to have no I/O. This injection is performed only if the medium
    is non rotational and performs internal queueing, and the process in
    service does random I/O (service injection might be beneficial for
    sequential I/O too, we'll work on that).

    As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
    SSD, and with five processes having differentiated weights and doing
    sync random 4KB I/O, this commit makes the throughput with bfq grow by
    400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
    that reached with none. As some less random I/O is added to the mix,
    the throughput becomes equal to or higher than that with none.

    This commit is a very first attempt to recover throughput without
    losing control, and certainly has many limitations. One is, e.g., that
    the processes whose service is injected are not chosen so as to
    distribute the extra bandwidth they receive in accordance to their
    weights. Thus there might be loss of weighted fairness in some
    cases. Anyway, this loss concerns extra service, which would not have
    been received at all without this commit. Other limitations and issues
    will probably show up with usage.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Jul, 2018

1 commit

  • To keep I/O throughput high as often as possible, BFQ performs
    I/O-dispatch plugging (aka device idling) only when beneficial exactly
    for throughput, or when needed for service guarantees (low latency,
    fairness). An important case where the latter condition holds is when
    the scenario is 'asymmetric' in terms of weights: i.e., when some
    bfq_queue or whole group of queues has a higher weight, and thus has
    to receive more service, than other queues or groups. Without dispatch
    plugging, lower-weight queues/groups may unjustly steal bandwidth to
    higher-weight queues/groups.

    To detect asymmetric scenarios, BFQ checks some sufficient
    conditions. One of these conditions is that active groups have
    different weights. BFQ controls this condition by maintaining a
    special set of unique weights of active groups
    (group_weights_tree). To this purpose, in the function
    bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
    group to/from this set.

    Unfortunately, the function bfq_active_extract may happen to be
    invoked also for a group that is still active (to preserve the correct
    update of the next queue to serve, see comments in function
    bfq_no_longer_next_in_service() for details). In this case, removing
    the weight of the group makes the set group_weights_tree
    inconsistent. Service-guarantee violations follow.

    This commit addresses this issue by moving group_weights_tree
    insertions from their previous location (in bfq_active_insert) into
    the function __bfq_activate_entity, and by moving group_weights_tree
    extractions from bfq_active_extract to when the entity that represents
    a group remains throughly idle, i.e., with no request either enqueued
    or dispatched.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

31 May, 2018

1 commit

  • BFQ computes the duration of weight raising for interactive
    applications automatically, using some reference parameters. In
    particular, BFQ uses the best durations (see comments in the code for
    how these durations have been assessed) for two classes of systems:
    slow and fast ones. Examples of slow systems are old phones or systems
    using micro HDDs. Fast systems are all the remaining ones. Using these
    parameters, BFQ computes the actual duration of the weight raising,
    for the system at hand, as a function of the relative speed of the
    system w.r.t. the speed of a reference system, belonging to the same
    class of systems as the system at hand.

    This slow vs fast differentiation proved to be useful in the past, but
    happens to have little meaning with current hardware. Even worse, it
    does cause problems in virtual systems, where the speed of the system
    can vary frequently, and so widely to just confuse the class-detection
    mechanism, and, as we have verified experimentally, to cause BFQ to
    compute non-sensical weight-raising durations.

    This commit addresses this issue by removing the slow class and the
    class-detection mechanism.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

11 May, 2018

1 commit


09 May, 2018

1 commit

  • cfq and bfq have some internal fields that use sched_clock() which can
    trivially use ktime_get_ns() instead. Their timestamp fields in struct
    request can also use ktime_get_ns(), which resolves the 8 year old
    comment added by commit 28f4197e5d47 ("block: disable preemption before
    using sched_clock()").

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Mar, 2018

1 commit

  • If a storage device handled by BFQ happens to be slower than 7.5 KB/s
    for a certain amount of time (in the order of a second), then the
    estimated peak rate of the device, maintained in BFQ, becomes equal to
    0. The reason is the limited precision with which the rate is
    represented (details on the range of representable values in the
    comments introduced by this commit). This leads to a division-by-zero
    error where the estimated peak rate is used as divisor. Such a type of
    failure has been reported in [1].

    This commit addresses this issue by:
    1. Lower-bounding the estimated peak rate to 1
    2. Adding and improving comments on the range of rates representable

    [1] https://www.spinics.net/lists/kernel/msg2739205.html

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Jan, 2018

2 commits

  • To maximise responsiveness, BFQ raises the weight, and performs device
    idling, for bfq_queues associated with processes deemed as
    interactive. In particular, weight raising has a maximum duration,
    equal to the time needed to start a large application. If a
    weight-raised process goes on doing I/O beyond this maximum duration,
    it loses weight-raising.

    This mechanism is evidently vulnerable to the following false
    positives: I/O-bound applications that will go on doing I/O for much
    longer than the duration of weight-raising. These applications have
    basically no benefit from being weight-raised at the beginning of
    their I/O. On the opposite end, while being weight-raised, these
    applications
    a) unjustly steal throughput to applications that may truly need
    low latency;
    b) make BFQ uselessly perform device idling; device idling results
    in loss of device throughput with most flash-based storage, and may
    increase latencies when used purposelessly.

    This commit adds a countermeasure to reduce both the above
    problems. To introduce this countermeasure, we provide the following
    extra piece of information (full details in the comments added by this
    commit). During the start-up of the large application used as a
    reference to set the duration of weight-raising, involved processes
    transfer at most ~110K sectors each. Accordingly, a process initially
    deemed as interactive has no right to be weight-raised any longer,
    once transferred 110K sectors or more.

    Basing on this consideration, this commit early-ends weight-raising
    for a bfq_queue if the latter happens to have received an amount of
    service at least equal to 110K sectors (actually, a little bit more,
    to keep a safety margin). I/O-bound applications that reach a high
    throughput, such as file copy, get to this threshold much before the
    allowed weight-raising period finishes. Thus this early ending of
    weight-raising reduces the amount of time during which these
    applications cause the problems described above.

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Asynchronous I/O can easily starve synchronous I/O (both sync reads
    and sync writes), by consuming all request tags. Similarly, storms of
    synchronous writes, such as those that sync(2) may trigger, can starve
    synchronous reads. In their turn, these two problems may also cause
    BFQ to loose control on latency for interactive and soft real-time
    applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
    Writer takes 0.6 seconds to start if the device is idle, but it takes
    more than 45 seconds (!) if there are sequential writes in the
    background.

    This commit addresses this issue by limiting the maximum percentage of
    tags that asynchronous I/O requests and synchronous write requests can
    consume. In particular, this commit grants a higher threshold to
    synchronous writes, to prevent the latter from being starved by
    asynchronous I/O.

    According to the above test, LibreOffice Writer now starts in about
    1.2 seconds on average, regardless of the background workload, and
    apart from some rare outlier. To check this improvement, run, e.g.,
    sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
    for the comm_startup_lat benchmark in the S suite [1].

    [1] https://github.com/Algodev-github/S

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

06 Jan, 2018

1 commit

  • In BFQ and CFQ, two processes are said to be cooperating if they do
    I/O in such a way that the union of their I/O requests yields a
    sequential I/O pattern. To get such a sequential I/O pattern out of
    the non-sequential pattern of each cooperating process, BFQ and CFQ
    merge the queues associated with these processes. In more detail,
    cooperating processes, and thus their associated queues, usually
    start, or restart, to do I/O shortly after each other. This is the
    case, e.g., for the I/O threads of KVM/QEMU and of the dump
    utility. Basing on this assumption, this commit allows a bfq_queue to
    be merged only during a short time interval (100ms) after it starts,
    or re-starts, to do I/O. This filtering provides two important
    benefits.

    First, it greatly reduces the probability that two non-cooperating
    processes have their queues merged by mistake, if they just happen to
    do I/O close to each other for a short time interval. These spurious
    merges cause loss of service guarantees. A low-weight bfq_queue may
    unjustly get more than its expected share of the throughput: if such a
    low-weight queue is merged with a high-weight queue, then the I/O for
    the low-weight queue is served as if the queue had a high weight. This
    may damage other high-weight queues unexpectedly. For instance,
    because of this issue, lxterminal occasionally took 7.5 seconds to
    start, instead of 6.5 seconds, when some sequential readers and
    writers did I/O in the background on a FUJITSU MHX2300BT HDD. The
    reason is that the bfq_queues associated with some of the readers or
    the writers were merged with the high-weight queues of some processes
    that had to do some urgent but little I/O. The readers then exploited
    the inherited high weight for all or most of their I/O, during the
    start-up of terminal. The filtering introduced by this commit
    eliminated any outlier caused by spurious queue merges in our start-up
    time tests.

    This filtering also provides a little boost of the throughput
    sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
    once a bfq_queue cannot be merged any longer, this commit makes BFQ
    stop updating the data needed to handle merging for the queue.

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Nov, 2017

1 commit

  • BFQ currently creates, and updates, its own instance of the whole
    set of blkio statistics that cfq creates. Yet, from the comments
    of Tejun Heo in [1], it turned out that most of these statistics
    are meant/useful only for debugging. This commit makes BFQ create
    the latter, debugging statistics only if the option
    CONFIG_DEBUG_BLK_CGROUP is set.

    By doing so, this commit also enables BFQ to enjoy a high perfomance
    boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
    BFQ has to update far fewer statistics, and, in particular, not the
    heaviest to update. To give an idea of the benefits, if
    CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
    with 8 threads doing random I/O in parallel on null_blk (configured
    with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
    (+30%). We have measured similar or even much higher boosts with other
    CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
    been obtained and can be reproduced very easily with the script in [1].

    [1] https://www.spinics.net/lists/linux-block/msg18943.html

    Suggested-by: Tejun Heo
    Suggested-by: Ulf Hansson
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Luca Miccio
     

31 Aug, 2017

1 commit

  • To provide a very smooth service, bfq starts to serve a bfq_queue
    only if the queue is 'eligible', i.e., if the same queue would
    have started to be served in the ideal, perfectly fair system that
    bfq simulates internally. This is obtained by associating each
    queue with a virtual start time, and by computing a special system
    virtual time quantity: a queue is eligible only if the system
    virtual time has reached the virtual start time of the
    queue. Finally, bfq guarantees that, when a new queue must be set
    in service, there is always at least one eligible entity for each
    active parent entity in the scheduler. To provide this guarantee,
    the function __bfq_lookup_next_entity pushes up, for each parent
    entity on which it is invoked, the system virtual time to the
    minimum among the virtual start times of the entities in the
    active tree for the parent entity (more precisely, the push up
    occurs if the system virtual time happens to be lower than all
    such virtual start times).

    There is however a circumstance in which __bfq_lookup_next_entity
    cannot push up the system virtual time for a parent entity, even
    if the system virtual time is lower than the virtual start times
    of all the child entities in the active tree. It happens if one of
    the child entities is in service. In fact, in such a case, there
    is already an eligible entity, the in-service one, even if it may
    not be not present in the active tree (because in-service entities
    may be removed from the active tree).

    Unfortunately, in the last re-design of the
    hierarchical-scheduling engine, the reset of the pointer to the
    in-service entity for a given parent entity--reset to be done as a
    consequence of the expiration of the in-service entity--always
    happens after the function __bfq_lookup_next_entity has been
    invoked. This causes the function to think that there is still an
    entity in service for the parent entity, and then that the system
    virtual time cannot be pushed up, even if actually such a
    no-more-in-service entity has already been properly reinserted
    into the active tree (or in some other tree if no more
    active). Yet, the system virtual time *had* to be pushed up, to be
    ready to correctly choose the next queue to serve. Because of the
    lack of this push up, bfq may wrongly set in service a queue that
    had been speculatively pre-computed as the possible
    next-in-service queue, but that would no more be the one to serve
    after the expiration and the reinsertion into the active trees of
    the previously in-service entities.

    This commit addresses this issue by making
    __bfq_lookup_next_entity properly push up the system virtual time
    if an expiration is occurring.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     

29 Aug, 2017

1 commit


11 Aug, 2017

1 commit

  • The logic that decides whether to idle the device is scattered across
    three functions. Almost all of the logic is in the function
    bfq_bfqq_may_idle, but (1) part of the decision is made in
    bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
    switch off idling regardless of the output of bfq_bfqq_may_idle. In
    addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
    their decisions as a function of parameters that are used, for similar
    purposes, also in bfq_bfqq_may_idle. This commit addresses these
    issues by moving all the logic into bfq_bfqq_may_idle.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

30 Jul, 2017

1 commit

  • Groups of BFQ queues are represented by generic entities in BFQ. When
    a queue belonging to a parent entity is deactivated, the parent entity
    may need to be deactivated too, in case the deactivated queue was the
    only active queue for the parent entity. This deactivation may need to
    be propagated upwards if the entity belongs, in its turn, to a further
    higher-level entity, and so on. In particular, the upward propagation
    of deactivation stops at the first parent entity that remains active
    even if one of its child entities has been deactivated.

    To decide whether the last non-deactivation condition holds for a
    parent entity, BFQ checks whether the field next_in_service is still
    not NULL for the parent entity, after the deactivation of one of its
    child entity. If it is not NULL, then there are certainly other active
    entities in the parent entity, and deactivations can stop.

    Unfortunately, this check misses a corner case: if in_service_entity
    is not NULL, then next_in_service may happen to be NULL, although the
    parent entity is evidently active. This happens if: 1) the entity
    pointed by in_service_entity is the only active entity in the parent
    entity, and 2) according to the definition of next_in_service, the
    in_service_entity cannot be considered as next_in_service. See the
    comments on the definition of next_in_service for details on this
    second point.

    Hitting the above corner case causes crashes.

    To address this issue, this commit:
    1) Extends the above check on only next_in_service to controlling both
    next_in_service and in_service_entity (if any of them is not NULL,
    then no further deactivation is performed)
    2) Improves the (important) comments on how next_in_service is defined
    and updated; in particular it fixes a few rather obscure paragraphs

    Reported-by: Eric Wheeler
    Reported-by: Rick Yiu
    Reported-by: Tom X Nguyen
    Signed-off-by: Paolo Valente
    Tested-by: Eric Wheeler
    Tested-by: Rick Yiu
    Tested-by: Laurentiu Nicola
    Tested-by: Tom X Nguyen
    Signed-off-by: Jens Axboe

    Paolo Valente
     

29 Jul, 2017

1 commit

  • Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
    way. Now we have standard blktrace API for this, so convert them to use
    it.

    Note, this changes the behavior a little bit. cgroup info isn't output
    by default, we only do this with 'blk_cgroup' option enabled. cgroup
    info isn't output as a string by default too, we only do this with
    'blk_cgname' option enabled. Also cgroup info is output in different
    position of the note string. I think these behavior changes aren't a big
    issue (actually we make trace data shorter which is good), since the
    blktrace note is solely for debugging.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li