06 Apr, 2019

1 commit

  • [ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]

    When a new I/O request arrives for a bfq_queue, say Q, bfq checks
    whether that request is close to
    (a) the head request of some other queue waiting to be served, or
    (b) the last request dispatched for the in-service queue (in case Q
    itself is not the in-service queue)

    If a queue, say Q2, is found for which the above condition holds, then
    bfq merges Q and Q2, to hopefully get a more sequential I/O in the
    resulting merged queue, and thus a possibly higher throughput.

    Case (b) is checked by comparing the new request for Q with the last
    request dispatched, assuming that the latter necessarily belonged to the
    in-service queue. Unfortunately, this assumption is no longer always
    correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
    into seeky idle queues on NCQ flash").

    When the assumption does not hold, queues that must not be merged may be
    merged, causing unexpected loss of control on per-queue service
    guarantees.

    This commit solves this problem by adding an extra field, which stores
    the actual last request dispatched for the in-service queue, and by
    using this new field to correctly check case (b).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

09 Jul, 2018

1 commit

  • To keep I/O throughput high as often as possible, BFQ performs
    I/O-dispatch plugging (aka device idling) only when beneficial exactly
    for throughput, or when needed for service guarantees (low latency,
    fairness). An important case where the latter condition holds is when
    the scenario is 'asymmetric' in terms of weights: i.e., when some
    bfq_queue or whole group of queues has a higher weight, and thus has
    to receive more service, than other queues or groups. Without dispatch
    plugging, lower-weight queues/groups may unjustly steal bandwidth to
    higher-weight queues/groups.

    To detect asymmetric scenarios, BFQ checks some sufficient
    conditions. One of these conditions is that active groups have
    different weights. BFQ controls this condition by maintaining a
    special set of unique weights of active groups
    (group_weights_tree). To this purpose, in the function
    bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
    group to/from this set.

    Unfortunately, the function bfq_active_extract may happen to be
    invoked also for a group that is still active (to preserve the correct
    update of the next queue to serve, see comments in function
    bfq_no_longer_next_in_service() for details). In this case, removing
    the weight of the group makes the set group_weights_tree
    inconsistent. Service-guarantee violations follow.

    This commit addresses this issue by moving group_weights_tree
    insertions from their previous location (in bfq_active_insert) into
    the function __bfq_activate_entity, and by moving group_weights_tree
    extractions from bfq_active_extract to when the entity that represents
    a group remains throughly idle, i.e., with no request either enqueued
    or dispatched.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

31 May, 2018

1 commit

  • BFQ computes the duration of weight raising for interactive
    applications automatically, using some reference parameters. In
    particular, BFQ uses the best durations (see comments in the code for
    how these durations have been assessed) for two classes of systems:
    slow and fast ones. Examples of slow systems are old phones or systems
    using micro HDDs. Fast systems are all the remaining ones. Using these
    parameters, BFQ computes the actual duration of the weight raising,
    for the system at hand, as a function of the relative speed of the
    system w.r.t. the speed of a reference system, belonging to the same
    class of systems as the system at hand.

    This slow vs fast differentiation proved to be useful in the past, but
    happens to have little meaning with current hardware. Even worse, it
    does cause problems in virtual systems, where the speed of the system
    can vary frequently, and so widely to just confuse the class-detection
    mechanism, and, as we have verified experimentally, to cause BFQ to
    compute non-sensical weight-raising durations.

    This commit addresses this issue by removing the slow class and the
    class-detection mechanism.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

11 May, 2018

1 commit


09 May, 2018

1 commit

  • cfq and bfq have some internal fields that use sched_clock() which can
    trivially use ktime_get_ns() instead. Their timestamp fields in struct
    request can also use ktime_get_ns(), which resolves the 8 year old
    comment added by commit 28f4197e5d47 ("block: disable preemption before
    using sched_clock()").

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Mar, 2018

1 commit

  • If a storage device handled by BFQ happens to be slower than 7.5 KB/s
    for a certain amount of time (in the order of a second), then the
    estimated peak rate of the device, maintained in BFQ, becomes equal to
    0. The reason is the limited precision with which the rate is
    represented (details on the range of representable values in the
    comments introduced by this commit). This leads to a division-by-zero
    error where the estimated peak rate is used as divisor. Such a type of
    failure has been reported in [1].

    This commit addresses this issue by:
    1. Lower-bounding the estimated peak rate to 1
    2. Adding and improving comments on the range of rates representable

    [1] https://www.spinics.net/lists/kernel/msg2739205.html

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Jan, 2018

2 commits

  • To maximise responsiveness, BFQ raises the weight, and performs device
    idling, for bfq_queues associated with processes deemed as
    interactive. In particular, weight raising has a maximum duration,
    equal to the time needed to start a large application. If a
    weight-raised process goes on doing I/O beyond this maximum duration,
    it loses weight-raising.

    This mechanism is evidently vulnerable to the following false
    positives: I/O-bound applications that will go on doing I/O for much
    longer than the duration of weight-raising. These applications have
    basically no benefit from being weight-raised at the beginning of
    their I/O. On the opposite end, while being weight-raised, these
    applications
    a) unjustly steal throughput to applications that may truly need
    low latency;
    b) make BFQ uselessly perform device idling; device idling results
    in loss of device throughput with most flash-based storage, and may
    increase latencies when used purposelessly.

    This commit adds a countermeasure to reduce both the above
    problems. To introduce this countermeasure, we provide the following
    extra piece of information (full details in the comments added by this
    commit). During the start-up of the large application used as a
    reference to set the duration of weight-raising, involved processes
    transfer at most ~110K sectors each. Accordingly, a process initially
    deemed as interactive has no right to be weight-raised any longer,
    once transferred 110K sectors or more.

    Basing on this consideration, this commit early-ends weight-raising
    for a bfq_queue if the latter happens to have received an amount of
    service at least equal to 110K sectors (actually, a little bit more,
    to keep a safety margin). I/O-bound applications that reach a high
    throughput, such as file copy, get to this threshold much before the
    allowed weight-raising period finishes. Thus this early ending of
    weight-raising reduces the amount of time during which these
    applications cause the problems described above.

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Asynchronous I/O can easily starve synchronous I/O (both sync reads
    and sync writes), by consuming all request tags. Similarly, storms of
    synchronous writes, such as those that sync(2) may trigger, can starve
    synchronous reads. In their turn, these two problems may also cause
    BFQ to loose control on latency for interactive and soft real-time
    applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
    Writer takes 0.6 seconds to start if the device is idle, but it takes
    more than 45 seconds (!) if there are sequential writes in the
    background.

    This commit addresses this issue by limiting the maximum percentage of
    tags that asynchronous I/O requests and synchronous write requests can
    consume. In particular, this commit grants a higher threshold to
    synchronous writes, to prevent the latter from being starved by
    asynchronous I/O.

    According to the above test, LibreOffice Writer now starts in about
    1.2 seconds on average, regardless of the background workload, and
    apart from some rare outlier. To check this improvement, run, e.g.,
    sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
    for the comm_startup_lat benchmark in the S suite [1].

    [1] https://github.com/Algodev-github/S

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

06 Jan, 2018

1 commit

  • In BFQ and CFQ, two processes are said to be cooperating if they do
    I/O in such a way that the union of their I/O requests yields a
    sequential I/O pattern. To get such a sequential I/O pattern out of
    the non-sequential pattern of each cooperating process, BFQ and CFQ
    merge the queues associated with these processes. In more detail,
    cooperating processes, and thus their associated queues, usually
    start, or restart, to do I/O shortly after each other. This is the
    case, e.g., for the I/O threads of KVM/QEMU and of the dump
    utility. Basing on this assumption, this commit allows a bfq_queue to
    be merged only during a short time interval (100ms) after it starts,
    or re-starts, to do I/O. This filtering provides two important
    benefits.

    First, it greatly reduces the probability that two non-cooperating
    processes have their queues merged by mistake, if they just happen to
    do I/O close to each other for a short time interval. These spurious
    merges cause loss of service guarantees. A low-weight bfq_queue may
    unjustly get more than its expected share of the throughput: if such a
    low-weight queue is merged with a high-weight queue, then the I/O for
    the low-weight queue is served as if the queue had a high weight. This
    may damage other high-weight queues unexpectedly. For instance,
    because of this issue, lxterminal occasionally took 7.5 seconds to
    start, instead of 6.5 seconds, when some sequential readers and
    writers did I/O in the background on a FUJITSU MHX2300BT HDD. The
    reason is that the bfq_queues associated with some of the readers or
    the writers were merged with the high-weight queues of some processes
    that had to do some urgent but little I/O. The readers then exploited
    the inherited high weight for all or most of their I/O, during the
    start-up of terminal. The filtering introduced by this commit
    eliminated any outlier caused by spurious queue merges in our start-up
    time tests.

    This filtering also provides a little boost of the throughput
    sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
    once a bfq_queue cannot be merged any longer, this commit makes BFQ
    stop updating the data needed to handle merging for the queue.

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Nov, 2017

1 commit

  • BFQ currently creates, and updates, its own instance of the whole
    set of blkio statistics that cfq creates. Yet, from the comments
    of Tejun Heo in [1], it turned out that most of these statistics
    are meant/useful only for debugging. This commit makes BFQ create
    the latter, debugging statistics only if the option
    CONFIG_DEBUG_BLK_CGROUP is set.

    By doing so, this commit also enables BFQ to enjoy a high perfomance
    boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
    BFQ has to update far fewer statistics, and, in particular, not the
    heaviest to update. To give an idea of the benefits, if
    CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
    with 8 threads doing random I/O in parallel on null_blk (configured
    with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
    (+30%). We have measured similar or even much higher boosts with other
    CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
    been obtained and can be reproduced very easily with the script in [1].

    [1] https://www.spinics.net/lists/linux-block/msg18943.html

    Suggested-by: Tejun Heo
    Suggested-by: Ulf Hansson
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Luca Miccio
     

31 Aug, 2017

1 commit

  • To provide a very smooth service, bfq starts to serve a bfq_queue
    only if the queue is 'eligible', i.e., if the same queue would
    have started to be served in the ideal, perfectly fair system that
    bfq simulates internally. This is obtained by associating each
    queue with a virtual start time, and by computing a special system
    virtual time quantity: a queue is eligible only if the system
    virtual time has reached the virtual start time of the
    queue. Finally, bfq guarantees that, when a new queue must be set
    in service, there is always at least one eligible entity for each
    active parent entity in the scheduler. To provide this guarantee,
    the function __bfq_lookup_next_entity pushes up, for each parent
    entity on which it is invoked, the system virtual time to the
    minimum among the virtual start times of the entities in the
    active tree for the parent entity (more precisely, the push up
    occurs if the system virtual time happens to be lower than all
    such virtual start times).

    There is however a circumstance in which __bfq_lookup_next_entity
    cannot push up the system virtual time for a parent entity, even
    if the system virtual time is lower than the virtual start times
    of all the child entities in the active tree. It happens if one of
    the child entities is in service. In fact, in such a case, there
    is already an eligible entity, the in-service one, even if it may
    not be not present in the active tree (because in-service entities
    may be removed from the active tree).

    Unfortunately, in the last re-design of the
    hierarchical-scheduling engine, the reset of the pointer to the
    in-service entity for a given parent entity--reset to be done as a
    consequence of the expiration of the in-service entity--always
    happens after the function __bfq_lookup_next_entity has been
    invoked. This causes the function to think that there is still an
    entity in service for the parent entity, and then that the system
    virtual time cannot be pushed up, even if actually such a
    no-more-in-service entity has already been properly reinserted
    into the active tree (or in some other tree if no more
    active). Yet, the system virtual time *had* to be pushed up, to be
    ready to correctly choose the next queue to serve. Because of the
    lack of this push up, bfq may wrongly set in service a queue that
    had been speculatively pre-computed as the possible
    next-in-service queue, but that would no more be the one to serve
    after the expiration and the reinsertion into the active trees of
    the previously in-service entities.

    This commit addresses this issue by making
    __bfq_lookup_next_entity properly push up the system virtual time
    if an expiration is occurring.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     

29 Aug, 2017

1 commit


11 Aug, 2017

1 commit

  • The logic that decides whether to idle the device is scattered across
    three functions. Almost all of the logic is in the function
    bfq_bfqq_may_idle, but (1) part of the decision is made in
    bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
    switch off idling regardless of the output of bfq_bfqq_may_idle. In
    addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
    their decisions as a function of parameters that are used, for similar
    purposes, also in bfq_bfqq_may_idle. This commit addresses these
    issues by moving all the logic into bfq_bfqq_may_idle.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

30 Jul, 2017

1 commit

  • Groups of BFQ queues are represented by generic entities in BFQ. When
    a queue belonging to a parent entity is deactivated, the parent entity
    may need to be deactivated too, in case the deactivated queue was the
    only active queue for the parent entity. This deactivation may need to
    be propagated upwards if the entity belongs, in its turn, to a further
    higher-level entity, and so on. In particular, the upward propagation
    of deactivation stops at the first parent entity that remains active
    even if one of its child entities has been deactivated.

    To decide whether the last non-deactivation condition holds for a
    parent entity, BFQ checks whether the field next_in_service is still
    not NULL for the parent entity, after the deactivation of one of its
    child entity. If it is not NULL, then there are certainly other active
    entities in the parent entity, and deactivations can stop.

    Unfortunately, this check misses a corner case: if in_service_entity
    is not NULL, then next_in_service may happen to be NULL, although the
    parent entity is evidently active. This happens if: 1) the entity
    pointed by in_service_entity is the only active entity in the parent
    entity, and 2) according to the definition of next_in_service, the
    in_service_entity cannot be considered as next_in_service. See the
    comments on the definition of next_in_service for details on this
    second point.

    Hitting the above corner case causes crashes.

    To address this issue, this commit:
    1) Extends the above check on only next_in_service to controlling both
    next_in_service and in_service_entity (if any of them is not NULL,
    then no further deactivation is performed)
    2) Improves the (important) comments on how next_in_service is defined
    and updated; in particular it fixes a few rather obscure paragraphs

    Reported-by: Eric Wheeler
    Reported-by: Rick Yiu
    Reported-by: Tom X Nguyen
    Signed-off-by: Paolo Valente
    Tested-by: Eric Wheeler
    Tested-by: Rick Yiu
    Tested-by: Laurentiu Nicola
    Tested-by: Tom X Nguyen
    Signed-off-by: Jens Axboe

    Paolo Valente
     

29 Jul, 2017

1 commit

  • Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
    way. Now we have standard blktrace API for this, so convert them to use
    it.

    Note, this changes the behavior a little bit. cgroup info isn't output
    by default, we only do this with 'blk_cgroup' option enabled. cgroup
    info isn't output as a string by default too, we only do this with
    'blk_cgname' option enabled. Also cgroup info is output in different
    position of the note string. I think these behavior changes aren't a big
    issue (actually we make trace data shorter which is good), since the
    blktrace note is solely for debugging.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

12 Jul, 2017

1 commit


04 Jul, 2017

1 commit

  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Jun, 2017

1 commit

  • In blk-cgroup, operations on blkg objects are protected with the
    request_queue lock. This is no more the lock that protects
    I/O-scheduler operations in blk-mq. In fact, the latter are now
    protected with a finer-grained per-scheduler-instance lock. As a
    consequence, although blkg lookups are also rcu-protected, blk-mq I/O
    schedulers may see inconsistent data when they access blkg and
    blkg-related objects. BFQ does access these objects, and does incur
    this problem, in the following case.

    The blkg_lookup performed in bfq_get_queue, being protected (only)
    through rcu, may happen to return the address of a copy of the
    original blkg. If this is the case, then the blkg_get performed in
    bfq_get_queue, to pin down the blkg, is useless: it does not prevent
    blk-cgroup code from destroying both the original blkg and all objects
    directly or indirectly referred by the copy of the blkg. BFQ accesses
    these objects, which typically causes a crash for NULL-pointer
    dereference of memory-protection violation.

    Some additional protection mechanism should be added to blk-cgroup to
    address this issue. In the meantime, this commit provides a quick
    temporary fix for BFQ: cache (when safe) blkg data that might
    disappear right after a blkg_lookup.

    In particular, this commit exploits the following facts to achieve its
    goal without introducing further locks. Destroy operations on a blkg
    invoke, as a first step, hooks of the scheduler associated with the
    blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
    consequence, for any blkg associated with the request queue an
    instance of BFQ is attached to, we are guaranteed that such a blkg is
    not destroyed, and that all the pointers it contains are consistent,
    while that instance is holding its bfqd->lock. A blkg_lookup performed
    with bfqd->lock held then returns a fully consistent blkg, which
    remains consistent until this lock is held. In more detail, this holds
    even if the returned blkg is a copy of the original one.

    Finally, also the object describing a group inside BFQ needs to be
    protected from destruction on the blkg_free of the original blkg
    (which invokes bfq_pd_free). This commit adds private refcounting for
    this object, to let it disappear only after no bfq_queue refers to it
    any longer.

    This commit also removes or updates some stale comments on locking
    issues related to blk-cgroup operations.

    Reported-by: Tomas Konir
    Reported-by: Lee Tibbert
    Reported-by: Marco Piazza
    Signed-off-by: Paolo Valente
    Tested-by: Tomas Konir
    Tested-by: Lee Tibbert
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

20 Apr, 2017

1 commit

  • If we don't have CGROUPS enabled, the compile ends in the
    following misery:

    In file included from ../block/bfq-iosched.c:105:0:
    ../block/bfq-iosched.h:819:22: error: array type has incomplete element type
    extern struct cftype bfq_blkcg_legacy_files[];
    ^
    ../block/bfq-iosched.h:820:22: error: array type has incomplete element type
    extern struct cftype bfq_blkg_files[];
    ^

    Move the declarations under the right ifdef.

    Reported-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Apr, 2017

1 commit

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente